mirror of
https://github.com/kennethreitz/python-guide.git
synced 2026-06-05 23:00:18 +00:00
Grammar fix, got rid of DOS line endings
This commit is contained in:
+12
-10
@@ -6,8 +6,8 @@ Web Scraping
|
|||||||
|
|
||||||
Web sites are written using HTML, which means that each web page is a
|
Web sites are written using HTML, which means that each web page is a
|
||||||
structured document. Sometimes it would be great to obtain some data from
|
structured document. Sometimes it would be great to obtain some data from
|
||||||
them and preserve the structure while we're at it. Web sites provide
|
them and preserve the structure while we're at it. Web sites don't always
|
||||||
don't always provide their data in comfortable formats such as ``.csv``.
|
provide their data in comfortable formats such as ``csv`` or ``json``.
|
||||||
|
|
||||||
This is where web scraping comes in. Web scraping is the practice of using a
|
This is where web scraping comes in. Web scraping is the practice of using a
|
||||||
computer program to sift through a web page and gather the data that you need
|
computer program to sift through a web page and gather the data that you need
|
||||||
@@ -19,9 +19,10 @@ lxml and Requests
|
|||||||
|
|
||||||
`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
|
`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
|
||||||
XML and HTML documents really fast. It even handles messed up tags. We will
|
XML and HTML documents really fast. It even handles messed up tags. We will
|
||||||
also be using the `Requests <http://docs.python-requests.org/en/latest/>`_ module instead of the already built-in urlib2
|
also be using the `Requests <http://docs.python-requests.org/en/latest/>`_
|
||||||
due to improvements in speed and readability. You can easily install both
|
module instead of the already built-in urlib2 due to improvements in speed and
|
||||||
using ``pip install lxml`` and ``pip install requests``.
|
readability. You can easily install both using ``pip install lxml`` and
|
||||||
|
``pip install requests``.
|
||||||
|
|
||||||
Lets start with the imports:
|
Lets start with the imports:
|
||||||
|
|
||||||
@@ -43,12 +44,13 @@ we can go over two different ways: XPath and CSSSelect. In this example, I
|
|||||||
will focus on the former.
|
will focus on the former.
|
||||||
|
|
||||||
XPath is a way of locating information in structured documents such as
|
XPath is a way of locating information in structured documents such as
|
||||||
HTML or XML documents. A good introduction to XPath is on `W3Schools <http://www.w3schools.com/xpath/default.asp>`_ .
|
HTML or XML documents. A good introduction to XPath is on
|
||||||
|
`W3Schools <http://www.w3schools.com/xpath/default.asp>`_ .
|
||||||
|
|
||||||
There are also various tools for obtaining the XPath of elements such as
|
There are also various tools for obtaining the XPath of elements such as
|
||||||
FireBug for Firefox or if you're using Chrome you can right click an
|
FireBug for Firefox or the Chrome Inspector. If you're using Chrome, you
|
||||||
element, choose 'Inspect element', highlight the code and then right
|
can right click an element, choose 'Inspect element', highlight the code,
|
||||||
click again and choose 'Copy XPath'.
|
right click again and choose 'Copy XPath'.
|
||||||
|
|
||||||
After a quick analysis, we see that in our page the data is contained in
|
After a quick analysis, we see that in our page the data is contained in
|
||||||
two elements - one is a div with title 'buyer-name' and the other is a
|
two elements - one is a div with title 'buyer-name' and the other is a
|
||||||
@@ -92,7 +94,7 @@ Lets see what we got exactly:
|
|||||||
Congratulations! We have successfully scraped all the data we wanted from
|
Congratulations! We have successfully scraped all the data we wanted from
|
||||||
a web page using lxml and Requests. We have it stored in memory as two
|
a web page using lxml and Requests. We have it stored in memory as two
|
||||||
lists. Now we can do all sorts of cool stuff with it: we can analyze it
|
lists. Now we can do all sorts of cool stuff with it: we can analyze it
|
||||||
using Python or we can save it a file and share it with the world.
|
using Python or we can save it to a file and share it with the world.
|
||||||
|
|
||||||
A cool idea to think about is modifying this script to iterate through
|
A cool idea to think about is modifying this script to iterate through
|
||||||
the rest of the pages of this example dataset or rewriting this
|
the rest of the pages of this example dataset or rewriting this
|
||||||
|
|||||||
Reference in New Issue
Block a user