Grammar fix, got rid of DOS line endings

This commit is contained in:
Kyle Kelley
2013-03-21 17:19:00 -04:00
parent a899f414f4
commit 7ec689f4bf
+12 -10
View File
@@ -6,8 +6,8 @@ Web Scraping
Web sites are written using HTML, which means that each web page is a Web sites are written using HTML, which means that each web page is a
structured document. Sometimes it would be great to obtain some data from structured document. Sometimes it would be great to obtain some data from
them and preserve the structure while we're at it. Web sites provide them and preserve the structure while we're at it. Web sites don't always
don't always provide their data in comfortable formats such as ``.csv``. provide their data in comfortable formats such as ``csv`` or ``json``.
This is where web scraping comes in. Web scraping is the practice of using a This is where web scraping comes in. Web scraping is the practice of using a
computer program to sift through a web page and gather the data that you need computer program to sift through a web page and gather the data that you need
@@ -19,9 +19,10 @@ lxml and Requests
`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing `lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
XML and HTML documents really fast. It even handles messed up tags. We will XML and HTML documents really fast. It even handles messed up tags. We will
also be using the `Requests <http://docs.python-requests.org/en/latest/>`_ module instead of the already built-in urlib2 also be using the `Requests <http://docs.python-requests.org/en/latest/>`_
due to improvements in speed and readability. You can easily install both module instead of the already built-in urlib2 due to improvements in speed and
using ``pip install lxml`` and ``pip install requests``. readability. You can easily install both using ``pip install lxml`` and
``pip install requests``.
Lets start with the imports: Lets start with the imports:
@@ -43,12 +44,13 @@ we can go over two different ways: XPath and CSSSelect. In this example, I
will focus on the former. will focus on the former.
XPath is a way of locating information in structured documents such as XPath is a way of locating information in structured documents such as
HTML or XML documents. A good introduction to XPath is on `W3Schools <http://www.w3schools.com/xpath/default.asp>`_ . HTML or XML documents. A good introduction to XPath is on
`W3Schools <http://www.w3schools.com/xpath/default.asp>`_ .
There are also various tools for obtaining the XPath of elements such as There are also various tools for obtaining the XPath of elements such as
FireBug for Firefox or if you're using Chrome you can right click an FireBug for Firefox or the Chrome Inspector. If you're using Chrome, you
element, choose 'Inspect element', highlight the code and then right can right click an element, choose 'Inspect element', highlight the code,
click again and choose 'Copy XPath'. right click again and choose 'Copy XPath'.
After a quick analysis, we see that in our page the data is contained in After a quick analysis, we see that in our page the data is contained in
two elements - one is a div with title 'buyer-name' and the other is a two elements - one is a div with title 'buyer-name' and the other is a
@@ -92,7 +94,7 @@ Lets see what we got exactly:
Congratulations! We have successfully scraped all the data we wanted from Congratulations! We have successfully scraped all the data we wanted from
a web page using lxml and Requests. We have it stored in memory as two a web page using lxml and Requests. We have it stored in memory as two
lists. Now we can do all sorts of cool stuff with it: we can analyze it lists. Now we can do all sorts of cool stuff with it: we can analyze it
using Python or we can save it a file and share it with the world. using Python or we can save it to a file and share it with the world.
A cool idea to think about is modifying this script to iterate through A cool idea to think about is modifying this script to iterate through
the rest of the pages of this example dataset or rewriting this the rest of the pages of this example dataset or rewriting this