Grammar fix, got rid of DOS line endings

2026-06-05 23:00:18 +00:00 · 2013-03-21 17:19:00 -04:00
parent a899f414f4
commit 7ec689f4bf
1 changed files with 101 additions and 99 deletions
@@ -1,99 +1,101 @@
-HTML Scraping
+HTML Scraping
-=============
+=============
-
+
-Web Scraping
+Web Scraping
------------
+------------
-
+
-Web sites are written using HTML, which means that each web page is a
+Web sites are written using HTML, which means that each web page is a
-structured document. Sometimes it would be great to obtain some data from 
+structured document. Sometimes it would be great to obtain some data from
-them and preserve the structure while we're at it. Web sites provide 
+them and preserve the structure while we're at it. Web sites don't always
-don't always provide their data in comfortable formats such as ``.csv``. 
+provide their data in comfortable formats such as ``csv`` or ``json``.
-
+
-This is where web scraping comes in. Web scraping is the practice of using a
+This is where web scraping comes in. Web scraping is the practice of using a
-computer program to sift through a web page and gather the data that you need
+computer program to sift through a web page and gather the data that you need
-in a format most useful to you while at the same time preserving the structure
+in a format most useful to you while at the same time preserving the structure
-of the data.
+of the data.
-
+
-lxml and Requests
+lxml and Requests
-----------------
+-----------------
-
+
-`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
+`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
-XML and HTML documents really fast. It even handles messed up tags. We will 
+XML and HTML documents really fast. It even handles messed up tags. We will
-also be using the `Requests <http://docs.python-requests.org/en/latest/>`_ module instead of the already built-in urlib2 
+also be using the `Requests <http://docs.python-requests.org/en/latest/>`_
-due to improvements in speed and readability. You can easily install both 
+module instead of the already built-in urlib2 due to improvements in speed and
-using ``pip install lxml`` and ``pip install requests``.
+readability. You can easily install both using ``pip install lxml`` and
-
+``pip install requests``.
-Lets start with the imports:
+
-
+Lets start with the imports:
-.. code-block:: python
+
-
+.. code-block:: python
-    from lxml import html
+
-    import requests
+    from lxml import html
-    
+    import requests
-Next we will use ``requests.get`` to retrieve the web page with our data 
+
-and parse it using the ``html`` module and save the results in ``tree``:
+Next we will use ``requests.get`` to retrieve the web page with our data
-
+and parse it using the ``html`` module and save the results in ``tree``:
-.. code-block:: python
+
-
+.. code-block:: python
-    page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
+
-    tree = html.fromstring(page.text)
+    page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
-
+    tree = html.fromstring(page.text)
-``tree`` now contains the whole HTML file in a nice tree structure which
+
-we can go over two different ways: XPath and CSSSelect. In this example, I
+``tree`` now contains the whole HTML file in a nice tree structure which
-will focus on the former. 
+we can go over two different ways: XPath and CSSSelect. In this example, I
-
+will focus on the former.
-XPath is a way of locating information in structured documents such as 
+
-HTML or XML documents. A good introduction to XPath is on `W3Schools <http://www.w3schools.com/xpath/default.asp>`_ .
+XPath is a way of locating information in structured documents such as
-
+HTML or XML documents. A good introduction to XPath is on
-There are also various tools for obtaining the XPath of elements such as
+`W3Schools <http://www.w3schools.com/xpath/default.asp>`_ .
-FireBug for Firefox or if you're using Chrome you can right click an 
+
-element, choose 'Inspect element', highlight the code and then right 
+There are also various tools for obtaining the XPath of elements such as
-click again and choose 'Copy XPath'.
+FireBug for Firefox or the Chrome Inspector. If you're using Chrome, you
-
+can right click an element, choose 'Inspect element', highlight the code,
-After a quick analysis, we see that in our page the data is contained in 
+right click again and choose 'Copy XPath'.
-two elements - one is a div with title 'buyer-name' and the other is a 
+
-span with class 'item-price':
+After a quick analysis, we see that in our page the data is contained in
-
+two elements - one is a div with title 'buyer-name' and the other is a
-::
+span with class 'item-price':
-
+
-    <div title="buyer-name">Carson Busses</div>
+::
-    <span class="item-price">$29.95</span>
+
-
+    <div title="buyer-name">Carson Busses</div>
-Knowing this we can create the correct XPath query and use the lxml
+    <span class="item-price">$29.95</span>
-``xpath`` function like this:
+
-
+Knowing this we can create the correct XPath query and use the lxml
-.. code-block:: python
+``xpath`` function like this:
-
+
-    #This will create a list of buyers:
+.. code-block:: python
-    buyers = tree.xpath('//div[@title="buyer-name"]/text()')
+
-    #This will create a list of prices
+    #This will create a list of buyers:
-    prices = tree.xpath('//span[@class="item-price"]/text()')
+    buyers = tree.xpath('//div[@title="buyer-name"]/text()')
-
+    #This will create a list of prices
-Lets see what we got exactly:
+    prices = tree.xpath('//span[@class="item-price"]/text()')
-
+
-.. code-block:: python
+Lets see what we got exactly:
-
+
-    print 'Buyers: ', buyers
+.. code-block:: python
-    print 'Prices: ', prices
+
-
+    print 'Buyers: ', buyers
-::
+    print 'Prices: ', prices
-
+
-    Buyers:  ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes', 
+::
-    'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff',
+
-    'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup',
+    Buyers:  ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes',
-    'Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire',
+    'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff',
-    'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell']
+    'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup',
-    
+    'Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire',
-    Prices:  ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25',
+    'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell']
-    '$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11',
+
-    '$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68',
+    Prices:  ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25',
-    '$15.00', '$114.07', '$10.09']
+    '$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11',
-
+    '$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68',
-Congratulations! We have successfully scraped all the data we wanted from
+    '$15.00', '$114.07', '$10.09']
-a web page using lxml and Requests. We have it stored in memory as two 
+
-lists. Now we can do all sorts of cool stuff with it: we can analyze it 
+Congratulations! We have successfully scraped all the data we wanted from
-using Python or we can save it a file and share it with the world.
+a web page using lxml and Requests. We have it stored in memory as two
-
+lists. Now we can do all sorts of cool stuff with it: we can analyze it
-A cool idea to think about is modifying this script to iterate through 
+using Python or we can save it to a file and share it with the world.
-the rest of the pages of this example dataset or rewriting this 
+
-application to use threads for improved speed.
+A cool idea to think about is modifying this script to iterate through
 the rest of the pages of this example dataset or rewriting this
 application to use threads for improved speed.