Grammar fix, got rid of DOS line endings

2026-06-05 14:50:19 +00:00 · 2013-03-21 17:19:00 -04:00
parent a899f414f4
commit 7ec689f4bf
1 changed files with 101 additions and 99 deletions
@@ -1,99 +1,101 @@
-HTML Scraping
-=============
-
-Web Scraping
------------
-
-Web sites are written using HTML, which means that each web page is a
-structured document. Sometimes it would be great to obtain some data from 
-them and preserve the structure while we're at it. Web sites provide 
-don't always provide their data in comfortable formats such as ``.csv``. 
-
-This is where web scraping comes in. Web scraping is the practice of using a
-computer program to sift through a web page and gather the data that you need
-in a format most useful to you while at the same time preserving the structure
-of the data.
-
-lxml and Requests
-----------------
-
-`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
-XML and HTML documents really fast. It even handles messed up tags. We will 
-also be using the `Requests <http://docs.python-requests.org/en/latest/>`_ module instead of the already built-in urlib2 
-due to improvements in speed and readability. You can easily install both 
-using ``pip install lxml`` and ``pip install requests``.
-
-Lets start with the imports:
-
-.. code-block:: python
-
-    from lxml import html
-    import requests
-    
-Next we will use ``requests.get`` to retrieve the web page with our data 
-and parse it using the ``html`` module and save the results in ``tree``:
-
-.. code-block:: python
-
-    page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
-    tree = html.fromstring(page.text)
-
-``tree`` now contains the whole HTML file in a nice tree structure which
-we can go over two different ways: XPath and CSSSelect. In this example, I
-will focus on the former. 
-
-XPath is a way of locating information in structured documents such as 
-HTML or XML documents. A good introduction to XPath is on `W3Schools <http://www.w3schools.com/xpath/default.asp>`_ .
-
-There are also various tools for obtaining the XPath of elements such as
-FireBug for Firefox or if you're using Chrome you can right click an 
-element, choose 'Inspect element', highlight the code and then right 
-click again and choose 'Copy XPath'.
-
-After a quick analysis, we see that in our page the data is contained in 
-two elements - one is a div with title 'buyer-name' and the other is a 
-span with class 'item-price':
-
-::
-
-    <div title="buyer-name">Carson Busses</div>
-    <span class="item-price">$29.95</span>
-
-Knowing this we can create the correct XPath query and use the lxml
-``xpath`` function like this:
-
-.. code-block:: python
-
-    #This will create a list of buyers:
-    buyers = tree.xpath('//div[@title="buyer-name"]/text()')
-    #This will create a list of prices
-    prices = tree.xpath('//span[@class="item-price"]/text()')
-
-Lets see what we got exactly:
-
-.. code-block:: python
-
-    print 'Buyers: ', buyers
-    print 'Prices: ', prices
-
-::
-
-    Buyers:  ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes', 
-    'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff',
-    'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup',
-    'Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire',
-    'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell']
-    
-    Prices:  ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25',
-    '$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11',
-    '$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68',
-    '$15.00', '$114.07', '$10.09']
-
-Congratulations! We have successfully scraped all the data we wanted from
-a web page using lxml and Requests. We have it stored in memory as two 
-lists. Now we can do all sorts of cool stuff with it: we can analyze it 
-using Python or we can save it a file and share it with the world.
-
-A cool idea to think about is modifying this script to iterate through 
-the rest of the pages of this example dataset or rewriting this 
-application to use threads for improved speed.
+HTML Scraping
+=============
+
+Web Scraping
+------------
+
+Web sites are written using HTML, which means that each web page is a
+structured document. Sometimes it would be great to obtain some data from
+them and preserve the structure while we're at it. Web sites don't always
+provide their data in comfortable formats such as ``csv`` or ``json``.
+
+This is where web scraping comes in. Web scraping is the practice of using a
+computer program to sift through a web page and gather the data that you need
+in a format most useful to you while at the same time preserving the structure
+of the data.
+
+lxml and Requests
+-----------------
+
+`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
+XML and HTML documents really fast. It even handles messed up tags. We will
+also be using the `Requests <http://docs.python-requests.org/en/latest/>`_
+module instead of the already built-in urlib2 due to improvements in speed and
+readability. You can easily install both using ``pip install lxml`` and
+``pip install requests``.
+
+Lets start with the imports:
+
+.. code-block:: python
+
+    from lxml import html
+    import requests
+
+Next we will use ``requests.get`` to retrieve the web page with our data
+and parse it using the ``html`` module and save the results in ``tree``:
+
+.. code-block:: python
+
+    page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
+    tree = html.fromstring(page.text)
+
+``tree`` now contains the whole HTML file in a nice tree structure which
+we can go over two different ways: XPath and CSSSelect. In this example, I
+will focus on the former.
+
+XPath is a way of locating information in structured documents such as
+HTML or XML documents. A good introduction to XPath is on
+`W3Schools <http://www.w3schools.com/xpath/default.asp>`_ .
+
+There are also various tools for obtaining the XPath of elements such as
+FireBug for Firefox or the Chrome Inspector. If you're using Chrome, you
+can right click an element, choose 'Inspect element', highlight the code,
+right click again and choose 'Copy XPath'.
+
+After a quick analysis, we see that in our page the data is contained in
+two elements - one is a div with title 'buyer-name' and the other is a
+span with class 'item-price':
+
+::
+
+    <div title="buyer-name">Carson Busses</div>
+    <span class="item-price">$29.95</span>
+
+Knowing this we can create the correct XPath query and use the lxml
+``xpath`` function like this:
+
+.. code-block:: python
+
+    #This will create a list of buyers:
+    buyers = tree.xpath('//div[@title="buyer-name"]/text()')
+    #This will create a list of prices
+    prices = tree.xpath('//span[@class="item-price"]/text()')
+
+Lets see what we got exactly:
+
+.. code-block:: python
+
+    print 'Buyers: ', buyers
+    print 'Prices: ', prices
+
+::
+
+    Buyers:  ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes',
+    'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff',
+    'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup',
+    'Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire',
+    'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell']
+
+    Prices:  ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25',
+    '$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11',
+    '$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68',
+    '$15.00', '$114.07', '$10.09']
+
+Congratulations! We have successfully scraped all the data we wanted from
+a web page using lxml and Requests. We have it stored in memory as two
+lists. Now we can do all sorts of cool stuff with it: we can analyze it
+using Python or we can save it to a file and share it with the world.
+
+A cool idea to think about is modifying this script to iterate through
+the rest of the pages of this example dataset or rewriting this
+application to use threads for improved speed.