Added scenario about web scraping using lxml

2026-06-05 14:50:19 +00:00 · 2012-12-31 10:22:38 -05:00
parent 2a9c7322a5
commit faae04c3a3
1 changed files with 82 additions and 0 deletions
@@ -0,0 +1,82 @@
+HTML Scraping
+=============
+
+Web Scraping
+------------
+
+Web sites are written using HTML, which means that each web page is a
+ structured document. Sometimes it would be great to obtain some data from 
+them and preserve the structure while we're at it, but this isn't always easy
+ - it's not often that web sites provide their data in comfortable formats
+ such as `.csv`. 
+
+This is where web scraping comes in. Web scraping is the practice of using
+computer program to sift through a web page and gather the data that you need
+in a format most useful to you.
+
+lxml
+----
+
+`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
+XML and HTML documents, which you can easily install using `pip`. We will 
+be using its `html` module to get data from this web page: `econpy <http://econpy.pythonanywhere.com/ex/001.html>'_ .
+
+First we shall import the required modules:
+
+.. code-block:: python
+
+    from lxml import html
+    from urllib2 import urlopen
+    
+We will use `urllib2.urlopen` to retrieve the web page with our data and
+parse it using the `html` module:
+
+.. code-block:: python
+
+    page = urlopen('http://econpy.pythonanywhere.com/ex/001.html')
+    tree = html.fromstring(page.read())
+
+`tree` now contains the whole HTML file in a nice tree structure which
+we can go over in many different ways, one of which is using XPath. XPath
+is a way of locating information in structured documents such as HTML or XML
+pages. A good introduction to XPath is 'here <http://www.w3schools.com/xpath/default.asp>'_ .
+One can also use various tools for obtaining the XPath of elements such as
+FireBug for Firefox or in Chrome you can right click an element, choose 
+'Inspect element', highlight the code and the right click again and choose
+'Copy XPath'.
+
+After a quick analysis, we see that in our page the data is contained in 
+two elements - one is a div with title 'buyer-name' and the other is a 
+span with class 'item-price'. Knowing this we can create the correct XPath
+query and use the lxml `xpath` function like this:
+
+.. code-block:: python
+
+    #This will create a list of buyers:
+    buyers = tree.xpath('//div[@title="buyer-name"]/text()')
+    #This will create a list of prices
+    prices = tree.xpath('//span[@class="item-price"]/text()')
+
+Lets see what we got exactly:
+
+.. code-block:: python
+
+    print 'Buyers: ', buyers
+    print 'Prices: ', prices
+
+::
+    Buyers:  ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes', 
+    'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff',
+    'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup',
+    'Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire',
+    'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell']
+    
+    Prices:  ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25',
+    '$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11',
+    '$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68',
+    '$15.00', '$114.07', '$10.09']
+
+Congratulations! We have successfully scraped all the data we wanted from
+a web page using lxml and we have it stored in memory as two lists. Now we
+can either continue our work on it, analyzing it using python or we can
+export it to a file and share it with friends.