From faae04c3a3c89e8033e8e79181008aca4b81ae4f Mon Sep 17 00:00:00 2001 From: sirMackk Date: Mon, 31 Dec 2012 10:22:38 -0500 Subject: [PATCH] Added scenario about web scraping using lxml --- docs/scenarios/scrape.rst | 82 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 82 insertions(+) create mode 100644 docs/scenarios/scrape.rst diff --git a/docs/scenarios/scrape.rst b/docs/scenarios/scrape.rst new file mode 100644 index 0000000..3537938 --- /dev/null +++ b/docs/scenarios/scrape.rst @@ -0,0 +1,82 @@ +HTML Scraping +============= + +Web Scraping +------------ + +Web sites are written using HTML, which means that each web page is a + structured document. Sometimes it would be great to obtain some data from +them and preserve the structure while we're at it, but this isn't always easy + - it's not often that web sites provide their data in comfortable formats + such as `.csv`. + +This is where web scraping comes in. Web scraping is the practice of using +computer program to sift through a web page and gather the data that you need +in a format most useful to you. + +lxml +---- + +`lxml `_ is a pretty extensive library written for parsing +XML and HTML documents, which you can easily install using `pip`. We will +be using its `html` module to get data from this web page: `econpy '_ . + +First we shall import the required modules: + +.. code-block:: python + + from lxml import html + from urllib2 import urlopen + +We will use `urllib2.urlopen` to retrieve the web page with our data and +parse it using the `html` module: + +.. code-block:: python + + page = urlopen('http://econpy.pythonanywhere.com/ex/001.html') + tree = html.fromstring(page.read()) + +`tree` now contains the whole HTML file in a nice tree structure which +we can go over in many different ways, one of which is using XPath. XPath +is a way of locating information in structured documents such as HTML or XML +pages. A good introduction to XPath is 'here '_ . +One can also use various tools for obtaining the XPath of elements such as +FireBug for Firefox or in Chrome you can right click an element, choose +'Inspect element', highlight the code and the right click again and choose +'Copy XPath'. + +After a quick analysis, we see that in our page the data is contained in +two elements - one is a div with title 'buyer-name' and the other is a +span with class 'item-price'. Knowing this we can create the correct XPath +query and use the lxml `xpath` function like this: + +.. code-block:: python + + #This will create a list of buyers: + buyers = tree.xpath('//div[@title="buyer-name"]/text()') + #This will create a list of prices + prices = tree.xpath('//span[@class="item-price"]/text()') + +Lets see what we got exactly: + +.. code-block:: python + + print 'Buyers: ', buyers + print 'Prices: ', prices + +:: + Buyers: ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes', + 'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff', + 'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup', + 'Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire', + 'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell'] + + Prices: ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25', + '$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11', + '$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68', + '$15.00', '$114.07', '$10.09'] + +Congratulations! We have successfully scraped all the data we wanted from +a web page using lxml and we have it stored in memory as two lists. Now we +can either continue our work on it, analyzing it using python or we can +export it to a file and share it with friends.