mirror of
https://github.com/kennethreitz/python-guide.git
synced 2026-06-05 23:00:18 +00:00
3aef3bd8ef
Fixed some markup.
84 lines
3.2 KiB
ReStructuredText
84 lines
3.2 KiB
ReStructuredText
HTML Scraping
|
|
=============
|
|
|
|
Web Scraping
|
|
------------
|
|
|
|
Web sites are written using HTML, which means that each web page is a
|
|
structured document. Sometimes it would be great to obtain some data from
|
|
them and preserve the structure while we're at it, but this isn't always easy.
|
|
It's not often that web sites provide their data in comfortable formats
|
|
such as ``.csv``.
|
|
|
|
This is where web scraping comes in. Web scraping is the practice of using
|
|
computer program to sift through a web page and gather the data that you need
|
|
in a format most useful to you.
|
|
|
|
lxml
|
|
----
|
|
|
|
`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
|
|
XML and HTML documents, which you can easily install using ``pip``. We will
|
|
be using its `html` module to get data from this web page: `econpy <http://econpy.pythonanywhere.com/ex/001.html>`_ .
|
|
|
|
First we shall import the required modules:
|
|
|
|
.. code-block:: python
|
|
|
|
from lxml import html
|
|
from urllib2 import urlopen
|
|
|
|
We will use ``urllib2.urlopen`` to retrieve the web page with our data and
|
|
parse it using the ``html`` module:
|
|
|
|
.. code-block:: python
|
|
|
|
page = urlopen('http://econpy.pythonanywhere.com/ex/001.html')
|
|
tree = html.fromstring(page.read())
|
|
|
|
`tree` now contains the whole HTML file in a nice tree structure which
|
|
we can go over in many different ways, one of which is using XPath. XPath
|
|
is a way of locating information in structured documents such as HTML or XML
|
|
pages. A good introduction to XPath is `here <http://www.w3schools.com/xpath/default.asp>`_ .
|
|
One can also use various tools for obtaining the XPath of elements such as
|
|
FireBug for Firefox or in Chrome you can right click an element, choose
|
|
'Inspect element', highlight the code and the right click again and choose
|
|
'Copy XPath'.
|
|
|
|
After a quick analysis, we see that in our page the data is contained in
|
|
two elements - one is a div with title 'buyer-name' and the other is a
|
|
span with class 'item-price'. Knowing this we can create the correct XPath
|
|
query and use the lxml `xpath` function like this:
|
|
|
|
.. code-block:: python
|
|
|
|
#This will create a list of buyers:
|
|
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
|
|
#This will create a list of prices
|
|
prices = tree.xpath('//span[@class="item-price"]/text()')
|
|
|
|
Lets see what we got exactly:
|
|
|
|
.. code-block:: python
|
|
|
|
print 'Buyers: ', buyers
|
|
print 'Prices: ', prices
|
|
|
|
::
|
|
|
|
Buyers: ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes',
|
|
'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff',
|
|
'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup',
|
|
'Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire',
|
|
'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell']
|
|
|
|
Prices: ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25',
|
|
'$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11',
|
|
'$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68',
|
|
'$15.00', '$114.07', '$10.09']
|
|
|
|
Congratulations! We have successfully scraped all the data we wanted from
|
|
a web page using lxml and we have it stored in memory as two lists. Now we
|
|
can either continue our work on it, analyzing it using python or we can
|
|
export it to a file and share it with friends.
|