mirror of
https://github.com/kennethreitz/python-guide.git
synced 2026-06-05 14:50:19 +00:00
2nd draft of web scraping scenario
Fixed some markup.
This commit is contained in:
@@ -5,10 +5,10 @@ Web Scraping
|
|||||||
------------
|
------------
|
||||||
|
|
||||||
Web sites are written using HTML, which means that each web page is a
|
Web sites are written using HTML, which means that each web page is a
|
||||||
structured document. Sometimes it would be great to obtain some data from
|
structured document. Sometimes it would be great to obtain some data from
|
||||||
them and preserve the structure while we're at it, but this isn't always easy
|
them and preserve the structure while we're at it, but this isn't always easy.
|
||||||
- it's not often that web sites provide their data in comfortable formats
|
It's not often that web sites provide their data in comfortable formats
|
||||||
such as `.csv`.
|
such as ``.csv``.
|
||||||
|
|
||||||
This is where web scraping comes in. Web scraping is the practice of using
|
This is where web scraping comes in. Web scraping is the practice of using
|
||||||
computer program to sift through a web page and gather the data that you need
|
computer program to sift through a web page and gather the data that you need
|
||||||
@@ -18,8 +18,8 @@ lxml
|
|||||||
----
|
----
|
||||||
|
|
||||||
`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
|
`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
|
||||||
XML and HTML documents, which you can easily install using `pip`. We will
|
XML and HTML documents, which you can easily install using ``pip``. We will
|
||||||
be using its `html` module to get data from this web page: `econpy <http://econpy.pythonanywhere.com/ex/001.html>'_ .
|
be using its `html` module to get data from this web page: `econpy <http://econpy.pythonanywhere.com/ex/001.html>`_ .
|
||||||
|
|
||||||
First we shall import the required modules:
|
First we shall import the required modules:
|
||||||
|
|
||||||
@@ -28,8 +28,8 @@ First we shall import the required modules:
|
|||||||
from lxml import html
|
from lxml import html
|
||||||
from urllib2 import urlopen
|
from urllib2 import urlopen
|
||||||
|
|
||||||
We will use `urllib2.urlopen` to retrieve the web page with our data and
|
We will use ``urllib2.urlopen`` to retrieve the web page with our data and
|
||||||
parse it using the `html` module:
|
parse it using the ``html`` module:
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
@@ -39,7 +39,7 @@ parse it using the `html` module:
|
|||||||
`tree` now contains the whole HTML file in a nice tree structure which
|
`tree` now contains the whole HTML file in a nice tree structure which
|
||||||
we can go over in many different ways, one of which is using XPath. XPath
|
we can go over in many different ways, one of which is using XPath. XPath
|
||||||
is a way of locating information in structured documents such as HTML or XML
|
is a way of locating information in structured documents such as HTML or XML
|
||||||
pages. A good introduction to XPath is 'here <http://www.w3schools.com/xpath/default.asp>'_ .
|
pages. A good introduction to XPath is `here <http://www.w3schools.com/xpath/default.asp>`_ .
|
||||||
One can also use various tools for obtaining the XPath of elements such as
|
One can also use various tools for obtaining the XPath of elements such as
|
||||||
FireBug for Firefox or in Chrome you can right click an element, choose
|
FireBug for Firefox or in Chrome you can right click an element, choose
|
||||||
'Inspect element', highlight the code and the right click again and choose
|
'Inspect element', highlight the code and the right click again and choose
|
||||||
@@ -65,6 +65,7 @@ Lets see what we got exactly:
|
|||||||
print 'Prices: ', prices
|
print 'Prices: ', prices
|
||||||
|
|
||||||
::
|
::
|
||||||
|
|
||||||
Buyers: ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes',
|
Buyers: ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes',
|
||||||
'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff',
|
'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff',
|
||||||
'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup',
|
'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup',
|
||||||
|
|||||||
Reference in New Issue
Block a user