mirror of
https://github.com/kennethreitz/python-guide.git
synced 2026-06-05 23:00:18 +00:00
Using requests instead of urllib2, final draft.
This commit is contained in:
+12
-10
@@ -14,27 +14,29 @@ This is where web scraping comes in. Web scraping is the practice of using
|
|||||||
computer program to sift through a web page and gather the data that you need
|
computer program to sift through a web page and gather the data that you need
|
||||||
in a format most useful to you.
|
in a format most useful to you.
|
||||||
|
|
||||||
lxml
|
lxml and Requests
|
||||||
----
|
-----------------
|
||||||
|
|
||||||
`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
|
`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
|
||||||
XML and HTML documents, which you can easily install using ``pip``. We will
|
XML and HTML documents really fast. It even handles messed up tags. We will
|
||||||
be using its ``html`` module to get example data from this web page: `econpy.org <http://econpy.pythonanywhere.com/ex/001.html>`_ .
|
also be using the `Requests <http://docs.python-requests.org/en/latest/>`_ module instead of the already built-in urlib2
|
||||||
|
due to improvements in speed and readability. You can easily install both
|
||||||
|
using ``pip install lxml`` and ``pip install requests``.
|
||||||
|
|
||||||
First we shall import the required modules:
|
Lets start with the imports:
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
from lxml import html
|
from lxml import html
|
||||||
from urllib2 import urlopen
|
import requests
|
||||||
|
|
||||||
We will use ``urllib2.urlopen`` to retrieve the web page with our data and
|
Next we will use ``requests.get`` to retrieve the web page with our data
|
||||||
parse it using the ``html`` module:
|
and parse it using the ``html`` module and save the results in ``tree``:
|
||||||
|
|
||||||
.. code-block:: python
|
.. code-block:: python
|
||||||
|
|
||||||
page = urlopen('http://econpy.pythonanywhere.com/ex/001.html')
|
page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
|
||||||
tree = html.fromstring(page.read())
|
tree = html.fromstring(page.text)
|
||||||
|
|
||||||
``tree`` now contains the whole HTML file in a nice tree structure which
|
``tree`` now contains the whole HTML file in a nice tree structure which
|
||||||
we can go over two different ways: XPath and CSSSelect. In this example, I
|
we can go over two different ways: XPath and CSSSelect. In this example, I
|
||||||
|
|||||||
Reference in New Issue
Block a user