From faae04c3a3c89e8033e8e79181008aca4b81ae4f Mon Sep 17 00:00:00 2001 From: sirMackk Date: Mon, 31 Dec 2012 10:22:38 -0500 Subject: [PATCH 1/7] Added scenario about web scraping using lxml --- docs/scenarios/scrape.rst | 82 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 82 insertions(+) create mode 100644 docs/scenarios/scrape.rst diff --git a/docs/scenarios/scrape.rst b/docs/scenarios/scrape.rst new file mode 100644 index 0000000..3537938 --- /dev/null +++ b/docs/scenarios/scrape.rst @@ -0,0 +1,82 @@ +HTML Scraping +============= + +Web Scraping +------------ + +Web sites are written using HTML, which means that each web page is a + structured document. Sometimes it would be great to obtain some data from +them and preserve the structure while we're at it, but this isn't always easy + - it's not often that web sites provide their data in comfortable formats + such as `.csv`. + +This is where web scraping comes in. Web scraping is the practice of using +computer program to sift through a web page and gather the data that you need +in a format most useful to you. + +lxml +---- + +`lxml `_ is a pretty extensive library written for parsing +XML and HTML documents, which you can easily install using `pip`. We will +be using its `html` module to get data from this web page: `econpy '_ . + +First we shall import the required modules: + +.. code-block:: python + + from lxml import html + from urllib2 import urlopen + +We will use `urllib2.urlopen` to retrieve the web page with our data and +parse it using the `html` module: + +.. code-block:: python + + page = urlopen('http://econpy.pythonanywhere.com/ex/001.html') + tree = html.fromstring(page.read()) + +`tree` now contains the whole HTML file in a nice tree structure which +we can go over in many different ways, one of which is using XPath. XPath +is a way of locating information in structured documents such as HTML or XML +pages. A good introduction to XPath is 'here '_ . +One can also use various tools for obtaining the XPath of elements such as +FireBug for Firefox or in Chrome you can right click an element, choose +'Inspect element', highlight the code and the right click again and choose +'Copy XPath'. + +After a quick analysis, we see that in our page the data is contained in +two elements - one is a div with title 'buyer-name' and the other is a +span with class 'item-price'. Knowing this we can create the correct XPath +query and use the lxml `xpath` function like this: + +.. code-block:: python + + #This will create a list of buyers: + buyers = tree.xpath('//div[@title="buyer-name"]/text()') + #This will create a list of prices + prices = tree.xpath('//span[@class="item-price"]/text()') + +Lets see what we got exactly: + +.. code-block:: python + + print 'Buyers: ', buyers + print 'Prices: ', prices + +:: + Buyers: ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes', + 'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff', + 'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup', + 'Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire', + 'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell'] + + Prices: ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25', + '$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11', + '$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68', + '$15.00', '$114.07', '$10.09'] + +Congratulations! We have successfully scraped all the data we wanted from +a web page using lxml and we have it stored in memory as two lists. Now we +can either continue our work on it, analyzing it using python or we can +export it to a file and share it with friends. From 3aef3bd8ef0c9fa2c008e1fc31795af86cde34ce Mon Sep 17 00:00:00 2001 From: sirMackk Date: Mon, 31 Dec 2012 10:27:29 -0500 Subject: [PATCH 2/7] 2nd draft of web scraping scenario Fixed some markup. --- docs/scenarios/scrape.rst | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/docs/scenarios/scrape.rst b/docs/scenarios/scrape.rst index 3537938..b2bd070 100644 --- a/docs/scenarios/scrape.rst +++ b/docs/scenarios/scrape.rst @@ -5,10 +5,10 @@ Web Scraping ------------ Web sites are written using HTML, which means that each web page is a - structured document. Sometimes it would be great to obtain some data from -them and preserve the structure while we're at it, but this isn't always easy - - it's not often that web sites provide their data in comfortable formats - such as `.csv`. +structured document. Sometimes it would be great to obtain some data from +them and preserve the structure while we're at it, but this isn't always easy. +It's not often that web sites provide their data in comfortable formats + such as ``.csv``. This is where web scraping comes in. Web scraping is the practice of using computer program to sift through a web page and gather the data that you need @@ -18,8 +18,8 @@ lxml ---- `lxml `_ is a pretty extensive library written for parsing -XML and HTML documents, which you can easily install using `pip`. We will -be using its `html` module to get data from this web page: `econpy '_ . +XML and HTML documents, which you can easily install using ``pip``. We will +be using its `html` module to get data from this web page: `econpy `_ . First we shall import the required modules: @@ -28,8 +28,8 @@ First we shall import the required modules: from lxml import html from urllib2 import urlopen -We will use `urllib2.urlopen` to retrieve the web page with our data and -parse it using the `html` module: +We will use ``urllib2.urlopen`` to retrieve the web page with our data and +parse it using the ``html`` module: .. code-block:: python @@ -39,7 +39,7 @@ parse it using the `html` module: `tree` now contains the whole HTML file in a nice tree structure which we can go over in many different ways, one of which is using XPath. XPath is a way of locating information in structured documents such as HTML or XML -pages. A good introduction to XPath is 'here '_ . +pages. A good introduction to XPath is `here `_ . One can also use various tools for obtaining the XPath of elements such as FireBug for Firefox or in Chrome you can right click an element, choose 'Inspect element', highlight the code and the right click again and choose @@ -65,6 +65,7 @@ Lets see what we got exactly: print 'Prices: ', prices :: + Buyers: ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes', 'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff', 'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup', From c3d7bddf3d6c0fa8781f7a0e64389abc20e6892a Mon Sep 17 00:00:00 2001 From: sirMackk Date: Mon, 31 Dec 2012 10:31:08 -0500 Subject: [PATCH 3/7] Third, final markup fixes. --- docs/scenarios/scrape.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/scenarios/scrape.rst b/docs/scenarios/scrape.rst index b2bd070..8fe6e03 100644 --- a/docs/scenarios/scrape.rst +++ b/docs/scenarios/scrape.rst @@ -8,7 +8,7 @@ Web sites are written using HTML, which means that each web page is a structured document. Sometimes it would be great to obtain some data from them and preserve the structure while we're at it, but this isn't always easy. It's not often that web sites provide their data in comfortable formats - such as ``.csv``. +such as ``.csv``. This is where web scraping comes in. Web scraping is the practice of using computer program to sift through a web page and gather the data that you need @@ -19,7 +19,7 @@ lxml `lxml `_ is a pretty extensive library written for parsing XML and HTML documents, which you can easily install using ``pip``. We will -be using its `html` module to get data from this web page: `econpy `_ . +be using its ``html`` module to get example data from this web page: `econpy.org `_ . First we shall import the required modules: From 83c9cba2a3ad6c161eb686bd071c625873f4aa32 Mon Sep 17 00:00:00 2001 From: sirMackk Date: Mon, 31 Dec 2012 10:37:21 -0500 Subject: [PATCH 4/7] Added a bit more code to improve understanding. --- docs/scenarios/scrape.rst | 26 ++++++++++++++++++++------ 1 file changed, 20 insertions(+), 6 deletions(-) diff --git a/docs/scenarios/scrape.rst b/docs/scenarios/scrape.rst index 8fe6e03..aa42bcc 100644 --- a/docs/scenarios/scrape.rst +++ b/docs/scenarios/scrape.rst @@ -36,10 +36,13 @@ parse it using the ``html`` module: page = urlopen('http://econpy.pythonanywhere.com/ex/001.html') tree = html.fromstring(page.read()) -`tree` now contains the whole HTML file in a nice tree structure which -we can go over in many different ways, one of which is using XPath. XPath -is a way of locating information in structured documents such as HTML or XML -pages. A good introduction to XPath is `here `_ . +``tree`` now contains the whole HTML file in a nice tree structure which +we can go over two different ways: XPath and CSSSelect. In this example, I +will focus on the former. + +XPath is a way of locating information in structured documents such as +HTML or XML pages. A good introduction to XPath is `here `_ . + One can also use various tools for obtaining the XPath of elements such as FireBug for Firefox or in Chrome you can right click an element, choose 'Inspect element', highlight the code and the right click again and choose @@ -47,8 +50,15 @@ FireBug for Firefox or in Chrome you can right click an element, choose After a quick analysis, we see that in our page the data is contained in two elements - one is a div with title 'buyer-name' and the other is a -span with class 'item-price'. Knowing this we can create the correct XPath -query and use the lxml `xpath` function like this: +span with class 'item-price': + +.. code-bloc:: html + +
Carson Busses
+ $29.95 + +Knowing this we can create the correct XPath query and use the lxml +``xpath`` function like this: .. code-block:: python @@ -81,3 +91,7 @@ Congratulations! We have successfully scraped all the data we wanted from a web page using lxml and we have it stored in memory as two lists. Now we can either continue our work on it, analyzing it using python or we can export it to a file and share it with friends. + +A cool idea to think about is writing a script to iterate through the rest +of the pages of this example data set or making this application use +threads to improve its speed. From a22a6e92fa7c60bf6673a6c8b1b1bb9d6a4ec4bb Mon Sep 17 00:00:00 2001 From: sirMackk Date: Mon, 31 Dec 2012 10:39:25 -0500 Subject: [PATCH 5/7] Fixing html code-block --- docs/scenarios/scrape.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/scenarios/scrape.rst b/docs/scenarios/scrape.rst index aa42bcc..ca7a44e 100644 --- a/docs/scenarios/scrape.rst +++ b/docs/scenarios/scrape.rst @@ -52,7 +52,7 @@ After a quick analysis, we see that in our page the data is contained in two elements - one is a div with title 'buyer-name' and the other is a span with class 'item-price': -.. code-bloc:: html +::
Carson Busses
$29.95 From 32dea94b808925ff9104fec1ba489ecef49cc838 Mon Sep 17 00:00:00 2001 From: sirMackk Date: Mon, 31 Dec 2012 17:16:00 -0500 Subject: [PATCH 6/7] Using requests instead of urllib2, final draft. --- docs/scenarios/scrape.rst | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/docs/scenarios/scrape.rst b/docs/scenarios/scrape.rst index ca7a44e..e7333c0 100644 --- a/docs/scenarios/scrape.rst +++ b/docs/scenarios/scrape.rst @@ -14,27 +14,29 @@ This is where web scraping comes in. Web scraping is the practice of using computer program to sift through a web page and gather the data that you need in a format most useful to you. -lxml ----- +lxml and Requests +----------------- `lxml `_ is a pretty extensive library written for parsing -XML and HTML documents, which you can easily install using ``pip``. We will -be using its ``html`` module to get example data from this web page: `econpy.org `_ . +XML and HTML documents really fast. It even handles messed up tags. We will +also be using the `Requests `_ module instead of the already built-in urlib2 +due to improvements in speed and readability. You can easily install both +using ``pip install lxml`` and ``pip install requests``. -First we shall import the required modules: +Lets start with the imports: .. code-block:: python from lxml import html - from urllib2 import urlopen + import requests -We will use ``urllib2.urlopen`` to retrieve the web page with our data and -parse it using the ``html`` module: +Next we will use ``requests.get`` to retrieve the web page with our data +and parse it using the ``html`` module and save the results in ``tree``: .. code-block:: python - page = urlopen('http://econpy.pythonanywhere.com/ex/001.html') - tree = html.fromstring(page.read()) + page = requests.get('http://econpy.pythonanywhere.com/ex/001.html') + tree = html.fromstring(page.text) ``tree`` now contains the whole HTML file in a nice tree structure which we can go over two different ways: XPath and CSSSelect. In this example, I From aa7f9aac98faa1d4c86cce8fa0e60a56ab5e6d6d Mon Sep 17 00:00:00 2001 From: sirMackk Date: Mon, 31 Dec 2012 17:25:34 -0500 Subject: [PATCH 7/7] Final version --- docs/scenarios/scrape.rst | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/docs/scenarios/scrape.rst b/docs/scenarios/scrape.rst index e7333c0..17a0281 100644 --- a/docs/scenarios/scrape.rst +++ b/docs/scenarios/scrape.rst @@ -6,13 +6,13 @@ Web Scraping Web sites are written using HTML, which means that each web page is a structured document. Sometimes it would be great to obtain some data from -them and preserve the structure while we're at it, but this isn't always easy. -It's not often that web sites provide their data in comfortable formats -such as ``.csv``. +them and preserve the structure while we're at it. Web sites provide +don't always provide their data in comfortable formats such as ``.csv``. -This is where web scraping comes in. Web scraping is the practice of using +This is where web scraping comes in. Web scraping is the practice of using a computer program to sift through a web page and gather the data that you need -in a format most useful to you. +in a format most useful to you while at the same time preserving the structure +of the data. lxml and Requests ----------------- @@ -43,12 +43,12 @@ we can go over two different ways: XPath and CSSSelect. In this example, I will focus on the former. XPath is a way of locating information in structured documents such as -HTML or XML pages. A good introduction to XPath is `here `_ . +HTML or XML documents. A good introduction to XPath is on `W3Schools `_ . -One can also use various tools for obtaining the XPath of elements such as -FireBug for Firefox or in Chrome you can right click an element, choose -'Inspect element', highlight the code and the right click again and choose -'Copy XPath'. +There are also various tools for obtaining the XPath of elements such as +FireBug for Firefox or if you're using Chrome you can right click an +element, choose 'Inspect element', highlight the code and then right +click again and choose 'Copy XPath'. After a quick analysis, we see that in our page the data is contained in two elements - one is a div with title 'buyer-name' and the other is a @@ -90,10 +90,10 @@ Lets see what we got exactly: '$15.00', '$114.07', '$10.09'] Congratulations! We have successfully scraped all the data we wanted from -a web page using lxml and we have it stored in memory as two lists. Now we -can either continue our work on it, analyzing it using python or we can -export it to a file and share it with friends. +a web page using lxml and Requests. We have it stored in memory as two +lists. Now we can do all sorts of cool stuff with it: we can analyze it +using Python or we can save it a file and share it with the world. -A cool idea to think about is writing a script to iterate through the rest -of the pages of this example data set or making this application use -threads to improve its speed. +A cool idea to think about is modifying this script to iterate through +the rest of the pages of this example dataset or rewriting this +application to use threads for improved speed.