<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
 
<title>sitescraper.net</title>
 <link href="http://sitescraper.net/atom.xml" rel="self"/>
 <link href="http://sitescraper.net/"/>
 <updated>2012-02-19T01:47:51+11:00</updated>
 <id>http://sitescraper.net/</id>
 <author>
   <name>Richard Penman</name>
   <email>richard@sitescraper.net</email>
 </author>

 
 <entry>
   <title>Automating webkit</title>
   <link href="http://sitescraper.net/blog/Automating-webkit/"/>
   <updated>2012-02-14T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/Automating-webkit</id>
   <content type="html">&lt;p&gt;I have received some enquiries about using webkit for web scraping, so here is an example using the &lt;a href=&quot;http://code.google.com/p/webscraping/&quot;&gt;webscraping module&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;webscraping&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;webkit&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;w&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;webkit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;WebkitBrowser&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gui&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; 
&lt;span class=&quot;c&quot;&gt;# load webpage&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;http://duckduckgo.com&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# fill search textbox &lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fill&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;input[id=search_form_input_homepage]&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;sitescraper&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# take screenshot of browser&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;screenshot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;duckduckgo_search.jpg&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# click search button &lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;click&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;input[id=search_button_homepage]&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# wait on results page&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;wait&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# take another screenshot&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;screenshot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;duckduckgo_results.jpg&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;Here are the screenshots saved:&lt;br /&gt;&lt;/p&gt;

&lt;table
    &lt;tr&gt;
        &lt;td&gt;&lt;img src=&quot;/static/img/blog/duckduckgo_search.jpg&quot; /&gt;&lt;/td&gt;
        &lt;td&gt;&lt;img src=&quot;/static/img/blog/duckduckgo_results.jpg&quot; /&gt;&lt;/td&gt;
    &lt;/tr&gt;
&lt;/table&gt;


&lt;p&gt;I often use webkit when working with websites that rely heavily on JavaScript.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Caching data efficiently</title>
   <link href="http://sitescraper.net/blog/Caching-data-efficiently/"/>
   <updated>2012-02-10T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/Caching-data-efficiently</id>
   <content type="html">&lt;p&gt;When crawling websites I usually cache all HTML on disk to avoid having to redownload later. I wrote the &lt;a href=&quot;http://code.google.com/p/webscraping/source/browse/pdict.py&quot;&gt;pdict module&lt;/a&gt; to automate this process. Here is an example:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;webscraping&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pdict&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# initiate cache&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pdict&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PersistentDict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;test.db&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# compresses and store content in the database&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;html&lt;/span&gt; 

&lt;span class=&quot;c&quot;&gt;# iterate all data in the database&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;key&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;The bottleneck with the cache is insertions so for efficiency records can be buffered and then inserted in a single transaction:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;c&quot;&gt;# dictionary of data to insert&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# cache each record individually (2m49.827s)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pdict&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PersistentDict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;test.db&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_buffer_size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;items&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# cache all records in a single transaction (0m0.774s)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pdict&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PersistentDict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;test.db&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_buffer_size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;items&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;In this example caching all records at once takes less than a second but caching each record individually takes almost 3 minutes.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>How to make python faster</title>
   <link href="http://sitescraper.net/blog/How-to-make-python-faster/"/>
   <updated>2012-02-01T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/How-to-make-python-faster</id>
   <content type="html">&lt;p&gt;Python and other scripting languages are sometimes dismissed because of their inefficiency compared to compiled languages like C. For example here are implementations of the &lt;a href=&quot;http://en.wikipedia.org/wiki/Fibonacci_number&quot;&gt;fibonacci sequence&lt;/a&gt; in C and Python:&lt;/p&gt;

&lt;table cellspacing=&quot;10&quot;&gt;
&lt;tr&gt;
&lt;td&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;c&quot;&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;){&lt;/span&gt;
   &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
   &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;fib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;40&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;/td&gt;
&lt;td&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;fib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;40&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;/td&gt;
&lt;/tr&gt;
&lt;/table&gt;


&lt;p&gt;And here are the execution times:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ time ./fib
3.099s
$ time python fib.py
16.655s
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As expected C has a much faster execution time - 5x faster in this case.&lt;/p&gt;

&lt;p&gt;In the context of web scraping, executing instructions is less important because the bottleneck is I/O - downloading the webpages. But I use Python in other contexts too so let's see if we can do better.&lt;/p&gt;

&lt;p&gt;First install &lt;a href=&quot;http://psyco.sourceforge.net/&quot;&gt;psyco&lt;/a&gt;. On Linux this is just:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;sudo apt-get install python-psyco
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Then modify the Python script to call psyco:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;psyco&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;psyco&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;full&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;fib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
     &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;40&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;And here is the updated execution time:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ time python fib.py
3.190s
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Just 3 seconds - with psyco the execution time is now equivalent to the C example! Psyco achieves this by compiling code on the fly to avoid interpreting each line.&lt;/p&gt;

&lt;p&gt;I now add the below snippet to most of my Python scripts to take advantage of psyco when installed:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;k&quot;&gt;try&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;psyco&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;psyco&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;full&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;except&lt;/span&gt; &lt;span class=&quot;ne&quot;&gt;ImportError&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;pass&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# psyco not installed so continue as usual&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;



</content>
 </entry>
 
 <entry>
   <title>Automatic web scraping</title>
   <link href="http://sitescraper.net/blog/Automatic-web-scraping/"/>
   <updated>2012-01-04T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/Automatic-web-scraping</id>
   <content type="html">&lt;p&gt;I have been interested in automatic approaches to web scraping for a while. During university I created the &lt;a href=&quot;http://code.google.com/p/sitescraper&quot;&gt;SiteScraper library&lt;/a&gt;, which used training cases to automatically scrape webpages. These days most of my work involves scraping listings from website directories so an automatic solution is not necessary. It is easier for me to specify the XPaths required than build a model with SiteScraper.&lt;/p&gt;

&lt;p&gt;I would like to work more efficiently and use an automated approach. Ideally I would have a solution that when given a URL:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;crawls the website&lt;/li&gt;
&lt;li&gt;organize the webpages into groups that share the same template (a directory page will have a different HTML structure than a listing page)&lt;/li&gt;
&lt;li&gt;the group with the most amount of webpages should be the listings&lt;/li&gt;
&lt;li&gt;compare these listing webpages to find what is static (the template) and what changes&lt;/li&gt;
&lt;li&gt;the parts that change represent dynamic data such as description, reviews, etc&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Apparently this process for scraping data automatically like this is known as &lt;em&gt;wrapper induction&lt;/em&gt; in academia. Unfortunately there do not seem to be any good open source solutions yet. The most commonly referenced one is &lt;a href=&quot;http://www.holovaty.com/writing/templatemaker&quot;&gt;Templatemaker&lt;/a&gt;, which is aimed at small text blocks and crashes in my test cases of real webpages. The author stopped development in 2007.&lt;/p&gt;

&lt;p&gt;Some commercial groups who have developed their own solutions so this certainly is possible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://www.dapper.net/technology.php&quot;&gt;Dapper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://kapowsoftware.com/products/kapow-katalyst-platform/index.php&quot;&gt;Kapow&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;If no open source solutions are released I plan to attempt building my own later this year.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Threading with webkit</title>
   <link href="http://sitescraper.net/blog/Threading-with-webkit/"/>
   <updated>2011-12-30T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/Threading-with-webkit</id>
   <content type="html">&lt;p&gt;In &lt;a href=&quot;/blog/Scraping-multiple-JavaScript-webpages-with-webkit/&quot;&gt;a previous post&lt;/a&gt; I showed how to scrape a list of webpages. That is fine for small crawls but will take too long otherwise. Here is an updated example that downloads the content in multiple threads.&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sys&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;collections&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;deque&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# threadsafe datatype&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;PyQt4.QtCore&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;PyQt4.QtGui&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;PyQt4.QtWebKit&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;NUM_THREADS&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# how many threads to use&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Render&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QWebView&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;active&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;deque&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# track how many threads are still active&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# store the data&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;QWebView&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loadFinished&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;connect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_loadFinished&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;crawl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;crawl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;downloading&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;Render&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;active&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
            &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QUrl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;except&lt;/span&gt; &lt;span class=&quot;ne&quot;&gt;IndexError&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;c&quot;&gt;# no more urls to process&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Render&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;active&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
                &lt;span class=&quot;c&quot;&gt;# no more threads downloading&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;finished&amp;#39;&lt;/span&gt;
                &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;close&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_loadFinished&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;c&quot;&gt;# process the downloaded html&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;frame&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;page&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mainFrame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;frame&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toString&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;Render&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;frame&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toHtml&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;Render&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;active&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;popleft&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;crawl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# crawl next URL in the list&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;app&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;QApplication&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sys&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# can only instantiate this once so must move outside class &lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;deque&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;http://sitescraper.net&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;http://sitescraper.net/questions&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&amp;#39;http://sitescraper.net/blog&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;http://sitescraper.net/projects&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;renders&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Render&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NUM_THREADS&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;exec_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# will execute qt loop until class calls close event&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Render&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;keys&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;



</content>
 </entry>
 
 <entry>
   <title>Scraping multiple JavaScript webpages with webkit</title>
   <link href="http://sitescraper.net/blog/Scraping-multiple-JavaScript-webpages-with-webkit/"/>
   <updated>2011-12-06T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/Scraping-multiple-JavaScript-webpages-with-webkit</id>
   <content type="html">&lt;p&gt;I made &lt;a href=&quot;/blog/Scraping-JavaScript-webpages-with-webkit/&quot;&gt;an earlier post&lt;/a&gt; about using webkit to process the JavaScript in a webpage so you can access the resulting HTML. A few people asked how to apply this to multiple webpages, so here it is:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sys&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;PyQt4.QtCore&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;PyQt4.QtGui&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;PyQt4.QtWebKit&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Render&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QWebPage&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;  
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;  
    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;app&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;QApplication&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sys&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
    &lt;span class=&quot;n&quot;&gt;QWebPage&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loadFinished&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;connect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_loadFinished&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt;  
    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# store downloaded HTML in a dict  &lt;/span&gt;
    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;crawl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  
    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;exec_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  
      
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;crawl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;  
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  
      &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pop&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
      &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;Downloading&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;  
      &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mainFrame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QUrl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;  
    &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  
      &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;quit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  
        
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_loadFinished&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;  
    &lt;span class=&quot;n&quot;&gt;frame&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mainFrame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  
    &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;frame&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toString&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;  
    &lt;span class=&quot;n&quot;&gt;html&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;frame&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toHtml&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  
    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;html&lt;/span&gt;  
    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;crawl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  
  
&lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;http://sitescraper.net&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;http://sitescraper.net/blog&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;  
&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Render&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;keys&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;This is a simple solution that will keep all HTML in memory, which is not practical for large crawls. For large crawls you should save the results to disk. I use the &lt;a href=&quot;http://code.google.com/p/webscraping/source/browse/pdict.py&quot;&gt;pdict module&lt;/a&gt; for this.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>How to teach yourself web scraping</title>
   <link href="http://sitescraper.net/blog/How-to-teach-yourself-web-scraping/"/>
   <updated>2011-12-03T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/How-to-teach-yourself-web-scraping</id>
   <content type="html">&lt;p&gt;I often get asked how to learn about &lt;a href=&quot;http://en.wikipedia.org/wiki/Web_scraping&quot;&gt;web scraping&lt;/a&gt;. Here is my advice.&lt;/p&gt;

&lt;p&gt;First learn a popular high level scripting language. A higher level language will allow you to work and test ideas faster. You don't need a more efficient compiled language like C because the bottleneck when web scraping is bandwidth rather than code execution. And learn a popular one so that there is already a community of other people working at similar problems so you can reuse their work. I use Python, but Ruby or Perl would also be a good choice.&lt;/p&gt;

&lt;p&gt;The following advice will assume you want to use Python for web scraping.&lt;br/&gt;
If you have some programming experience then I recommend working through the &lt;a href=&quot;http://www.diveintopython.net/toc/index.html&quot;&gt;Dive Into Python&lt;/a&gt; book:&lt;/p&gt;

&lt;p&gt;Make sure you learn all the details of the &lt;a href=&quot;http://docs.python.org/library/urllib2.html&quot;&gt;urllib2 module&lt;/a&gt;. Here are some additional good resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://www.doughellmann.com/PyMOTW/urllib2/&quot;&gt;http://www.doughellmann.com/PyMOTW/urllib2/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.voidspace.org.uk/python/articles/urllib2.shtml&quot;&gt;http://www.voidspace.org.uk/python/articles/urllib2.shtml&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Learn about the &lt;a href=&quot;http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol&quot;&gt;HTTP protocol&lt;/a&gt;, which is how you will interact with websites.&lt;/p&gt;

&lt;p&gt;Learn about &lt;a href=&quot;http://docs.python.org/library/re.html&quot;&gt;regular expressions&lt;/a&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://docs.python.org/howto/regex&quot;&gt;http://docs.python.org/howto/regex&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.doughellmann.com/PyMOTW/re/&quot;&gt;http://www.doughellmann.com/PyMOTW/re/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Learn about XPath:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://www.w3schools.com/xpath/&quot;&gt;http://www.w3schools.com/xpath/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.learn-xslt-tutorial.com/XPath.cfm&quot;&gt;http://www.learn-xslt-tutorial.com/XPath.cfm&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://lxml.de/dev/xpathxslt.html&quot;&gt;http://lxml.de/dev/xpathxslt.html&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;If necessary learn about JavaScript:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://developer.mozilla.org/en/A_re-introduction_to_JavaScript&quot;&gt;https://developer.mozilla.org/en/A_re-introduction_to_JavaScript&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://eloquentjavascript.net/contents.html&quot;&gt;http://eloquentjavascript.net/contents.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.yuiblog.com/crockford/&quot;&gt;http://www.yuiblog.com/crockford/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;These FireFox extensions can make web scraping easier:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://getfirebug.com/&quot;&gt;FireBug&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://addons.mozilla.org/en-US/firefox/addon/live-http-headers/&quot;&gt;Live HTTP Headers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://groups.csail.mit.edu/uid/chickenfoot/&quot;&gt;Chickenfoot&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://getfoxyproxy.org/&quot;&gt;FoxyProxy&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Some libraries that can make web scraping easier:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://code.google.com/p/webscraping/&quot;&gt;webscraping module&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://lxml.de/lxmlhtml.html&quot;&gt;lxml&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://wwwsearch.sourceforge.net/mechanize/&quot;&gt;mechanize&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://code.google.com/p/httplib2/&quot;&gt;httplib2&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Some other resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://dev.lethain.com/an-introduction-to-compassionate-screenscraping/&quot;&gt;http://dev.lethain.com/an-introduction-to-compassionate-screenscraping&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://stackoverflow.com/questions/tagged/screen-scraping&quot;&gt;http://stackoverflow.com/questions/tagged/screen-scraping&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://blog.sitescraper.net/&quot;&gt;http://blog.sitescraper.net&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</content>
 </entry>
 
 <entry>
   <title>How to use proxies</title>
   <link href="http://sitescraper.net/blog/How-to-use-proxies/"/>
   <updated>2011-11-29T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/How-to-use-proxies</id>
   <content type="html">&lt;p&gt;First you need some working proxies. You can try to collect them from the various free lists such as &lt;a href=&quot;http://hidemyass.com/proxy-list/&quot;&gt;this one&lt;/a&gt;, but many people use these websites so they won't be reliable.&lt;br/&gt;
If this is more than a hobby then it would be a better use of your time to rent your proxies from a provider like &lt;a href=&quot;http://packetflip.com/&quot;&gt;packetflip&lt;/a&gt; or &lt;a href=&quot;http://proxybonanza.com/&quot;&gt;proxybonanza&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Each proxy will have the format &lt;em&gt;login:password@IP:port&lt;/em&gt;&lt;br/&gt;
The login details and post are optional. Here are some examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;bob:eakej34@66.12.121.140:8000&lt;/li&gt;
&lt;li&gt;219.66.12.12&lt;/li&gt;
&lt;li&gt;219.66.12.14:8080&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;With the &lt;a href=&quot;http://code.google.com/p/webscraping/&quot;&gt;webscraping library&lt;/a&gt; you can then use the proxies like this:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;webscraping&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;download&lt;/span&gt;  
&lt;span class=&quot;n&quot;&gt;D&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;download&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Download&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;proxies&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;proxies&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user_agent&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;user_agent&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
&lt;span class=&quot;n&quot;&gt;html&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;The above script will download content through a random proxy from the given list.Here is a standalone version:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;urllib2&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;gzip&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;random&lt;/span&gt;  
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;StringIO&lt;/span&gt;  
  
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;fetch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;proxies&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user_agent&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;Mozilla/5.0&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;  
    &lt;span class=&quot;sd&quot;&gt;&amp;quot;&amp;quot;&amp;quot;Download the content at this url and return the content  &lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;  
    &lt;span class=&quot;n&quot;&gt;opener&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;urllib2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;build_opener&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;proxies&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  
        &lt;span class=&quot;c&quot;&gt;# download through a random proxy from the list  &lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;proxy&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;random&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;choice&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;proxies&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lower&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;startswith&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;https://&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;  
            &lt;span class=&quot;n&quot;&gt;opener&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add_handler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;urllib2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ProxyHandler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;({&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;https&amp;#39;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;proxy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}))&lt;/span&gt;  
        &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  
            &lt;span class=&quot;n&quot;&gt;opener&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add_handler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;urllib2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ProxyHandler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;({&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;http&amp;#39;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;proxy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}))&lt;/span&gt;  
      
    &lt;span class=&quot;c&quot;&gt;# submit these headers with the request  &lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;headers&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;  &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;User-agent&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;user_agent&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;Accept-encoding&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;Referer&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;  
      
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;  
        &lt;span class=&quot;c&quot;&gt;# need to post this data  &lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;urllib&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;urlencode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
    &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  
        &lt;span class=&quot;n&quot;&gt;response&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;opener&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;urllib2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Request&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;headers&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;  
        &lt;span class=&quot;n&quot;&gt;content&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;response&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;response&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;headers&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;content-encoding&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;gzip&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  
            &lt;span class=&quot;c&quot;&gt;# data came back gzip-compressed so decompress it            &lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;content&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gzip&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GzipFile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fileobj&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StringIO&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;StringIO&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;content&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  
    &lt;span class=&quot;k&quot;&gt;except&lt;/span&gt; &lt;span class=&quot;ne&quot;&gt;Exception&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  
        &lt;span class=&quot;c&quot;&gt;# so many kinds of errors are possible here so just catch them all  &lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;Error: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;%s&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;%s&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
        &lt;span class=&quot;n&quot;&gt;content&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;  
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;



</content>
 </entry>
 
 <entry>
   <title>How to automatically find contact details</title>
   <link href="http://sitescraper.net/blog/How-to-automatically-find-contact-details/"/>
   <updated>2011-11-06T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/How-to-automatically-find-contact-details</id>
   <content type="html">&lt;p&gt;I often find businesses hide their contact details behind layers of navigation. I guess they want to cut down their support costs.&lt;/p&gt;

&lt;p&gt;This wastes my time so I use this snippet to automate extracting the available emails:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sys&lt;/span&gt;  
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;webscraping&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;common&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;download&lt;/span&gt;  
  
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;get_emails&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;website&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_depth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;  
    &lt;span class=&quot;sd&quot;&gt;&amp;quot;&amp;quot;&amp;quot;Returns a list of emails found at this website  &lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;  &lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;max_depth is how deep to follow links  &lt;/span&gt;
&lt;span class=&quot;sd&quot;&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;  
    &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;download&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Download&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_emails&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;website&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_depth&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;max_depth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
  
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;__name__&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;__main__&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  
    &lt;span class=&quot;k&quot;&gt;try&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  
        &lt;span class=&quot;n&quot;&gt;website&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sys&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;  
        &lt;span class=&quot;n&quot;&gt;max_depth&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sys&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;  
    &lt;span class=&quot;k&quot;&gt;except&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  
        &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;Usage: &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;%s&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; &amp;lt;URL&amp;gt; &amp;lt;max depth&amp;gt;&amp;#39;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sys&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;  
    &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  
        &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_emails&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;website&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;max_depth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;Example use:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_emails&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;http://www.sitescraper.net&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;richard@sitescraper.net&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;



</content>
 </entry>
 
 <entry>
   <title>Free service to extract article from webpage</title>
   <link href="http://sitescraper.net/blog/Free-service-to-extract-article-from-webpage/"/>
   <updated>2011-10-11T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/Free-service-to-extract-article-from-webpage</id>
   <content type="html">&lt;p&gt;In a previous post I showed a tool for automatically extracting article summaries. Recently I came across a free online service from instapaper.com that does an even better job.&lt;/p&gt;

&lt;p&gt;Here is one of my &lt;a href=&quot;/How-to-protect-your-data/&quot;&gt;blog articles&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/static/img/blog/how_to_stop_scraper.jpg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;And here are the results when &lt;a href=&quot;http://www.instapaper.com/text?u=http://blog.sitescraper.net/2010/06/how-to-stop-scraper.html&quot;&gt;submitted to instapaper&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/static/img/blog/instapaper1.jpg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;And here is a &lt;a href=&quot;http://www.bbc.co.uk/news/14185334&quot;&gt;BBC article&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/static/img/blog/bbc.jpg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;And again the &lt;a href=&quot;http://www.instapaper.com/text?u=http://www.bbc.co.uk/news/14185334&quot;&gt;results from instapaper&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/static/img/blog/instapaper1.jpg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Instapaper has not made this service public, so hopefully they add it to their &lt;a href=&quot;http://www.instapaper.com/api/full&quot;&gt;official API&lt;/a&gt; in future.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Webpage screenshots with webkit</title>
   <link href="http://sitescraper.net/blog/Webpage-screenshots-with-webkit/"/>
   <updated>2011-09-20T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Webpage-screenshots-with-webkit</id>
   <content type="html">&lt;p&gt;For a recent project I needed to render screenshots of webpages. Here is my solution using webkit:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sys&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;time&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;PyQt4.QtCore&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;PyQt4.QtGui&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;PyQt4.QtWebKit&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Screenshot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QWebView&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;app&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;QApplication&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sys&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;QWebView&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_loaded&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loadFinished&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;connect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_loadFinished&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;capture&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;output_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QUrl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;wait_load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;c&quot;&gt;# set to webpage size&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;frame&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;page&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mainFrame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;page&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;setViewportSize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;frame&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;contentsSize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;
        &lt;span class=&quot;c&quot;&gt;# render image&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;image&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;QImage&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;page&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;viewportSize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;QImage&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Format_ARGB32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;painter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;QPainter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;frame&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;render&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;painter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;painter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;end&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;saving&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;output_file&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;image&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;save&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;output_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;wait_load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;delay&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;c&quot;&gt;# process app events until page loaded&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;while&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_loaded&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;processEvents&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sleep&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;delay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_loaded&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_loadFinished&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_loaded&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Screenshot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;capture&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;http://sitescraper.net&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;website.png&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;capture&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;http://sitescraper.net/blog&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;blog.png&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;



</content>
 </entry>
 
 <entry>
   <title>Google interview</title>
   <link href="http://sitescraper.net/blog/Google-interview/"/>
   <updated>2011-09-05T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Google-interview</id>
   <content type="html">&lt;p&gt;Recently I was invited to Sydney to interview for a developer position.&lt;/p&gt;

&lt;p&gt;I waited in the reception for a 1pm start. Apart from a tire swing, which people ignored anyway, this could have been any office. Nothing like the famous Googleplex in Mountain View.&lt;/p&gt;

&lt;p&gt;I was led to a small room for the interviews. Along the way I passed rows of coders at work and an effigy of Sarah Palin. Poor taste.&lt;/p&gt;

&lt;p&gt;I then had back to back technical interviews until past 5pm. Lunch was not provided as I had understood so I was feeling weak by the final interview.&lt;/p&gt;

&lt;p&gt;Surprisingly by then the office was already empty and so I was escorted out. No tour. No introductions. No snacks. No schwag.&lt;/p&gt;

&lt;p&gt;The interviewers were clearly talented and I could learn a lot working with them but the poor administration left a bad taste. In conclusion I don't feel tempted to leave my freelancing + travel lifestyle.&lt;/p&gt;

&lt;p&gt;My advice for other applicants: practice writing algorithms on a whiteboard. And bring your own lunch.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Can you extract data from any website?</title>
   <link href="http://sitescraper.net/blog/Can-you-extract-data-from-this-website/"/>
   <updated>2011-08-10T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Can-you-extract-data-from-this-website</id>
   <content type="html">&lt;p&gt;Yes - if the data is publically available then it can be extracted. The majority of websites are straightforward to scrape, however some are more difficult and may not be practical to scrape if you have time or budget restrictions.&lt;/p&gt;

&lt;p&gt;For example if the website restricts how many pages each IP address can access then it could take months to download the entire website. In that case I can use proxies to provide me multiple IP addresses and download the data faster, but this can get expensive if many proxies are required.&lt;/p&gt;

&lt;p&gt;If the website uses JavaScript and AJAX to load their data then I usually use a tool like &lt;a href=&quot;http://getfirebug.com&quot;&gt;Firebug&lt;/a&gt; to reverse engineer how the website works, and then call the appropriate AJAX URL's directly. And if the JavaScript is obfuscated or particularly complicated I can use a browser renderer like webkit to execute the JavaScript and provide me with the final HTML.&lt;/p&gt;

&lt;p&gt;Another difficulty is if the website uses CAPTCHA's or stores their data in images. Then I would need to get people (with cheaper hourly costs) to manually interpret the images. Fortunately there are services available like &lt;a href=&quot;http://deathbycaptcha.com&quot;&gt;Death by CAPTCHA&lt;/a&gt; for parsing images. So data in images can be extracted but it adds extra time and cost to the project.&lt;/p&gt;

&lt;p&gt;In summary I can always extract publically available data from a website, but some websites will take more time and cost.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>User agents</title>
   <link href="http://sitescraper.net/blog/User-agents/"/>
   <updated>2011-07-20T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/User-agents</id>
   <content type="html">&lt;p&gt;Your web browser will send what is known as a &quot;&lt;a href=&quot;http://en.wikipedia.org/wiki/User_agent&quot;&gt;User Agent&lt;/a&gt;&quot; for every page you access. This is a string to tell the server what kind of device you are accessing the page with. Here are some common User Agent strings:&lt;/p&gt;

&lt;table&gt;&lt;tbody&gt;
&lt;tr&gt; &lt;th&gt;Browser&lt;/th&gt; &lt;th&gt;User Agent&lt;/th&gt; &lt;/tr&gt;
&lt;tr&gt; &lt;td&gt;Firefox on Windows XP&lt;/td&gt; &lt;td&gt;Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6&lt;/td&gt; &lt;/tr&gt;
&lt;tr&gt; &lt;td&gt;Chrome on Linux&lt;/td&gt; &lt;td&gt;Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3&lt;/td&gt; &lt;/tr&gt;
&lt;tr&gt; &lt;td&gt;Internet Explorer on Windows Vista&lt;/td&gt; &lt;td&gt;Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)&lt;/td&gt; &lt;/tr&gt;
&lt;tr&gt; &lt;td&gt;Opera on Windows Vista&lt;/td&gt; &lt;td&gt;Opera/9.00 (Windows NT 5.1; U; en)&lt;/td&gt; &lt;/tr&gt;
&lt;tr&gt; &lt;td&gt;Android&lt;/td&gt; &lt;td&gt;Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt; &lt;td&gt;IPhone&lt;/td&gt; &lt;td&gt;Mozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt; &lt;td&gt;Blackberry&lt;/td&gt; &lt;td&gt;Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, Like Gecko) Version/6.0.0.141 Mobile Safari/534.1+&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt; &lt;td&gt;Python urllib&lt;/td&gt; &lt;td&gt;Python-urllib/2.1&lt;/td&gt; &lt;/tr&gt;
&lt;tr&gt; &lt;td&gt;Old Google Bot&lt;/td&gt; &lt;td&gt;Googlebot/2.1 ( http://www.googlebot.com/bot.html)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt; &lt;td&gt;New Google Bot&lt;/td&gt; &lt;td&gt;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt; &lt;td&gt;MSN Bot&lt;/td&gt; &lt;td&gt;msnbot/1.1 (+http://search.msn.com/msnbot.htm)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt; &lt;td&gt;Yahoo Bot&lt;/td&gt; &lt;td&gt;Yahoo! Slurp/Site Explorer&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt; &lt;td&gt;&lt;/td&gt; &lt;td&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


&lt;p&gt;You can find your own current User Agent &lt;a href=&quot;http://whatsmyuseragent.com/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Some webpages will use the User Agent to display content that is customized to your particular browser. For example if your User Agent indicates you are using an old browser then the website may return the plain HTML version without any AJAX features, which may be easier to scrape.&lt;/p&gt;

&lt;p&gt;Some websites will automatically block certain User Agents, for example if your User Agent indicates you are accessing their server with a script rather than a regular web browser.&lt;/p&gt;

&lt;p&gt;Fortunately it is easy to set your User Agent to whatever you like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For FireFox you can use &lt;a href=&quot;https://addons.mozilla.org/en-US/firefox/addon/user-agent-switcher/&quot;&gt;User Agent Switcher&lt;/a&gt; extension.&lt;/li&gt;
&lt;li&gt;For Chrome there is currently no extension, but you can set the User Agent from the command line at startup: &lt;em&gt;chromium-browser --user-agent=&quot;my custom user agent&quot;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;For Internet Explorer you can use the &lt;a href=&quot;http://www.enhanceie.com/ietoys/uapick.asp&quot;&gt;UAPick&lt;/a&gt; extension.&lt;/li&gt;
&lt;li&gt;&lt;p&gt;And for Python scripts you can set the proxy header with:&lt;/p&gt;

&lt;p&gt; proxy = urllib2.ProxyHandler({'http': IP})&lt;br/&gt;
 opener = urllib2.build_opener(proxy)&lt;br/&gt;
 opener.urlopen('http://www.google.com')&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Using the default User Agent for your scraper is a common reason to be blocked, so don't forget.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Taking advantage of mobile interfaces</title>
   <link href="http://sitescraper.net/blog/Taking-advantage-of-mobile-interfaces/"/>
   <updated>2011-07-05T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Taking-advantage-of-mobile-interfaces</id>
   <content type="html">&lt;p&gt;Sometimes a website will have multiple versions: one for regular users with a modern browser, a HTML version for browsers that don't support JavaScript, and a simplified version for mobile users.&lt;/p&gt;

&lt;p&gt;For example Gmail has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://gmail.com/&quot;&gt;gmail.com&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://mail.google.com/mail/h/&quot;&gt;mail.google.com/maiOl/h/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://m.gmail.com/&quot;&gt;m.gmail.com&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;All three of these interfaces will display the content of your emails but use different layouts and features. The main entrance at gmail.com is well known for its use of AJAX to load content dynamically without refreshing the page. This leads to a better user experience but makes web automation or scraping harder.&lt;/p&gt;

&lt;p&gt;On the other hand the static HTML interface has fewer features and is less efficient for users, but much easier to automate or scrape because all the content is available when the page loads.&lt;/p&gt;

&lt;p&gt;So before scraping a website check for its HTML or mobile version, which when exist should be easier to scrape.&lt;/p&gt;

&lt;p&gt;To find the HTML version try disabling JavaScript in your browser and see what happens.&lt;br/&gt;
To find the mobile version try adding the &quot;m&quot; subdomain (domain.com -&gt; m.domain.com) or using a mobile &lt;a href=&quot;http://en.wikipedia.org/wiki/User_agent&quot;&gt;user-agent&lt;/a&gt;.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Parsing Flash with Swiffy</title>
   <link href="http://sitescraper.net/blog/Parsing-Flash-with-Swiffy/"/>
   <updated>2011-06-30T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Parsing-Flash-with-Swiffy</id>
   <content type="html">&lt;p&gt;Google has released a tool called &lt;a href=&quot;http://swiffy.googlelabs.com/&quot;&gt;Swiffy&lt;/a&gt; for parsing Flash files into HTML5. This is relevant to web scraping because content embedded in Flash is a pain to extract, as I &lt;a href=&quot;/Scraping-Flash-based-websites/&quot;&gt;wrote about earlier&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I tried some test files and found the results no more useful for parsing text content than the output produced by &lt;a href=&quot;http://webscraping.googlecode.com/files/swf2html&quot;&gt;swf2html&lt;/a&gt; (Linux version). Some neat example conversions are &lt;a href=&quot;http://swiffy.googlelabs.com/gallery.html&quot;&gt;available here&lt;/a&gt;. Currently Swiffy supports ActionScript 2.0 and works best with Flash 5, which was released back in 2000 so there is still a lot of work to do.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Google App Engine limitations</title>
   <link href="http://sitescraper.net/blog/Google-App-Engine-limitations/"/>
   <updated>2011-06-19T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Google-App-Engine-limitations</id>
   <content type="html">&lt;p&gt;Most of the discussion about &lt;a href=&quot;http://code.google.com/appengine/&quot;&gt;Google App Engine&lt;/a&gt; seems to focus on how it allows you to scale your app, however I find it most useful for small client apps where we want a reliable platform while avoiding any ongoing hosting fee. For large apps paying for hosting would not be a problem.&lt;/p&gt;

&lt;p&gt;These are some of the downsides I have found using Google App Engine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slow - if your app has not been accessed recently (last minute) then it can take up to 10 seconds to load for the user&lt;/li&gt;
&lt;li&gt;Pure Python/Java code only - this prevents using a lot of good libraries, most importantly for me lxml&lt;/li&gt;
&lt;li&gt;CPU quota easily gets exhausted when uploading data&lt;/li&gt;
&lt;li&gt;Proxies not supported, which makes apps that rely on external websites risky. For example the Twitter API has a per IP quota which you would be sharing with all other GAE apps.&lt;/li&gt;
&lt;li&gt;Blocked in some countries, such as Turkey&lt;/li&gt;
&lt;li&gt;Indexes - &lt;a href=&quot;http://code.google.com/appengine/docs/quotas.html&quot;&gt;the free quota&lt;/a&gt; is 1 GB but often over half of this is taken up by indexes&lt;/li&gt;
&lt;li&gt;&lt;strike&gt;Maximum 1000 records per query&lt;/strike&gt; - &lt;a href=&quot;http://code.google.com/p/googleappengine/wiki/SdkReleaseNotes#Version_1.3.6_-_August_17,_2010&quot;&gt;no longer a limitation&lt;/a&gt;!&lt;/li&gt;
&lt;li&gt;20 second request limit, so often need the overhead of using &lt;a href=&quot;http://code.google.com/appengine/docs/python/taskqueue/overview.html&quot;&gt;Task Queues&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Despite these problems I still find Google App Engine a fantastic platform and a pleasure to develop on.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Using Google Translate to crawl a website</title>
   <link href="http://sitescraper.net/blog/Using-Google-Translate-to-crawl-a-website/"/>
   <updated>2011-05-29T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Using-Google-Translate-to-crawl-a-website</id>
   <content type="html">&lt;p&gt;I wrote previously about using &lt;a href=&quot;/Using-Google-Cache-to-crawl-a-website/&quot;&gt;Google Cache to crawl a website&lt;/a&gt;. Sometimes, for whatever reason, Google Cache does not include a webpage so it is good to have backup options.&lt;/p&gt;

&lt;p&gt;One option is using &lt;a href=&quot;http://translate.google.com/&quot;&gt;Google Translate&lt;/a&gt;, which let's you translate a webpage into another language.If the source language is selected as something you know it is not (eg Dutch) then no translation will take place and you will just get back the &lt;a href=&quot;http://translate.google.com/translate?sl=nl&amp;amp;tl=en&amp;amp;u=http%3A%2F%2Famazon.com&quot;&gt;original content&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/static/img/blog/google_translate.jpg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I added a function to download a URL via Google Translate and Google Cache to my &lt;a href=&quot;http://code.google.com/p/webscraping/&quot;&gt;webscraping&lt;/a&gt; library. Here is an example:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;webscraping&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;download&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;xpath&lt;/span&gt;  
  
&lt;span class=&quot;n&quot;&gt;D&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;download&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Download&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  
&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;http://sitescraper.net/faq&amp;#39;&lt;/span&gt;  
&lt;span class=&quot;n&quot;&gt;html1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# download directly  &lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;html2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gcache_get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# download via Google Cache  &lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;html3&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;D&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gtrans_get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# download via Google Translate  &lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;html&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;html1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;html2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;html3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;  
    &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;xpath&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;html&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;//title&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;This example downloads the same webpage directly, via Google Cache, and via Google Translate. Then it parses the title to show the same webpage has been downloaded. The output when run is:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;Frequently asked questions | SiteScraper
Frequently asked questions | SiteScraper
Frequently asked questions | SiteScraper
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The same title was extracted from each source, which shows that the correct result was downloaded from Google Cache and Google Translate.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Using Google Cache to crawl a website</title>
   <link href="http://sitescraper.net/blog/Using-Google-Cache-to-crawl-a-website/"/>
   <updated>2011-05-15T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Using-Google-Cache-to-crawl-a-website</id>
   <content type="html">&lt;p&gt;Occasionally I come across a website that blocks your IP after only a few requests. If the website contains a lot of data then downloading it quickly would take an expensive amount of proxies.&lt;/p&gt;

&lt;p&gt;Fortunately there is an alternative - Google.&lt;/p&gt;

&lt;p&gt;If a website doesn't exist in Google's search results then for most people it doesn't exist at all. Websites want visitors so will usually be happy for Google to crawl their content. This meansGoogle has likely already downloaded all the web pages we want. And after downloading Google makes much of the content available through their cache.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/static/img/blog/google_sitescraper.jpg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;So instead of downloading a URL we want directly we can download it indirectly via &lt;a href=&quot;Google%20Cache&quot;&gt;http://www.google.com/search?&amp;amp;q=cache%3Ahttp%3A//sitescraper.net&lt;/a&gt;. Then the source website can not block you and does not even know you are crawling their content.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Crawling with threads</title>
   <link href="http://sitescraper.net/blog/Crawling-with-threads/"/>
   <updated>2011-04-10T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Crawling-with-threads</id>
   <content type="html">&lt;p&gt;The bottleneck for web scraping is generally bandwidth - the time waiting for webpages to download. This delay can be minimized by downloading multiple webpages concurrently in separate threads.&lt;/p&gt;

&lt;p&gt;Here are examples of both approaches:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;c&quot;&gt;# a list of 100 webpage URL&amp;#39;s to download&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# first try downloading sequentially&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;urllib&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;urllib&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;urlopen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# now try concurrently&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sys&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;webscraping&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;download&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;num_threads&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sys&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;download&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;threaded_get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;urls&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;delay&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_threads&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_threads&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
    &lt;span class=&quot;n&quot;&gt;read_cache&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;write_cache&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# disable cache&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;Here are the results:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;$ time python sequential.py
4m25.602s
$ time python concurrent.py 10
1m7.430s
$ time python concurrent.py 100
0m31.528s
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As expected threading the downloads makes a big difference. You may have noticed the time saved is not linearly proportional to the number of threads. That is primarily because my web server struggles to keep up with all the requests. When crawling websites with threads be careful not to overload their web server by downloading too fast. Otherwise the website will become slower for others users and your IP risks being blacklisted.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Google Storage</title>
   <link href="http://sitescraper.net/blog/Google-Storage/"/>
   <updated>2011-03-30T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/Google-Storage</id>
   <content type="html">&lt;p&gt;Often the datasets I scrape are too big to send via email and would take up too much space on my web server, so I upload them to &lt;a href=&quot;http://code.google.com/apis/storage/docs/getting-started.html&quot;&gt;Google Storage&lt;/a&gt;. &lt;br/&gt;
Here is an example snippet to create a folder on GS, upload a file, and then download it:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; gsutil mb gs://bucket_name  
&amp;gt;&amp;gt;&amp;gt; gsutil ls  
gs://bucket_name  
&amp;gt;&amp;gt;&amp;gt; gsutil cp path/to/file.ext gs://bucket_name  
&amp;gt;&amp;gt;&amp;gt; gsutil ls gs://bucket_name  
file.ext  
&amp;gt;&amp;gt;&amp;gt; gsutil cp gs://bucket_name/file.ext file_copy.ext
&lt;/code&gt;&lt;/pre&gt;
</content>
 </entry>
 
 <entry>
   <title>The SiteScraper module</title>
   <link href="http://sitescraper.net/blog/The-SiteScraper-module/"/>
   <updated>2011-03-01T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/The-SiteScraper-module</id>
   <content type="html">&lt;p&gt;A few years ago I developed the &lt;a href=&quot;http://code.google.com/p/sitescraper/&quot;&gt;sitescraper library&lt;/a&gt; for automatically scraping website data based on example cases:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sitescraper&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sitescraper&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ss&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sitescraper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&amp;amp;field-keywords=python&amp;amp;x=0&amp;amp;y=0&amp;#39;&lt;/span&gt;  
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;Amazon.com: python&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;Learning Python, 3rd Edition&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   
  &lt;span class=&quot;s&quot;&gt;&amp;quot;Programming in Python 3: A Complete Introduction to the Python Language (Developer&amp;#39;s Library)&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   
  &lt;span class=&quot;s&quot;&gt;&amp;quot;Python in a Nutshell, Second Edition (In a Nutshell (O&amp;#39;Reilly))&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]]&lt;/span&gt;  
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ss&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# we can add multiple example cases, but this is a simple example so 1 will do (I generally use 3)  &lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# ss.add(url2, data2)   &lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ss&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;scrape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;http://www.amazon.com/s/ref=nb_ss_gw?url=search-alias%3Daps&amp;amp;field-keywords=linux&amp;amp;x=0&amp;amp;y=0&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;quot;Amazon.com: linux&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&amp;quot;A Practical Guide to Linux(R) Commands, Editors, and Shell Programming&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
    &lt;span class=&quot;s&quot;&gt;&amp;quot;Linux Pocket Guide&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
    &lt;span class=&quot;s&quot;&gt;&amp;quot;Linux in a Nutshell (In a Nutshell (O&amp;#39;Reilly))&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
    &lt;span class=&quot;s&quot;&gt;&amp;#39;Practical Guide to Ubuntu Linux (Versions 8.10 and 8.04), A (2nd Edition)&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; 
    &lt;span class=&quot;s&quot;&gt;&amp;#39;Linux Bible, 2008 Edition&amp;#39;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;]]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;See &lt;a href=&quot;http://sitescraper.googlecode.com/files/sitescraper.pdf&quot;&gt;this paper&lt;/a&gt; for more info.&lt;/p&gt;

&lt;p&gt;It was designed for scraping websites overtime where their layout may change. Unfortunately I don't use it much these days because most of my projects are one-off scrapes.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Automating CAPTCHA's</title>
   <link href="http://sitescraper.net/blog/Automating-CAPTCHAs/"/>
   <updated>2011-02-20T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/Automating-CAPTCHAs</id>
   <content type="html">&lt;p&gt;By now you would be used to entering the text for an image like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/static/img/blog/captcha.jpg&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The idea is this will prevent bots because only a real user can interpret the image.&lt;/p&gt;

&lt;p&gt;However this is not an obstacle for a determined scraper because of services like &lt;a href=&quot;http://www.deathbycaptcha.com&quot;&gt;deathbycaptcha&lt;/a&gt; that will solve the CAPTCHA for you. These services use cheap labor to manually interpret the images and send the result back through an API.&lt;/p&gt;

&lt;p&gt;CAPTCHA's are still useful because they deter most bots. However they can not prevent a determined scraper and are annoying to genuine users.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>New scraping quote tool</title>
   <link href="http://sitescraper.net/blog/New-scraping-quote-tool/"/>
   <updated>2010-11-06T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/New-scraping-quote-tool</id>
   <content type="html">&lt;p&gt;An ongoing problem for my web scraping work is how much to quote for a job. I prefer &lt;a href=&quot;/Fixed-fee-or-hourly/&quot;&gt;fixed fee to hourly&lt;/a&gt; rates so I need to consider the complexity upfront. My initial strategy was simply to quote low to ensure I got business and hopefully build up some regular clients.&lt;/p&gt;

&lt;p&gt;Through experience I found the following factors most effected the time required for a job:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Website size&lt;/li&gt;
&lt;li&gt;Login protected&lt;/li&gt;
&lt;li&gt;IP restrictions&lt;/li&gt;
&lt;li&gt;HTML quality&lt;/li&gt;
&lt;li&gt;JavaScript/AJAX&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;I developed a formula based on these factors and have now &lt;a href=&quot;http://sitescraper.net/contact&quot;&gt;built an interface&lt;/a&gt; that lets potential clients clarify the costs involved with different kinds of web scraping jobs. Additionally I hope this will reduce the communication overhead be helping clients to provide the necessary information upfront.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Increase your Google App Engine quotas for free</title>
   <link href="http://sitescraper.net/blog/Increase-your-Google-App-Engine-quotas-for-free/"/>
   <updated>2010-10-27T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/Increase-your-Google-App-Engine-quotas-for-free</id>
   <content type="html">&lt;p&gt;Google App Engine provides &lt;a href=&quot;http://code.google.com/appengine/docs/quotas.html&quot;&gt;generous free quotas&lt;/a&gt; for your app and additional paid quotas. &lt;br/&gt;
I always enable billing for my GAE apps even though I rarely exhaust the free quotas. Enabling billing and setting paid quotas does not mean you have to pay anything and in fact increases what you get for free.&lt;/p&gt;

&lt;p&gt;Here is a screenshot of the billing panel:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/static/img/blog/gae_billing.png&quot; /&gt;&lt;/p&gt;

&lt;p&gt;GAE lets you allocate a daily budget to the various resources, with the minimum permitted budget being USD $1. When you exhaust a free quota you will only be charged for the budget allocated to it. In the above screenshot I have allocated all my budget to emailing, but since my app does use the Mail API I can be confident this free quota will never be exhausted and I will never pay a cent. For another app that does use Mail I have allocated all the budget to Bandwidth Out instead.&lt;/p&gt;

&lt;p&gt;Now with billing enabled my app:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;can access the &lt;a href=&quot;http://code.google.com/appengine/docs/quotas.html#Blobstore&quot;&gt;Blobstore API&lt;/a&gt; to store larger amounts of data&lt;/li&gt;
&lt;li&gt;enjoys much higher free limits for the &lt;a href=&quot;http://code.google.com/appengine/docs/quotas.html#Mail&quot;&gt;Mail&lt;/a&gt;, T&lt;a href=&quot;http://code.google.com/appengine/docs/quotas.html#Task_Queue&quot;&gt;ask Queue&lt;/a&gt;, and &lt;a href=&quot;http://code.google.com/appengine/docs/quotas.html#UrlFetch&quot;&gt;UrlFetch&lt;/a&gt; API's - for example by default an app can make 7000 Mail API calls but with billing enabled this limit jumps to 1,700,000 calls&lt;/li&gt;
&lt;li&gt;has a higher per minute &lt;a href=&quot;http://code.google.com/appengine/docs/quotas.html#Requests&quot;&gt;CPU&lt;/a&gt; limit, which I find particularly useful when uploading a mass of records to the Datastore&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;So in summary you can enable billing to extend your free quotas without risk of paying.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Extracting article summaries</title>
   <link href="http://sitescraper.net/blog/Extracting-article-summaries/"/>
   <updated>2010-10-06T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/Extracting-article-summaries</id>
   <content type="html">&lt;p&gt;I made my own version of &lt;a href=&quot;http://blog.davidziegler.net/post/122176962/a-python-script-to-automatically-extract-excerpts-from&quot;&gt;this technique&lt;/a&gt; to extract article summaries.&lt;br/&gt;
Source code can be &lt;a href=&quot;http://code.google.com/p/webscraping/source/browse/alg.py&quot;&gt;found here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The idea is simple - extract the biggest text block - but performs well.&lt;br/&gt;
Here are some test results:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.nytimes.com/2010/03/23/technology/23google.html?_r=1&quot;&gt;http://www.nytimes.com/2010/03/23/technology/23google.html?_r=1&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The decision to shut down google.cn will have a limited financial impact on Google, which is based in Mountain View, Calif. China accounted for a small fraction of Google&amp;rsquo;s $23.6 billion in global revenue last year. Ads that once appeared on google.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.theregister.co.uk/2010/09/29/novell_suse_appliance_1_1/&quot;&gt;http://www.theregister.co.uk/2010/09/29/novell_suse_appliance_1_1/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Being able to spin up appliance images for EC2 and spit them out onto the Amazon cloud meshes with Novell's EC2-based SUSE Linux licensing, which was announced back in August. Novell is only selling priority-level (24x7) support contract for SUSE Linux li&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://blog.sitescraper.net/2010/08/best-website-for-freelancers.html&quot;&gt;http://blog.sitescraper.net/2010/08/best-website-for-freelancers.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;However with Elance there is a high barrier to entry: you have to pass a test, receive a phone call to confirm your identity, and pay money for each job you bid on. Often I see jobs on Elance with no bids because it requires obscure experience - people we&lt;/em&gt;&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Image efficiencies</title>
   <link href="http://sitescraper.net/blog/Image-efficiencies/"/>
   <updated>2010-09-14T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Image-efficiencies</id>
   <content type="html">&lt;p&gt;I needed to store a large quantities of images so took the following measurements:&lt;/p&gt;

&lt;table cellspacing=&quot;10px&quot;&gt;
    &lt;thead&gt;
        &lt;tr&gt;
            &lt;th&gt;Format&lt;/th&gt;
            &lt;th&gt;Time&lt;/th&gt;
            &lt;th&gt;Size&lt;/th&gt;
            &lt;th&gt;Sample&lt;/th&gt;
        &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;bmp&lt;/td&gt;&lt;td&gt;0.715670&lt;/td&gt;&lt;td&gt;1769526&lt;/td&gt;&lt;th&gt;
&lt;img border=&quot;0&quot; src=&quot;/static/img/blog/efficiencies/test.bmp&quot; /&gt;&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;gif&lt;/td&gt;&lt;td&gt;4.184417&lt;/td&gt;&lt;td&gt;501931&lt;/td&gt;&lt;th&gt;
&lt;img border=&quot;0&quot; src=&quot;/static/img/blog/efficiencies/test.gif&quot; /&gt;&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;jpg&lt;/td&gt;&lt;td&gt;2.507811&lt;/td&gt;&lt;td&gt;22252&lt;/td&gt;&lt;th&gt;
&lt;img border=&quot;0&quot; src=&quot;/static/img/blog/efficiencies/test.jpg&quot; /&gt;&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;png&lt;/td&gt;&lt;td&gt;10.909442&lt;/td&gt;&lt;td&gt;67295&lt;/td&gt;&lt;th&gt;
&lt;img border=&quot;0&quot; src=&quot;/static/img/blog/efficiencies/test.png&quot; /&gt;&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;ppm&lt;/td&gt;&lt;td&gt;0.648540&lt;/td&gt;&lt;td&gt;1769488&lt;/td&gt;&lt;td&gt;
Browsers does not support PPM&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;tiff&lt;/td&gt;&lt;td&gt;1.011216&lt;/td&gt;&lt;td&gt;1769600&lt;/td&gt;&lt;td&gt;
Browser does not support TIFF&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;


&lt;p&gt;Gif is the clear loser - it takes a long time to process but still looks terrible.&lt;br/&gt;
For space use jpeg, speed ppm.&lt;/p&gt;

&lt;p&gt;Google's new &lt;a href=&quot;http://code.google.com/speed/webp/index.html&quot;&gt;WebP format&lt;/a&gt; looks promising.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Client feedback</title>
   <link href="http://sitescraper.net/blog/Client-Feedback/"/>
   <updated>2010-09-06T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Client-Feedback</id>
   <content type="html">&lt;p&gt;I was concerned about what blind spots I might have with the way I run my business. For example I am Australian and Australian's are usually very informal, even in a professional setting - was my communication with international clients too informal?&lt;/p&gt;

&lt;p&gt;To try and address these concerns I developed a &lt;a href=&quot;http://sitescraper.net/feedback&quot;&gt;feedback survey&lt;/a&gt; with Google Docs, which I have been (politely) requesting my clients to complete at the end of a job. The results have been helpful, and it also seems to have impressed some clients that I wanted their feedback. Wish I had thought of this earlier!&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Why reinvent the wheel?</title>
   <link href="http://sitescraper.net/blog/Why-reinvent-the-wheel/"/>
   <updated>2010-08-27T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Why-reinvent-the-wheel</id>
   <content type="html">&lt;p&gt;I have been asked a few times why I chose to &lt;em&gt;reinvent the wheel&lt;/em&gt; when libraries such as &lt;a href=&quot;http://scrapy.org/&quot;&gt;Scrapy&lt;/a&gt; and &lt;a href=&quot;http://codespeak.net/lxml/&quot;&gt;lxml&lt;/a&gt; already exist.&lt;/p&gt;

&lt;p&gt;I am aware of these libraries and have used them in the past with good results. However my current work involves building relatively simple web scraping scripts that I want to run without hassle on the clients machine. This rules out installing full frameworks such as Scrapy or compiling C based libraries such as lxml - I need a pure Python solution. This also gives me the flexibility to run the script on &lt;a href=&quot;http://code.google.com/appengine/&quot;&gt;Google App Engine&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To scrape webpages there are generally two stages: parse the HTML and then select the relevant nodes.&lt;br/&gt;
The most well known Python HTML parser seems to be &lt;a href=&quot;http://www.crummy.com/software/BeautifulSoup/&quot;&gt;BeautifulSoup&lt;/a&gt;, however I find it &lt;a href=&quot;http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/&quot;&gt;slow&lt;/a&gt;, difficult to use (compared to XPath), often parses HTML inaccurately, and significantly - the original author has &lt;a href=&quot;http://www.crummy.com/software/BeautifulSoup/3.1-problems.html&quot;&gt;lost interest&lt;/a&gt; in further developing it. So I would not recommend using it - instead go with &lt;a href=&quot;http://code.google.com/p/html5lib/&quot;&gt;html5lib&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To select HTML content I use &lt;a href=&quot;http://blog.sitescraper.net/2010/06/how-to-use-xpaths-robustly.html&quot;&gt;XPath&lt;/a&gt;. Is there a decent pure Python XPath solution? I didn't find one 6 months ago when I needed it so developed &lt;a href=&quot;http://code.google.com/p/webscraping/source/browse/xpath.py&quot;&gt;this simple version&lt;/a&gt; that covers my typical use cases. I would deprecate this in future if a decent solution does come along, but for now I am happy with my pure Python infrastructure.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Best website for freelancers</title>
   <link href="http://sitescraper.net/blog/Best-website-for-freelancers/"/>
   <updated>2010-08-20T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Best-website-for-freelancers</id>
   <content type="html">&lt;p&gt;When I started freelancing I tried competing for as much work as possible by creating accounts on every freelance site I could find: &lt;a href=&quot;http://www.odesk.com/&quot;&gt;oDesk&lt;/a&gt;, &lt;a href=&quot;http://www.guru.com/&quot;&gt;guru&lt;/a&gt;, &lt;a href=&quot;http://www.scriptlance.com/&quot;&gt;scriptlance&lt;/a&gt;, and many others. However to my surprise I got almost all my work from just one source - &lt;a href=&quot;http://www.elance.com/?rid=1I2ZU&quot;&gt;&lt;em&gt;Elance&lt;/em&gt;&lt;/a&gt;. How is Elance different?&lt;/p&gt;

&lt;p&gt;With most freelancing sites you create an account and then can start bidding for jobs straight away. There is generally no cost to bidding so freelancers tend to bid on projects even if they don't have the skills or time to complete it. This is obviously frustrating for clients who are going to waste a lot of time sifting through bids.&lt;/p&gt;

&lt;p&gt;However with Elance there is a high barrier to entry: you have to pass a test, receive a phone call to confirm your identity, and pay money for each job you bid on. Often I see jobs on Elance with no bids because it requires obscure experience - people weren't willing to waste their money bidding for a job they can't do. This barrier serves to weed out some of the less serious workers so that the average bid is of higher quality.&lt;/p&gt;

&lt;p&gt;From my experience the clients are different on Elance too. On most freelancing sites the client is trying to get the job done for the smallest amount of money possible and so are often willing to spend their time sifting through dozens of proposals, hoping to get lucky. Elance seems to attract clients who consider their time as valuable and are willing to pay a premium for good service.&lt;br/&gt;
Often clients contact me directly through Elance because I am native English and want to avoid potential communication problems. One client even requested me to double my bid because &quot;we are not cheap!&quot;&lt;/p&gt;

&lt;p&gt;After a year of freelancing I now get the majority of work directly through &lt;a href=&quot;http://sitescraper.net/&quot;&gt;my website&lt;/a&gt;, but still get a decent percentage of clients through Elance.&lt;/p&gt;

&lt;p&gt;My advice for new freelancers - focus on building your Elance profile and don't waste your time with the others. (Though do let me know if you have had good experience elsewhere.)&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>All your data are belong to us?</title>
   <link href="http://sitescraper.net/blog/All-your-data-are-belong-to-us/"/>
   <updated>2010-07-24T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/All-your-data-are-belong-to-us</id>
   <content type="html">&lt;p&gt;Regarding the title of this blog &quot;All your data are belong to us&quot; - I realized not everyone gets the reference. See &lt;a href=&quot;http://en.wikipedia.org/wiki/All_your_base_are_belong_to_us&quot;&gt;this wikipedia article&lt;/a&gt; for an explanation.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Caching crawled webpages</title>
   <link href="http://sitescraper.net/blog/Caching-crawled-webpages/"/>
   <updated>2010-07-10T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Caching-crawled-webpages</id>
   <content type="html">&lt;p&gt;When crawling a website I store the HTML in a local cache so if I need to rescrape the website later I can load the webpages quickly from my local cache and avoid extra load on their website server. This is often necessary when a client realizes they require additional features scraped.&lt;/p&gt;

&lt;p&gt;I built the &lt;a href=&quot;http://code.google.com/p/webscraping/source/browse/pdict.py&quot;&gt;pdict library&lt;/a&gt; to manage my cache. Pdict provides a dictionary like interface but stores the data in a sqlite database on disk rather than in memory. All data is automatically compressed (using zlib) before writing and decompressed after reading. Both zlib and sqlite3 come builtin with Python (2.5+) so there are no external dependencies.&lt;/p&gt;

&lt;p&gt;Here is some example usage of pdict:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;webscraping.pdict&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PersistentDict&lt;/span&gt;  
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;PersistentDict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CACHE_FILE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;html1&lt;/span&gt;  
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;html2&lt;/span&gt;  
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url1&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt;  
&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;  
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;  
&lt;span class=&quot;n&quot;&gt;html1&lt;/span&gt;  
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;keys&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  
&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;  
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;del&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;  
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url1&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt;  
&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;



</content>
 </entry>
 
 <entry>
   <title>Fixed fee or hourly?</title>
   <link href="http://sitescraper.net/blog/Fixed-fee-or-hourly/"/>
   <updated>2010-07-01T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Fixed-fee-or-hourly</id>
   <content type="html">&lt;p&gt;I prefer to quote per project rather than per hour for my web scraping work because it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;gives me incentive to increase my efficiency (by improving my infrastructure)&lt;/li&gt;
&lt;li&gt;gives the client security about the total cost&lt;/li&gt;
&lt;li&gt;avoids distrust about the number of hours actually worked&lt;/li&gt;
&lt;li&gt;makes me look more competitive compared to the hourly rates available in Asia and Eastern Europe&lt;/li&gt;
&lt;li&gt;is difficult to track time fairly when working on two or more projects simultaneously&lt;/li&gt;
&lt;li&gt;is easy to estimate complexity based on past experience, atleast compared to building websites&lt;/li&gt;
&lt;li&gt;involves less administration&lt;/li&gt;
&lt;/ul&gt;

</content>
 </entry>
 
 <entry>
   <title>Open sourced web scraping code</title>
   <link href="http://sitescraper.net/blog/Open-sourced-web-scraping-code/"/>
   <updated>2010-06-12T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Open-sourced-web-scraping-code</id>
   <content type="html">&lt;p&gt;For most scraping jobs I use the same general approach of crawling, selecting the appropriate nodes, and then saving the results. Consequently I reuse a lot of code across projects, which I have now combined into a library. Most of this infrastructure is available open sourced on &lt;a href=&quot;http://code.google.com/p/webscraping/&quot;&gt;Google Code&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The code in that repository is licensed under the &lt;a href=&quot;http://www.opensource.org/licenses/lgpl-2.1.php&quot;&gt;LGPL&lt;/a&gt;, which means you are free to use it in your own applications (including commercial) but are obliged to release any changes you make. This is different than the more popular &lt;a href=&quot;http://www.opensource.org/licenses/gpl-2.0.php&quot;&gt;GPL&lt;/a&gt; license, which would make the library unusable in most commercial projects. And it is also different than the &lt;a href=&quot;http://www.opensource.org/licenses/bsd-license.php&quot;&gt;BSD&lt;/a&gt; and &lt;a href=&quot;http://sam.zoy.org/wtfpl/&quot;&gt;WTFPL&lt;/a&gt; style licenses, which would let people do whatever they want with the library including making changes and not releasing them.&lt;/p&gt;

&lt;p&gt;I think the LGPL is a good balance for libraries because it lets anyone use the code while everyone can benefit from improvements made by individual users.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Why web2py?</title>
   <link href="http://sitescraper.net/blog/Why-web2py/"/>
   <updated>2010-05-03T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Why-web2py</id>
   <content type="html">&lt;p&gt;In a previous post I mentioned that &lt;a href=&quot;http://web2py.com/&quot;&gt;web2py&lt;/a&gt; is my weapon of choice for building web applications. Before web2py I had learnt a variety of approaches to building dynamic websites (raw PHP, Python CGI, Turbogears, Symfony, Rails, Django), but find myself most productive with web2py.&lt;/p&gt;

&lt;p&gt;This is because web2py:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;uses a pure Python templating system without restrictions - &quot;we're all consenting adults here&quot;&lt;/li&gt;
&lt;li&gt;supports database migrations&lt;/li&gt;
&lt;li&gt;has automatic form generation and validation with &lt;a href=&quot;http://web2py.com/book/default/chapter/07#SQLFORM&quot;&gt;SQLFORM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;runs on &lt;a href=&quot;http://web2py.com/book/default/chapter/11#Google-App-Engine&quot;&gt;Google App Engine&lt;/a&gt; without modification&lt;/li&gt;
&lt;li&gt;has a highly active and friendly user &lt;a href=&quot;http://groups.google.com/group/web2py&quot;&gt;forum&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;rapid development - feature requests are often written and committed to trunk within the day&lt;/li&gt;
&lt;li&gt;supports multiple apps for a single install&lt;/li&gt;
&lt;li&gt;can develop apps through the browser admin&lt;/li&gt;
&lt;li&gt;commits to backward compatibility&lt;/li&gt;
&lt;li&gt;has no configuration files or dependencies - works out of the box&lt;/li&gt;
&lt;li&gt;has sensible defaults for view templates, imported modules, etc&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The downsides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;highly dependent on Massimo (the project leader)&lt;/li&gt;
&lt;li&gt;the name &lt;em&gt;web2py&lt;/em&gt; is unattractive compared to rails, pylons, web.py, etc&lt;/li&gt;
&lt;li&gt;few designers, so the example applications look crude&lt;/li&gt;
&lt;li&gt;&lt;strike&gt;inconsistent scattered documentation&lt;/strike&gt; [online book now available &lt;a href=&quot;http://web2py.com/book&quot;&gt;here&lt;/a&gt;!]&lt;/li&gt;
&lt;/ul&gt;

</content>
 </entry>
 
 <entry>
   <title>Why Google App Engine?</title>
   <link href="http://sitescraper.net/blog/Why-Google-App-Engine/"/>
   <updated>2010-04-15T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Why-Google-App-Engine</id>
   <content type="html">&lt;p&gt;In the previous post I covered three alternative approaches to regularly scrape a website for a client, with the most common one being in the form of a web application. However hosting the web application on either my own or the clients server has problems.&lt;/p&gt;

&lt;p&gt;My solution is to host the application on a neutral third party platform - &lt;a href=&quot;http://code.google.com/appengine/whyappengine.html&quot;&gt;Google App Engine&lt;/a&gt; (GAE). Here is my overview of deploying on GAE:&lt;/p&gt;

&lt;p&gt;Pros:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;provides a stable and consistent platform that I can use for multiple applications&lt;/li&gt;
&lt;li&gt;both the customer and I can login and manage it, so we do not need to expose our servers&lt;/li&gt;
&lt;li&gt;has &lt;a href=&quot;http://code.google.com/appengine/docs/quotas.html&quot;&gt;generous free quotas&lt;/a&gt;, which I rarely exhaust&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Cons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;only supports pure Python (or Java), so libraries that rely on C such as lxml are not supported (&lt;a href=&quot;http://code.google.com/p/googleappengine/issues/detail?id=18&quot;&gt;yet&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;limitations on maximum job time and interacting with the database&lt;/li&gt;
&lt;li&gt;have to trust Google with storing our scraped data&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Often deploying on GAE works well for both the client and me, but it is not always practical/possible. I am still looking for a &lt;a href=&quot;http://en.wikipedia.org/wiki/No_Silver_Bullet&quot;&gt;silver bullet&lt;/a&gt;!&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Scraping dynamic data</title>
   <link href="http://sitescraper.net/blog/Scraping-dynamic-data/"/>
   <updated>2010-04-12T00:00:00+10:00</updated>
   <id>http://sitescraper.net/blog/Scraping-dynamic-data</id>
   <content type="html">&lt;p&gt;Usually my clients request for a website to be scraped into a standard format like &lt;a href=&quot;http://en.wikipedia.org/wiki/Comma-separated_values&quot;&gt;CSV&lt;/a&gt;, which they can then integrate with their existing applications. However sometimes a client need a website scraped periodically because its data is continually updated. An example of the first use case is census statistics, and of the second stock prices.&lt;/p&gt;

&lt;p&gt;I have three solutions for periodically scraping a website:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;I provide the client with my web scraping code, which they can then execute regularly&lt;/li&gt;
&lt;li&gt;Client pays me a small fee in future whenever they want the data rescraped&lt;/li&gt;
&lt;li&gt;I build a web application that scrapes regularly and provides the data in a useful form&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;The first option is not always practical if the client does not have a technical background. Additionally my solutions are developed and tested on Linux and may not work on Windows.&lt;/p&gt;

&lt;p&gt;The second option is generally not attractive to the client because it puts them in a weak position where they are dependent on me being contactable and cooperative in future. &lt;br/&gt;
Also it involves ongoing costs for them.&lt;/p&gt;

&lt;p&gt;So usually I end up building a basic web application that consists of a CRON job to do the scraping, an interface to the scraped data, and some administration settings. If the scraping jobs are not too big I am happy to host the application on my own server, however most clients prefer the security of hosting it on their own server in case the app breaks down.&lt;/p&gt;

&lt;p&gt;Unfortunately I find hosting on their server does not work well because the client will often have different versions of libraries or use a platform I am not familiar with. Additionally I prefer to build my web applications in Python (using &lt;a href=&quot;http://web2py.com/&quot;&gt;web2py&lt;/a&gt;), and though Python is great for development it cannot compare to PHP for ease of deployment. I can usually figure this all out but it takes time and also trust from the client to give me root privilege on their server. And given that these web applications are generally low cost (~ $1000) the ease of deployment is important.&lt;/p&gt;

&lt;p&gt;All this is far from ideal. The solution? - see my next post.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Scraping Flash based websites</title>
   <link href="http://sitescraper.net/blog/Scraping-Flash-based-websites/"/>
   <updated>2010-03-27T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/Scraping-Flash-based-websites</id>
   <content type="html">&lt;p&gt;Flash is a pain. It is flaky on Linux and can not be scraped like HTML because it uses a binary format. &lt;a href=&quot;http://en.wikipedia.org/wiki/HTML5&quot;&gt;HTML5&lt;/a&gt; and &lt;a href=&quot;http://www.apple.com/hotnews/thoughts-on-flash/&quot;&gt;Apple's criticism of Flash&lt;/a&gt; are good news for me because they encourage developers to use non-Flash solutions.&lt;/p&gt;

&lt;p&gt;The current reality though is that many sites currently use Flash to display content that I need to access. Here are some approaches for scraping Flash that I have tried:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check for AJAX requests that may carry the data you are after between the flash app and server&lt;/li&gt;
&lt;li&gt;Extract text with the &lt;a href=&quot;http://www.adobe.com/aboutadobe/pressroom/pressreleases/200806/070108AdobeRichMediaSearch.html&quot;&gt;Macromedia Flash Search Engine SDK&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Use &lt;a href=&quot;http://en.wikipedia.org/wiki/Optical_character_recognition&quot;&gt;OCR&lt;/a&gt; to extract the text directly&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;Most flash apps are self contained and so don't use AJAX, which rules out (1). And I have had poor results with (2) and (3).&lt;/p&gt;

&lt;p&gt;Still no silver bullet...&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>I love AJAX!</title>
   <link href="http://sitescraper.net/blog/I-love-AJAX/"/>
   <updated>2010-03-16T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/I-love-AJAX</id>
   <content type="html">&lt;p&gt;&lt;a href=&quot;http://en.wikipedia.org/wiki/Ajax_%28programming%29&quot;&gt;AJAX&lt;/a&gt; is a JavaScript technique that allows a webpage to request URLs from its backend server and then make use of the returned data. For example gmail uses AJAX to load new messages. The old way to do this was reloading the webpage and then embedding the new content in the HTML, which was inefficient because it required downloading the entire webpage again rather that just the updated data.&lt;br/&gt;
AJAX is good for developers because it makes more complex web applications possible. It is good for users because it gives them a faster and smoother browsing experience. And it is good for me because AJAX powered websites are often &lt;em&gt;easier&lt;/em&gt; to scrape.&lt;/p&gt;

&lt;p&gt;The trouble with scraping websites is they obscure the data I am after within a layer of HTML presentation. However AJAX calls typically return just the data in an easy to parse format like &lt;a href=&quot;http://en.wikipedia.org/wiki/JSON&quot;&gt;JSON&lt;/a&gt; or &lt;a href=&quot;http://en.wikipedia.org/wiki/XML&quot;&gt;XML&lt;/a&gt;. So effectively they provide an API to their backend database.&lt;/p&gt;

&lt;p&gt;These AJAX calls can be monitored through tools such as &lt;a href=&quot;http://getfirebug.com/&quot;&gt;Firebug&lt;/a&gt; to see what URLs are called and what they return from the server. Then I can call these URLs directly myself from outside the application and change the query parameters to fetch other records.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Scraping JavaScript webpages with webkit</title>
   <link href="http://sitescraper.net/blog/Scraping-JavaScript-webpages-with-webkit/"/>
   <updated>2010-03-12T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/Scraping-JavaScript-webpages-with-webkit</id>
   <content type="html">&lt;p&gt;In the previous post I covered how to tackle JavaScript based websites with &lt;a href=&quot;http://groups.csail.mit.edu/uid/chickenfoot/&quot;&gt;Chickenfoot&lt;/a&gt;. Chickenfoot is great but not perfect because it:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;requires me to program in JavaScript rather than my beloved Python (with all its great libraries)&lt;/li&gt;
&lt;li&gt;is slow because have to wait for FireFox to render the entire webpage&lt;/li&gt;
&lt;li&gt;is somewhat buggy and has a small user/developer community, mostly at MIT&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;An alternative solution that addresses all these points is &lt;a href=&quot;http://en.wikipedia.org/wiki/WebKit&quot;&gt;webkit&lt;/a&gt;, the open source browser engine used most famously in Apple's Safari browser. Webkit has now been ported to the &lt;a href=&quot;http://en.wikipedia.org/wiki/Qt_%28framework%29&quot;&gt;Qt framework&lt;/a&gt; and can be used through its &lt;a href=&quot;http://www.pyside.org/docs/pyside/PySide/QtWebKit/index.html&quot;&gt;Python bindings&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here is a simple class that renders a webpage (including executing any JavaScript) and then saves the final HTML to a file:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sys&lt;/span&gt;  
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;PyQt4.QtGui&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;  
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;PyQt4.QtCore&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;  
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;PyQt4.QtWebKit&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;  
  
&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Render&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QWebPage&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;  
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;  
    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;app&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;QApplication&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sys&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
    &lt;span class=&quot;n&quot;&gt;QWebPage&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;loadFinished&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;connect&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_loadFinished&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mainFrame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;QUrl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;  
    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;exec_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  
  
  &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_loadFinished&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;  
    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;frame&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mainFrame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  
    &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;app&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;quit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  
  
&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;http://sitescraper.net&amp;#39;&lt;/span&gt;  
&lt;span class=&quot;n&quot;&gt;r&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Render&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
&lt;span class=&quot;n&quot;&gt;html&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;frame&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;toHtml&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;  
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;I can then analyze this resulting HTML with my standard Python tools like the &lt;a href=&quot;http://code.google.com/p/webscraping/&quot;&gt;webscraping module&lt;/a&gt;.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Scraping JavaScript based web pages with Chickenfoot</title>
   <link href="http://sitescraper.net/blog/Scraping-JavaScript-based-web-pages-with-Chickenfoot/"/>
   <updated>2010-03-02T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/Scraping-JavaScript-based-web-pages-with-Chickenfoot</id>
   <content type="html">&lt;p&gt;The data from most webpages can be scraped by simply downloading the HTML and then parsing out the desired content. However some webpages load their content dynamically with JavaScript after the page loads so that the desired data is not found in the original HTML. This is usually done for legitimate reasons such as loading the page faster, but in some cases is designed solely to inhibit scrapers.&lt;br/&gt;
This can make scraping a little tougher, but not impossible.&lt;/p&gt;

&lt;p&gt;The easiest case is where the content is stored in JavaScript structures which are then inserted into the &lt;a href=&quot;http://www.w3schools.com/htmldom/default.asp&quot;&gt;DOM&lt;/a&gt; at page load. This means the content is still embedded in the HTML but needs to instead be scraped from the JavaScript code rather than the HTML tags.&lt;/p&gt;

&lt;p&gt;A more tricky case is where websites encode their content in the HTML and then use JavaScript to decode it on page load. It is possible to convert such functions into Python and then run them over the downloaded HTML, but often an easier and quicker alternative is to execute the original JavaScript. One such tool to do this is the Firefox &lt;a href=&quot;http://groups.csail.mit.edu/uid/chickenfoot/&quot;&gt;Chickenfoot&lt;/a&gt; extension. Chickenfoot consists of a Firefox panel where you can execute arbitrary JavaScript code within a webpage and across multiple webpages. It also comes with a number of &lt;a href=&quot;http://groups.csail.mit.edu/uid/chickenfoot/api.html&quot;&gt;high level functions&lt;/a&gt; to make interaction and navigation easier.&lt;/p&gt;

&lt;p&gt;To get a feel for Chickenfoot here is an example to crawl a website:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;javascript&quot;&gt;&lt;span class=&quot;c1&quot;&gt;// crawl given website url recursively to given depth  &lt;/span&gt;
&lt;span class=&quot;kd&quot;&gt;function&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;crawl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;website&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;max_depth&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;  
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;  
    &lt;span class=&quot;nx&quot;&gt;links&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{};&lt;/span&gt;  
    &lt;span class=&quot;nx&quot;&gt;go&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;website&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;  
    &lt;span class=&quot;nx&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;website&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;  
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;  
  
  &lt;span class=&quot;c1&quot;&gt;// TODO: insert code to act on current webpage here&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;max_depth&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;  
    &lt;span class=&quot;c1&quot;&gt;// iterate links  &lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kd&quot;&gt;var&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;link&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;find&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&amp;quot;link&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;link&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;hasMatch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;link&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;link&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;next&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;    
      &lt;span class=&quot;nx&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;link&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;element&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;href&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;  
      &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;  
        &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;indexOf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;website&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;  
          &lt;span class=&quot;c1&quot;&gt;// same domain  &lt;/span&gt;
          &lt;span class=&quot;nx&quot;&gt;go&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;  
          &lt;span class=&quot;nx&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;  
          &lt;span class=&quot;nx&quot;&gt;crawl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nx&quot;&gt;website&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;max_depth&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;links&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;  
        &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;  
      &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;  
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;  
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;  
  &lt;span class=&quot;nx&quot;&gt;back&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt; &lt;span class=&quot;nx&quot;&gt;wait&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;  
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;This is part of a script I built on my Linux machine for a client on Windows and it worked fine for both of us. To find out more about Chickenfoot check out their &lt;a href=&quot;http://video.google.com/videoplay?docid=-8967914974980683249&quot;&gt;video&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Chickenfoot is a useful weapon in my web scraping arsenal, particularly for quick jobs with a low to medium amount of data. For larger websites there is a more suitable alternative, which I will cover in the next post.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>How to crawl websites without being blocked</title>
   <link href="http://sitescraper.net/blog/How-to-crawl-websites-without-being-blocked/"/>
   <updated>2010-02-08T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/How-to-crawl-websites-without-being-blocked</id>
   <content type="html">&lt;p&gt;Websites want users who will purchase their products and click on their advertising. They want to be crawled by search engines so their users can find them, however they don't (generally) want to be crawled by others. One such company is &lt;a href=&quot;http://www.google.com/accounts/TOS&quot;&gt;Google&lt;/a&gt;, ironically.&lt;/p&gt;

&lt;p&gt;Some websites will actively try to stop scrapers so here are some suggestions to help you crawl beneath their radar.&lt;/p&gt;

&lt;h2&gt;Speed&lt;/h2&gt;

&lt;p&gt;If you download 1 webpage a day then you will not be blocked but your crawl would take too long to be useful. If you instead used threading to crawl multiple URLs asynchronouslythen they might mistake you for a &lt;a href=&quot;http://en.wikipedia.org/wiki/Denial-of-service_attack&quot;&gt;DOS attack&lt;/a&gt; and blacklist your IP. So what is the happy medium? The &lt;a href=&quot;http://en.wikipedia.org/wiki/Web_crawler#Politeness_policy&quot;&gt;wikipedia article on web crawlers&lt;/a&gt; currently states &lt;em&gt;Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 34 minutes&lt;/em&gt;. This is a little slow and I have found 1 download every 5 seconds is usually fine. If you don't need the data quickly then use a longer delay to reduce your risk and be kinder to their server.&lt;/p&gt;

&lt;h2&gt;Identity&lt;/h2&gt;

&lt;p&gt;Websites do not want to block genuine users so you should try to look like one. Set your &lt;a href=&quot;http://en.wikipedia.org/wiki/User_agent&quot;&gt;user-agent&lt;/a&gt; to a &lt;a href=&quot;http://whatsmyuseragent.com/CommonUserAgents.asp&quot;&gt;common web browser&lt;/a&gt; instead of using the library default (such as &lt;em&gt;wget/version or urllib/version&lt;/em&gt;). You could even pretend to be the Google Bot (only for the brave): &lt;em&gt;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you have access to multiple IP addresses (for example via proxies) then distribute your requests among them so that it appears your downloading comes from multiple users.&lt;/p&gt;

&lt;h2&gt;Consistency&lt;/h2&gt;

&lt;p&gt;Avoid accessing webpages sequentially: &lt;em&gt;/product/1&lt;/em&gt;, &lt;em&gt;/product/2&lt;/em&gt;, etc. And don't download a new webpage exactly every N seconds. Both of these mistakes can attract attention to your downloading because a real user browses more randomly. So make sure to crawl webpages in an unordered manner and add a random offset to the delay between downloads.&lt;/p&gt;

&lt;p&gt;Following these recommendations will allow you to crawl most websites without being detected.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>How to protect your data</title>
   <link href="http://sitescraper.net/blog/How-to-protect-your-data/"/>
   <updated>2010-02-05T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/How-to-protect-your-data</id>
   <content type="html">&lt;p&gt;You spent time and money collecting the data in your website so you want to prevent someone else downloading and reusing it. However you still want Google to index your website so that people can find you. This is a common problem. Below I will outline some strategies to protect your data.&lt;/p&gt;

&lt;h2&gt;Restrict&lt;/h2&gt;

&lt;p&gt;Firstly if your data really is valuable then perhaps it shouldn't be all publicly available. Often websites will display the basic data to standard users / search engines and the more valuable data (such as emails) only to logged in users. Then the website can easily track and control how much valuable data each account is accessing.&lt;/p&gt;

&lt;p&gt;If requiring accounts isn't practical and you want search engines to crawl itthen realistically you can't prevent it being scraped, but you can discourage scrapers by setting a high enough barrier.&lt;/p&gt;

&lt;h2&gt;Obfuscate&lt;/h2&gt;

&lt;p&gt;Scrapers typically work by downloading the HTML for a URL and then extracting out the desired content. To make this process harder you can obfuscate your valuable data.&lt;/p&gt;

&lt;p&gt;The simplest way to obfuscate your data is have it encoded on the server and then dynamically decoded with JavaScript in the client's browser. The scraper would then need to decode this JavaScript to extract the original data. This is not difficult for an experienced scraper, but would atleast provide a small barrier.&lt;/p&gt;

&lt;p&gt;A better way is to encapsulate the key data within images or flash. Optical Character Recognition (OCR) techniques would then need to be used to extract the original data, which require a lot of effort to do accurately. (Make sure the URL of the image does not reveal the original data, as one website did!) The free OCR tools that I have tested are at best 80% accurate, which makes the resulting data useless.&lt;br/&gt;
The tradeoff with encoding data in images images is there will be more data for the client to download and they prevent genuine users from conveniently copying the text. For example people often display their email address within an image to combat spammers, which then forces everyone else to type it out manually.&lt;/p&gt;

&lt;h2&gt;Challenge&lt;/h2&gt;

&lt;p&gt;A popular way to prevent automated scrapers is by forcing users to pass a &lt;a href=&quot;http://en.wikipedia.org/wiki/CAPTCHA&quot;&gt;CAPTCHA&lt;/a&gt;. For example Google does this when it gets too many search requests from the same IP within a timeframe. To avoid the CAPTCHA the scraper could proceed slowly, but they probably can't afford to wait. To speed up this rate they may purchase multiple anonymous proxies to provide multiple IP addresses, but that is expensive - 10 anonymous proxies will cost ~$30 / month to rent. The CAPTCHA can also be solved automatically by a service like &lt;a href=&quot;http://deathbycaptcha.com/&quot;&gt;deathbycaptcha.com&lt;/a&gt;. This takes some effort to setup so would only be implemented by experienced scrapers for valuable data.&lt;/p&gt;

&lt;p&gt;CAPTCHA is not a good solution for protecting your content - they annoy genuine users, can be bypassed by a determined scraper, and additionallymake it difficult for the Google Bot to index your website properly. They are only a good solution when being indexed by Google is not a priority and you want to stop most scrapers.&lt;/p&gt;

&lt;h2&gt;Corrupt&lt;/h2&gt;

&lt;p&gt;If you are suspicious of an IP that is accessing your website you could block the IP, but then they would know they are detected and try a different approach. Instead you could allow the IP to continue downloading but return incorrect text or figures. This should be done subtly so that is not clear which data is correct and their entire data set will be corrupted. Perhaps they won't notice and you will be able to track them down later by searching for &lt;em&gt;purple monkey dishwasher&lt;/em&gt; or whatever other content was inserted!&lt;/p&gt;

&lt;h2&gt;Structure&lt;/h2&gt;

&lt;p&gt;Another factor that makes sites easy to scrape is when they use a URL structure like:&lt;br/&gt;
domain/product/product_title/product_id&lt;br/&gt;
For example these two URLs point to the same content on Amazon:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;http://www.amazon.com/Lets-Go-Australia-10th-Inc/dp/0312385757&quot;&gt;http://www.amazon.com/Lets-Go-Australia-10th-Inc/dp/0312385757&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;http://www.amazon.com/FAKE_TITLE/dp/0312385757&quot;&gt;http://www.amazon.com/FAKE_TITLE/dp/0312385757&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The title is just to make the URL look pretty. This makes the site easy to crawl because the scraper can just iterate through all the ID's (in this case ISBN's). If the title here had to be consistent with the product ID then it would take more work to scrape.&lt;/p&gt;

&lt;h2&gt;Google&lt;/h2&gt;

&lt;p&gt;All of the above strategies could be ignored for the Google Bot to ensure your website is properly indexed. Be aware that anyone could pretend to be the Google Bot by setting their user-agent to &lt;em&gt;Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)&lt;/em&gt;, so to be confident you should also verify the IP address via a &lt;a href=&quot;http://www.google.com/support/webmasters/bin/answer.py?hl=en&amp;amp;answer=80553&quot;&gt;reversed DNS lookup&lt;/a&gt;. Be warned that Google has been known to punish websites that display different content for their bot to regular users.&lt;/p&gt;

&lt;p&gt;In the next post I will take the opposite point of view of someone trying to scrape a website.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Why Python</title>
   <link href="http://sitescraper.net/blog/Why-Python/"/>
   <updated>2010-02-02T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/Why-Python</id>
   <content type="html">&lt;p&gt;Sometimes people ask why I use &lt;a href=&quot;http://www.python.org/&quot;&gt;Python&lt;/a&gt; instead of something faster like C/C++. For me the speed of a language is a low priority because in my work the overwhelming amount of execution time is spent waiting for data to be downloaded rather than programming instructions to finish. So it makes sense to use whatever language I can write good code fastest in, which is currently Python because of its high level syntax and excellent ecosystem of libraries. ESR wrote &lt;a href=&quot;http://www.python.org/about/success/esr/&quot;&gt;an article&lt;/a&gt; on why he likes Python that I expect resonates with many.&lt;/p&gt;

&lt;p&gt;Additionally Python is an interpreted language so it is easier for me to distribute my solutions to clients than would be for a compiled language like C. Most of my scraping jobs are relatively small so distribution overhead is important.&lt;/p&gt;

&lt;p&gt;A few people have suggested I use &lt;a href=&quot;http://www.ruby-lang.org/&quot;&gt;ruby&lt;/a&gt; instead. I have used ruby and like it, but found it lacks the depth of libraries available to Python.&lt;/p&gt;

&lt;p&gt;However Python is by no means perfect - for example there are limitations with threading, using unicode is awkward, and distributing on Windows can be difficult. And there are also many redundant or poorly designed builtin libraries. Some of these issues are being addressed in Python 3, some not.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>The SiteScraper library</title>
   <link href="http://sitescraper.net/blog/The-SiteScraper-library/"/>
   <updated>2010-01-29T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/The-SiteScraper-library</id>
   <content type="html">&lt;p&gt;As a student I was fortunate to have the opportunity to learn about web scraping, guided by &lt;a href=&quot;http://ww2.cs.mu.oz.au/~tim/&quot;&gt;Professor Timothy Baldwin&lt;/a&gt;. I aimed to build a tool to make scraping web pages easier, resulting from frustration with a previous project.&lt;/p&gt;

&lt;p&gt;My idea for this tool was that it should be possible to train a program to scrape a website by just giving the desired outputs for some example webpages. The program would build a model of how to extract this content and then this model could be applied to scrape other webpages that used the same template.&lt;/p&gt;

&lt;p&gt;The tool was eventually called &lt;em&gt;SiteScraper&lt;/em&gt; and is available for download on &lt;a href=&quot;http://code.google.com/p/sitescraper/&quot;&gt;Google Code&lt;/a&gt;. For more information have a browse of &lt;a href=&quot;http://sitescraper.googlecode.com/files/sitescraper.pdf&quot;&gt;this paper&lt;/a&gt;, which covers the implementation and results in detail.&lt;/p&gt;

&lt;p&gt;I use SiteScraper for much of my scraping work and often make updates based on experience gained from a project.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Web scraping with regular expressions</title>
   <link href="http://sitescraper.net/blog/Web-scraping-with-regular-expressions/"/>
   <updated>2010-01-20T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/Web-scraping-with-regular-expressions</id>
   <content type="html">&lt;p&gt;Using regular expressions for web scraping is &lt;a href=&quot;http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454&quot;&gt;sometimes criticized&lt;/a&gt;, but I believe they still have their place, particularly for one-off scrapes. Let's say I want to extract the title of a particular webpage - here is an implementation using BeautifulSoup, lxml, and regular expressions:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;re&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;time&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;urllib2&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;BeautifulSoup&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BeautifulSoup&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;lxml&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;html&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lxmlhtml&lt;/span&gt;


&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;timeit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;t1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;fn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;t2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;time&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;%s&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; took &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;%0.3f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt; ms&amp;#39;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;func_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;t2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;t1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;1000.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    
    
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;bs_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;html&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;soup&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BeautifulSoup&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;html&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;soup&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;html&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;head&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;title&lt;/span&gt;
    
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;lxml_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;html&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;tree&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lxmlhtml&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fromstring&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;html&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tree&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xpath&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;//title&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;text_content&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
    
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;regex_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;html&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;re&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;findall&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;&amp;lt;title&amp;gt;(.*?)&amp;lt;/title&amp;gt;&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;html&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
    
    
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;__name__&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;__main__&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;html&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;urllib2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;urlopen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;http://sitescraper.net/blog/Web-scraping-with-regular-expressions/&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fn&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bs_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lxml_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;regex_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;timeit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;html&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;The results are:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;regex_test took 40.032 ms
lxml_test took 1863.463 ms
bs_test took 54206.303 ms
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;That means for this use case lxml takes 40x longer than regular expressions and BeautifulSoup over 1000x! This is because lxml and BeautifulSoup parse the entire document into their internal format, when only the title is required.&lt;/p&gt;

&lt;p&gt;XPaths are very useful for most web scraping tasks, but there still is a use case for regular expressions.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>How to use XPaths robustly</title>
   <link href="http://sitescraper.net/blog/How-to-use-XPaths-robustly/"/>
   <updated>2010-01-05T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/How-to-use-XPaths-robustly</id>
   <content type="html">&lt;p&gt;In an earlier post I referred to XPaths but did not explain how to use them.&lt;/p&gt;

&lt;p&gt;Say we have the following HTML document:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;html&amp;gt;  
&amp;lt;body  
&amp;lt;div&amp;gt;&amp;lt;/div&amp;gt;  
&amp;lt;div id=&quot;content&quot;&amp;gt;  
&amp;lt;ul&amp;gt;  
&amp;lt;li&amp;gt;First item&amp;lt;/li&amp;gt;  
&amp;lt;li&amp;gt;Second item&amp;lt;/li&amp;gt;  
&amp;lt;/ul&amp;gt;  
&amp;lt;/div&amp;gt;  
&amp;lt;/body&amp;gt;  
&amp;lt;/html&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To access the list elements we follow the HTML structure from the root tag down to the li's:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;html &amp;gt; body &amp;gt; 2nd div &amp;gt; ul &amp;gt; many li's.
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;An XPath to represent this traversal is:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;/html[1]/body[1]/div[2]/ul[1]/li
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;If a tag has no index then every tag of that type will be selected:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;/html/body/div/ul/li
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;XPaths can also use attributes to select nodes:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;/html/body/div[@id=&quot;content&quot;]/ul/li 
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And instead of using an absolute XPath from the root the XPath can be relative to a particular node by using double slash:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;//div[@id=&quot;content&quot;]/ul/li
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This is more reliable than an absolute XPath because it can still locate the correct content after the surrounding structure is changed.&lt;/p&gt;

&lt;p&gt;There are other features in the &lt;a href=&quot;http://www.w3.org/TR/xpath/&quot;&gt;XPath standard&lt;/a&gt; but the above are all I use regularly.&lt;/p&gt;

&lt;p&gt;A handy way to find the XPath of a tag is with Firefox's &lt;a href=&quot;http://getfirebug.com/&quot;&gt;Firebug extension&lt;/a&gt;. To do this open the HTML tab in Firebug, right click the element you are interested in, and select &quot;Copy XPath&quot;. (Alternatively use the &quot;Inspect&quot; button to select the tag.)&lt;br/&gt;
This will give you an XPath with indices only where there are multiple tags of the same type, such as:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;/html/body/div[2]/ul/li
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;One thing to keep in mind is Firefox will always create a tbody tag within tables whether it existed in the original HTML or not. This has tripped me up a few times!&lt;/p&gt;

&lt;p&gt;For one-off scrapes the above XPath should be fine. But for long term repeat scrapes it is better to use a relative XPath around an ID element with attributes instead of indices. From my experience such an XPath is more likely to survive minor modifications to the layout. However for a more robust solution see my &lt;a href=&quot;http://code.google.com/p/sitescraper/&quot;&gt;SiteScraper library&lt;/a&gt;, which I will introduce in a later post.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Parsing HTML with Python</title>
   <link href="http://sitescraper.net/blog/Parsing-HTML-with-Python/"/>
   <updated>2010-01-02T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/Parsing-HTML-with-Python</id>
   <content type="html">&lt;p&gt;HTML is a tree structure: at the root is a &lt;html&gt; tag followed by the &lt;head&gt; and &lt;body&gt; tags and then more tags before the content itself. However when a webpage is downloaded all one gets is a series of characters. Working directly with that text is fine when using regular expressions, but often we want to traverse the webpage content, which requires parsing the tree structure.&lt;/p&gt;

&lt;p&gt;Unfortunately the HTML of many webpages around the internet is invalid - for example a list may be missing closing tags:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;&amp;lt;ul&amp;gt;  
&amp;lt;li&amp;gt;abc
&amp;lt;li&amp;gt;def  
&amp;lt;li&amp;gt;ghi
&amp;lt;/ul&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;but it still needs to be interpreted as a proper list:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;abc&lt;/li&gt;
&lt;li&gt;def&lt;/li&gt;
&lt;li&gt;ghi&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;This means we can't naively parse HTML by assuming a tag ends when we find the next closing tag. Instead it is best to use one of the many HTML parsing libraries available, such as &lt;a href=&quot;http://www.crummy.com/software/BeautifulSoup/&quot;&gt;BeautifulSoup&lt;/a&gt;, &lt;a href=&quot;http://codespeak.net/lxml/&quot;&gt;lxml&lt;/a&gt;, &lt;a href=&quot;http://code.google.com/p/html5lib/&quot;&gt;html5lib&lt;/a&gt;, and &lt;a href=&quot;http://www.boddie.org.uk/python/libxml2dom.html&quot;&gt;libxml2dom&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Seemingly the most well known and used such library is &lt;a href=&quot;http://www.crummy.com/software/BeautifulSoup/&quot;&gt;BeautifulSoup&lt;/a&gt;. A Google search for &lt;a href=&quot;http://www.google.com/search?q=python+web+scraping+module&quot;&gt;Python web scraping module&lt;/a&gt; currently returns BeautifulSoup as the first result.&lt;br/&gt;
However I instead use lxml because I find it more robust when parsing bad HTML. Additionally Ian Bicking found &lt;a href=&quot;http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/&quot;&gt;lxml more efficient&lt;/a&gt; than the other parsing libraries, though my priority is accuracy over speed.&lt;/p&gt;

&lt;p&gt;You will need to use version 2 onwards of lxml, which includes the &lt;em&gt;&lt;strong&gt;html&lt;/strong&gt;&lt;/em&gt; module. This meant needing to compile lxml up to Ubuntu 8.10, which came with an earlier version.&lt;/p&gt;

&lt;p&gt;Here is an example how to parse the previous broken HTML with lxml:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;    &lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;lxml&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;html&lt;/span&gt;  
    &lt;span class=&quot;n&quot;&gt;tree&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;html&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fromstring&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;&amp;lt;ul&amp;gt;&amp;lt;li&amp;gt;abc&amp;lt;/li&amp;gt;&amp;lt;li&amp;gt;def&amp;lt;li&amp;gt;ghi&amp;lt;/li&amp;gt;&amp;lt;/ul&amp;gt;&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
    &lt;span class=&quot;n&quot;&gt;tree&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;xpath&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;#39;//ul/li&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  
    &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Element&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;li&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;at&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;959553&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Element&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;li&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;at&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;95952&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fc&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Element&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;li&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;at&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;959544&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;



</content>
 </entry>
 
 <entry>
   <title>Typical web scraping job</title>
   <link href="http://sitescraper.net/blog/Typical-web-scraping-job/"/>
   <updated>2009-12-30T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/Typical-web-scraping-job</id>
   <content type="html">&lt;p&gt;In this post I will clarify what I do by walking through a simple web scraping job I worked on.&lt;/p&gt;

&lt;p&gt;A few months back a client asked me for a quote to get demographic data for every county and city in the US. I first checked around for an existing data set but did not find one, so I would need to scrape it from the &lt;a href=&quot;http://quickfacts.census.gov/qfd/&quot;&gt;official census website&lt;/a&gt;. I spent some time getting to know this website and found it followed a simplehierarchy, with navigation performed through selecting options from select boxes:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://quickfacts.census.gov/qfd/&quot;&gt;Overview page&lt;/a&gt; / &lt;a href=&quot;http://quickfacts.census.gov/qfd/states/01000.html&quot;&gt;stage pages&lt;/a&gt; / &lt;a href=&quot;http://quickfacts.census.gov/qfd/states/01/01001.html&quot;&gt;county pages&lt;/a&gt; | &lt;a href=&quot;http://quickfacts.census.gov/qfd/states/01/0103076.html&quot;&gt;city pages&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I viewed the source of these webpages and found the content I was after embedded, which meant the content was defined statically rather than being loaded dynamically with JavaScript or AJAX. This would make scraping it more straightforward.&lt;/p&gt;

&lt;p&gt;I emailed the client that the census website was small sized, easily navigable, and I would be able to provide a CSV file of the data within 3 days. I would be willing to do this for US $200 with half deposited beforehand (by PayPal) and the remainder after they were satisfied with the results. The client was satisfied with this arrangement, so it was time to get started.&lt;/p&gt;

&lt;p&gt;The first step was to get all the state page URLs from the select box. I could hardcode these URLs but I don't like grunt work, so I constructed a regular expression to extract them automatically.&lt;br/&gt;
This expression can also be used to extract all the county and city URLs from their respective select boxes, so now I have access to all the required URLs.&lt;br/&gt;
(Note that using regular expressions is generally a bad approach to web scraping, which I will expand on in a future post.)&lt;/p&gt;

&lt;p&gt;There are many factors involved in crawling that I will leave to future posts and will now jump ahead to after I have downloaded the HTML and am ready to scrape data from it.&lt;/p&gt;

&lt;p&gt;If you have a look at a sample &lt;a href=&quot;http://quickfacts.census.gov/qfd/states/01/01001.html&quot;&gt;census page&lt;/a&gt;, you will see that there is a lot of data. It would be toocumbersometo craft regular expressions that extracted each of these fields in a structured way so I used &lt;a href=&quot;http://en.wikipedia.org/wiki/XPath&quot;&gt;XPaths&lt;/a&gt;instead. (I will also leave a detailed coverage of XPaths for a future post, but essentially it is a convenient method for selecting HTML nodes.)&lt;/p&gt;

&lt;p&gt;Now I am on the home stretch. I combine these various parts together into a single script that iterates the HTML pages, extracts the content with XPath, and writes out the result to a CSV file.&lt;/p&gt;

&lt;p&gt;QED&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>What is web scraping?</title>
   <link href="http://sitescraper.net/blog/What-is-web-scraping/"/>
   <updated>2009-12-20T00:00:00+11:00</updated>
   <id>http://sitescraper.net/blog/What-is-web-scraping</id>
   <content type="html">&lt;p&gt;The internet contains a huge amount of useful data but most is not easily accessible. For example let's suppose you run a business and want your prices to match a competitor. The competitor's website contains all the product details but there are too many products to track manually - to be useful you need this data extracted automatically into an Excel spreadsheet. The process is known as &lt;em&gt;web scraping&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Here are other common use cases for web scraping:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Collect business contact details (eg from &lt;a href=&quot;http://maps.google.com&quot;&gt;Google Places&lt;/a&gt; or &lt;a href=&quot;http://yellowpages.com&quot;&gt;Yellow Pages&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Collect reviews (eg from &lt;a href=&quot;http://yelp.com&quot;&gt;Yelp&lt;/a&gt; or &lt;a href=&quot;http://amazon.com&quot;&gt;Amazon&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Track what people are saying about your product (eg from &lt;a href=&quot;http://twitter.com&quot;&gt;Twitter&lt;/a&gt; or &lt;a href=&quot;http://facebook.com&quot;&gt;Facebook&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;If this sounds interesting then feel welcome to &lt;a href=&quot;/contact&quot;&gt;contact me&lt;/a&gt; to discuss further.&lt;/p&gt;
</content>
 </entry>
 
 
</feed>

