Scrape the web with ruby

Introduction

In the last few months I have taken some time to play with a number of dynamic languages. My experiments were mostly in the “web hacks” category e.g. fetching files from the web and extracting data of interest from these. For my most recent hack (get wordpress weblog statistics) I used Ruby.

The task at hand

The task at hand consists of fetching the weblog statistics for my wordpress weblog and displaying them in the terminal window.
This includes the handling of possible redirections to the wordpress.com login page, the parsing of the HTML file to be obtained and the extraction of the various weblog statistics.

The tools used

After briefly surveying the tools and libraries available in Ruby-land I settled for WWW::Mechanize a Ruby implementation of Perl‘s venerable WWW-Mechanize CPAN module.

Under the hood WWW::Mechanize uses the Hpricot HTML parser.

The approach

The worpress.com weblog statistics pages have a URL with the following structure:

    http://#user.wordpress.com/wp-admin/index.php?page=stats

They contain the following statistics for today and yesterday respectively:

  • Referrers: people clicked links from these pages to get to your weblog
  • Top posts: these posts on your weblog got the most traffic
  • Search engine terms: these are terms people used to find your weblog
  • Clicks: your visitors clicked these links on your weblog

Each of the above are structured as follows:

<div class="statsdiv">
<h3><a href="7-day page URL">statistics type</a></h3>
<p>..explanatory text..</p>
<h4>Today</h4>
  <table class="statsDay">
    <tr><th>..</th><th class="views">..</th></tr>
    <tr>
      <td class="label">URL or term</td>
      <td class="views">number of views</td>
    </tr>
    ...
  </table>
<h4>Yesterday</h4>
  <table class="statsDay">
    <tr><th>..</th><th class="views">..</th></tr>
    <tr>
      <td class="label">URL or term</td>
      <td class="views">number of views</td>
    </tr>
    ...
  </table>
</div>

The Ruby code below first finds the <div class="statsdiv"> sub-trees and then extracts today’s data from them.

The code

  1 #!/usr/bin/env ruby
  2 
  3 require 'rubygems'
  4 require 'mechanize'
  5 
  6 HELP_STRING =<<EOS
  7 
  8 Tool for fetching wordpress.com weblog statistics. Usage:
  9 
 10     wls.rb [username] [pwd]
 11 
 12 where 'user' is your wordpress user name and 'pwd' is your
 13 password respectively.
 14 
 15 EOS
 16 
 17 if not ARGV.grep(/-h|--help/).empty?
 18     puts HELP_STRING
 19     exit(0)
 20 end
 21 
 22 # try to access the weblog statistics page
 23 user = 'muharem'
 24 password = nil  # set your password here if you dislike being prompted for it
 25 
 26 if ARGV[0]
 27     user = ARGV[0]
 28 end
 29 if ARGV[1]
 30     password = ARGV[1]
 31 end
 32 
 33 stats_url = "http://#{user}.wordpress.com/wp-admin/index.php?page=stats"
 34 
 35 # instantiate/initialise web agent ..
 36 agent = WWW::Mechanize.new
 37 agent.user_agent_alias = 'Mac Safari'
 38 # .. and get the weblog statistics page
 39 page = agent.get(stats_url)
 40 
 41 # did we get back the login form?
 42 if (page.title.strip.split[-1] == 'Login')
 43     # yes, fill it in and submit it
 44     loginf = page.form('loginform')
 45     loginf.log = user
 46     if not password
 47         print "Enter your wordpress.com password: "
 48         password = $stdin.gets.chomp
 49     end
 50     loginf.pwd = password
 51     agent.submit(loginf, loginf.buttons.first)
 52 end
 53 
 54 # now get the actual weblog statistics page
 55 page = agent.get_file(stats_url)
 56 # parse it!
 57 doc = Hpricot(page)
 58 
 59 # search for the div elements that contain the statistics data
 60 stats_divs = doc.search("//div[@class='statsdiv']")
 61 stats_divs.each do |div|
 62     heading = div.search("h3/a/text()")
 63     # we are only interested in the statistics for today
 64     day = div.search("h4/text()").first
 65     if (heading and day)
 66         heading = "==== #{heading} (#{day.inner_text.downcase}) ====".center(50)
 67         puts "\\n#{heading}\\n"
 68         # find the table with today's statistics data
 69         tab = div.search("table").first
 70         if tab
 71             # extract the statistics data from the <tr> elements
 72             tab.search("tr").each do |tr|
 73                 what = tr.search("td[@class='label']")
 74                 views = tr.search("td[@class='views']")
 75                 whats = what.inner_text.strip()
 76                 if not whats.empty?
 77                     views = views.inner_text.strip()
 78                     printf("%s -- %5s\\n", whats.center(45), views)
 79                 end
 80             end
 81         end
 82     end
 83 end
 84 # grab the div with the general (weblog level) statistics data
 85 gbdiv = doc.search("//div[@id='generalblog']")
 86 # find the <p> element with the number of views today
 87 vtoday =  gbdiv.search("p").find { |p| p.inner_text.index('Views today') }
 88 if vtoday
 89     printf("\\n%s\\n\\n", "=> #{vtoday.inner_text.strip} <=".center(45))
 90 else
 91     puts "\\n\\n!! No weblog statistics data found."
 92     puts "   Did you enter a wrong user name and/or password?"
 93 end

Example output

  1            ==== Referrers (today) ====
  2    stumbleupon.com/refer.php?url=http%3A?     --    16
  3    stumbleupon.com/refer.php?url=http%3A?     --     3
  4    planeterlang.org/story.php?title=Erla?     --     2
  5    linuxquestions.org/questions/showthre?     --     2
  6        del.icio.us/jdkimball/stackless        --     1
  7    rodenas.org/blog/2007/08/27/erlang-ri?     --     1
  8    intertwingly.net/blog/2007/08/14/Long?     --     1
  9              ozone.wordpress.com              --     1
 10    programming.reddit.com/search?q=erlan?     --     1
 11 
 12            ==== Top Posts (today) ====
 13           Processing XML in Erlang            --    22
 14   Erlang vs. Stackless python: a first ben    --    18
 15   Python: file find, grep and in-line repl    --     4
 16   Python decorator mini-study (part 1 of 3    --     2
 17   Code refactoring with python's functoo      --     2
 18   Python: find files using Unix shell-styl    --     2
 19   Determine order of execution by (re-)seq    --     2
 20            A first look at Groovy             --     1
 21   Python decorator mini-study (part 2 of 3    --     1
 22    Turn on line numbers while searching in    --     1
 23 
 24       ==== Search Engine Terms (today) ====
 25               erlang benchmark                --     3
 26              stackless vs erlang              --     2
 27          python decorators argument           --     2
 28       source code of an execution path        --     2
 29           python functools partial            --     2
 30                  python grep                  --     2
 31          python string replace 2.4.4          --     1
 32               erlang vs C speed               --     1
 33     erlang command line arguments getopt      --     1
 34    Python + parsing command line arguments    --     1
 35 
 36              ==== Clicks (today) ====
 37         hpcwire.com/hpc/1295541.html          --     2
 38    pragmaticprogrammer.com/titles/jaerla?     --     1
 39    hrnjad.net/src/6/scriptutil.py.html#f?     --     1
 40 
 41             => Views today: 68 <=

Conclusion

I am a total Ruby beginner but have a lot of experience with Python and Perl. It took approximately 2 hours (and frequent look-ups in the pickaxe book) to write the tool above and I enjoyed it 🙂

Being the kind of person that stays away from all things over-hyped I ignored Ruby for the last two years or so but I have to say it’s a cool language after all.

Click here to download the code.

9 thoughts on “Scrape the web with ruby

  1. Pingback: Scraping Dynamic Websites Using JRuby and HtmlUnit | Innovation On The Run

  2. You can check out scRUBYt! as well which is a Ruby web-scraping framework based on Mechanize and Hpricot (and FireWatir, most recently). It has it’s quirks, but if you can use it well, it can accomplish the above task with a considerably shorter DSL script,

  3. Hi muharem!
    I encourage take a look into Scrapy [http://scrapy.org], is a python framework for Screen Scraping. It does the job very well. Actually, Scrapy is beign used for big screen scraping projects and is reach the stage of production use.

    I hope you like it.

    Kind regards,
    Andres

  4. Pingback: No Friday » Blog Archive » Garagiste Invoice Fetcher

Leave a comment