Introduction
In the last few months I have taken some time to play with a number of dynamic languages. My experiments were mostly in the “web hacks” category e.g. fetching files from the web and extracting data of interest from these. For my most recent hack (get wordpress weblog statistics) I used Ruby.
The task at hand
The task at hand consists of fetching the weblog statistics for my wordpress weblog and displaying them in the terminal window.
This includes the handling of possible redirections to the wordpress.com login page, the parsing of the HTML file to be obtained and the extraction of the various weblog statistics.
The tools used
After briefly surveying the tools and libraries available in Ruby-land I settled for WWW::Mechanize a Ruby
implementation of Perl
‘s venerable WWW-Mechanize CPAN
module.
Under the hood WWW::Mechanize
uses the Hpricot HTML parser.
The approach
The worpress.com
weblog statistics pages have a URL with the following structure:
http://#user.wordpress.com/wp-admin/index.php?page=stats
They contain the following statistics for today and yesterday respectively:
- Referrers: people clicked links from these pages to get to your weblog
- Top posts: these posts on your weblog got the most traffic
- Search engine terms: these are terms people used to find your weblog
- Clicks: your visitors clicked these links on your weblog
Each of the above are structured as follows:
<div class="statsdiv"> <h3><a href="7-day page URL">statistics type</a></h3> <p>..explanatory text..</p> <h4>Today</h4> <table class="statsDay"> <tr><th>..</th><th class="views">..</th></tr> <tr> <td class="label">URL or term</td> <td class="views">number of views</td> </tr> ... </table> <h4>Yesterday</h4> <table class="statsDay"> <tr><th>..</th><th class="views">..</th></tr> <tr> <td class="label">URL or term</td> <td class="views">number of views</td> </tr> ... </table> </div>
The Ruby
code below first finds the <div class="statsdiv">
sub-trees and then extracts today’s data from them.
The code
1 #!/usr/bin/env ruby 2 3 require 'rubygems' 4 require 'mechanize' 5 6 HELP_STRING =<<EOS 7 8 Tool for fetching wordpress.com weblog statistics. Usage: 9 10 wls.rb [username] [pwd] 11 12 where 'user' is your wordpress user name and 'pwd' is your 13 password respectively. 14 15 EOS 16 17 if not ARGV.grep(/-h|--help/).empty? 18 puts HELP_STRING 19 exit(0) 20 end 21 22 # try to access the weblog statistics page 23 user = 'muharem' 24 password = nil # set your password here if you dislike being prompted for it 25 26 if ARGV[0] 27 user = ARGV[0] 28 end 29 if ARGV[1] 30 password = ARGV[1] 31 end 32 33 stats_url = "http://#{user}.wordpress.com/wp-admin/index.php?page=stats" 34 35 # instantiate/initialise web agent .. 36 agent = WWW::Mechanize.new 37 agent.user_agent_alias = 'Mac Safari' 38 # .. and get the weblog statistics page 39 page = agent.get(stats_url) 40 41 # did we get back the login form? 42 if (page.title.strip.split[-1] == 'Login') 43 # yes, fill it in and submit it 44 loginf = page.form('loginform') 45 loginf.log = user 46 if not password 47 print "Enter your wordpress.com password: " 48 password = $stdin.gets.chomp 49 end 50 loginf.pwd = password 51 agent.submit(loginf, loginf.buttons.first) 52 end 53 54 # now get the actual weblog statistics page 55 page = agent.get_file(stats_url) 56 # parse it! 57 doc = Hpricot(page) 58 59 # search for the div elements that contain the statistics data 60 stats_divs = doc.search("//div[@class='statsdiv']") 61 stats_divs.each do |div| 62 heading = div.search("h3/a/text()") 63 # we are only interested in the statistics for today 64 day = div.search("h4/text()").first 65 if (heading and day) 66 heading = "==== #{heading} (#{day.inner_text.downcase}) ====".center(50) 67 puts "\\n#{heading}\\n" 68 # find the table with today's statistics data 69 tab = div.search("table").first 70 if tab 71 # extract the statistics data from the <tr> elements 72 tab.search("tr").each do |tr| 73 what = tr.search("td[@class='label']") 74 views = tr.search("td[@class='views']") 75 whats = what.inner_text.strip() 76 if not whats.empty? 77 views = views.inner_text.strip() 78 printf("%s -- %5s\\n", whats.center(45), views) 79 end 80 end 81 end 82 end 83 end 84 # grab the div with the general (weblog level) statistics data 85 gbdiv = doc.search("//div[@id='generalblog']") 86 # find the <p> element with the number of views today 87 vtoday = gbdiv.search("p").find { |p| p.inner_text.index('Views today') } 88 if vtoday 89 printf("\\n%s\\n\\n", "=> #{vtoday.inner_text.strip} <=".center(45)) 90 else 91 puts "\\n\\n!! No weblog statistics data found." 92 puts " Did you enter a wrong user name and/or password?" 93 end
Example output
1 ==== Referrers (today) ==== 2 stumbleupon.com/refer.php?url=http%3A? -- 16 3 stumbleupon.com/refer.php?url=http%3A? -- 3 4 planeterlang.org/story.php?title=Erla? -- 2 5 linuxquestions.org/questions/showthre? -- 2 6 del.icio.us/jdkimball/stackless -- 1 7 rodenas.org/blog/2007/08/27/erlang-ri? -- 1 8 intertwingly.net/blog/2007/08/14/Long? -- 1 9 ozone.wordpress.com -- 1 10 programming.reddit.com/search?q=erlan? -- 1 11 12 ==== Top Posts (today) ==== 13 Processing XML in Erlang -- 22 14 Erlang vs. Stackless python: a first ben -- 18 15 Python: file find, grep and in-line repl -- 4 16 Python decorator mini-study (part 1 of 3 -- 2 17 Code refactoring with python's functoo -- 2 18 Python: find files using Unix shell-styl -- 2 19 Determine order of execution by (re-)seq -- 2 20 A first look at Groovy -- 1 21 Python decorator mini-study (part 2 of 3 -- 1 22 Turn on line numbers while searching in -- 1 23 24 ==== Search Engine Terms (today) ==== 25 erlang benchmark -- 3 26 stackless vs erlang -- 2 27 python decorators argument -- 2 28 source code of an execution path -- 2 29 python functools partial -- 2 30 python grep -- 2 31 python string replace 2.4.4 -- 1 32 erlang vs C speed -- 1 33 erlang command line arguments getopt -- 1 34 Python + parsing command line arguments -- 1 35 36 ==== Clicks (today) ==== 37 hpcwire.com/hpc/1295541.html -- 2 38 pragmaticprogrammer.com/titles/jaerla? -- 1 39 hrnjad.net/src/6/scriptutil.py.html#f? -- 1 40 41 => Views today: 68 <=
Conclusion
I am a total Ruby
beginner but have a lot of experience with Python
and Perl
. It took approximately 2 hours (and frequent look-ups in the pickaxe book) to write the tool above and I enjoyed it 🙂
Being the kind of person that stays away from all things over-hyped I ignored Ruby
for the last two years or so but I have to say it’s a cool language after all.
Click here to download the code.
Pingback: Scraping Dynamic Websites Using JRuby and HtmlUnit | Innovation On The Run
You can check out scRUBYt! as well which is a Ruby web-scraping framework based on Mechanize and Hpricot (and FireWatir, most recently). It has it’s quirks, but if you can use it well, it can accomplish the above task with a considerably shorter DSL script,
@Peter: thanks for the pointer! I’ll have a look.
Hello sir.
please sir can u arrange this code in PHP.
thank u so much
very cool!
I enjoy coding ruby too =)
Hi muharem!
I encourage take a look into Scrapy [http://scrapy.org], is a python framework for Screen Scraping. It does the job very well. Actually, Scrapy is beign used for big screen scraping projects and is reach the stage of production use.
I hope you like it.
Kind regards,
Andres
Pingback: No Friday » Blog Archive » Garagiste Invoice Fetcher
Love your blog post more often thanks
Good code sippet and tips. I’m a beginner too.