Scrape the web with ruby

Introduction

In the last few months I have taken some time to play with a number of dynamic languages. My experiments were mostly in the “web hacks” category e.g. fetching files from the web and extracting data of interest from these. For my most recent hack (get wordpress weblog statistics) I used Ruby.

The task at hand

The task at hand consists of fetching the weblog statistics for my wordpress weblog and displaying them in the terminal window.
This includes the handling of possible redirections to the wordpress.com login page, the parsing of the HTML file to be obtained and the extraction of the various weblog statistics.

The tools used

After briefly surveying the tools and libraries available in Ruby-land I settled for WWW::Mechanize a Ruby implementation of Perl‘s venerable WWW-Mechanize CPAN module.

Under the hood WWW::Mechanize uses the Hpricot HTML parser.

The approach

The worpress.com weblog statistics pages have a URL with the following structure:


http://#{user}.wordpress.com/wp-admin/index.php?page=stats

They contain the following statistics for today and yesterday respectively:

  • Referrers: people clicked links from these pages to get to your weblog
  • Top posts: these posts on your weblog got the most traffic
  • Search engine terms: these are terms people used to find your weblog
  • Clicks: your visitors clicked these links on your weblog

Each of the above are structured as follows:

<div class="statsdiv">
<h3><a href="7-day page URL">statistics type</a></h3>
<p>..explanatory text..</p>
<h4>Today</h4>
  <table class="statsDay">
    <tr><th>..</th><th class="views">..</th></tr>
    <tr>
      <td class="label">URL or term</td>
      <td class="views">number of views</td>
    </tr>
    ...
  </table>
<h4>Yesterday</h4>
  <table class="statsDay">
    <tr><th>..</th><th class="views">..</th></tr>
    <tr>
      <td class="label">URL or term</td>
      <td class="views">number of views</td>
    </tr>
    ...
  </table>
</div>

The Ruby code below first finds the <div class="statsdiv"> sub-trees and then extracts today’s data from them.

The code

  1 #!/usr/bin/env ruby
  2 
  3 require 'rubygems'
  4 require 'mechanize'
  5 
  6 HELP_STRING =<<EOS
  7 
  8 Tool for fetching wordpress.com weblog statistics. Usage:
  9 
 10     wls.rb [username] [pwd]
 11 
 12 where 'user' is your wordpress user name and 'pwd' is your
 13 password respectively.
 14 
 15 EOS
 16 
 17 if not ARGV.grep(/-h|--help/).empty?
 18     puts HELP_STRING
 19     exit(0)
 20 end
 21 
 22 # try to access the weblog statistics page
 23 user = 'muharem'
 24 password = nil  # set your password here if you dislike being prompted for it
 25 
 26 if ARGV[0]
 27     user = ARGV[0]
 28 end
 29 if ARGV[1]
 30     password = ARGV[1]
 31 end
 32 
 33 stats_url = "http://#{user}.wordpress.com/wp-admin/index.php?page=stats"
 34 
 35 # instantiate/initialise web agent ..
 36 agent = WWW::Mechanize.new
 37 agent.user_agent_alias = 'Mac Safari'
 38 # .. and get the weblog statistics page
 39 page = agent.get(stats_url)
 40 
 41 # did we get back the login form?
 42 if (page.title.strip.split[-1] == 'Login')
 43     # yes, fill it in and submit it
 44     loginf = page.form('loginform')
 45     loginf.log = user
 46     if not password
 47         print "Enter your wordpress.com password: "
 48         password = $stdin.gets.chomp
 49     end
 50     loginf.pwd = password
 51     agent.submit(loginf, loginf.buttons.first)
 52 end
 53 
 54 # now get the actual weblog statistics page
 55 page = agent.get_file(stats_url)
 56 # parse it!
 57 doc = Hpricot(page)
 58 
 59 # search for the div elements that contain the statistics data
 60 stats_divs = doc.search("//div[@class='statsdiv']")
 61 stats_divs.each do |div|
 62     heading = div.search("h3/a/text()")
 63     # we are only interested in the statistics for today
 64     day = div.search("h4/text()").first
 65     if (heading and day)
 66         heading = "==== #{heading} (#{day.inner_text.downcase}) ====".center(50)
 67         puts "\\n#{heading}\\n"
 68         # find the table with today's statistics data
 69         tab = div.search("table").first
 70         if tab
 71             # extract the statistics data from the <tr> elements
 72             tab.search("tr").each do |tr|
 73                 what = tr.search("td[@class='label']")
 74                 views = tr.search("td[@class='views']")
 75                 whats = what.inner_text.strip()
 76                 if not whats.empty?
 77                     views = views.inner_text.strip()
 78                     printf("%s -- %5s\\n", whats.center(45), views)
 79                 end
 80             end
 81         end
 82     end
 83 end
 84 # grab the div with the general (weblog level) statistics data
 85 gbdiv = doc.search("//div[@id='generalblog']")
 86 # find the <p> element with the number of views today
 87 vtoday =  gbdiv.search("p").find { |p| p.inner_text.index('Views today') }
 88 if vtoday
 89     printf("\\n%s\\n\\n", "=> #{vtoday.inner_text.strip} <=".center(45))
 90 else
 91     puts "\\n\\n!! No weblog statistics data found."
 92     puts "   Did you enter a wrong user name and/or password?"
 93 end

Example output

  1            ==== Referrers (today) ====
  2    stumbleupon.com/refer.php?url=http%3A?     --    16
  3    stumbleupon.com/refer.php?url=http%3A?     --     3
  4    planeterlang.org/story.php?title=Erla?     --     2
  5    linuxquestions.org/questions/showthre?     --     2
  6        del.icio.us/jdkimball/stackless        --     1
  7    rodenas.org/blog/2007/08/27/erlang-ri?     --     1
  8    intertwingly.net/blog/2007/08/14/Long?     --     1
  9              ozone.wordpress.com              --     1
 10    programming.reddit.com/search?q=erlan?     --     1
 11 
 12            ==== Top Posts (today) ====
 13           Processing XML in Erlang            --    22
 14   Erlang vs. Stackless python: a first ben    --    18
 15   Python: file find, grep and in-line repl    --     4
 16   Python decorator mini-study (part 1 of 3    --     2
 17   Code refactoring with python's functoo      --     2
 18   Python: find files using Unix shell-styl    --     2
 19   Determine order of execution by (re-)seq    --     2
 20            A first look at Groovy             --     1
 21   Python decorator mini-study (part 2 of 3    --     1
 22    Turn on line numbers while searching in    --     1
 23 
 24       ==== Search Engine Terms (today) ====
 25               erlang benchmark                --     3
 26              stackless vs erlang              --     2
 27          python decorators argument           --     2
 28       source code of an execution path        --     2
 29           python functools partial            --     2
 30                  python grep                  --     2
 31          python string replace 2.4.4          --     1
 32               erlang vs C speed               --     1
 33     erlang command line arguments getopt      --     1
 34    Python + parsing command line arguments    --     1
 35 
 36              ==== Clicks (today) ====
 37         hpcwire.com/hpc/1295541.html          --     2
 38    pragmaticprogrammer.com/titles/jaerla?     --     1
 39    hrnjad.net/src/6/scriptutil.py.html#f?     --     1
 40 
 41             => Views today: 68 <=

Conclusion

I am a total Ruby beginner but have a lot of experience with Python and Perl. It took approximately 2 hours (and frequent look-ups in the pickaxe book) to write the tool above and I enjoyed it :-)

Being the kind of person that stays away from all things over-hyped I ignored Ruby for the last two years or so but I have to say it’s a cool language after all.

Click here to download the code.

Processing XML in Erlang

Introduction

This is my second stab at Erlang (see the ring benchmark article for my first take on it). This time around I wanted to get a sense of how well Erlang and its libraries support more mundane tasks like e.g. XML parsing.

I am not a big fan of XML but it is the lingua franca of the web and any language that aspires to become “mainstream” has to support it in an efficient manner.

In order to get a feeling for how well Erlang is doing in this respect I am going to repeat my recent XML processing experiments with Groovy but this time using Erlang.

Example

I’ll be doing some basic processing of RSS files. For a full example of what these look like see e.g. the MacBreak’s weekly RSS file. Here’s an excerpt (abridged for the sake of clarity):

  1 <?xml version="1.0" encoding="utf-8"?>
  2 <rss version="2.0">
  3   <channel>
  4     <title>MacBreak Weekly</title>
  5     <item>
  6       <title>MacBreak Weekly 53: Bill In A Box</title>
  7       <link>http://www.podtrac.com/twit.cachefly.net/MBW-053.mp3</link>
  8       <pubDate>Wed, 15 Aug 2007 12:58:11 -0700</pubDate>
  9       <enclosure url="http://www.podtrac.com/twit.cachefly.net/MBW-053.mp3" />
 10     </item>
 11   </channel>
 12 </rss>

Each of the potentially many <item> tags keeps the data pertaining to a single audiocast episode. What we want to extract is:

  • the audiocast episode title (<title> tag)
  • the publication date (<pubDate> tag)
  • the URL pointing to the MP3 file (<link> tag)

Depending on the publisher the format of the RSS file may vary slightly. The publication date is e.g. sometimes buried in a <dc:date> tag.
Likewise, the MP3 URL is sometimes not contained in a <link> tag but in the url attribute of the <enclosure> tag.

The RSS files I will be using for testing are as follows:

 1 bbox33:audiocasts $ pwd
 2 /Users/mhr/dl/audio/audiocasts
 3 bbox33:audiocasts $ find . -type f -name \*.rss
 4 ./metadata/Cc_zwei.rss
 5 ./metadata/Elrep.rss
 6 ./metadata/Security_now_.rss
 7 ./metadata/Technometria.rss
 8 ./metadata/This_week_in_tech.rss
 9 ./metadata/Windows_weekly.rss

Opening remarks

Unfortunately, the documentation of Erlang’s XML parsing library is pretty scant and — apart from the xmerl User’s Guide — I could not find any tutorials on xmerl on the web.

Again, a language aspiring to widespread adoption should have more material covering these kinds of basics.

There were a few choices as to how to go about the parsing business:

I chose the first approach because it was better documented. Only after finishing the first cut of the parsing code did I find xmerl_xpath usage examples on a mailing list and played with it.

The code

The Erlang code that finds the RSS files listed above and parses them is as follows:

  1 -module(xml2).
  2 -export([main/1]).
  3 -include_lib("xmerl/include/xmerl.hrl").
  4 
  5 parseAll(D) ->
  6     % find all RSS files underneath D
  7     FL = filelib:fold_files(D, ".+.rss$", true, fun(F, L) -> [F|L] end, []),
  8     [ parse(F) || F <- FL ].
  9 
 10 parse(FName) ->
 11     % parses a single RSS file
 12     {R,_} = xmerl_scan:file(FName),
 13     % extract episode titles, publication dates and MP3 URLs
 14     L = lists:reverse(extract(R, [])),
 15     % print channel title and data for first two episodes
 16     io:format("~n>> ~p~n", [element(1,lists:split(3,L))]),
 17     L.
 18 
 19 % handle 'xmlElement' tags
 20 extract(R, L) when is_record(R, xmlElement) ->
 21     case R#xmlElement.name of
 22         enclosure ->
 23             if element(1, hd(R#xmlElement.parents)) == item ->
 24                     FFunc = fun(X) -> X#xmlAttribute.name == url end,
 25                     U = hd(lists:filter(FFunc, R#xmlElement.attributes)),
 26                     [ {url, U#xmlAttribute.value} | L ];
 27                 true -> L
 28             end;
 29         channel ->
 30             lists:foldl(fun extract/2, L, R#xmlElement.content);
 31         item ->
 32             ItemData = lists:foldl(fun extract/2, [], R#xmlElement.content),
 33             [ ItemData | L ];
 34         _ -> % for any other XML elements, simply iterate over children
 35             lists:foldl(fun extract/2, L, R#xmlElement.content)
 36     end;
 37 
 38 extract(#xmlText{parents=[{title,_},{channel,2},_], value=V}, L) ->
 39     [{channel, V}|L]; % extract channel/audiocast title
 40 
 41 extract(#xmlText{parents=[{title,_},{item,_},_,_], value=V}, L) ->
 42     [{title, V}|L]; % extract episode title
 43 
 44 extract(#xmlText{parents=[{link,_},{item,_},_,_], value=V}, L) ->
 45     [{link, V}|L]; % extract episode link
 46 
 47 extract(#xmlText{parents=[{pubDate,_},{item,_},_,_], value=V}, L) ->
 48     [{pubDate, V}|L]; % extract episode publication date ('pubDate' tag)
 49 
 50 extract(#xmlText{parents=[{'dc:date',_},{item,_},_,_], value=V}, L) ->
 51     [{pubDate, V}|L]; % extract episode publication date ('dc:date' tag)
 52 
 53 extract(#xmlText{}, L) -> L.  % ignore any other text data
 54 
 55 % 'main' function (invoked from shell, receives command line arguments)
 56 main(A) ->
 57     D = atom_to_list(hd(A)),
 58     parseAll(D).

Conclusion

Erlang’s filelib:fold_files() function is cool and a good example of how easy things should be.

On the other hand:

  • it took quite a bit of time and effort to write the code above (perhaps due to my lack of experience in functional programming) and it was not fun :-(
  • beauty is in the eye of the beholder as they say and the promise of being able to write clean and attractive code is a major reason to pick up a new programming language. Again, maybe it’s just me but I did not find the resulting Erlang code to be particularly attractive.
  • the XML parser chokes reproducibly on XML files with non-ASCII character sets (try the code e.g. with the following RSS file (containing german characters))
  • The XPath implementation appears to be incomplete: I could not use the | operator in an XPath expression to select several paths for example.

Anyway, that’s just my $ 0.02 on XML parsing with Erlang. I am not an expert by any stretch of the imagination so feel free to point to anything I may have missed.

A first look at Groovy

Introduction

Recently I started playing with Groovy, a dynamic language that — according to the first chapter of the Groovy in Action book — is Python-inspired.

One of the reasons why I find Groovy attractive is that it can be compiled to Java byte code (despite being a dynamic language) i.e. you gain access to all the Java libraries and the capability to deploy on the Java platform without actually having to write Java code.

Groovy is even supported by the Spring Framework. So, if you’re “forced” to work in a Java project in order to make a living but would rather be using a dynamic language then Groovy is definitely worth a look.

Example

Some time ago I wrote a small tool (using Python) that manages my audiocasts. Based on the RSS feeds I am interested in it synchronises

  1. the MP3 files on my MP3 player with the ones available on the local hard disk of my computer
  2. the MP3 files on my computer (local hard disk) with the ones available on the web

It is also cognizant of the date of an audiocast and will only consider MP3 files that were published on a particular channel N days ago.

I thought building a similar tool in Groovy would be a nice exercise and a good way to get to know the language.

I started playing with the code that parses the RSS (XML) files first and was pleasantly surprised how quickly I was able to get my hands on the data required.

For an example of how an RSS file looks like see e.g. the MacBreak’s weekly RSS file. In essence the top level tag is <channel> with a number of embedded <item> tags. Each item is an audiocast and the data I needed for my purpose is as follows:

  • audiocast episode title (<title> tag)
  • publication date (<pubDate> tag)
  • the URL pointing to the MP3 file (<link> tag)

Depending on the publisher the format of the RSS file may vary slightly. The publication date is e.g. sometimes buried in a <dc:date> tag (where dc is an XML name space pointing to http://purl.org/dc/elements/1.1/).

Likewise, the MP3 file URL is sometimes not contained in a <link> tag but in the url attribute of the <enclosure> tag.

The RSS files on my system are here:

 1 bbox33:audiocasts $ pwd
 2 /Users/mhr/dl/audio/audiocasts
 3 bbox33:audiocasts $ find . -type f -name \\*.rss
 4 ./metadata/Cc_zwei.rss
 5 ./metadata/Elrep.rss
 6 ./metadata/Security_now_.rss
 7 ./metadata/Technometria.rss
 8 ./metadata/This_week_in_tech.rss
 9 ./metadata/Windows_weekly.rss

The code

The Groovy code that finds the RSS files listed above and parses them is as follows:

  1 def audioDir = new File("/Users/mhr/dl/audio/audiocasts")
  2 
  3 // initialise RSS files list
  4 files = []
  5 // find RSS files underneath 'audioDir'
  6 audioDir.eachFileRecurse { if (it =~ /.*\.(xml|rss)$/) { files << it } }
  7 
  8 println("\\\n-------- RSS files --------")
  9 println files.join('\\\n')
 10 
 11 // iterate over RSS files found
 12 for (rssf in files) {
 13     println("\\\n-------- $rssf --------")
 14     // parse the RSS file
 15     def d = new XmlSlurper().parse(rssf)
 16     d.declareNamespace(dc:"http://purl.org/dc/elements/1.1/")
 17 
 18     // iterate over item tags in RSS file (take only the first two)
 19     d.channel.item[0..1].each {
 20         println "==> ${it.title}"
 21         if (it.pubDate.toString().trim()) {
 22             println "pubDate: ${it.pubDate}"
 23         } else {
 24             println "dc:date: ${it.'dc:date'}"
 25         }
 26         if (it.link.toString().trim()) {
 27             println "link: ${it.link}"
 28         } else {
 29             println "url: ${it.enclosure.@url}"
 30         }
 31     }
 32 }

Here’s the output generated by the code above:

  1 bbox33:groovy $ groovy xml.groovy
  2 
  3 -------- RSS files --------
  4 /Users/mhr/dl/audio/audiocasts/metadata/Cc_zwei.rss
  5 /Users/mhr/dl/audio/audiocasts/metadata/Elrep.rss
  6 /Users/mhr/dl/audio/audiocasts/metadata/Security_now_.rss
  7 /Users/mhr/dl/audio/audiocasts/metadata/Technometria.rss
  8 /Users/mhr/dl/audio/audiocasts/metadata/This_week_in_tech.rss
  9 /Users/mhr/dl/audio/audiocasts/metadata/Windows_weekly.rss
 10 
 11 -------- /Users/mhr/dl/audio/audiocasts/metadata/Cc_zwei.rss --------
 12 ==> CC2 - 62. Folge
 13 pubDate: Mon, 13 Aug 2007 20:00:00 +0200 +0200
 14 url: http://www.media01-live.de/CC-Zwei-62.mp3
 15 ==> CC2 - 61. Folge
 16 pubDate: Mon, 06 Aug 2007 20:00:00 +0200 +0200
 17 url: http://www.media01-live.de/CC-Zwei-61.mp3
 18 
 19 -------- /Users/mhr/dl/audio/audiocasts/metadata/Elrep.rss --------
 20 ==> 36: Lutz Schmitt ^_ber Machinima
 21 dc:date: 2007-08-05T21:30:00+01:00
 22 link: http://www.elektrischer-reporter.de/index.php/site/film/48/
 23 ==> 35: Peter Schaar ^_ber Vorratsdatenspeicherung und Online-Durchsuchungen
 24 dc:date: 2007-07-22T21:30:00+01:00
 25 link: http://www.elektrischer-reporter.de/index.php/site/film/47/
 26 
 27 -------- /Users/mhr/dl/audio/audiocasts/metadata/Security_now_.rss --------
 28 ==> Security Now 104: SteveOs Questions, Your Answers 22 - sponsored by Astaro Corp.
 29 pubDate: Thu, 09 Aug 2007 08:22:49 -0700
 30 link: http://www.podtrac.com/pts/redirect.mp3/aolradio.podcast.aol.com/sn/SN-104.mp3
 31 ==> Security Now 103: Paypal Security Key - sponsored by Astaro Corp.
 32 pubDate: Thu, 02 Aug 2007 15:17:58 -0700
 33 link: http://www.podtrac.com/pts/redirect.mp3/aolradio.podcast.aol.com/sn/SN-103.mp3
 34 
 35 -------- /Users/mhr/dl/audio/audiocasts/metadata/Technometria.rss --------
 36 ==> Scott Lemon, Ben Galbraith - Technometria
 37 pubDate: Tue, 14 Aug 2007 00:00:00 CDT
 38 link: http://www.itconversations.com/shows/detail1892.html
 39 ==> Drew Major - Technometria
 40 pubDate: Tue, 7 Aug 2007 00:00:00 CDT
 41 link: http://www.itconversations.com/shows/detail1886.html
 42 
 43 -------- /Users/mhr/dl/audio/audiocasts/metadata/This_week_in_tech.rss --------
 44 ==> TWiT 109: The Numinous From The Quotidian
 45 pubDate: Sun, 12 Aug 2007 22:39:47 -0700
 46 link: http://www.podtrac.com/pts/redirect.mp3/aolradio.podcast.aol.com/twit/TWiT0109H.mp3
 47 ==> TWiT 108: The Crash of 2007
 48 pubDate: Sun, 05 Aug 2007 20:48:59 -0700
 49 link: http://www.podtrac.com/pts/redirect.mp3/aolradio.podcast.aol.com/twit/TWiT0108H.mp3
 50 
 51 -------- /Users/mhr/dl/audio/audiocasts/metadata/Windows_weekly.rss --------
 52 ==> Windows Weekly 32: Neener Neener Neener
 53 pubDate: Fri, 27 Jul 2007 18:49:31 -0700
 54 link: http://www.podtrac.com/pts/redirect.mp3/twit.cachefly.net/WW-032.mp3
 55 ==> Windows Weekly 31: Computing In The Clouds
 56 pubDate: Fri, 20 Jul 2007 08:57:52 -0700
 57 link: http://www.podtrac.com/pts/redirect.mp3/twit.cachefly.net/WW-031.mp3

Conclusion

I have not written any substantial code in Groovy yet but my first impressions of the language are favourable. Groovy should be an interesting addition to your arsenal (particularly if you know the Java libraries fairly well).

Declarative parsing of command line arguments in python

Introduction

You will probably agree with me that one of the most boring programming chores is the parsing of command line arguments. After programming one too many argument handling routine I decided to write a utility that does the job for me.

The following example shows the use of the resulting python class (CLAP):

 1  #!/usr/bin/env python
 2
 3  # please note: the code below is just a usage example (modelled on a
 4  # hypothetical crypto utility)
 5
 6  import sys, pprint
 7  from parseargs import CLAP
 8
 9  class Crypto:
10      def __init__(self):
11          self.handleArgs()
12
13      def handleArgs(self):
14          # dictionary with command line args along with their types and defaults
15          args = {
16              ('–a', '–x', '––algo')      :   ('algo', str, None),
17              ('–c', '––crypt')           :   ('crypt', bool, None),
18              ('–d', '––decrypt')         :   ('decrypt', bool, None),
19              ('–e', '––echo', '––fyo')   :   ('echo', bool, None),
20              ('–l', '––lines')           :   ('lines', int, '25'),
21              ('–i', )                    :   ('input', str, None),
22              ('–o', )                    :   ('output', str, None),
23              ('–p', '––pager', '––pgr')  :   ('pager', str, '/usr/bin/less'),
24              ('–r', '––recipient')       :   ('recipient', str, None)
25          }
26
27          apu = CLAP(sys.argv[1:], args, min_args=2)
28          self.args = apu.check_args()
29
30  if __name__ == '__main__':
31      c = Crypto()
32      pp = pprint.PrettyPrinter(indent=4)
33      pp.pprint(c.args)

As you can see, the main action is on lines 15-28. I opted for a more declarative approach i.e. I wanted to be able to “declare” the expected command line arguments (along with their types and default values) and be done.

Here is an usage example:

mhr@playground2:~/src/published$ python clap_example.py -d -a blowfish --echo
{   'algo': 'blowfish',
    'decrypt': True,
    'echo': True,
    'lines': 25,
    'pager': '/usr/bin/less'}

The first three values were supplied on the command line whereas the last two stem from defaults declared in the client code (see lines 20 and 23 above).

Here is what happens in case of erroneous user input (e.g. supplying the value ‘abc’ for argument ‘lines’ which is of type integer):

mhr@playground2:~/src/published$ python clap_example.py -d -l abc
!! Invalid parameter value: invalid literal for int(): abc !!

In the invocation below an unsupported parameter (‘-z’) was passed:

mhr@playground2:~/src/published$ python clap_example.py -d -z
!! Error: option -z not recognized !!

The utility class

The utility class CLAP is reasonably straightforward and will be introduced below. To see it in full beaty click here :-)

  1  #!/usr/bin/env python
  2  """
  3  Utility class for handling of command line arguments, see the bottom of the
  4  file for an example showing how it should be used.
  5  """
  6  # Copyright: (c) 2006 Muharem Hrnjadovic
  7  # created: 21/11/2006 15:15:49
  8
  9  __version__ = "$Id$"
 10  # $HeadURL $
 11
 12  import sys, getopt, re
 13  import itertools as IT
 14  import operator as OP
 15
 16  class CLAP(object):
 17      """A class that uses a declarative technique for command line
 18      argument parsing"""
 19
 20      def __init__(self, argv, args, min_args = 0, help_string = None):
 21          """initialiser, just copies its arguments to attributes"""
 22          self.args = args
 23          self.min_args = min_args
 24          self.help_string = help_string
 25          # skip any leading arguments that don't start with a dash (since
 26          # this confuses the getopt utility)
 27          self.argv = list(IT.dropwhile(lambda s: not s.startswith('-'), argv))

lines 20-27 (initialiser method): merely copies the parameters passed to it to attributes of the same name.

 28
 29      def check_args(self):
 30          if not self.argv or not self.args or len(self.argv) < self.min_args:
 31              sys.stderr.write("!! Error: not enough arguments or data " \\
 32                               "for parsing !!\\n")
 33              self.help(1)
 34
 35          self.construct_getopt_data()
 36          try:
 37              opts, args = getopt.getopt(self.argv, self.shortflags,
 38                  self.longflags)
 39          except getopt.GetoptError, e:
 40              sys.stderr.write("!! Error: %s !!\\n" % str(e))
 41              self.help(2)

lines 35-41: the data required for the getopt() function is put together (from the the command line argument “declaration” supplied by the client code). Subsequently getopt() is invoked to perform the low level argument parsing.

 42
 43          # holds arguments that were actually supplied on the command line
 44          suppliedd = {}
 45          # result dictionary
 46          resultd = {}
 47
 48          # initialise args where approppriate
 49          try:
 50              for flags, (argn,typef,initv) in self.args.iteritems():
 51                  if initv is not None: resultd[argn] = typef(initv)
 52          except Exception, e:
 53              sys.stderr.write("!! Internal error: %s !!\\n" % str(e))
 54              self.help(3)

lines 49-54: for any arguments that have default values an attempt to initialise them with these is made. Please note how the code uses python type functions to perform the initialisation (line 51)

 55
 56          # dictionary needed for matching against the command line flags
 57          matchd = dict([(arg, (OP.itemgetter(0)(v), OP.itemgetter(1)(v))) for \\
 58                         args, v in self.args.iteritems() for arg in args])
 59
 60          # check the arguments provided on the command line
 61          try:
 62              for opt, argv in opts:
 63                  if opt in matchd:
 64                      argn, typef = matchd[opt]
 65                      suppliedd[argn] = (typef == bool and True) or typef(argv)
 66          except Exception, e:
 67              sys.stderr.write("!! Invalid parameter value: %s !!\\n" % str(e))
 68              self.help(4)

lines 57-68: given the command line arguments shown in the “Introduction” section, the matchd dictionary will have the following value:

pp.pprint(matchd)
{   '--algo': ('algo', <type 'str'>),
    '--crypt': ('crypt', <type 'bool'>),
    '--decrypt': ('decrypt', <type 'bool'>),
    '--echo': ('echo', <type 'bool'>),
    '--fyo': ('echo', <type 'bool'>),
    '--lines': ('lines', <type 'int'>),
    '--pager': ('pager', <type 'str'>),
    '--pgr': ('pager', <type 'str'>),
    '--recipient': ('recipient', <type 'str'>),
    '-a': ('algo', <type 'str'>),
    '-c': ('crypt', <type 'bool'>),
    '-d': ('decrypt', <type 'bool'>),
    '-e': ('echo', <type 'bool'>),
    '-i': ('input', <type 'str'>),
    '-l': ('lines', <type 'int'>),
    '-o': ('output', <type 'str'>),
    '-p': ('pager', <type 'str'>),
    '-r': ('recipient', <type 'str'>),
    '-x': ('algo', <type 'str'>)}

It is used to add the arguments that were actually supplied on the command line to the suppliedd dictionary. Please note again, how python type functions are used to convert (any non-boolean) command line arguments from strings to the desired type (line 65).

 69
 70          # merge arguments (supplied on the command line) with the defaults
 71          resultd.update(suppliedd)
 72
 73          return (resultd)

lines 71-73: last but not least we merge the arguments that were actually supplied on the command line to the default values and return the result.

 74
 75      def construct_getopt_data(self):
 76          # pair all flags will their respective types
 77          flags = [(arg, OP.itemgetter(1)(v)) for args, v in \\
 78                                          self.args.iteritems() for arg in args]
 79          def ff(((argf, argt), fchar)):
 80              return argt == bool and argf.lstrip('-') or \\
 81              "%s%s" % (argf.lstrip('-'), fchar)
 82          # single character flags
 83          self.shortflags = ''.join(map(ff, zip(filter(lambda t: len(t[0]) <= 2,
 84                                                       flags), IT.repeat(':'))))
 85          # multiple character flags
 86          self.longflags = map(ff, zip(filter(lambda t: len(t[0]) > 2, flags),
 87                                       IT.repeat('=')))

Again, based on the example above, the flags dictionary will be as follows:

pp.pprint(flags)
[   ('-l', <type 'int'>),
    ('--lines', <type 'int'>),
    ('-e', <type 'bool'>),
    ('--echo', <type 'bool'>),
    ('--fyo', <type 'bool'>),
    ('-p', <type 'str'>),
    ('--pager', <type 'str'>),
    ('--pgr', <type 'str'>),
    ('-a', <type 'str'>),
    ('-x', <type 'str'>),
    ('--algo', <type 'str'>),
    ('-r', <type 'str'>),
    ('--recipient', <type 'str'>),
    ('-o', <type 'str'>),
    ('-i', <type 'str'>),
    ('-d', <type 'bool'>),
    ('--decrypt', <type 'bool'>),
    ('-c', <type 'bool'>),
    ('--crypt', <type 'bool'>)]

The ensuing manipulations result in the following data (to be passed to getopt()):

pp.pprint(self.shortflags)
'l:ep:a:x:r:o:i:dc'

pp.pprint(self.longflags)
[   'lines=',
    'echo',
    'fyo',
    'pager=',
    'pgr=',
    'algo=',
    'recipient=',
    'decrypt',
    'crypt']
 88
 89      def help(self, exit_code=0):
 90          if self.help_string: sys.stderr.write(self.help_string)
 91          sys.exit(exit_code)

In case of an error check_args() will invoke the help() function which terminates the program execution after printing a help string (if any was supplied).

Conclusion

In case you liked the command line argument processing class introduced above, please feel free to download it from here here and play with it. The colorised source code without any interspersed commentary can be viewed here.