Processing XML in Erlang

Introduction

This is my second stab at Erlang (see the ring benchmark article for my first take on it). This time around I wanted to get a sense of how well Erlang and its libraries support more mundane tasks like e.g. XML parsing.

I am not a big fan of XML but it is the lingua franca of the web and any language that aspires to become “mainstream” has to support it in an efficient manner.

In order to get a feeling for how well Erlang is doing in this respect I am going to repeat my recent XML processing experiments with Groovy but this time using Erlang.

Example

I’ll be doing some basic processing of RSS files. For a full example of what these look like see e.g. the MacBreak’s weekly RSS file. Here’s an excerpt (abridged for the sake of clarity):

  1 <?xml version="1.0" encoding="utf-8"?>
  2 <rss version="2.0">
  3   <channel>
  4     <title>MacBreak Weekly</title>
  5     <item>
  6       <title>MacBreak Weekly 53: Bill In A Box</title>
  7       <link>http://www.podtrac.com/twit.cachefly.net/MBW-053.mp3</link>
  8       <pubDate>Wed, 15 Aug 2007 12:58:11 -0700</pubDate>
  9       <enclosure url="http://www.podtrac.com/twit.cachefly.net/MBW-053.mp3" />
 10     </item>
 11   </channel>
 12 </rss>

Each of the potentially many <item> tags keeps the data pertaining to a single audiocast episode. What we want to extract is:

  • the audiocast episode title (<title> tag)
  • the publication date (<pubDate> tag)
  • the URL pointing to the MP3 file (<link> tag)

Depending on the publisher the format of the RSS file may vary slightly. The publication date is e.g. sometimes buried in a <dc:date> tag.
Likewise, the MP3 URL is sometimes not contained in a <link> tag but in the url attribute of the <enclosure> tag.

The RSS files I will be using for testing are as follows:

 1 bbox33:audiocasts $ pwd
 2 /Users/mhr/dl/audio/audiocasts
 3 bbox33:audiocasts $ find . -type f -name \*.rss
 4 ./metadata/Cc_zwei.rss
 5 ./metadata/Elrep.rss
 6 ./metadata/Security_now_.rss
 7 ./metadata/Technometria.rss
 8 ./metadata/This_week_in_tech.rss
 9 ./metadata/Windows_weekly.rss

Opening remarks

Unfortunately, the documentation of Erlang’s XML parsing library is pretty scant and — apart from the xmerl User’s Guide — I could not find any tutorials on xmerl on the web.

Again, a language aspiring to widespread adoption should have more material covering these kinds of basics.

There were a few choices as to how to go about the parsing business:

I chose the first approach because it was better documented. Only after finishing the first cut of the parsing code did I find xmerl_xpath usage examples on a mailing list and played with it.

The code

The Erlang code that finds the RSS files listed above and parses them is as follows:

  1 -module(xml2).
  2 -export([main/1]).
  3 -include_lib("xmerl/include/xmerl.hrl").
  4 
  5 parseAll(D) ->
  6     % find all RSS files underneath D
  7     FL = filelib:fold_files(D, ".+.rss$", true, fun(F, L) -> [F|L] end, []),
  8     [ parse(F) || F <- FL ].
  9 
 10 parse(FName) ->
 11     % parses a single RSS file
 12     {R,_} = xmerl_scan:file(FName),
 13     % extract episode titles, publication dates and MP3 URLs
 14     L = lists:reverse(extract(R, [])),
 15     % print channel title and data for first two episodes
 16     io:format("~n>> ~p~n", [element(1,lists:split(3,L))]),
 17     L.
 18 
 19 % handle 'xmlElement' tags
 20 extract(R, L) when is_record(R, xmlElement) ->
 21     case R#xmlElement.name of
 22         enclosure ->
 23             if element(1, hd(R#xmlElement.parents)) == item ->
 24                     FFunc = fun(X) -> X#xmlAttribute.name == url end,
 25                     U = hd(lists:filter(FFunc, R#xmlElement.attributes)),
 26                     [ {url, U#xmlAttribute.value} | L ];
 27                 true -> L
 28             end;
 29         channel ->
 30             lists:foldl(fun extract/2, L, R#xmlElement.content);
 31         item ->
 32             ItemData = lists:foldl(fun extract/2, [], R#xmlElement.content),
 33             [ ItemData | L ];
 34         _ -> % for any other XML elements, simply iterate over children
 35             lists:foldl(fun extract/2, L, R#xmlElement.content)
 36     end;
 37 
 38 extract(#xmlText{parents=[{title,_},{channel,2},_], value=V}, L) ->
 39     [{channel, V}|L]; % extract channel/audiocast title
 40 
 41 extract(#xmlText{parents=[{title,_},{item,_},_,_], value=V}, L) ->
 42     [{title, V}|L]; % extract episode title
 43 
 44 extract(#xmlText{parents=[{link,_},{item,_},_,_], value=V}, L) ->
 45     [{link, V}|L]; % extract episode link
 46 
 47 extract(#xmlText{parents=[{pubDate,_},{item,_},_,_], value=V}, L) ->
 48     [{pubDate, V}|L]; % extract episode publication date ('pubDate' tag)
 49 
 50 extract(#xmlText{parents=[{'dc:date',_},{item,_},_,_], value=V}, L) ->
 51     [{pubDate, V}|L]; % extract episode publication date ('dc:date' tag)
 52 
 53 extract(#xmlText{}, L) -> L.  % ignore any other text data
 54 
 55 % 'main' function (invoked from shell, receives command line arguments)
 56 main(A) ->
 57     D = atom_to_list(hd(A)),
 58     parseAll(D).

Conclusion

Erlang’s filelib:fold_files() function is cool and a good example of how easy things should be.

On the other hand:

  • it took quite a bit of time and effort to write the code above (perhaps due to my lack of experience in functional programming) and it was not fun :-(
  • beauty is in the eye of the beholder as they say and the promise of being able to write clean and attractive code is a major reason to pick up a new programming language. Again, maybe it’s just me but I did not find the resulting Erlang code to be particularly attractive.
  • the XML parser chokes reproducibly on XML files with non-ASCII character sets (try the code e.g. with the following RSS file (containing german characters))
  • The XPath implementation appears to be incomplete: I could not use the | operator in an XPath expression to select several paths for example.

Anyway, that’s just my $ 0.02 on XML parsing with Erlang. I am not an expert by any stretch of the imagination so feel free to point to anything I may have missed.

A first look at Groovy

Introduction

Recently I started playing with Groovy, a dynamic language that — according to the first chapter of the Groovy in Action book — is Python-inspired.

One of the reasons why I find Groovy attractive is that it can be compiled to Java byte code (despite being a dynamic language) i.e. you gain access to all the Java libraries and the capability to deploy on the Java platform without actually having to write Java code.

Groovy is even supported by the Spring Framework. So, if you’re “forced” to work in a Java project in order to make a living but would rather be using a dynamic language then Groovy is definitely worth a look.

Example

Some time ago I wrote a small tool (using Python) that manages my audiocasts. Based on the RSS feeds I am interested in it synchronises

  1. the MP3 files on my MP3 player with the ones available on the local hard disk of my computer
  2. the MP3 files on my computer (local hard disk) with the ones available on the web

It is also cognizant of the date of an audiocast and will only consider MP3 files that were published on a particular channel N days ago.

I thought building a similar tool in Groovy would be a nice exercise and a good way to get to know the language.

I started playing with the code that parses the RSS (XML) files first and was pleasantly surprised how quickly I was able to get my hands on the data required.

For an example of how an RSS file looks like see e.g. the MacBreak’s weekly RSS file. In essence the top level tag is <channel> with a number of embedded <item> tags. Each item is an audiocast and the data I needed for my purpose is as follows:

  • audiocast episode title (<title> tag)
  • publication date (<pubDate> tag)
  • the URL pointing to the MP3 file (<link> tag)

Depending on the publisher the format of the RSS file may vary slightly. The publication date is e.g. sometimes buried in a <dc:date> tag (where dc is an XML name space pointing to http://purl.org/dc/elements/1.1/).

Likewise, the MP3 file URL is sometimes not contained in a <link> tag but in the url attribute of the <enclosure> tag.

The RSS files on my system are here:

 1 bbox33:audiocasts $ pwd
 2 /Users/mhr/dl/audio/audiocasts
 3 bbox33:audiocasts $ find . -type f -name \\*.rss
 4 ./metadata/Cc_zwei.rss
 5 ./metadata/Elrep.rss
 6 ./metadata/Security_now_.rss
 7 ./metadata/Technometria.rss
 8 ./metadata/This_week_in_tech.rss
 9 ./metadata/Windows_weekly.rss

The code

The Groovy code that finds the RSS files listed above and parses them is as follows:

  1 def audioDir = new File("/Users/mhr/dl/audio/audiocasts")
  2 
  3 // initialise RSS files list
  4 files = []
  5 // find RSS files underneath 'audioDir'
  6 audioDir.eachFileRecurse { if (it =~ /.*\.(xml|rss)$/) { files << it } }
  7 
  8 println("\\\n-------- RSS files --------")
  9 println files.join('\\\n')
 10 
 11 // iterate over RSS files found
 12 for (rssf in files) {
 13     println("\\\n-------- $rssf --------")
 14     // parse the RSS file
 15     def d = new XmlSlurper().parse(rssf)
 16     d.declareNamespace(dc:"http://purl.org/dc/elements/1.1/")
 17 
 18     // iterate over item tags in RSS file (take only the first two)
 19     d.channel.item[0..1].each {
 20         println "==> ${it.title}"
 21         if (it.pubDate.toString().trim()) {
 22             println "pubDate: ${it.pubDate}"
 23         } else {
 24             println "dc:date: ${it.'dc:date'}"
 25         }
 26         if (it.link.toString().trim()) {
 27             println "link: ${it.link}"
 28         } else {
 29             println "url: ${it.enclosure.@url}"
 30         }
 31     }
 32 }

Here’s the output generated by the code above:

  1 bbox33:groovy $ groovy xml.groovy
  2 
  3 -------- RSS files --------
  4 /Users/mhr/dl/audio/audiocasts/metadata/Cc_zwei.rss
  5 /Users/mhr/dl/audio/audiocasts/metadata/Elrep.rss
  6 /Users/mhr/dl/audio/audiocasts/metadata/Security_now_.rss
  7 /Users/mhr/dl/audio/audiocasts/metadata/Technometria.rss
  8 /Users/mhr/dl/audio/audiocasts/metadata/This_week_in_tech.rss
  9 /Users/mhr/dl/audio/audiocasts/metadata/Windows_weekly.rss
 10 
 11 -------- /Users/mhr/dl/audio/audiocasts/metadata/Cc_zwei.rss --------
 12 ==> CC2 - 62. Folge
 13 pubDate: Mon, 13 Aug 2007 20:00:00 +0200 +0200
 14 url: http://www.media01-live.de/CC-Zwei-62.mp3
 15 ==> CC2 - 61. Folge
 16 pubDate: Mon, 06 Aug 2007 20:00:00 +0200 +0200
 17 url: http://www.media01-live.de/CC-Zwei-61.mp3
 18 
 19 -------- /Users/mhr/dl/audio/audiocasts/metadata/Elrep.rss --------
 20 ==> 36: Lutz Schmitt ^_ber Machinima
 21 dc:date: 2007-08-05T21:30:00+01:00
 22 link: http://www.elektrischer-reporter.de/index.php/site/film/48/
 23 ==> 35: Peter Schaar ^_ber Vorratsdatenspeicherung und Online-Durchsuchungen
 24 dc:date: 2007-07-22T21:30:00+01:00
 25 link: http://www.elektrischer-reporter.de/index.php/site/film/47/
 26 
 27 -------- /Users/mhr/dl/audio/audiocasts/metadata/Security_now_.rss --------
 28 ==> Security Now 104: SteveOs Questions, Your Answers 22 - sponsored by Astaro Corp.
 29 pubDate: Thu, 09 Aug 2007 08:22:49 -0700
 30 link: http://www.podtrac.com/pts/redirect.mp3/aolradio.podcast.aol.com/sn/SN-104.mp3
 31 ==> Security Now 103: Paypal Security Key - sponsored by Astaro Corp.
 32 pubDate: Thu, 02 Aug 2007 15:17:58 -0700
 33 link: http://www.podtrac.com/pts/redirect.mp3/aolradio.podcast.aol.com/sn/SN-103.mp3
 34 
 35 -------- /Users/mhr/dl/audio/audiocasts/metadata/Technometria.rss --------
 36 ==> Scott Lemon, Ben Galbraith - Technometria
 37 pubDate: Tue, 14 Aug 2007 00:00:00 CDT
 38 link: http://www.itconversations.com/shows/detail1892.html
 39 ==> Drew Major - Technometria
 40 pubDate: Tue, 7 Aug 2007 00:00:00 CDT
 41 link: http://www.itconversations.com/shows/detail1886.html
 42 
 43 -------- /Users/mhr/dl/audio/audiocasts/metadata/This_week_in_tech.rss --------
 44 ==> TWiT 109: The Numinous From The Quotidian
 45 pubDate: Sun, 12 Aug 2007 22:39:47 -0700
 46 link: http://www.podtrac.com/pts/redirect.mp3/aolradio.podcast.aol.com/twit/TWiT0109H.mp3
 47 ==> TWiT 108: The Crash of 2007
 48 pubDate: Sun, 05 Aug 2007 20:48:59 -0700
 49 link: http://www.podtrac.com/pts/redirect.mp3/aolradio.podcast.aol.com/twit/TWiT0108H.mp3
 50 
 51 -------- /Users/mhr/dl/audio/audiocasts/metadata/Windows_weekly.rss --------
 52 ==> Windows Weekly 32: Neener Neener Neener
 53 pubDate: Fri, 27 Jul 2007 18:49:31 -0700
 54 link: http://www.podtrac.com/pts/redirect.mp3/twit.cachefly.net/WW-032.mp3
 55 ==> Windows Weekly 31: Computing In The Clouds
 56 pubDate: Fri, 20 Jul 2007 08:57:52 -0700
 57 link: http://www.podtrac.com/pts/redirect.mp3/twit.cachefly.net/WW-031.mp3

Conclusion

I have not written any substantial code in Groovy yet but my first impressions of the language are favourable. Groovy should be an interesting addition to your arsenal (particularly if you know the Java libraries fairly well).