Processing XML in Erlang

Introduction

This is my second stab at Erlang (see the ring benchmark article for my first take on it). This time around I wanted to get a sense of how well Erlang and its libraries support more mundane tasks like e.g. XML parsing.

I am not a big fan of XML but it is the lingua franca of the web and any language that aspires to become “mainstream” has to support it in an efficient manner.

In order to get a feeling for how well Erlang is doing in this respect I am going to repeat my recent XML processing experiments with Groovy but this time using Erlang.

Example

I’ll be doing some basic processing of RSS files. For a full example of what these look like see e.g. the MacBreak’s weekly RSS file. Here’s an excerpt (abridged for the sake of clarity):

  1 <?xml version="1.0" encoding="utf-8"?>
  2 <rss version="2.0">
  3   <channel>
  4     <title>MacBreak Weekly</title>
  5     <item>
  6       <title>MacBreak Weekly 53: Bill In A Box</title>
  7       <link>http://www.podtrac.com/twit.cachefly.net/MBW-053.mp3</link>
  8       <pubDate>Wed, 15 Aug 2007 12:58:11 -0700</pubDate>
  9       <enclosure url="http://www.podtrac.com/twit.cachefly.net/MBW-053.mp3" />
 10     </item>
 11   </channel>
 12 </rss>

Each of the potentially many <item> tags keeps the data pertaining to a single audiocast episode. What we want to extract is:

  • the audiocast episode title (<title> tag)
  • the publication date (<pubDate> tag)
  • the URL pointing to the MP3 file (<link> tag)

Depending on the publisher the format of the RSS file may vary slightly. The publication date is e.g. sometimes buried in a <dc:date> tag.
Likewise, the MP3 URL is sometimes not contained in a <link> tag but in the url attribute of the <enclosure> tag.

The RSS files I will be using for testing are as follows:

 1 bbox33:audiocasts $ pwd
 2 /Users/mhr/dl/audio/audiocasts
 3 bbox33:audiocasts $ find . -type f -name \*.rss
 4 ./metadata/Cc_zwei.rss
 5 ./metadata/Elrep.rss
 6 ./metadata/Security_now_.rss
 7 ./metadata/Technometria.rss
 8 ./metadata/This_week_in_tech.rss
 9 ./metadata/Windows_weekly.rss

Opening remarks

Unfortunately, the documentation of Erlang’s XML parsing library is pretty scant and — apart from the xmerl User’s Guide — I could not find any tutorials on xmerl on the web.

Again, a language aspiring to widespread adoption should have more material covering these kinds of basics.

There were a few choices as to how to go about the parsing business:

I chose the first approach because it was better documented. Only after finishing the first cut of the parsing code did I find xmerl_xpath usage examples on a mailing list and played with it.

The code

The Erlang code that finds the RSS files listed above and parses them is as follows:

  1 -module(xml2).
  2 -export([main/1]).
  3 -include_lib("xmerl/include/xmerl.hrl").
  4 
  5 parseAll(D) ->
  6     % find all RSS files underneath D
  7     FL = filelib:fold_files(D, ".+.rss$", true, fun(F, L) -> [F|L] end, []),
  8     [ parse(F) || F <- FL ].
  9 
 10 parse(FName) ->
 11     % parses a single RSS file
 12     {R,_} = xmerl_scan:file(FName),
 13     % extract episode titles, publication dates and MP3 URLs
 14     L = lists:reverse(extract(R, [])),
 15     % print channel title and data for first two episodes
 16     io:format("~n>> ~p~n", [element(1,lists:split(3,L))]),
 17     L.
 18 
 19 % handle 'xmlElement' tags
 20 extract(R, L) when is_record(R, xmlElement) ->
 21     case R#xmlElement.name of
 22         enclosure ->
 23             if element(1, hd(R#xmlElement.parents)) == item ->
 24                     FFunc = fun(X) -> X#xmlAttribute.name == url end,
 25                     U = hd(lists:filter(FFunc, R#xmlElement.attributes)),
 26                     [ {url, U#xmlAttribute.value} | L ];
 27                 true -> L
 28             end;
 29         channel ->
 30             lists:foldl(fun extract/2, L, R#xmlElement.content);
 31         item ->
 32             ItemData = lists:foldl(fun extract/2, [], R#xmlElement.content),
 33             [ ItemData | L ];
 34         _ -> % for any other XML elements, simply iterate over children
 35             lists:foldl(fun extract/2, L, R#xmlElement.content)
 36     end;
 37 
 38 extract(#xmlText{parents=[{title,_},{channel,2},_], value=V}, L) ->
 39     [{channel, V}|L]; % extract channel/audiocast title
 40 
 41 extract(#xmlText{parents=[{title,_},{item,_},_,_], value=V}, L) ->
 42     [{title, V}|L]; % extract episode title
 43 
 44 extract(#xmlText{parents=[{link,_},{item,_},_,_], value=V}, L) ->
 45     [{link, V}|L]; % extract episode link
 46 
 47 extract(#xmlText{parents=[{pubDate,_},{item,_},_,_], value=V}, L) ->
 48     [{pubDate, V}|L]; % extract episode publication date ('pubDate' tag)
 49 
 50 extract(#xmlText{parents=[{'dc:date',_},{item,_},_,_], value=V}, L) ->
 51     [{pubDate, V}|L]; % extract episode publication date ('dc:date' tag)
 52 
 53 extract(#xmlText{}, L) -> L.  % ignore any other text data
 54 
 55 % 'main' function (invoked from shell, receives command line arguments)
 56 main(A) ->
 57     D = atom_to_list(hd(A)),
 58     parseAll(D).

Conclusion

Erlang’s filelib:fold_files() function is cool and a good example of how easy things should be.

On the other hand:

  • it took quite a bit of time and effort to write the code above (perhaps due to my lack of experience in functional programming) and it was not fun :-(
  • beauty is in the eye of the beholder as they say and the promise of being able to write clean and attractive code is a major reason to pick up a new programming language. Again, maybe it’s just me but I did not find the resulting Erlang code to be particularly attractive.
  • the XML parser chokes reproducibly on XML files with non-ASCII character sets (try the code e.g. with the following RSS file (containing german characters))
  • The XPath implementation appears to be incomplete: I could not use the | operator in an XPath expression to select several paths for example.

Anyway, that’s just my $ 0.02 on XML parsing with Erlang. I am not an expert by any stretch of the imagination so feel free to point to anything I may have missed.