Processing XML in Erlang

Introduction

This is my second stab at Erlang (see the ring benchmark article for my first take on it). This time around I wanted to get a sense of how well Erlang and its libraries support more mundane tasks like e.g. XML parsing.

I am not a big fan of XML but it is the lingua franca of the web and any language that aspires to become “mainstream” has to support it in an efficient manner.

In order to get a feeling for how well Erlang is doing in this respect I am going to repeat my recent XML processing experiments with Groovy but this time using Erlang.

Example

I’ll be doing some basic processing of RSS files. For a full example of what these look like see e.g. the MacBreak’s weekly RSS file. Here’s an excerpt (abridged for the sake of clarity):

  1 <?xml version="1.0" encoding="utf-8"?>
  2 <rss version="2.0">
  3   <channel>
  4     <title>MacBreak Weekly</title>
  5     <item>
  6       <title>MacBreak Weekly 53: Bill In A Box</title>
  7       <link>http://www.podtrac.com/twit.cachefly.net/MBW-053.mp3</link>
  8       <pubDate>Wed, 15 Aug 2007 12:58:11 -0700</pubDate>
  9       <enclosure url="http://www.podtrac.com/twit.cachefly.net/MBW-053.mp3" />
 10     </item>
 11   </channel>
 12 </rss>

Each of the potentially many <item> tags keeps the data pertaining to a single audiocast episode. What we want to extract is:

  • the audiocast episode title (<title> tag)
  • the publication date (<pubDate> tag)
  • the URL pointing to the MP3 file (<link> tag)

Depending on the publisher the format of the RSS file may vary slightly. The publication date is e.g. sometimes buried in a <dc:date> tag.
Likewise, the MP3 URL is sometimes not contained in a <link> tag but in the url attribute of the <enclosure> tag.

The RSS files I will be using for testing are as follows:

 1 bbox33:audiocasts $ pwd
 2 /Users/mhr/dl/audio/audiocasts
 3 bbox33:audiocasts $ find . -type f -name \*.rss
 4 ./metadata/Cc_zwei.rss
 5 ./metadata/Elrep.rss
 6 ./metadata/Security_now_.rss
 7 ./metadata/Technometria.rss
 8 ./metadata/This_week_in_tech.rss
 9 ./metadata/Windows_weekly.rss

Opening remarks

Unfortunately, the documentation of Erlang’s XML parsing library is pretty scant and — apart from the xmerl User’s Guide — I could not find any tutorials on xmerl on the web.

Again, a language aspiring to widespread adoption should have more material covering these kinds of basics.

There were a few choices as to how to go about the parsing business:

I chose the first approach because it was better documented. Only after finishing the first cut of the parsing code did I find xmerl_xpath usage examples on a mailing list and played with it.

The code

The Erlang code that finds the RSS files listed above and parses them is as follows:

  1 -module(xml2).
  2 -export([main/1]).
  3 -include_lib("xmerl/include/xmerl.hrl").
  4 
  5 parseAll(D) ->
  6     % find all RSS files underneath D
  7     FL = filelib:fold_files(D, ".+.rss$", true, fun(F, L) -> [F|L] end, []),
  8     [ parse(F) || F <- FL ].
  9 
 10 parse(FName) ->
 11     % parses a single RSS file
 12     {R,_} = xmerl_scan:file(FName),
 13     % extract episode titles, publication dates and MP3 URLs
 14     L = lists:reverse(extract(R, [])),
 15     % print channel title and data for first two episodes
 16     io:format("~n>> ~p~n", [element(1,lists:split(3,L))]),
 17     L.
 18 
 19 % handle 'xmlElement' tags
 20 extract(R, L) when is_record(R, xmlElement) ->
 21     case R#xmlElement.name of
 22         enclosure ->
 23             if element(1, hd(R#xmlElement.parents)) == item ->
 24                     FFunc = fun(X) -> X#xmlAttribute.name == url end,
 25                     U = hd(lists:filter(FFunc, R#xmlElement.attributes)),
 26                     [ {url, U#xmlAttribute.value} | L ];
 27                 true -> L
 28             end;
 29         channel ->
 30             lists:foldl(fun extract/2, L, R#xmlElement.content);
 31         item ->
 32             ItemData = lists:foldl(fun extract/2, [], R#xmlElement.content),
 33             [ ItemData | L ];
 34         _ -> % for any other XML elements, simply iterate over children
 35             lists:foldl(fun extract/2, L, R#xmlElement.content)
 36     end;
 37 
 38 extract(#xmlText{parents=[{title,_},{channel,2},_], value=V}, L) ->
 39     [{channel, V}|L]; % extract channel/audiocast title
 40 
 41 extract(#xmlText{parents=[{title,_},{item,_},_,_], value=V}, L) ->
 42     [{title, V}|L]; % extract episode title
 43 
 44 extract(#xmlText{parents=[{link,_},{item,_},_,_], value=V}, L) ->
 45     [{link, V}|L]; % extract episode link
 46 
 47 extract(#xmlText{parents=[{pubDate,_},{item,_},_,_], value=V}, L) ->
 48     [{pubDate, V}|L]; % extract episode publication date ('pubDate' tag)
 49 
 50 extract(#xmlText{parents=[{'dc:date',_},{item,_},_,_], value=V}, L) ->
 51     [{pubDate, V}|L]; % extract episode publication date ('dc:date' tag)
 52 
 53 extract(#xmlText{}, L) -> L.  % ignore any other text data
 54 
 55 % 'main' function (invoked from shell, receives command line arguments)
 56 main(A) ->
 57     D = atom_to_list(hd(A)),
 58     parseAll(D).

Conclusion

Erlang’s filelib:fold_files() function is cool and a good example of how easy things should be.

On the other hand:

  • it took quite a bit of time and effort to write the code above (perhaps due to my lack of experience in functional programming) and it was not fun :-(
  • beauty is in the eye of the beholder as they say and the promise of being able to write clean and attractive code is a major reason to pick up a new programming language. Again, maybe it’s just me but I did not find the resulting Erlang code to be particularly attractive.
  • the XML parser chokes reproducibly on XML files with non-ASCII character sets (try the code e.g. with the following RSS file (containing german characters))
  • The XPath implementation appears to be incomplete: I could not use the | operator in an XPath expression to select several paths for example.

Anyway, that’s just my $ 0.02 on XML parsing with Erlang. I am not an expert by any stretch of the imagination so feel free to point to anything I may have missed.

About these ads

8 thoughts on “Processing XML in Erlang

  1. I’m trying to port Mark Pilgrim’s FeedParser lib from Python to Erlang and I’ve felt the pain of XML parsing in Erlang, too. You might want to try the erlsom XML library. I found it’s SAX interface to be what I was used to seeing coming from a Java/Python background.

    Also, you might want to try running the string you’re about to parse through xmerl_ucs:to_utf8/1 before parsing it. That cleaned up a bunch of problems I was having dealing with non-ASCII chars.

    Now, if there was a way to parse XML files without having to read them all in memory at once, I’d be golden.

  2. Hi there,

    For XML parsing, you can try erlsom, which I found to be faster, produce smaller output and has a better implementation for SAX.

    Best regards,

    Ahmed

  3. Pingback: Simple Example of XML Parsing With Erlang and Erlsom.

  4. Your parsing code could be much less verbose. Here’s my version:

    http://pastebin.com/nnKbjdjh

    Don’t give up on Erlang. Once you’ve overcome the initial pain of the syntax and different mode of thinking you’ll realize how incredibly powerful it is.

  5. libexpat is used for xml parsing in ejabberd. Our custom servers handle tens of thousands of active connections each. You can find libexpat (expat_erl.so) in any ejabberd source distribution. You will need to load it using

    erl_ddll:load_driver(“.”, “expat_erl”).

    in your initialization code. The ejabberd source distribution also includes the files xml_stream.erl and xml.erl which provide an API to the library. For instance you can use xml_stream:parse_element(Data) to get an xml term, that you can then use the methods in xml.erl to process. No XPath, but if nice native Erlang API instead.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s