Introduction
After a long break I picked up the Erlang book again and my appetite for writing some erlang code was soon kindled.
A small Python component I produced at work seemed like a good candidate for my (sequential) erlang exercises. It is a fairly simple component that removes user/password data embedded in URLs.
Just so you know where I am coming from:
- my main/favourite programming language is Python
- my exercises are mainly about sequential, non-distributed and non-telecoms-related problems whereas erlang’s main strength and appeal lies in the area of parallel/distributed telecoms/networking systems
- I have played with erlang a little bit before (ring benchmark, XML parsing) and liked it in general although IMHO it lacks severely when it comes to the availability and quality of standard library components.
Now that my particular set of preconceptions is clear and in the open, let’s look at the stuff below
File processing with erlang’s regexp module
The initial implementation of the URL filter in erlang used its regexp library.
1 -module(regex). 2 -export([main/1]). 3 4 isalphanum(C) when C > 47, C < 58; C > 64, C < 91; C > 96, C < 123 -> true; 5 isalphanum(_) -> false. 6 7 %% Generate a temporary file name of length N 8 genname(0, L) -> L; 9 genname(N, L) -> 10 R = random:uniform(123), 11 case isalphanum(R) of 12 true -> genname(N-1, [R|L]); 13 false -> genname(N, L) 14 end. 15 16 %% Returns a randomly generated temporary file path where the basename is 17 %% of length N 18 mktemppath(Prefix, N) -> Prefix ++ "/" ++ genname(N, []). 19
Please note how I had to implement functionality absent from the standard library above.
20 %% Removes passwords embedded in URLs from a log file. 21 scrub_file(Tmpdir, F) -> 22 %% make a temporary directory if it does not exist yet. 23 case file:make_dir(Tmpdir) of 24 ok -> ok; 25 {error,eexist} -> ok; 26 _ -> exit({error, failed_to_make_tmpdir}) 27 end, 28 29 %% Move the original file out of the way. 30 T = mktemppath(Tmpdir, 16), 31 case file:rename(F, T) of 32 ok -> ok; 33 _ -> exit({error, failed_to_move_file}) 34 end, 35 36 %% Now open it for reading. 37 {_, In} = file:open([T], read), 38 %% Open the original path for writing. 39 {_, Out} = file:open([F], write), 40 41 %% Call the function that will scrub the lines. 42 scrub_lines(In, Out), 43 44 %% Close the file handles and return the path to the original file. 45 file:close(Out), 46 file:close(In), 47 T. 48
The code that scrubs the URLs is below, the scrub_lines() function is tail recursive.
49 %% This is where the log file is actually read linewise and where 50 %% the scrubbing function is invoked for lines that contain URLs. 51 scrub_lines(In, Out) -> 52 L = io:get_line(In, ''), 53 case L of 54 eof -> ok; 55 _ -> 56 %% Does the line contain URLs? 57 case string:str(L, "://") of 58 0 -> io:format(Out, "~s", [L]); 59 _ -> 60 case regexp:gsub(L, "://[^:]+:[^@]+@", "://") of 61 {ok, S, _} -> io:format(Out, "~s", [S]); 62 {R, S, _} -> io:format("Failed: {~p,~p}", [R,S]) 63 end 64 end, 65 %% Continue with next line. 66 scrub_lines(In, Out) 67 end. 68 69 %% Main function. 70 main([A]) -> 71 {A1,A2,A3} = now(), 72 random:seed(A1, A2, A3), 73 74 %% A single argument (the name of the file to be scrubbed) is expected. 75 F = atom_to_list(A), 76 T = scrub_file("tmp", F), 77 78 %% The scrubbed file content will be written to a new file that's 79 %% in the place of the original file. Where was the latter moved to? 80 io:format("~s~n", [T]), 81 82 init:stop().
The cursory benchmarks performed (on log files of varying size) using the python and the erlang code confirmed other people’s experiences with erlang’s regex performance (but see also this interesting “rebuttal”).
| Log file size | Python times | Erlang times |
|---|---|---|
| 1 MB | 0m0.230s | 0m1.896s |
| 10 MB | 0m1.510s | 0m8.766s |
| 100 MB | 0m14.793s | 1m17.662s |
| 1 GB | 2m55.012s | 13m54.588s |
The do-it-yourself construction
Curious to learn whether the performance can be improved by abstaining from regular expressions I came up with an alternative implementation that does not use regexp.
As you can see below the do-it-yourself construction is indeed performing slightly better at the expense of being very specialized and requiring 60% more code.
| Log file size | Python times | Erlang regexp | Erlang do-it-yourself |
|---|---|---|---|
| 1 MB | 0m0.230s | 0m1.896s | 0m1.969s |
| 10 MB | 0m1.510s | 0m8.766s | 0m8.459s |
| 100 MB | 0m14.793s | 1m17.662s | 1m12.448s |
| 1 GB | 2m55.012s | 13m54.588s | 13m3.360s |
In conclusion
Every couple of months or so I develop a euphoria towards erlang which is consistently dampened by using the language to tackle problems for which the language admittedly was not designed in first place.
I guess most people start using a language for simple programming exercises first as opposed to building something like a Jabber/XMPP instant messaging server straightaway.
I hate to repeat myself but improving the standard library (by adding common functionality and making sure it performs decently) would do a lot to attract fresh talent to the erlang community and I hear that a certain rate of influx of “fresh blood” is a necessary prerequisite for success.
Ah, and no, you were not supposed to grok the sentence above unless you read it three times