First experiments with golang

I finally found some time to look at the Go programming language (aka golang). In order to get a feeling for it I picked a random Google code jam problem and programmed it in Go.

The code used in the experiments that follow is pretty simple

first impressions

My first impressions were mostly positive: Go has

  • decent documentation covering the language proper as well as the standard library
  • a fast compiler resulting in short edit-compile-test cycles
  • a nice standard library and a wealth of packages (provided by the community)
  • a lively and friendly mailing list and irc channel

The language has quite a “direct” feel to it: I could get to work and be productive almost immediately.
This is in stark contrast to other languages I tried to learn recently e.g. Scala (back in January): it required a lot of reading and even a couple of days into it I was not really productive in Scala.

Go is quite the opposite, the barrier to entry is low, the language is clean and simple. The combined declaration and initialisation operator (':=') alone is a godsend.

Coming from a Python background the main thing I was missing was the REPL. Who knows, maybe there is even one out there but I just did not find it yet..?

playing with goroutines

One of the most attractive golang features is its support for concurrent programming via goroutines and I wanted to play with these.

The programming problem chosen came with an input for 50 calculations. I used it to create inputs with 50, 100 and 200 *thousand* calculations. All calculations are independent of each other i.e. ideally parallelisable.

Being a fairly young language still Go does not parallelise code by default. If CPU parallelism is desired one must tell the run-time how many goroutines shall execute simultaneously.

The code I wrote starts each calculation in a separate goroutine and allows the user to specify the number of CPUs/cores that should be used to execute the program.

Using a bash script I ran the resulting program varying both the number of calculations and the number of CPU cores.

These experiments were conducted on a 32-core server (Quad-Core AMD Opteron Processor 8356) with 64GB of RAM running Ubuntu 11.04 server. Also, I ran each configuration for three consecutive times and used the average duration in the graph below.

Apparently the golang run-time was not able to utilise more than 8 cores when running this particular program.

50, 100 and 200 thousand calculations running on 1 through 16 CPU cores

As can be seen from the graph (full size) above, executing the program on more than 8 cores did not decrease its running time futher.

The 200K calculations input file is a bit over half a gigabyte so I suspected that the program is dominated by I/O and the goroutines cannote execute because the result channel is full.

That lead me to experiment with different result channel sizes. The resulting running times (e.g. for 200K calculations) can be seen in the graph (full size) below.

The 200K calculations running on 1 through 16 CPU cores and with varying result channel sizes

However, varying the result channel sizes did not seem to have a big effect.

Anyway, I am pretty happy with the code at this point but suggestions are always welcome, particularly those aiming at improving the degree of parallelism πŸ™‚


I am amazed how far I got by investing approx. 10 hours in learning Go and programming in it.

Having used python almost exclusively for the last 5 years I am pretty spoiled when it comes to code conciseness and productivity.
Go is not too far away though, and, programming in it was fun and enjoyable.

I will definitely continue to explore it. Maybe you should give it a whirl as well πŸ™‚


Erlang vs. Stackless python: a first benchmark

Change of Heart

Mon Aug 17 17:51:58 CEST 2009


I have had more time to study erlang since writing the article below and
have come to realize that the conclusions I drew are skewed.


  - is a language and a high-quality runtime system optimized for
    concurrent and distributed execution. The examples below do not
    really exercise any of these strengths.
  - has proven itself in quite a number of high-profile projects and
    deployments (ejabberd, CouchDb and RabbitMQ just to mention a few).
  - has a strong, vibrant and competent community that's driving the
    erlang system and improving it continuously.

For all these reasons stackless python -- although being an interesting
piece of technology in its own right -- is certainly no match for

Please see
for what appears to be a more even-handed comparison.


I obtained a copy of Joe Armstrong’s Erlang book recently. After reading through the first half I have found it to be a very commendable work, making a strong case for message passing concurrency — a different approach to building large scale and highly concurrent systems.

After finishing chapter 8, I came across the following exercise:

Write a ring benchmark. Create N processes in a ring. Send a message round the ring M times so that a total of N*M messages get sent. Time how long this takes [..]

Write a similar program in some other programming language you are familiar with. Compare the results. Write a blog, and publish the results on the internet!

I liked it, it looked like a great opportunity to get my feet wet with Erlang and message passing concurrency πŸ™‚

So, here we are, that’s precisely what this article is about.

The “other programming language” is of course Python, in this particular case Stackless Python featuring tasklets and channels to counter Erlang’s concurrency machinery.

The source code

Both the Erlang and the Stackless python code set up a ring of N processes so that each of the latter knows its successor. The resulting “ring” has the shape of an open necklace:

  • there is a first and a last bead in the chain
  • each message
    • is sent to the first bead
    • is relayed by beads number [1 .. N-1] to their immediate neighbour
    • stops at bead number N

The Erlang program

This is my very first Erlang program, so please bear with me. If there’s anything that could be done better, feel free to comment on it (in a polite way).

  1 % sets up a ring of N processes, a message is passed M times around the ring.
  2 -module(oringb).
  3 -export([run_benchmark/2, loop/2, mloop/1, main/1]).
  4 % builds a ring of N processes and send a message around the ring M times.
  5 run_benchmark(N, M) ->
  6     io:format(">> Erlang R11B-5 here (N=~w, M=~w)!~n~n", [N, M]),
  7     % spawn the other N-1 processes needed for the ring
  8     LastP = lists:foldl(fun(S, Pid) -> spawn(?MODULE, loop, [N-S, Pid]) end,
  9                         self(), lists:seq(1, N-1)),
 10     spawn(fun() -> [ LastP ! R || R <- lists:reverse(lists:seq(0, M-1)) ] end),
 11     mloop(N).

The function above sets up the ring. The lists:foldl construct is equivalent to Python’s built-in reduce() function.

The process spawned on line 10 sends the M messages around the ring. The messages are just integers taken from the [M-1 .. 0] sequence.

The main process acts as process number N and invokes the function mloop() thus entering the appropriate receive loop on line 11.

 12 % receive loop executed by the first N-1 processes in the ring
 13 %   - 'S' is the sequence number allowing us to keep track of a
 14 %     process' position in the ring
 15 %   - 'NextP' is the next process in the ring
 16 %   - 'R' is the round/message number, zero indicates that the processes
 17 %      should terminate;
 18 loop(S, NextP) ->
 19     receive
 20         % the message number is above zero => forward the message to the next
 21         % process in the ring
 22         R when R > 0 -> NextP ! R,
 23             io:format(": Proc: ~8w, Seq#: ~w, Msg#: ~w ..~n", [self(), S, R]),
 24             loop(S, NextP);
 25         % the message number is zero => forward message and terminate
 26         R when R =:= 0 -> NextP ! R,
 27             io:format("* Proc: ~8w, Seq#: ~w, Msg#: terminate!~n", [self(), S]);
 28         % error: the message number is below zero => raise exception
 29         R when R < 0 ->
 30             erlang:error({"internal error", "invalid message number"})
 31     end.

The loop() function is executed by the first N-1 processes in the ring. These know their position in the ring (S) as well as their immediate neighbour (NextP) to which they forward all messages received.

The messages (R) are just integers. Values above zero keep the ring going, a zero value (the last message) results in the termination of the ring.

 32 % receive loop executed by the last (Nth) process;
 33 % it won't forward any messages
 34 mloop(S) ->
 35     receive
 36         R when R > 0 ->
 37             io:format("> Proc: ~8w, Seq#: ~w, Msg#: ~w ..~n", [self(), S, R]),
 38             mloop(S);
 39         0 ->
 40             io:format("@ Proc: ~8w, Seq#: ~w, ring terminated.~n", [self(), S])
 41     end.

The mloop() function acts as the receive loop of the process number N. It “terminates” the messages that go around the ring. Its main purpose is to produce diagnostic output.

 42 % 'main' function allowing the invocation from the shell as well as the
 43 % passing of command line arguments
 44 main(A) ->
 45     Args = [ list_to_integer(Litem) ||
 46              Litem <- [ atom_to_list(Atom) || Atom <- A ]],
 47     [N, M] = Args,
 48     run_benchmark(N, M).

What follows is an example invocation of the oringb module (a ring of 4 processes with 3 messages going around):

  1 bbox33:ring $ erl -noshell -s oringb main 4 3 -s init stop
  2 >> Erlang R11B-5 here (N=4, M=3)!
  4 : Proc: <0.29.0>, Seq#: 1, Msg#: 2 ..
  5 : Proc: <0.28.0>, Seq#: 2, Msg#: 2 ..
  6 : Proc: <0.27.0>, Seq#: 3, Msg#: 2 ..
  7 : Proc: <0.29.0>, Seq#: 1, Msg#: 1 ..
  8 : Proc: <0.28.0>, Seq#: 2, Msg#: 1 ..
  9 > Proc:  <0.1.0>, Seq#: 4, Msg#: 2 ..
 10 : Proc: <0.27.0>, Seq#: 3, Msg#: 1 ..
 11 * Proc: <0.29.0>, Seq#: 1, Msg#: terminate!
 12 * Proc: <0.28.0>, Seq#: 2, Msg#: terminate!
 13 > Proc:  <0.1.0>, Seq#: 4, Msg#: 1 ..
 14 * Proc: <0.27.0>, Seq#: 3, Msg#: terminate!
 15 @ Proc:  <0.1.0>, Seq#: 4, ring terminated.

Lines 4-6,9 are generated by the first message, lines 7,8,10 and 13 show the second message going around the ring — lines 11,12,14 and 15 are the result of the zero-valued (terminate ring!) message.

The Stackless Python program

I have made an attempt to keep the python code as similar as possible to its Erlang counterpart. The run_benchmark() function below builds the ring of N tasklets (lines 9-18) and sends the M messages around it (lines 19-21).

One minor difference to note: the main process/tasklet here is not the last process in the ring but sends the messages (after spawning the last process on lines 16-18).

  1 #!/Library/Frameworks/Python.framework/Versions/2.5/bin/python
  2 # encoding: utf-8
  3 import sys
  4 import stackless as SL
  6 def run_benchmark(n, m):
  7     print(">> Python 2.5.1, stackless 3.1b3 here (N=%d, M=%d)!\\n" % (n, m))
  8     firstP = cin =
  9     for s in xrange(1, n):
 10         seqn = s
 11         cout =
 12         # print("*> s = %d" % (seqn, ))
 13         t = SL.tasklet(loop)(seqn, cin, cout)
 14         cin = cout
 15     else:
 16         seqn = s+1
 17         # print("$> s = %d" % (seqn, ))
 18         t = SL.tasklet(mloop)(seqn, cin)
 19     for r in xrange(m-1, -1, -1):
 20         # print("+ sending Msg#  %d" % r)
 21         firstP.send(r)
 22     SL.schedule()

The loop() function is executed by the first N-1 processes in the ring as in the Erlang code. Apart from relaying the messages received, it mainly produces diagnostic output.

Each tasklet knows its position in the ring (S) and is given an input channel from which to read its messages (cin) as well as an output channel to which to write all the messages received (cout).

 23 def loop(s, cin, cout):
 24     while True:
 25         r = cin.receive()
 26         cout.send(r)
 27         if r > 0:
 28             print(": Proc: <%s>, Seq#: %s, Msg#: %s .." % (pid(), s, r))
 29         else:
 30             print("* Proc: <%s>, Seq#: %s, Msg#: terminate!" % (pid(), s))
 31             break

The last tasklet runs the mloop() function which mainly produces diagnostic output.

 32 def mloop(s, cin):
 33     while True:
 34         r = cin.receive()
 35         if r > 0:
 36             print("> Proc: <%s>, Seq#: %s, Msg#: %s .." % (pid(), s, r))
 37         else:
 38             print("@ Proc: <%s>, Seq#: %s, ring terminated." % (pid(), s))
 39             break
 41 def pid(): return repr(SL.getcurrent()).split()[-1][2:-1]
 43 if __name__ == '__main__':
 44     run_benchmark(int(sys.argv[1]), int(sys.argv[2]))

The following is an example invocation of the python code (a chain of 4 processes with 3 messages going around (like above)):

  1 bbox33:ring $ ./ 4 3
  2 >> Python 2.5.1, stackless 3.1b3 here (N=4, M=3)!
  4 > Proc: <6b870>, Seq#: 4, Msg#: 2 ..
  5 : Proc: <6b730>, Seq#: 1, Msg#: 2 ..
  6 : Proc: <6b770>, Seq#: 2, Msg#: 2 ..
  7 : Proc: <6b7f0>, Seq#: 3, Msg#: 2 ..
  8 > Proc: <6b870>, Seq#: 4, Msg#: 1 ..
  9 : Proc: <6b730>, Seq#: 1, Msg#: 1 ..
 10 : Proc: <6b770>, Seq#: 2, Msg#: 1 ..
 11 : Proc: <6b7f0>, Seq#: 3, Msg#: 1 ..
 12 @ Proc: <6b870>, Seq#: 4, ring terminated.
 13 * Proc: <6b730>, Seq#: 1, Msg#: terminate!
 14 * Proc: <6b770>, Seq#: 2, Msg#: terminate!
 15 * Proc: <6b7f0>, Seq#: 3, Msg#: terminate!

Line 4-7 show the first message going around, lines 8-11 stem from the second message and lines 12-15 make the last (terminating) message visible (I guess the funny “order of execution” above arises from the fact that the Stackless python scheduler by default prefers the message receiver tasklets).

The first benchmarks

Now for the actual benchmarks, all run on a machine with the following specs:

  Model Name:	MacBook Pro 15"
  Model Identifier:	MacBookPro2,2
  Processor Name:	Intel Core 2 Duo
  Processor Speed:	2.33 GHz
  Number Of Processors:	1
  Total Number Of Cores:	2
  L2 Cache (per processor):	4 MB
  Memory:	2 GB
  Bus Speed:	667 MHz

I ran two kind of benchmarks:

  1. fixed number of processes (ring size N = 100) and varying numbers of messages (10000 <= M <= 90000 stepped up in 10K increments)
  2. fixed number of messages (M = 100) and varying numbers of processes (10000 <= N <= 90000 stepped up in 10K increments)

In both cases the resulting total number of messages sent was between 1 and 9 million.

I expected that Erlang would clearly outperform Stackless python because this was the former’s “sweet spot” scenario after all.

I was pleasantly surprised by python not only holding up to the challenge but winning it!

As can be seen from the diagrams depicting the results of the two benchmarks, Stackless Python performed much better than Erlang!

100 processes varying message numbers
100 messages varying number of processes

The second round

Eventually I came to believe that Erlang’s I/O library is very slow and that that might be the reason for its sub-optimal performance in the benchmarks above.

I hence modified both sources (Erlang here, Stackless python here) and commented out all of the output.

Following that, I reran the benchmarks with the following results.

100 processes varying message numbers (no output)
100 messages varying number of processes (no output)

Now Erlang performed much more like a system that was built with message passing concurrency in mind BUT with Stackless python still close on its heels.

In conclusion

Even I as a Python aficionado am sometimes baffled in light of the gems that can be found in the Python treasure trove.

The Python community should stop worrying about the language falling out of fashion and/or being overtaken by RoR/Ruby or some other contender of the day.

Python’s potential is so huge, all it takes is attracting notice to it e.g. to the fact that Python is ready for Our Manycore Future.

Last but not least, I find it’s quite a shame that Stackless python seems to be treated like the “unloved stepchild” by the community.

Highly concurrent systems are the future of software and stackless facilitates the construction of such systems today.

Isn’t such a capability worthy of being maintained, improved and expanded upon as opposed to building yet another web framework or whatever ..?