Richards is an interesting medium-sized language benchmark (400-500
lines). It simulates the task dispatcher in the kernel of an
operating system. The original version was written in BCPL by Martin Richards at
Cambridge University, England. It has since been translated into many
languages, including C, C++ and Smalltalk.
One translation into the Java programming language is due to Jonathan Gibbons, based on
a C version by Mick Jordan.
This character of this translation is very similar to that of the
original BCPL
program by Martin Richards, and does not take advantage
of the object-oriented features of Java.
I wrote a different version in Java, derived from a
Smalltalk implementation
by L. Peter Deutsch. Like the Smalltalk version, this is much more
object-oriented, and has quite different performance characteristics.
This version, in C++ (by Urs
Hölzle),
Smalltalk and
Self, has been
used as the basis
for comparison in all the papers published by the
Self project.
After creating the latter version, I was intrigued by the performance
difference between the two Java translations, and decided to study the
two versions more closely. In
the end I created seven different versions of this benchmark, each
exhibiting a different stylistic element and performance characteristic.
The different versions span a range in object-orientedness. One
version is the `natural' translation from C code, without particular
attention to source-level optimization. Another version uses features
of the Java language to improve performance. A third is like the C-based
version except that the program state is not as highly encoded.
The other four versions are based on the Smalltalk original, from which
a C++ version was also derived. One is a
`natural' translation from the C++, taking the most obvious mapping
from C++ to Java. The others are closer to the Smalltalk version, in
that all instance variables are accessed via methods; they vary in how
those methods are declared.
What does Richards do?
At the heart of Richards is a task dispatcher (Richards.schedule).
Tasks come in 4 different flavors, each represented by a class
(DeviceTask, HandlerTask, IdleTask, WorkTask, all subclasses of Task).
Each kind of task has an associated work function (fn). At startup
(Richards.run), a particular task mix is created, and then the tasks
are scheduled, each having its work function invoked. The work
functions manipulate work packets and packet queues.
At the end of the benchmark (in Richards.run) the number of queued and
held packets is checked against the correct value to assist in
verifying that the benchmark ran correctly.
1) The Gibbons translation from C (richards_gibbons)
The first version is due to Jonathan Gibbons, and is close to the
original BCPL code in operation. The C version is by Mick Jordan.
It is very close to the original version in BCPL, and is not very
object-oriented. It uses a switch statement (line 127) to distinguish
between one of 8 states, encoded in an integer variable (Task.state).
2) Using final to avoid call overhead (richards_gibbons_final)
All calls in richards_gibbons are virtual; richards_gibbons_final
defines classes and methods to be final where possible, to diminish
the call overhead.
3) Gibbons, but without a `switch' (richards_gibbons_no_switch)
The third version is a simple modification of the first: the 8-valued
integer (Task.state) is replaced with three booleans (packetPending,
taskWaiting and taskHolding), and the switch in Richards.schedule is
replaced by tests of these variables and an extra function,
Richards.runTask. This inner loop now is used in all subsequent
versions.
4) The Deutsch version without accessors (richards_deutsch_no_acc)
The implementation by Deutsch moves some of the task state into
separate objects (DeviceTaskRec, IdleTaskRec, HandlerTaskRec and
WorkerTaskRec, all subclasses of TaskRec). This version follows that
scheme, requiring more, finer classes and objects, and more calls.
The state within these objects is represented by public instance
variables, which are directly accessed where necessary.
5) Deutsch with virtual accessors (richards_deutsch_acc_virtual)
In the remaining versions based on Deutsch, all object state is
accessed via methods, according to good encapsulation practices.
This is a natural translation from the C++ version of Deutsch, which
in turn was based on the Smalltalk version. Note,
however, that where C++ methods default to being statically bound, Java
methods default to being virtually bound. This is also the closest
to the versions in Smalltalk and Self.
6) Deutsch with final methods and classes (richards_deutsch_acc_final)
In an attempt to regain any efficiency lost by the virtual calls, in
this version we make the calls non-virtual by making classes and
methods final where possible. Note that there is still a virtual
method (Task.fn, overridden in subclasses) at the heart of the
program, called on line 247.
7) Deutsch with virtual methods and interfaces
(richards_deutsch_acc_interface)
Finally, we make all classes inherit from an interface class, to
simulate the most object-oriented framework-like behavior. All
methods are virtual.
Performance results and interpretation
Below is a typical performance graph for the 7 versions of Richards
run on a JavaTM Virtual Machine (lower points are faster).
Versions 1 and 2 (richards_gibbons and richards_gibbons_final) are
almost the same in performance, the extra finals buying little
speedup. Version 3 (richards_gibbons_no_switch) is slightly slower.
Version 4 (richards_deutsch_no_acc) is slightly slower again.
However, when virtual accessors are introduced in #5
(richards_deutsch_acc_virtual) performance degrades significantly.
Using finals (#6, richards_deutsch_acc_final) can recover this
performance. Using interfaces (#7, richards_deutsch_acc_interface) is
as bad as using virtuals, if not slightly worse.
Compare this with the results for C++ (below). There are no results
for #2 (it is the same as #1) and #7 (there are no interfaces in
C++).
Note that an all-virtual program is much slower in C++ compared to the
non-virtual versions (#5 vs #6). Because of this, C++ programmers
tend to use virtuals sparingly. When translating from C++ to Java
extra effort must be expended to make methods final, as the default
binding in Java is virtual.
Contrast this with an experimental implementation of the JavaTM Virtual
Machine, called Pep,
which was implemented on the
Self Virtual Machine (do not
compare absolute times as these runs are on a much slower machine).
The Self Virtual Machine performs adaptive optimization and is even
capable of inlining virtuals. Note that the curve is essentially
flat, with the exception of #1 (Self has no good way of implementing a
switch statement). This suggests that adaptive optimization will work
for Java (some programs, at least). More details can be found in
"Design and Implementation of Pep, a Java Just-In-Time Translator",
by Ole Agesen, published in the Theory and Practice of Object Systems,
3(2), 1997, pp.127-155. Similar techniques are used in the `HotSpot'
Java VM.