[Beowulf] Java vs C++ for interfacing to parallel library

Sun Aug 20 11:25:05 PDT 2006

On Sun, 20 Aug 2006, Joe Landman wrote:

> Jonathan:
>
> Jonathan Ennis-King wrote:
>> Does anyone have experience writing parallel Java code (using MPI) with
>> calls to C libraries which also use MPI? Is this possible/sensible? Is
>> there a big performance hit relative to doing the same in C++?
>
> Unless all of the important optimizable calculation is done in libraries
> that you are stitching together with Java glue, the compiled languages
> are likely to be quite a bit faster.
>
> There is a sizeable abstraction penalty associated with OO languages.
> Many of the design patterns that they encourage (object factories,
> inheritance chains, etc) are anathema to high performance.

Hear, hear!

>> I'm considering writing some parallel code to do fluid flow in porous
>> media, the heart of which is solving systems of sparse linear equations.
>> There are some good libraries in C which provide the parallel solver
>> (e.g. PETSC), but I'm trying to resolve which language to use for my
>> code. The choice is between C++ and Java, and although I'm favouring
>> Java at present, I'm not sure about its performance in this context.
>
> Hmmm.  For this, C or Fortran may be far more appropriate.  Depends upon
> what it is you want to do with the code.  High performance using MPI
> depends upon many factors.  If there is one particular part of the code
> that is better served by an OO based language, then I might suggest
> designing/implementing all the speed sensitive bits in a language which
> lets you achieve high performance, and then interfacing them to your OO
> language so that the OO system isn't being used for the critical time
> sensitive portions.

<disclaimer>Parts of the stuff below are editorial comment and religious
belief and can be ignored or sniffed at by those of differing
belief.</disclaimer>

Remember well the observation that you can write object oriented code in
a procedural language (and ditto, you can write procedural code in an OO
language).  Matching the language to the kind of code -- or more
likely, the personal taste of the coder -- simply makes development a
bit more simple and natural.

Untimately, OO vs procedural code is a matter of style as much as
anything else.  I write "real" code exclusively in C.  I'm in the
process of (re)writing a random number testing program (dieharder) into
a library-based tool that was originally (first pass) quite procedural
in its design.  In the second pass, as I came to fully understand the
data objects better in practice and could start to see how the code
could be simplified and compressed, I began to introduce a set of "lazy"
shared objects for certain parts of the code.

In the third (current) pass I'm splitting off all of the actual testing
code, as opposed to the startup/results/presentation UI code, into a
library.  Since most of the tests share a very similar implementation
structure and certain control variables in common, I can now see
precisely how to make the code very object oriented with a set of "test
objects" (structs and similarly structured test implementations that
read from them and fill them in) and a single set of "shell" code for
calling a standard test.  This reduces writing a UI to nothing but
simple, repetitive boilerplate for calling the actual tests and
displaying the returned results -- one can focus on the human side of
the UI and stop worrying about the tests, and one can relatively easily
and scalably add more tests or RNGs to test.

Since the code is still both lazy OO and C, I can freely intersperse the
use of pointers, can choose to treat variables (incluing all
structs/objects) as "opaque" or not as makes sense in the code, and keep
the code as efficient as C can make it, which is to say damn near as
efficient as assembler.  The "objectness" of the encapsulated tests just
permits me to write a relatively clean API to the library (without too
many test specific global/shared variables or the even greater hassle of
dealing with passing variable length argument lists through layers of
encapsulating subroutines) so that when I'm done adding a UI or GUI or
implementing the tests native inside e.g.  R or octave or whatever will
be fairly straightforward.

The point being that one CAN write non-lazy OO code in C or even in
Fortran -- that's more a question of program design and an understanding
of the basic data objects that a program requires, although it certainly
helps if the language permits the definition of a struct of one sort or
another.  One has the choice in C, though, of writing fully OO, lazy
(mixed) OO or fully procedural code when and where that is appropriate
for either ease of coding or program efficiency.  I suppose that choice
exists to some extent for at least some non-fascist OO environments
(e.g. C++ as a sort-of superset of C) but I think that the only people
who even know how to do so are those who have learned to code in a
non-OO language first -- people who learn C++ as their primary language
tend to be pretty clueless about pointers or the performance advantages
of NOT using protection and inheritance in your structs but just letting
everything access them directly.  C provides few safety nets but rather
permits you to do pretty much anything you like, at your own risk, in
code that is ultimately transparent.

Now, I personally believe that all nontrivial programs go through stages
like the three described above no matter what language they are written
in.  This is one of the reasons that Wirth's Pascal had its day and that
it passed -- whether one starts at the top or at the bottom or both, one
is likely to encounter mismatches that require rethinking all or part of
the memory hierarchy one begins with in any difficult project.  In that
SECOND pass and beyond, both strict-topdown and strict-bottomup
languages tend to require MORE work to fix than one that is less
hierarchically prestructured.

Perhaps there are OO ubercoders that can just "see" what the data
objects appropriate to a complex application are from the beginning and
can start off with the right top level, mid, AND bottom level objects
all perfectly enmeshed and integrated but I have yet to meet one.  One
of the great (IMO) illusions promoted by OO fanatics is that by using an
OO language (per se) to write the code in the first place one can
somehow shorten this process and home in on the correct hierarchy of
data structures (objects or not) that optimally support the
application's efficient implementation from top to bottom.  This is not
my experience, but hey, the world is a big place and there may be people
who just think that way and for them it may be true.

For code like the specific stuff you want to implement above that have
efficient libraries written in C, my guess is that you would do best
using C -- this is pretty much a no-brainer.  It is highly probable that
in C you have the best access to example programs using the library,
UIs, human support in the form of others who use the libraries in their
C code, and more.  Even communicating with the author/maintainers of the
library is bound to be simplest if you are implementing in C.  Second
best would almost certainly be C++, as C++ can (I believe) call C
libraries fairly transparently or with a minimal C++ encapsulation of
the C prototypes and data structures.

OTOH Fortran and C tend to have somewhat different subroutine call
mechanisms so binding a C library into fortran code or VV tends to be a
PITA -- for example, C always passes subroutine arguments by value,
fortran by reference.  In addition, C and fortran use slightly different
conventions for other simple stuff e.g. terminating a string.  Some of
the issues associated with the port are mentioned here:
http://star-www.rl.ac.uk/star/dvi/sun209.htx/node4.html as well as
elsewhere on the web.  Basically, calling C libraries in fortran code is
possible but requires some work and code encapsulation (and vice versa
for calling fortran routines from inside C code, IIRC -- fortran/C
compiler folks can check me on this:-).

Java, octave, matlab, python, perl etc. are MUCH WORSE in this regard.
All require NONTRIVIAL encapsulation of the library into the interactive
environment.  I have never done an actual encapsulation into any of
them, but I'll wager that it is really quite difficult because each of
them has their very own internal data types that are REALLY opaque
objects that bear little overt resemblance to the simple "all data
objects can be viewed as a projection onto a block of memory with either
typed or pointer driven offset arithmetic" view of data in C or for that
matter C++ or Fortran (with slighly different projective views in both
cases).

These languages typically permit you to allocate memory by just using a
named variable.  This is marvelously convenient for an interactive
environment -- it is marvelously expensive in terms of program
efficiency because the underlying environment has to manage allocating
the memory transparently extensibly (most of the languages permit you to
allocate whole vectors or matrices of variables by just referencing
them), tracking instances of the memory in code, and freeing the memory
when it is no longer referenced or being used.  Conservatively, so that
they tend to keep things if there is ANY CHANCE of their ever being
referenced, making them typically memory hogs almost as bad as a C
program would be if every memory reference in the program was to static
global memory -- no memory allocation or freeing at all, beyond whatever
goes on stack/heap in the course of subroutine calls or internal
function execution.  Complicated hashes or advanced list structures are
used to keep the execution itself moderately efficient (but highly
INefficient compared to a decent compiler with flat memory outlays).

The point being that you have to interface these opaque and not
obviously documented data types to the C library calls.  This is surely
possible -- it is how all those perl libraries, matlab toolboxes, java
interfaces come about.  It will probably require that you learn WAY more
about how the language itself is implemented at the source level than
you are likely to want to know, and it is probably not going to be
terribly easy...

    rgb

>
>>
>>
>>   Jonathan Ennis-King
>>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu