[Beowulf] Java vs C++ for interfacing to parallel library

Sun Aug 20 14:55:18 PDT 2006

On Sun, 20 Aug 2006, Joe Landman wrote:

>
>
> Robert G. Brown wrote:
>
> [...]
>
>> Java, octave, matlab, python, perl etc. are MUCH WORSE in this regard.
>> All require NONTRIVIAL encapsulation of the library into the interactive
>> environment.  I have never done an actual encapsulation into any of
>
> Cant speak to Octave/Java/Matlab.  Python and Perl make this relatively
> easy.  In Perl you have the Inline:: modules.  If you have installed
> Inline::C, this example
>
> 	   #!/usr/bin/perl
> 	   use Inline C;
>           greet('Joe');
>
>           __END__
>           __C__
>           void greet(char* name) {
>             printf("Hello %s!\n", name);
>           }
>
> does this
>
> 	[landman at balto:~]
> 	105 >./inline.pl
> 	Hello Joe!
>
> Obviously this is a trivial example, but if you create a reasonable set
> of API's that you can express as we have indicated, even pass function
> prototypes in using a header file, and a little config stuff at the
> front end to give paths to libraries, this is not generally very hard.

I'm a bit skeptical about this for heavy lifting.  As in, could you
encapsulate the GSL in this way?  I doubt it.

> Only when you have some ... odd ... structures or objects passing back
> and forth which require a bit more work.

What's an odd structure?  All typedef structs (an object by any other
name) are "odd" in that they aren't part of the standard language
specification.  However C also permits a variety of freeform data
objects created by alloc'ing a block of memory and setting up any form
of offset addressing one needs/likes.

> Python has similar facilities.  Generally speaking the dynamic
> languanges (Perl, Python, Ruby) are pretty easy to wrap around things
> and link with other stuff, as long as the API/data structures are pretty
> clean.

Ay, that's the rub...;-) That and what you consider "pretty easy"...:-)

> Hmmm... methinks you are thinking of strongly typed languages.  In
> non-strongly-typed languages, internal data types are not usually opaque
> unless they are objects with well formed classes/accessors behind them.
> Even then, the data stores tend to be quite flexible. In Perl (as an
> example) you have several types.  SV's (scalar variables), AV's (array
> variables), and HV's (hash variables), as well as pointers to the same.
> Notice that I didn't talk about ints, floats, etc.  Python has a
> similar view, though it's data types include "lower level" types (ints,
> floats, ...).  In Ruby, everything is an object.

I'm simply pointing out that in perl (the case I'm most familiar with)
it is really quite difficult to know the details of what a data object
looks like from the "inside".  When I talk about an array object in C, I
know EXACTLY what it looks like.  A **array holds addresses of a set of
*vectors of data.  A ***array holds addresses of a set of **arrays, each
holding addresses of a set of *vectors.  I can take steps to ensure that
the actual data of the array is in a single contiguous block of memory
or can allocate vector blocks all over the place or I can just let
malloc generate them whereever it likes.  In perl, arrays are an opaque
data type.  One cannot in general assume that an ordinary C subroutine
can dereference a perl array passed by reference as a **array or a
matrix[][].  I'm assuming that this is what the "Inline" stuff above
does -- perform all the requisite translations of perl data types into
forms that C can grok.  In SOME cases they may be pretty much the same
as in C -- but I doubt it.

I think perl uses a (de facto) struct/complex object for even simple
$variable types for a variety of reasons -- probably mostly because
"Perl is a contextually polymorphic language whose scalars can be
strings, numbers, or references (which includes objects)."  Hence there
is metadata associated with the storage of even the simplest data
objects.  I vaguely remember all sorts of dire warnings in the language
reference manual on this very point although I'm not going to go dig out
my copy to verify my recollection.

The point being that there is a nontrivial step for ANY language
translating data structures and objects, quite possibly including the
very simplest ones e.g. simple scalar variables containing e.g. ints,
uints, doubles, into a subroutine.  A subroutine written to use uint
inputs is going to be unhappy or do odd things when fed with ints.  What
can one do about this when feeding it from a perl variable named $i that
is intrinsically polymorphic.  When you set $anumber = -214748365; in
perl and pass it to a C routine, did you just pass it a signed long long
integer or an unsigned integer?  C will expect a binary representation
in definite type, but perl has no such type.  Ditto char variables --
what is a perl string and how is it terminated and what does perl do
with char s variables one byte wide?  In C this is a valid concept -- an
unterminated single byte character.  In perl I'm GUESSING that it ALWAYS
saves a character as a string in an actual struct, with the usual
metadata and probably with a terminator.

There is some discussion of this and some advice here:

   http://world.std.com/~swmcd/steven/perl/pm/xs/intro/index.html

discussing "XS" -- a translation/interface system that permits one to
integrate C source into perl native.  See also man perlxs, of course.
This is "the right way" to integrate libraries or complex C sources with
perl, but I quote:

   If you want to write XS, you have to learn it. Learning XS is very
   difficult, for two reasons.

   The first is that the core Perl docs, such as perlxs and perlguts,
   tacitly assume that you already understand XS. Accordingly, they omit or
   gloss over crucial assumptions and background information. This sounds
   bad, but it is actually rather common in the Unix world.

   The second is that you can't learn XS. Not as such. Not from the top
   down. This problem is much more profound than the first, and it stems
   not from any inadequacy in the documentation, but from what XS and
   isn't.

   The Perl docs refer to XS as a language, but it isn't. XS is a
   collection of macros. The XS langauge processor is a program called
   xsubpp, where pp is short for PreProcessor, and PreProcessor is a
   polite term for macro expander. xsubpp expands XS macros into the bits
   of C code necessary to connect the Perl interpreter to your C-language
   subroutines.

   Because XS isn't a language, it lacks structure. The underlying C code
   has structure, but you can't see it, because it is hidden behind the
   macros. This makes it virtually impossible to learn XS on its own
   terms.

...

   In order to learn XS, you have to work from the bottom up. You have to
   learn the Perl C API. You have to understand Perl's internal data
   structures. You have to understand how the Perl stack works, and how a C
   subroutine gets access to it. You have to understand how C subroutines
   get linked into the Perl executable. You have to understand the data
   paths through the DynaLoader module that bind the name of a Perl
   subroutine to the entry point of a C subroutine.

As I suggested, to interface a complex program or library with perl
"correctly" (efficiently), you have to learn the perl API.  Which is Not
Trivial At All.  Basically, to use perlxs, you start by learning how
perl is written -- all its data types and memory management and how
routines are written and called -- and THEN suddenly you see how to use
perlxs to encapsulate your C code to run as commands in native perl.

Or (as this site also suggests) you can try SWIG: http://www.swig.org/.

This is a quick-and-dirty solution that works (AFAICS) by first putting
your C routines in a SWIG wrapper, which is relatively simple because
SWIG is designed to simply wrap C routines.  Then SWIG does all the
magic, pre-encapsulated, of translating into and out of the language of
choice via its (now hidden) API.  Presumably it has library layers that
manage all of the interfacing cleanly for you.

NEITHER of these seems like they are for the faint of heart or anyone
less than a bloomin' expert top coder.  XS for the Ubercoder who thinks
nothing of reading the source code of the linux kernel as a pleasant
summer diversion, SWIG for less lofty but still highly competent coders.
At a guess, with SWIG you can do any simple project, but probably not
encapsulate the GSL.  With XS you can encapsulate the GSL directly into
perl, but only if you are a class of programmer that, frankly, I find a
bit frightening (think of a "Ramanujan" or "Riemann" of programmer) or a
team of very good programmers.  I don't know which class the referenced
library is in.

> My first experience with Perl (my gosh, more than a decade ago) was
> wrapping a long calculation that we were doing to extract total energies
> by autogenerating input files and preparing data sets.  Took me a few
> hours to write the code, started it running, and two weeks later, we had
> our results.  The perl code did not do the calculations, that was the
> fortran code.  The perl code drove the fortran, extracted the relevant
> information and wrote it to a file, and prepared the next input.  I
> would hate to think how long the dynamics would have take had it been
> written in anything other than fortran/c.

Yeah, I do the same thing, only I use programs written in C native, and
write perl programs only as controllers (which is really what it is
for).  Good old "system()" or the various flavors of exec*() -- just let
perl generate your command line(s) for you and screen or file scrape
your results in for processing.  This is the SIMPLEST encapsulation of C
libraries that permit their use in perl -- just wrap them up with a
simple command line interface and stdout output and embed them into perl
code that way.  It has the further advantage that you can actually run
the programs by hand from the command line.

There is something to be said for using the Unix "build complex tools
out of pipelines of simple tools" approach to things.  Having tackled a
couple of GUI programs with some back end complexity at this point, I
learned the hard way that the only sane way to proceed here is to put
the complexity in a library so you can focus on the UI at the UI stage,
the complexity in the library stage.  That forces you to build an API,
and THEN you can sometimes do the required integration relatively
simply.  But it is almost invariably much simpler to write a simple
program that outputs the results to a file and then generate a graph of
the file with e.g. gnuplot by hand than to write a program that puts the
same results into memory and generates a graph of them via library calls
inside the program.  Not as nice for users, but way simpler for the
programmer.  If making something commercial, it might be justified.  If
you just want to make a bunch of plots, it probably isn't.

>> The point being that you have to interface these opaque and not
>> obviously documented data types to the C library calls.  This is surely
>> possible -- it is how all those perl libraries, matlab toolboxes, java
>> interfaces come about.  It will probably require that you learn WAY more
>> about how the language itself is implemented at the source level than
>> you are likely to want to know, and it is probably not going to be
>> terribly easy...
>
> Hmmm.  Did speak to Perl and python above.  Not sure how to do it with
> Octave, but the Matlab folks have some good connectivity with external
> libraries.  I dont know if it is easy to extend.  Java likes to talk to
> Java.

I think that in all cases the recipe is pretty much the same.  Either
use SWIG (or equivalent), use macros or wrappers designed to help you
encapsulate, and/or learn the language API way down deep.  Deep enough
to fully understand its method of data management and command linkage.
I'd be deeply suspicious of anyone who says this is simple.  At the very
least, I'd want to know their IQ and compare it to my own and if it is
more than 10 points higher reject their conclusion out of hand...;-)

With the exception of the people who "do" this regularly, of course.
They've invested the (possibly considerable) amount of time necessary to
MAKE it simple.  For example, >>I'm<< not going to do a port of
dieharder into R (as yet another interactive programming language
interface) -- but I'm willing to simplify encapsulate the dieharder
routines in a library interface that I'm HOPING will be simple enough to
encapsulate in R for somebody that has already figured out and written
extensions for it.

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu