[Beowulf] PVM on wireless...

Robert G. Brown rgb at phy.duke.edu
Wed Feb 6 10:21:55 PST 2008


On Wed, 6 Feb 2008, Bill Rankin wrote:

> Hey Rob,
>
> Could it be a node naming issue where the wireless IP does not resolve to the 
> same address as that used in the machinefile?  I seem to recall a similar 
> issue back when we PVM on machines with multiple network connections.

pvmd is actually starting up on the target machine -- it works that far.
The master node IP number is correct, as is the slave IP number (both
visible as arguments to pvmd).  The name I'm using is the one associated
with the wireless interface in question, both machines ping in all four
directions by name with the correct internet address.  All my machines
are configured more or less identically, use the same environment
variables, support transparent ssh command execution (which obviously
works even in PVM as the daemon is being spawned on the correct target).

The wireless interfaces have the right MTU and look exactly like the
ethernet devices they in fact are to the kernel AFAIK.  In every other
aspect I've ever tested, including my own homemade socket code, response
to both tcp and udp daemons, ability to mount NFS, support ssh, and so
on and so forth, they behave like TCP/IP sockets over ethernet devices
as far as systems calls go -- they use the same interface, and the whole
point of OSI/ISO is that code should not depend on the hardware layer
and in general on even a roughly posix compliant machine using standard
devices and e.g. the socket API it doesn't.

Last time I encountered this, I actually cranked up the -d0x0 stuff and
"watched" as the system went through to where it hung in the middle of
doing some part of the post-spawn handshaking.

I suspect a race condition, probably caused by using raw UDP with some
assumption of latency during the handshake.  The one way I can think of
that the two connections differ is in their latency -- even the
bandwidth of wireless is every bit as great as 10B2 networks I've run
PVM on in years past (on proportionally slower CPUs, of course).  If the
master or slave send out an acknowledgement packet either before the
window where the other can receive it or after it has grown bored and
stopped listening, it might fail to properly bind or something.  It
seems like it would be a bug, not a feature, but if I were feeling
infinitely masochistic and were to wander down into Other People's
Source (ouch!) to try to debug this, that's what I'd look for first.

Any PVM developers still on list?  Any comments from them?

    rgb

>
> Just a thought,
>
> -bill
>
>
> On Feb 6, 2008, at 10:40 AM, Robert G. Brown wrote:
>
>> Anybody on list have any idea why PVM fails to add hosts over a wireless
>> link?  I've now tried this over multiple distro version and at least one
>> PVM update, and it just doesn't work.  Works fine over a wire, fails on
>> wireless, and as far as I know wire and wireless are both "identical"
>> at the kernel interface layer so that any e.g. socket one might open is
>> absolutely ecumenical about what the underlying hardware is (good old
>> ISO/OSI layering, right?).
>

-- 
Robert G. Brown                            Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Web: http://www.phy.duke.edu/~rgb
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977



More information about the Beowulf mailing list