[Fwd: Software DSM, for help!]

Shu Xiao xiao@usc.edu
Tue, 22 Jun 1999 21:22:09 -0400


Dear All!

We are porting JIAJIA 2.1, the software DSM package, to our new
linux-based cluster system. But I get stuck by some problem. We do
need the help, wish it doesn't bother.

Basically we installed JIAJIA smoothly. There is no problem to creat the libary
libjia.a. We also made all the applications provided successfully. But when I
run the application, it fails. The sample response is shown below,

===============================================
[root@andy src]# ./pi

***JIAJIA---Software DSM***
***  Cachepages = 1024  Pagesize=4096***

Host[0]: andy [192.168.1.253]
Host[1]: trojan1 [192.168.1.1]
*********Total of 2 hosts found!**********

******Start to copy system files to slaves!******
Copy files to root@trojan1.
Remote copy succeed!

******Start to create processes on slaves!******

Starting CMD rsh -l root trojan1 ./pi -P27952  & on host trojan1
End of Initialization
Map 0x1000 bytes in home    0! globaladdr = 0x0

***JIAJIA---Software DSM***
***  Cachepages = 1024  Pagesize=4096***

Host[0]: andy [192.168.1.253]
Host[1]: trojan1 [192.168.1.1]
Assert0 error from host 0 --- outsend()-->sendto()
Unix Error: Connection refused
[root@andy src]#
==================================================

It seems that both rcp and rsh are OK. The problem is in the sendto(). The
support from the dsm@water.chpc.ict.ac.cn said they used rsh instead of
rexec() because it doesn't work on Linux. And rsh might have problem for
global synchronization. They also suggested me to add sleep() after the
jia_init() in the application program to solve this problem. I tried it and it
didn't help. I traced the code using tcpdump, it seems that the slave machine
doesn't response the UDP package it received. I am using Redhat Linux 6.0.
The some code runs OK under Redhat 5.2

Would you please give me any advices on it?

Thank ahead.

Sincerely,

Shu Xiao