On Mon, 22 Jul 2002, Randy Clarksean wrote:
> There are two problems that keep coming up and any suggestions would be
> greatly appreciated.
>
> -  at times there are delays in the network because a ping can take up to 1
> sec to go to any one or all of the machines.  The machines are all on 10/100
> NICs with a 100 MB Hub.  If the machines are carefully rebooted (meaning
> rebooting one at a time until it is all the way back up again) this problem
> will seemingly go away.  So .. any time there is the ping delay, the rsh
> delays everything - which causes the cluster many problems.  Any thoughts on
> rsh parameters that I should change or set differently?  OR is there
> something running on the system that can cause these ping or rsh delays?
>

I've not had any experience with beowulf, but you might try to see if you
can reduce this to the lowest common denomenator.  ie - have you tried
rebooting the hub instead of the systems? if this is a hub, then you've
got the everything in half-duplex, so there's a (however remote)
possibility that a collision is screwing something up.  another thought -
can you bring up just 2 systems, with network running, and none of the
cluster software, and beat the snot out of the tcpip connections between 2
systems and re-create the issue?  ftp a large file back & forth between a
system & dev/null on the other side, if possible, set it up in a while
loop and come back tomorrow, something on that order?
	another long-shot, instead of letting the nic card driver
auto-negotiate, force it to 100-half for a hub or 100-full for a switch.
(yes, I've seen "auto" not work correctly in some environments)
	Is there any other connectivity into the hub?  can it be removed?
	I've seen similar issues in AIX RS6000 clusters, that issue
ultimatly  was a memory leak in the nic card driver.  are you using the
latest driver code (or most stable) for the NIC?

	as I said above, I'm no beowulf expert, but this issue has to be
in one of a few places - the network itself(bad card/hub), the card
driver, tcpip, or the cluster software. I'd find it unlikely this was due
to rsh or ping, they both rely on tcpip and everything below it.  the
cluster software could be to blame, so try & get that out of the equation.

-- LINUX, because rebooting is for adding hardware!
www.linuxsnob.com <-- a little linux humor, and a very little support.