[StarCluster] Error Report

Justin Riley jtriley at MIT.EDU
Tue Aug 23 17:21:00 EDT 2011


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Josh,

Apparently I didn't read your original post carefully enough - you
already used the restart command. In either case it should be safe to
call restart again once your connection improves if that was the issue.
Are you getting the "Connection reset by peer" error consistently?

Also, concerning your jobs, which I'm assuming you're using SGE for, are
you still having issues? If so, it's hard to tell what might be the
issue without more details. Have you checked that the jobs are not in an
error state with "qstat -f" and have you tried logging onto the
execution host running the jobs to check that the processes are running,
proper files being generated, etc? Just some starting points...

~Justin

On 08/23/2011 05:13 PM, Justin Riley wrote:
> Hi Josh,
> 
> My apologies for the delay. The error you submitted could be related
> to:
> 
> 1. A spotty Internet connection 2. DOS attacks on your instances' SSH
> daemons
> 
> In the first case there's not much I can do; you really need to have
> a solid Internet connection for StarCluster to work. I've played
> around with auto-reconnecting but it's a hack and likely to break in
> other more extravagant ways so I've been hesitant to add it in.
> 
> However, if you lose your connection during a 'start' command there's
> no need to destroy the cluster, just run restart instead:
> 
> $ starcluster restart mycluster
> 
> This will simply reboot the instances and reconfigure the cluster
> all over again rather than terminating the instances and wasting
> instance hours.
> 
> I'm working on a solution for the second case which is basically to 
> restrict SSH access to only each IP address that attempts to connect 
> with valid credentials. Essentially StarCluster would:
> 
> 1. Figure out your current IP 2. Modify the security group
> permissions if necesssary to allow SSH access from your current IP
> 
> This would happen for each new IP you try to
> start/sshmaster/ssnode/etc from.
> 
> HTH,
> 
> ~Justin
> 
> On 08/05/2011 03:05 PM, josh katz wrote:
>> This error occurred after I had submitted 10 jobs 2 of which where
>>  never completed. So i deleted them and tried again but thos ejobs 
>> were also never finished. Then when I tried to restart the cluster 
>> this error appeared and asked me to send it. Thus this email.
> 
>> Thanks, Josh
> 
> 
> _______________________________________________ StarCluster mailing
> list StarCluster at mit.edu 
> http://mailman.mit.edu/mailman/listinfo/starcluster

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk5UGbwACgkQ4llAkMfDcrnn3QCfaKbZ3fi8axgSO2g9N1xpyaUA
Rk8AniiSgpdfgy8CkU7hZ1aGbYi+qi23
=AFu+
-----END PGP SIGNATURE-----



More information about the StarCluster mailing list