[StarCluster] jobs on slave nodes disappear

Tue Jan 3 12:44:43 EST 2012

Hi Liang,

Ahhh I think I know what's happening. You're submitting the jobs as root
from root's home folder which is not NFS shared. This is why you're not
seeing any results, output files, error files, etc. unless the job
actually gets scheduled to run on the 'master node' which you're
currently logged into.

If you switch to the CLUSTER_USER and submit the job from CLUSTER_USER's
$HOME folder you'll probably find everything works fine. This is because
$HOME is NFS-shared across the cluster by default so no matter what node
the job gets scheduled to you'll always see the results from any node in
the cluster.

HTH,

~Justin

On 12/31/2011 03:18 PM, liang cheng wrote:
> Hi Justin,
>  
> Thanks for your reply. There's no error log nor output log even when I
> use "-e" or "-o" option.
>  
> I created a cluster with one master and 10 slave. I made a minor change
> on the master node and use "starcluster createimage i-xxxx AAA BBB".
> "i-xxxx" is the instance id of the master. After I got the ami-yyyy, I
> run "starcluster start ami-yyyy". I found all jobs submitted to slave
> nodes are finished instantly, as you see in the log I sent earlier. The
> jobs in master node are run normally.
>  
> I haven't used "restart" command but will give it a try.
>  
> -Liang
> 
> On Sat, Dec 31, 2011 at 12:03 PM, Justin Riley <jtriley at mit.edu
> <mailto:jtriley at mit.edu>> wrote:
> 
> Hi Liang,
> 
> Is this happening consistently even after restarting the cluster using
> "starcluster restart mycluster"? Also, is there anything in your
> job(s) error logs? Given the output you provided these would most
> likely be located in the directory you submitted the job from and
> should be named something like "single.sh.e23".
> 
> ~Justin
> 
> 
> On 12/30/2011 08:58 PM, liang cheng wrote:
>> Greetings !
> 
>> I created  a star cluster on EC2 and use qsub to submit jobs. It
>> used to work well. From this afternoon, after I requested for
>> additional EC2 instance from Amazon, the issue comes out.
> 
>> Only the jobs submitted to the master node are executed. Other
>> jobs disappeared just in no time.  Some diagonosis is as below. Any
>> helps are appreciated !
> 
>> Happy New Year !
> 
> 
>> root at master:/# qacct -j 23
>> ==============================================================
>> qname        all.q hostname     node006 group        root
>>  owner        root project      NONE department   defaultdepartment
>>  jobname      single.sh out 3 jobnumber    23 taskid
>> undefined account      sge priority     0 qsub_time    Sat Dec 31
>> 01:38:32 2011 start_time   Sat Dec 31 01:38:39 2011 end_time
>> Sat Dec 31 01:38:39 2011 granted_pe   NONE slots        1
>>  failed       0 exit_status  0 ru_wallclock 0 ru_utime     0.010
>>  ru_stime     0.010 ru_maxrss    2276 ru_ixrss     0
>>  ru_ismrss    0 ru_idrss     0 ru_isrss     0 ru_minflt    2648
>>  ru_majflt    0 ru_nswap     0 ru_inblock   0 ru_oublock   272
>>  ru_msgsnd    0 ru_msgrcv    0 ru_nsignals  0 ru_nvcsw     12
>>  ru_nivcsw    3 cpu          0.020 mem          0.000 io
>> 0.000 iow          0.000 maxvmem      0.000 arid         undefined
> 
>> =========================
> 
>> Thanks, -Liang
> 
> 
> 
> 
> 
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster