[StarCluster] jobs on slave nodes disappear

liang cheng liang.cheng at gmail.com
Sat Dec 31 15:18:33 EST 2011


Hi Justin,

Thanks for your reply. There's no error log nor output log even when I use
"-e" or "-o" option.

I created a cluster with one master and 10 slave. I made a minor change on
the master node and use "starcluster createimage i-xxxx AAA BBB". "i-xxxx"
is the instance id of the master. After I got the ami-yyyy, I run
"starcluster start ami-yyyy". I found all jobs submitted to slave nodes are
finished instantly, as you see in the log I sent earlier. The jobs in
master node are run normally.

I haven't used "restart" command but will give it a try.

-Liang

On Sat, Dec 31, 2011 at 12:03 PM, Justin Riley <jtriley at mit.edu> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi Liang,
>
> Is this happening consistently even after restarting the cluster using
> "starcluster restart mycluster"? Also, is there anything in your
> job(s) error logs? Given the output you provided these would most
> likely be located in the directory you submitted the job from and
> should be named something like "single.sh.e23".
>
> ~Justin
>
>
> On 12/30/2011 08:58 PM, liang cheng wrote:
> > Greetings !
> >
> > I created  a star cluster on EC2 and use qsub to submit jobs. It
> > used to work well. From this afternoon, after I requested for
> > additional EC2 instance from Amazon, the issue comes out.
> >
> > Only the jobs submitted to the master node are executed. Other
> > jobs disappeared just in no time.  Some diagonosis is as below. Any
> > helps are appreciated !
> >
> > Happy New Year !
> >
> >
> > root at master:/# qacct -j 23
> > ==============================================================
> > qname        all.q hostname     node006 group        root
> >  owner        root project      NONE department   defaultdepartment
> >  jobname      single.sh out 3 jobnumber    23 taskid
> > undefined account      sge priority     0 qsub_time    Sat Dec 31
> > 01:38:32 2011 start_time   Sat Dec 31 01:38:39 2011 end_time
> > Sat Dec 31 01:38:39 2011 granted_pe   NONE slots        1
> >  failed       0 exit_status  0 ru_wallclock 0 ru_utime     0.010
> >  ru_stime     0.010 ru_maxrss    2276 ru_ixrss     0
> >  ru_ismrss    0 ru_idrss     0 ru_isrss     0 ru_minflt    2648
> >  ru_majflt    0 ru_nswap     0 ru_inblock   0 ru_oublock   272
> >  ru_msgsnd    0 ru_msgrcv    0 ru_nsignals  0 ru_nvcsw     12
> >  ru_nivcsw    3 cpu          0.020 mem          0.000 io
> > 0.000 iow          0.000 maxvmem      0.000 arid         undefined
> >
> > =========================
> >
> > Thanks, -Liang
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2.0.17 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk7/aooACgkQ4llAkMfDcrmFegCfULuLAaDIrEvDi1257HZR3ico
> B5wAn2rGWD5D9c4rETIq07d6jKq/jrCs
> =pb1b
> -----END PGP SIGNATURE-----
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20111231/0b487022/attachment.htm


More information about the StarCluster mailing list