[StarCluster] Eqw errors in SGE with default starcluster configuration

Josh Moore jlmo at cs.cornell.edu
Tue Feb 7 11:27:06 EST 2012


Ah ok, the problem is obvious now:

02/07/2012 16:20:31|worker|master|W|job 1.1 failed on host node001 general
changing into working directory because: 02/07/2012 16:20:30 [0:1568]:
error: can't chdir to /root/lme: No such fi
le or directory
02/07/2012 16:20:31|worker|master|W|rescheduling job 1.1
02/07/2012 16:20:31|worker|master|W|job 3.1 failed on host node001 general
changing into working directory because: 02/07/2012 16:20:30 [0:1569]:
error: can't chdir to /root/lme: No such fi
le or directory
02/07/2012 16:20:31|worker|master|W|rescheduling job 3.1

The directory structure isn't there on the other nodes. Like I said, this
script runs fine on my department's cluster, which runs SGE, without any
extra action to set up directories on the nodes. What do I need to
configure so that this will replicate the directory structure etc.?

Best,
Josh

On Tue, Feb 7, 2012 at 11:02 AM, Dustin Machi <dmachi at vbi.vt.edu> wrote:

> I'd check out /opt/sge6/default/spool/qmaster/messages to see if there is
> anything useful about what is happening there.  It will generally tell you
> why its not queuing an additional job.  Are the parallel environments setup
> the same between your two clusters?
>
> Dustin
>
> On Feb 7, 2012, at 2:30 AM, Josh Moore wrote:
>
> > I tried submitting a bunch of jobs using qsub with a script that works
> fine on another (non-Amazon) cluster's configuration of SGE. But on a
> cluster configured with StarCluster, only the first 8 (on a cluster of
> c1.xlarge nodes, so 8 cores each) enter the queue without error (all of
> those are immediately executed on the master node). Even if I delete one of
> the jobs on the master node, another one never takes its place. I have a
> cluster of 8 c1.xlarge nodes. Here is the output of qconf -ssconf:
> >
> > algorithm                         default
> > schedule_interval                 0:0:15
> > maxujobs                          0
> > queue_sort_method                 load
> > job_load_adjustments              np_load_avg=0.50
> > load_adjustment_decay_time        0:7:30
> > load_formula                      np_load_avg
> > schedd_job_info                   false
> > flush_submit_sec                  0
> > flush_finish_sec                  0
> > params                            none
> > reprioritize_interval             0:0:0
> > halftime                          168
> > usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
> > compensation_factor               5.000000
> > weight_user                       0.250000
> > weight_project                    0.250000
> > weight_department                 0.250000
> > weight_job                        0.250000
> > weight_tickets_functional         0
> > weight_tickets_share              0
> > share_override_tickets            TRUE
> > share_functional_shares           TRUE
> > max_functional_jobs_to_schedule   200
> > report_pjob_tickets               TRUE
> > max_pending_tasks_per_job         50
> > halflife_decay_list               none
> > policy_hierarchy                  OFS
> > weight_ticket                     0.010000
> > weight_waiting_time               0.000000
> > weight_deadline                   3600000.000000
> > weight_urgency                    0.100000
> > weight_priority                   1.000000
> > max_reservation                   0
> > default_duration                  INFINITY
> >
> > I can't figure out how to change schedd_job_info to true to find out
> more about the error message...
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster at mit.edu
> > http://mailman.mit.edu/mailman/listinfo/starcluster
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20120207/6476f29e/attachment.htm


More information about the StarCluster mailing list