[StarCluster] Eqw errors in SGE with default starcluster configuration

Tue Feb 7 11:02:33 EST 2012

I'd check out /opt/sge6/default/spool/qmaster/messages to see if there is anything useful about what is happening there.  It will generally tell you why its not queuing an additional job.  Are the parallel environments setup the same between your two clusters?

Dustin

On Feb 7, 2012, at 2:30 AM, Josh Moore wrote:

> I tried submitting a bunch of jobs using qsub with a script that works fine on another (non-Amazon) cluster's configuration of SGE. But on a cluster configured with StarCluster, only the first 8 (on a cluster of c1.xlarge nodes, so 8 cores each) enter the queue without error (all of those are immediately executed on the master node). Even if I delete one of the jobs on the master node, another one never takes its place. I have a cluster of 8 c1.xlarge nodes. Here is the output of qconf -ssconf:
> 
> algorithm                         default
> schedule_interval                 0:0:15
> maxujobs                          0
> queue_sort_method                 load
> job_load_adjustments              np_load_avg=0.50
> load_adjustment_decay_time        0:7:30
> load_formula                      np_load_avg
> schedd_job_info                   false
> flush_submit_sec                  0
> flush_finish_sec                  0
> params                            none
> reprioritize_interval             0:0:0
> halftime                          168
> usage_weight_list                 cpu=1.000000,mem=0.000000,io=0.000000
> compensation_factor               5.000000
> weight_user                       0.250000
> weight_project                    0.250000
> weight_department                 0.250000
> weight_job                        0.250000
> weight_tickets_functional         0
> weight_tickets_share              0
> share_override_tickets            TRUE
> share_functional_shares           TRUE
> max_functional_jobs_to_schedule   200
> report_pjob_tickets               TRUE
> max_pending_tasks_per_job         50
> halflife_decay_list               none
> policy_hierarchy                  OFS
> weight_ticket                     0.010000
> weight_waiting_time               0.000000
> weight_deadline                   3600000.000000
> weight_urgency                    0.100000
> weight_priority                   1.000000
> max_reservation                   0
> default_duration                  INFINITY
> 
> I can't figure out how to change schedd_job_info to true to find out more about the error message...
> _______________________________________________
> StarCluster mailing list
> StarCluster at mit.edu
> http://mailman.mit.edu/mailman/listinfo/starcluster