[StarCluster] StarCluster LoadBalancer

Sergio Mafra sergiohmafra at gmail.com
Wed Jan 23 08:39:56 EST 2013


Hi fellows,

Started to test the SC´s LoadBalancer but something is not working well.
The LoadBalancer tells me that there´s no jobs in the OGE´s queue.
Here comes all history:

1-Launched a 5-node cluster (Ubuntu HVM) cc1.x4large
-> Used mpich2 plugin (mpich2 v1.4.1 native)

2- Submitted an application job to OGE (mympiapp):
$ qsub -N Newaveinth -b y -pe orte 80 -cwd mpiexec -n 80 mympiapp

3- Checked the queue:

job-ID  prior   name       user         state submit/start at     queue
                     slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
      2 0.55500 Newaveinth sgeadmin     r     01/23/2013 13:29:48
all.q at master                      80
sgeadmin at master:~/pmo0113$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch
 states
---------------------------------------------------------------------------------
all.q at master                   BIP   0/16/16        3.06     linux-x64
      2 0.55500 Newaveinth sgeadmin     r     01/23/2013 13:29:48    16
---------------------------------------------------------------------------------
all.q at node001                  BIP   0/16/16        2.87     linux-x64
      2 0.55500 Newaveinth sgeadmin     r     01/23/2013 13:29:48    16
---------------------------------------------------------------------------------
            BIP   0/16/16        2.74     linux-x64
      2 0.55500 Newaveinth sgeadmin     r     01/23/2013 13:29:48    16
---------------------------------------------------------------------------------
all.q at node003                  BIP   0/16/16        2.86     linux-x64
      2 0.55500 Newaveinth sgeadmin     r     01/23/2013 13:29:48    16
---------------------------------------------------------------------------------
all.q at node004                  BIP   0/16/16        2.11     linux-x64
      2 0.55500 Newaveinth sgeadmin     r     01/23/2013 13:29:48    16

4- Tried to Load Balance the cluster launcedd with 5 nodes

ubuntu at ip-10-112-98-159:~$ starcluster loadbalance mycluster --max_nodes=6
StarCluster - (http://web.mit.edu/starcluster) (v. 0.93.3)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster at mit.edu

>>> Starting load balancer (Use ctrl-c to exit)
Maximum cluster size: 6
Minimum cluster size: 1
Cluster growth rate: 1 nodes/iteration

>>> Loading full job history
Execution hosts: 5
Queued jobs: 0
Avg job duration: 1699 secs
Avg job wait time: 8 secs
Last cluster modification time: 2013-01-23 13:32:33
>>> Cluster was modified less than 180 seconds ago
>>> Waiting for cluster to stabilize...
>>> Sleeping...(looping again in 60 secs)

>>> Loading full job history
Execution hosts: 5
Queued jobs: 0 (<-- Jobs equals a 0 ???)
Avg job duration: 1699 secs
Avg job wait time: 8 secs
Last cluster modification time: 2013-01-23 13:32:33
>>> Cluster was modified less than 180 seconds ago
>>> Waiting for cluster to stabilize...
>>> Sleeping...(looping again in 60 secs)

^C (<-- Ctrl-C and a lot of messages...)
Traceback (most recent call last):
  File "/usr/local/bin/starcluster", line 9, in <module>
    load_entry_point('StarCluster==0.93.3', 'console_scripts',
'starcluster')()
  File
"/usr/local/lib/python2.7/dist-packages/StarCluster-0.93.3-py2.7.egg/starcluster/cli.py",
line 312, in main
    StarClusterCLI().main()
  File
"/usr/local/lib/python2.7/dist-packages/StarCluster-0.93.3-py2.7.egg/starcluster/cli.py",
line 255, in main
    sc.execute(args)
  File
"/usr/local/lib/python2.7/dist-packages/StarCluster-0.93.3-py2.7.egg/starcluster/commands/loadbalance.py",
line 90, in execute
    lb.run(cluster)
  File
"/usr/local/lib/python2.7/dist-packages/StarCluster-0.93.3-py2.7.egg/starcluster/balancers/sge/__init__.py",
line 619, in run
    time.sleep(self.polling_interval)
KeyboardInterrupt
Exception in thread Thread-1 (most likely raised during interpreter
shutdown):
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
  File
"/usr/local/lib/python2.7/dist-packages/ssh-1.7.13-py2.7.egg/ssh/transport.py",
line 1602, in run
<type 'exceptions.AttributeError'>: 'NoneType' object has no attribute
'error'@node002

All the best,

Sergio
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20130123/6d659a96/attachment.htm


More information about the StarCluster mailing list