[Starcluster] Load Balancer Problems

Tue Aug 3 10:15:59 EDT 2010

Raj,

> I know what that error is - when no job has completed, qacct returns 1
> instead of 0. I am looking for a way to get rid of that message. It
> has no effect on the balancer.

master.ssh.execute('qacct .....', ignore_exit_status=True)

~Justin

________________________________________
From: rqbanerjee at gmail.com [rqbanerjee at gmail.com] On Behalf Of Rajat Banerjee [rbanerj at fas.harvard.edu]
Sent: Monday, August 02, 2010 4:53 PM
To: Amaro Taylor
Cc: Justin Riley; starcluster at mit.edu
Subject: Re: [Starcluster] Load Balancer Problems

Hey Amaro,
Great. Glad it is working for you.

I know what that error is - when no job has completed, qacct returns 1
instead of 0. I am looking for a way to get rid of that message. It
has no effect on the balancer.

Rajat

On Mon, Aug 2, 2010 at 4:51 PM, Amaro Taylor
<amaro.taylor at resgroupinc.com> wrote:
> Hey Rajat,
>
> Just to update you on the testing progress. Im currently running a job and
> it seems to be working as expected. We also got one error that didnt seem to
> change anything :  ssh.py:248 - ERROR - command source /etc/profile && qacct
> -j -b 201008021652 failed with status 1. The balancer looks to be working
> great.
> Best,
> Amaro Taylor
> RES Group, Inc.
> 1 Broadway • Cambridge, MA 02142 • U.S.A.
> Tel: 310 880-1906 (Direct) • Fax: 617-812-8042 • Email:
> amaro.taylor at resgroupinc.com
>
> Disclaimer: The information contained in this email message may be
> confidential. Please be careful if you forward, copy or print this message.
> If you have received this email in error, please immediately notify the
> sender and delete the message.
>
>
> On Mon, Aug 2, 2010 at 12:59 PM, Amaro Taylor <amaro.taylor at resgroupinc.com>
> wrote:
>>
>> Hey Guys,
>>
>> As far as the node idle time I think we just misinterpreted what was
>> happening. The modulus statement was what we wanted.
>>
>> Thanks
>> Amaro Taylor
>> RES Group, Inc.
>> 1 Broadway • Cambridge, MA 02142 • U.S.A.
>> Tel: 310 880-1906 (Direct) • Fax: 617-812-8042 • Email:
>> amaro.taylor at resgroupinc.com
>>
>> Disclaimer: The information contained in this email message may be
>> confidential. Please be careful if you forward, copy or print this message.
>> If you have received this email in error, please immediately notify the
>> sender and delete the message.
>>
>>
>> On Mon, Aug 2, 2010 at 12:30 PM, Justin Riley <jtriley at mit.edu> wrote:
>>>
>>> -----BEGIN PGP SIGNED MESSAGE-----
>>> Hash: SHA1
>>>
>>> Raj,
>>>
>>> > 2. What is your preference for how long a job should stay idle before
>>> > being killed?
>>>
>>> I think you meant *node* not job...
>>>
>>> > I usually don't check how long it has been idle. If it
>>> > is idle now and the queue is empty then kill it. I could add code to
>>> > check how long it has been idle, if it seems useful. Is there a use
>>> > case?
>>>
>>> Also, the node must be up for the "majority of the hour" before it can
>>> be considered for removal. This provides flexibility for the queue to
>>> stabilize and also saves money given that you pay for the entire
>>> instance hour anyway.
>>>
>>> As far as the "code to check how long a node has been idle" goes I'm not
>>> sure I understand the use case/context either. Mind bringing the list up
>>> to date on this discussion?
>>>
>>> ~Justin
>>>
>>> On 08/02/2010 02:38 PM, Rajat Banerjee wrote:
>>> > Hey Amaro,
>>> > Cool thanks. I called Brian and got info regarding the array of jobs.
>>> > I checked in some test code that works fine on my (simple) cluster
>>> > with qsub -t 1-20:1. I'd appreciate it if you'd test and let me know
>>> > how it goes. Just committed to github:
>>> >
>>> > http://github.com/rqbanerjee/StarCluster/commit/17998a68feab3d1440aa5d9edc2e74697e43ef54
>>> >
>>> > Making requests during a business day has its rewards :)
>>> >
>>> > Regarding the host that had been inactive for a short time:
>>> > 1. If the "tasks" field was properly recognized , as it is now, the
>>> > queue should be recognized as full, and that node probably wouldn't
>>> > have been killed.
>>> > 2. What is your preference for how long a job should stay idle before
>>> > being killed? I usually don't check how long it has been idle. If it
>>> > is idle now and the queue is empty then kill it. I could add code to
>>> > check how long it has been idle, if it seems useful. Is there a use
>>> > case?
>>> >
>>> > Thanks,
>>> > Rajat
>>> > _______________________________________________
>>> > Starcluster mailing list
>>> > Starcluster at mit.edu
>>> > http://mailman.mit.edu/mailman/listinfo/starcluster
>>>
>>> -----BEGIN PGP SIGNATURE-----
>>> Version: GnuPG v2.0.15 (GNU/Linux)
>>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>>>
>>> iEYEARECAAYFAkxXHOcACgkQ4llAkMfDcrnbvACghwwDpZn2uMUcr88lqH/bFdAr
>>> MAIAn39LoXOe4j1iJ0x0crm4IsSI5kZC
>>> =TQh9
>>> -----END PGP SIGNATURE-----
>>
>
>