[StarCluster] grid engine not initialize on gpu hvm image
Jesse Lu
jesselu at stanford.edu
Mon Aug 27 17:59:11 EDT 2012
Okay, figured out that using ami-999d49f0 for non-HVM master
and ami-4583572c for HVM nodes makes SGE work well. It's my fault for not
looking at the available public starcluster images carefully enough.
On Mon, Aug 27, 2012 at 2:26 PM, Jesse Lu <jesselu at stanford.edu> wrote:
> Sorry for the spam, but here's another follow-up.
>
> I found that this only happens when I use a non HVM-EBS AMI for the
> master, but an HWM-EBS for the master.
>
> This is probably because StarCluster copies the sge install from the
> master to the nodes, and this doesn't play nice when the nodes are CentOS
> based but the master is Ubuntu based.
>
> Any ideas for a work-around?
>
>
> On Mon, Aug 27, 2012 at 2:07 PM, Jesse Lu <jesselu at stanford.edu> wrote:
>
>> Follow-up,
>>
>> Here are the contents of the installation log file (for grid engine)
>>
>> cat
>> /opt/sge6/default/common/install_logs/execd_install_node001_2012-08-27_14:04:29.log
>>
>>
>> Your $SGE_ROOT directory: /opt/sge6
>>
>>
>> Using cell: >default<
>>
>>
>>
>>
>>
>> Using local execd spool directory
>> [/opt/sge6/default/spool/exec_spool_local]
>>
>> Creating local configuration for host >node001<
>> sgeadmin at node001 modified "node001" in configuration list
>> Local configuration for host >node001< created.
>>
>> Host >master< already in submit host list!
>> Host >node001< already in submit host list!
>>
>>
>> starting sge_execd
>>
>>
>> No modification because "node001" already exists in "hostlist" of
>> "hostgroup"
>> root at node001 modified "@allhosts" in host group list
>> root at node001 modified "all.q" in cluster queue list
>>
>> got select error: Connection refused
>> got select error: closing "node001/execd/1"
>> Execd on host node001 is not started!
>>
>>
>> On Mon, Aug 27, 2012 at 1:37 PM, Jesse Lu <jesselu at stanford.edu> wrote:
>>
>>> ami-12b6477b produces the folowing error on cluster startup
>>>
>>> !!! ERROR - command 'cd /opt/sge6 && TERM=rxvt ./inst_sge -x -noremote
>>> -auto ./ec2_sge.conf' failed with status 1
>>>
>>> I'm guessing the sge6 installation is faulty? Can anyone help? Thanks!
>>>
>>> Jesse
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20120827/7cc69175/attachment-0001.htm
More information about the StarCluster
mailing list