[StarCluster] Twilight Zone: sge_gethostbyname failed

Rayson Ho raysonlogin at gmail.com
Fri Dec 27 19:43:00 EST 2013


/etc/sysconfig/network is read during reboot, and may be after DHCP...

To see if it is the issue, set HOSTNAME back to master, and also run
"hostname master" as root.

Rayson

==================================================
Open Grid Scheduler - The Official Open Source Grid Engine
http://gridscheduler.sourceforge.net/
http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html


On Fri, Dec 27, 2013 at 7:40 PM, Lyn Gerner <schedulerqueen at gmail.com> wrote:
> Thanks for digging, Rayson.
>
> So, /etc/sysconfig/network had HOSTNAME=centos-ami when the problem first
> occurred.  I tried resetting it to "master" and then retried the SGE
> commands (qstat, qsub, etc.).  They still failed with the same error at that
> point, so I switched them back, not knowing for sure if they'd been set to
> master and node001 to begin with.
>
> Thanks,
> Lyn
>
>
> On Fri, Dec 27, 2013 at 2:35 PM, Rayson Ho <raysonlogin at gmail.com> wrote:
>>
>> (Updating the list...)
>>
>> The hostname on the master gets reset to centos-ami, which is not
>> resolvable. Thus Grid Engine complains about the hostname issue.
>>
>> Lyn: what is the value of the HOSTNAME key in "/etc/sysconfig/network"
>> on your master instance??
>>
>> Justin & other devs: set_hostname() in node.py works on Ubuntu because
>> Ubuntu uses /etc/hostname, but RHEL (and RHEL-based distros like
>> CentOS, Oracle Linux, Scientific Linux) uses /etc/sysconfig/network,
>> and yet SuSE uses /etc/HOSTNAME!
>>
>> Rayson
>>
>> ==================================================
>> Open Grid Scheduler - The Official Open Source Grid Engine
>> http://gridscheduler.sourceforge.net/
>> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>>
>>
>> On Fri, Dec 27, 2013 at 6:39 PM, Lyn Gerner <schedulerqueen at gmail.com>
>> wrote:
>> > I used the Scientific Linux AMI (been a long time, but I found it from
>> > the
>> > SC site), and 0.94.3 is my SC version.
>> >
>> >
>> > On Fri, Dec 27, 2013 at 1:36 PM, Rayson Ho <raysonlogin at gmail.com>
>> > wrote:
>> >>
>> >> Hmm, which AMI did you use, and what's the version of SC?
>> >>
>> >> Rayson
>> >>
>> >> ==================================================
>> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> http://gridscheduler.sourceforge.net/
>> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>> >>
>> >>
>> >> On Fri, Dec 27, 2013 at 6:33 PM, Lyn Gerner <schedulerqueen at gmail.com>
>> >> wrote:
>> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> >> > # /opt/sge6/utilbin/linux-x64/gethostname -name
>> >> > error resolving local host: can't resolve host name (h_errno =
>> >> > HOST_NOT_FOUND)
>> >> >
>> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> >> > # hostname
>> >> > centos-ami
>> >> >
>> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> >> > # hostname -f
>> >> > hostname: Unknown host
>> >> >
>> >> > What's weird is that I have never mucked with any of this under
>> >> > StarCluster,
>> >> > and have only recently started having problems.  Can't pinpoint any
>> >> > specific
>> >> > event or thing that changed--except that I started leaving the config
>> >> > up
>> >> > for
>> >> > days instead of hours at a stretch.
>> >> >
>> >> > Thanks,
>> >> > Lyn
>> >> >
>> >> >
>> >> > On Fri, Dec 27, 2013 at 1:30 PM, Rayson Ho <raysonlogin at gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> No problem, and I think that's why it is failing. Can you also send
>> >> >> me
>> >> >> the output of:
>> >> >>
>> >> >> 1) gethostname -name
>> >> >>
>> >> >> 2) hostname
>> >> >>
>> >> >> 3) hostname -f
>> >> >>
>> >> >> Rayson
>> >> >>
>> >> >> ==================================================
>> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> http://gridscheduler.sourceforge.net/
>> >> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>> >> >>
>> >> >>
>> >> >> On Fri, Dec 27, 2013 at 6:27 PM, Lyn Gerner
>> >> >> <schedulerqueen at gmail.com>
>> >> >> wrote:
>> >> >> > My bad:
>> >> >> >
>> >> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
>> >> >> > # /opt/sge6/utilbin/linux-x64/gethostname -all
>> >> >> > error resolving local host: can't resolve host name (h_errno =
>> >> >> > HOST_NOT_FOUND)
>> >> >> >
>> >> >> > Thanks for any insights,
>> >> >> > Lyn
>> >> >> >
>> >> >> >
>> >> >> > On Fri, Dec 27, 2013 at 1:25 PM, Rayson Ho <raysonlogin at gmail.com>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> But I need the output of "gethostname", not "gethostbyname"...
>> >> >> >> :-P
>> >> >> >>
>> >> >> >> Rayson
>> >> >> >>
>> >> >> >> ==================================================
>> >> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> >> http://gridscheduler.sourceforge.net/
>> >> >> >>
>> >> >> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>> >> >> >>
>> >> >> >>
>> >> >> >> On Fri, Dec 27, 2013 at 6:11 PM, Lyn Gerner
>> >> >> >> <schedulerqueen at gmail.com>
>> >> >> >> wrote:
>> >> >> >> > Thanks for the quick response, Rayson.  Output from
>> >> >> >> > gethostbyname
>> >> >> >> > is
>> >> >> >> > in
>> >> >> >> > between the ****s below:
>> >> >> >> >
>> >> >> >> > On Fri, Dec 27, 2013 at 1:04 PM, Rayson Ho
>> >> >> >> > <raysonlogin at gmail.com>
>> >> >> >> > wrote:
>> >> >> >> >>
>> >> >> >> >> What is the output of "gethostname"? (gethostname is shipped
>> >> >> >> >> with
>> >> >> >> >> SGE
>> >> >> >> >> in the util dir.)
>> >> >> >> >>
>> >> >> >> >> Rayson
>> >> >> >> >>
>> >> >> >> >> ==================================================
>> >> >> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
>> >> >> >> >> http://gridscheduler.sourceforge.net/
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> On Fri, Dec 27, 2013 at 5:34 PM, Lyn Gerner
>> >> >> >> >> <schedulerqueen at gmail.com>
>> >> >> >> >> wrote:
>> >> >> >> >> > Hi All,
>> >> >> >> >> >
>> >> >> >> >> > Okay, I'm in the Twilight Zone now.  After starting a small
>> >> >> >> >> > cluster
>> >> >> >> >> > on
>> >> >> >> >> > the
>> >> >> >> >> > 23rd, and doing minimal reconfig (qmod -d) to disable the
>> >> >> >> >> > sge_execd
>> >> >> >> >> > on
>> >> >> >> >> > the
>> >> >> >> >> > master and qconf -mq all.q to change some slot counts -- all
>> >> >> >> >> > of
>> >> >> >> >> > which
>> >> >> >> >> > worked
>> >> >> >> >> > fine -- I come back these days later to find an unusable SGE
>> >> >> >> >> > config:
>> >> >> >> >> >
>> >> >> >> >> > root at AWS-VTMXmaster-w2b ~
>> >> >> >> >> > # qstat -f
>> >> >> >> >> > error: sge_gethostbyname failed
>> >> >> >> >> >
>> >> >> >> >> > /etc/hosts is correct for all its (internal) host addrs:
>> >> >> >> >> >
>> >> >> >> >> > root at AWS-VTMXmaster-w2b ~
>> >> >> >> >> > # cat /etc/hosts
>> >> >> >> >> > 127.0.0.1   localhost localhost.localdomain localhost4
>> >> >> >> >> > localhost4.localdomain4
>> >> >> >> >> > ::1         localhost localhost.localdomain localhost6
>> >> >> >> >> > localhost6.localdomain6
>> >> >> >> >> > 10.250.65.204 master
>> >> >> >> >> > 10.251.30.12 node001
>> >> >> >> >> >
>> >> >> >> >> *****
>> >> >> >> >>
>> >> >> >> >> > The gethostbyname utility works correctly (so does
>> >> >> >> >> > gethostbyaddr):
>> >> >> >> >> >
>> >> >> >> >> > root at AWS-VTMXmaster-w2b
>> >> >> >> >> > /opt/sge6/default/common/install_logs
>> >> >> >> >> > # /opt/sge6/utilbin/linux-x64/gethostbyname master
>> >> >> >> >> > Hostname: master
>> >> >> >> >> > Aliases:
>> >> >> >> >> > Host Address(es): 10.250.65.204
>> >> >> >> >> >
>> >> >> >> >> > root at AWS-VTMXmaster-w2b
>> >> >> >> >> > /opt/sge6/default/common/install_logs
>> >> >> >> >> > # /opt/sge6/utilbin/linux-x64/gethostbyname node001
>> >> >> >> >> > Hostname: node001
>> >> >> >> >> > Aliases:
>> >> >> >> >> > Host Address(es): 10.251.30.12
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > ******
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> > root at AWS-VTMXmaster-w2b
>> >> >> >> >> > /opt/sge6/default/common/install_logs
>> >> >> >> >> > # qstat -f
>> >> >> >> >> > error: sge_gethostbyname failed
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > I went so far as to edit the hostname in
>> >> >> >> >> > /etc/sysconfig/network
>> >> >> >> >> > to
>> >> >> >> >> > contain
>> >> >> >> >> > "master" and "node001" on the two nodes.  Same error.
>> >> >> >> >> >
>> >> >> >> >> > I have been all over the 'net looking for solutions, but
>> >> >> >> >> > have
>> >> >> >> >> > found
>> >> >> >> >> > nothing
>> >> >> >> >> > with a clear resolution.  gridengine.sunsource.net is gone.
>> >> >> >> >> > The
>> >> >> >> >> > follow-on
>> >> >> >> >> > at http://gridengine.org/pipermail/users/ doesn't seem to be
>> >> >> >> >> > searchable,
>> >> >> >> >> > except on an onerous, month-by-month click-thru basis (which
>> >> >> >> >> > hasn't
>> >> >> >> >> > yielded
>> >> >> >> >> > anything useful as I slog thru it).
>> >> >> >> >> >
>> >> >> >> >> > Short of starcluster restart'ing, I'll appreciate anyone's
>> >> >> >> >> > inputs
>> >> >> >> >> > on
>> >> >> >> >> > what to
>> >> >> >> >> > try next.
>> >> >> >> >> >
>> >> >> >> >> > Thanks much,
>> >> >> >> >> > Lyn
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> > _______________________________________________
>> >> >> >> >> > StarCluster mailing list
>> >> >> >> >> > StarCluster at mit.edu
>> >> >> >> >> > http://mailman.mit.edu/mailman/listinfo/starcluster
>> >> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>


More information about the StarCluster mailing list