[StarCluster] Twilight Zone: sge_gethostbyname failed
Lyn Gerner
schedulerqueen at gmail.com
Fri Dec 27 20:07:21 EST 2013
Thanks very much, Rayson.
Best,
Lyn
On Fri, Dec 27, 2013 at 2:57 PM, Rayson Ho <raysonlogin at gmail.com> wrote:
> We need to change the SC code for RHEL-based distros. Each distro does
> things slightly differently, and that's why you get that behavior.
>
> In the mean time, you might want to go to each node and set the
> hostname by editing /etc/sysconfig/network and running hostname <name>
> as root, and then restart OGS/GE.
>
> Rayson
>
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
>
>
> On Fri, Dec 27, 2013 at 7:47 PM, Lyn Gerner <schedulerqueen at gmail.com>
> wrote:
> > Yep, it works again with those changes.
> >
> > So, how should I stop the regression in a non-kludgy way?
> >
> > Thanks again,
> > Lyn
> >
> >
> > On Fri, Dec 27, 2013 at 2:43 PM, Rayson Ho <raysonlogin at gmail.com>
> wrote:
> >>
> >> /etc/sysconfig/network is read during reboot, and may be after DHCP...
> >>
> >> To see if it is the issue, set HOSTNAME back to master, and also run
> >> "hostname master" as root.
> >>
> >> Rayson
> >>
> >> ==================================================
> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> http://gridscheduler.sourceforge.net/
> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
> >>
> >>
> >> On Fri, Dec 27, 2013 at 7:40 PM, Lyn Gerner <schedulerqueen at gmail.com>
> >> wrote:
> >> > Thanks for digging, Rayson.
> >> >
> >> > So, /etc/sysconfig/network had HOSTNAME=centos-ami when the problem
> >> > first
> >> > occurred. I tried resetting it to "master" and then retried the SGE
> >> > commands (qstat, qsub, etc.). They still failed with the same error
> at
> >> > that
> >> > point, so I switched them back, not knowing for sure if they'd been
> set
> >> > to
> >> > master and node001 to begin with.
> >> >
> >> > Thanks,
> >> > Lyn
> >> >
> >> >
> >> > On Fri, Dec 27, 2013 at 2:35 PM, Rayson Ho <raysonlogin at gmail.com>
> >> > wrote:
> >> >>
> >> >> (Updating the list...)
> >> >>
> >> >> The hostname on the master gets reset to centos-ami, which is not
> >> >> resolvable. Thus Grid Engine complains about the hostname issue.
> >> >>
> >> >> Lyn: what is the value of the HOSTNAME key in
> "/etc/sysconfig/network"
> >> >> on your master instance??
> >> >>
> >> >> Justin & other devs: set_hostname() in node.py works on Ubuntu
> because
> >> >> Ubuntu uses /etc/hostname, but RHEL (and RHEL-based distros like
> >> >> CentOS, Oracle Linux, Scientific Linux) uses /etc/sysconfig/network,
> >> >> and yet SuSE uses /etc/HOSTNAME!
> >> >>
> >> >> Rayson
> >> >>
> >> >> ==================================================
> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> >> http://gridscheduler.sourceforge.net/
> >> >> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
> >> >>
> >> >>
> >> >> On Fri, Dec 27, 2013 at 6:39 PM, Lyn Gerner <
> schedulerqueen at gmail.com>
> >> >> wrote:
> >> >> > I used the Scientific Linux AMI (been a long time, but I found it
> >> >> > from
> >> >> > the
> >> >> > SC site), and 0.94.3 is my SC version.
> >> >> >
> >> >> >
> >> >> > On Fri, Dec 27, 2013 at 1:36 PM, Rayson Ho <raysonlogin at gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Hmm, which AMI did you use, and what's the version of SC?
> >> >> >>
> >> >> >> Rayson
> >> >> >>
> >> >> >> ==================================================
> >> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> >> >> http://gridscheduler.sourceforge.net/
> >> >> >>
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
> >> >> >>
> >> >> >>
> >> >> >> On Fri, Dec 27, 2013 at 6:33 PM, Lyn Gerner
> >> >> >> <schedulerqueen at gmail.com>
> >> >> >> wrote:
> >> >> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
> >> >> >> > # /opt/sge6/utilbin/linux-x64/gethostname -name
> >> >> >> > error resolving local host: can't resolve host name (h_errno =
> >> >> >> > HOST_NOT_FOUND)
> >> >> >> >
> >> >> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
> >> >> >> > # hostname
> >> >> >> > centos-ami
> >> >> >> >
> >> >> >> > root at AWS-VTMXmaster-w2b /opt/sge6/default/common/install_logs
> >> >> >> > # hostname -f
> >> >> >> > hostname: Unknown host
> >> >> >> >
> >> >> >> > What's weird is that I have never mucked with any of this under
> >> >> >> > StarCluster,
> >> >> >> > and have only recently started having problems. Can't pinpoint
> >> >> >> > any
> >> >> >> > specific
> >> >> >> > event or thing that changed--except that I started leaving the
> >> >> >> > config
> >> >> >> > up
> >> >> >> > for
> >> >> >> > days instead of hours at a stretch.
> >> >> >> >
> >> >> >> > Thanks,
> >> >> >> > Lyn
> >> >> >> >
> >> >> >> >
> >> >> >> > On Fri, Dec 27, 2013 at 1:30 PM, Rayson Ho <
> raysonlogin at gmail.com>
> >> >> >> > wrote:
> >> >> >> >>
> >> >> >> >> No problem, and I think that's why it is failing. Can you also
> >> >> >> >> send
> >> >> >> >> me
> >> >> >> >> the output of:
> >> >> >> >>
> >> >> >> >> 1) gethostname -name
> >> >> >> >>
> >> >> >> >> 2) hostname
> >> >> >> >>
> >> >> >> >> 3) hostname -f
> >> >> >> >>
> >> >> >> >> Rayson
> >> >> >> >>
> >> >> >> >> ==================================================
> >> >> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> >> >> >> http://gridscheduler.sourceforge.net/
> >> >> >> >>
> >> >> >> >>
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Fri, Dec 27, 2013 at 6:27 PM, Lyn Gerner
> >> >> >> >> <schedulerqueen at gmail.com>
> >> >> >> >> wrote:
> >> >> >> >> > My bad:
> >> >> >> >> >
> >> >> >> >> > root at AWS-VTMXmaster-w2b/opt/sge6/default/common/install_logs
> >> >> >> >> > # /opt/sge6/utilbin/linux-x64/gethostname -all
> >> >> >> >> > error resolving local host: can't resolve host name (h_errno
> =
> >> >> >> >> > HOST_NOT_FOUND)
> >> >> >> >> >
> >> >> >> >> > Thanks for any insights,
> >> >> >> >> > Lyn
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > On Fri, Dec 27, 2013 at 1:25 PM, Rayson Ho
> >> >> >> >> > <raysonlogin at gmail.com>
> >> >> >> >> > wrote:
> >> >> >> >> >>
> >> >> >> >> >> But I need the output of "gethostname", not
> "gethostbyname"...
> >> >> >> >> >> :-P
> >> >> >> >> >>
> >> >> >> >> >> Rayson
> >> >> >> >> >>
> >> >> >> >> >> ==================================================
> >> >> >> >> >> Open Grid Scheduler - The Official Open Source Grid Engine
> >> >> >> >> >> http://gridscheduler.sourceforge.net/
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> On Fri, Dec 27, 2013 at 6:11 PM, Lyn Gerner
> >> >> >> >> >> <schedulerqueen at gmail.com>
> >> >> >> >> >> wrote:
> >> >> >> >> >> > Thanks for the quick response, Rayson. Output from
> >> >> >> >> >> > gethostbyname
> >> >> >> >> >> > is
> >> >> >> >> >> > in
> >> >> >> >> >> > between the ****s below:
> >> >> >> >> >> >
> >> >> >> >> >> > On Fri, Dec 27, 2013 at 1:04 PM, Rayson Ho
> >> >> >> >> >> > <raysonlogin at gmail.com>
> >> >> >> >> >> > wrote:
> >> >> >> >> >> >>
> >> >> >> >> >> >> What is the output of "gethostname"? (gethostname is
> >> >> >> >> >> >> shipped
> >> >> >> >> >> >> with
> >> >> >> >> >> >> SGE
> >> >> >> >> >> >> in the util dir.)
> >> >> >> >> >> >>
> >> >> >> >> >> >> Rayson
> >> >> >> >> >> >>
> >> >> >> >> >> >> ==================================================
> >> >> >> >> >> >> Open Grid Scheduler - The Official Open Source Grid
> Engine
> >> >> >> >> >> >> http://gridscheduler.sourceforge.net/
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html
> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >> >> On Fri, Dec 27, 2013 at 5:34 PM, Lyn Gerner
> >> >> >> >> >> >> <schedulerqueen at gmail.com>
> >> >> >> >> >> >> wrote:
> >> >> >> >> >> >> > Hi All,
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > Okay, I'm in the Twilight Zone now. After starting a
> >> >> >> >> >> >> > small
> >> >> >> >> >> >> > cluster
> >> >> >> >> >> >> > on
> >> >> >> >> >> >> > the
> >> >> >> >> >> >> > 23rd, and doing minimal reconfig (qmod -d) to disable
> the
> >> >> >> >> >> >> > sge_execd
> >> >> >> >> >> >> > on
> >> >> >> >> >> >> > the
> >> >> >> >> >> >> > master and qconf -mq all.q to change some slot counts
> --
> >> >> >> >> >> >> > all
> >> >> >> >> >> >> > of
> >> >> >> >> >> >> > which
> >> >> >> >> >> >> > worked
> >> >> >> >> >> >> > fine -- I come back these days later to find an
> unusable
> >> >> >> >> >> >> > SGE
> >> >> >> >> >> >> > config:
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > root at AWS-VTMXmaster-w2b ~
> >> >> >> >> >> >> > # qstat -f
> >> >> >> >> >> >> > error: sge_gethostbyname failed
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > /etc/hosts is correct for all its (internal) host
> addrs:
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > root at AWS-VTMXmaster-w2b ~
> >> >> >> >> >> >> > # cat /etc/hosts
> >> >> >> >> >> >> > 127.0.0.1 localhost localhost.localdomain localhost4
> >> >> >> >> >> >> > localhost4.localdomain4
> >> >> >> >> >> >> > ::1 localhost localhost.localdomain localhost6
> >> >> >> >> >> >> > localhost6.localdomain6
> >> >> >> >> >> >> > 10.250.65.204 master
> >> >> >> >> >> >> > 10.251.30.12 node001
> >> >> >> >> >> >> >
> >> >> >> >> >> >> *****
> >> >> >> >> >> >>
> >> >> >> >> >> >> > The gethostbyname utility works correctly (so does
> >> >> >> >> >> >> > gethostbyaddr):
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > root at AWS-VTMXmaster-w2b
> >> >> >> >> >> >> > /opt/sge6/default/common/install_logs
> >> >> >> >> >> >> > # /opt/sge6/utilbin/linux-x64/gethostbyname master
> >> >> >> >> >> >> > Hostname: master
> >> >> >> >> >> >> > Aliases:
> >> >> >> >> >> >> > Host Address(es): 10.250.65.204
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > root at AWS-VTMXmaster-w2b
> >> >> >> >> >> >> > /opt/sge6/default/common/install_logs
> >> >> >> >> >> >> > # /opt/sge6/utilbin/linux-x64/gethostbyname node001
> >> >> >> >> >> >> > Hostname: node001
> >> >> >> >> >> >> > Aliases:
> >> >> >> >> >> >> > Host Address(es): 10.251.30.12
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> > ******
> >> >> >> >> >> >>
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > root at AWS-VTMXmaster-w2b
> >> >> >> >> >> >> > /opt/sge6/default/common/install_logs
> >> >> >> >> >> >> > # qstat -f
> >> >> >> >> >> >> > error: sge_gethostbyname failed
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > I went so far as to edit the hostname in
> >> >> >> >> >> >> > /etc/sysconfig/network
> >> >> >> >> >> >> > to
> >> >> >> >> >> >> > contain
> >> >> >> >> >> >> > "master" and "node001" on the two nodes. Same error.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > I have been all over the 'net looking for solutions,
> but
> >> >> >> >> >> >> > have
> >> >> >> >> >> >> > found
> >> >> >> >> >> >> > nothing
> >> >> >> >> >> >> > with a clear resolution. gridengine.sunsource.net is
> >> >> >> >> >> >> > gone.
> >> >> >> >> >> >> > The
> >> >> >> >> >> >> > follow-on
> >> >> >> >> >> >> > at http://gridengine.org/pipermail/users/ doesn't
> seem to
> >> >> >> >> >> >> > be
> >> >> >> >> >> >> > searchable,
> >> >> >> >> >> >> > except on an onerous, month-by-month click-thru basis
> >> >> >> >> >> >> > (which
> >> >> >> >> >> >> > hasn't
> >> >> >> >> >> >> > yielded
> >> >> >> >> >> >> > anything useful as I slog thru it).
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > Short of starcluster restart'ing, I'll appreciate
> >> >> >> >> >> >> > anyone's
> >> >> >> >> >> >> > inputs
> >> >> >> >> >> >> > on
> >> >> >> >> >> >> > what to
> >> >> >> >> >> >> > try next.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > Thanks much,
> >> >> >> >> >> >> > Lyn
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > _______________________________________________
> >> >> >> >> >> >> > StarCluster mailing list
> >> >> >> >> >> >> > StarCluster at mit.edu
> >> >> >> >> >> >> > http://mailman.mit.edu/mailman/listinfo/starcluster
> >> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20131227/0ad16f1e/attachment-0001.htm
More information about the StarCluster
mailing list