[StarCluster] docker daemon not found when docker command executed with qsub

Xander Dunn xander.dunn at icloud.com
Tue Nov 17 02:36:17 EST 2015


You’re right, thanks very much!  

Submitting the job `qsub -b y -cwd id` produces: 
uid=1001(sgeadmin) gid=1001(sgeadmin) groups=1001(sgeadmin),20000

Strangely, however, executing the same command on the same node with ssh yields a different result:
sgeadmin at master:~$ ssh node001 id
uid=1001(sgeadmin) gid=1001(sgeadmin) groups=1001(sgeadmin),999(docker)

This explains the discrepancy I’m seeing.  Why does qsub get a uid 1001 without docker while ssh gets a uid 1001 with docker?  

My first thought to resolve this was to `usermod` the sgeadmin user on my AMI to add the docker group to it, but I realize there is no sgeadmin user on my AMI.  It’s created by starcluster on node boot.  

How can this be set?  

Thanks,
Xander

> On Nov 16, 2015, at 19:26, Rayson Ho <raysonlogin at gmail.com> wrote:
> 
> Xander,
> 
> Can you check whether the Grid Engine job environment has the "docker" group as one of the supplemental groups by submitting a job that runs "id"?
> 
> http://man7.org/linux/man-pages/man1/id.1.html <http://man7.org/linux/man-pages/man1/id.1.html>
> 
> IIRC, Docker requires the process to be a member of the docker group in order to dial  /var/run/docker.sock.
> 
> Rayson
> 
> ==================================================
> Open Grid Scheduler - The Official Open Source Grid Engine
> http://gridscheduler.sourceforge.net/ <http://gridscheduler.sourceforge.net/>
> http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html <http://gridscheduler.sourceforge.net/GridEngine/GridEngineCloud.html>
> 
> 
> 
> 
> On Mon, Nov 16, 2015 at 7:15 PM, Xander Dunn <xander.dunn at icloud.com <mailto:xander.dunn at icloud.com>> wrote:
> >
> > I have star cluster installed from the develop branch because I need to use c4 instance types, which aren’t in a released version yet.  I have open grid scheduler 2011.11 installed on an Ubuntu 14.04 AMI.
> >
> > I have Docker installed in that AMI and the daemon starts on boot.  If I manually ssh into my master node or any worker node and execute a Docker command, it works.  The Docker daemon is found and the command succeeds.  Furthermore, executing any docker command from the master node in the form `ssh node001 docker pull IMAGE` also works correctly.
> >
> > However, those same commands, when executed with qsub, will fail because the running Docker daemon can’t be found:
> > Post IMAGE: dial unix /var/run/docker.sock: permission denied.
> > * Are you trying to connect to a TLS-enabled daemon without TLS?
> > * Is your docker daemon up and running?
> >
> > Example: `qsub -V -b y -cwd docker pull ubuntu:14.04`
> >
> > What’s the difference in the way qsub executes commands that’s causing this?
> >
> > Thanks,
> > Xander
> > _______________________________________________
> > StarCluster mailing list
> > StarCluster at mit.edu <mailto:StarCluster at mit.edu>
> > http://mailman.mit.edu/mailman/listinfo/starcluster <http://mailman.mit.edu/mailman/listinfo/starcluster>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.mit.edu/pipermail/starcluster/attachments/20151117/faaa0abc/attachment.html


More information about the StarCluster mailing list