This is bizarre and while I have a workaround, I’d prefer a permanent fix.
I have a small group of GPU machines running Ubuntu 14.04 which I am using as workers for a cloud service that’s effected via Docker images. I have nvidia-docker installed on all the worker machines, so that docker has access to the GPUs. The worker machines also function as individual servers which lab members can do experiments on directly (academic environment, the cloud service is experimental, etc). For the latter purpose, all the machines automount individual user shares over NFS. We recently switched to automount from a static fstab configuration, and I’m still getting used to it — it’s entirely possible there’s some obvious issue at play here I’m not seeing because I’m an automount n00b. Finally, I haven’t set anything up for docker images to be able to access the NFS shares, so in theory there should be no connection… in theory.
This week one of our lab members reported the
Too many levels of symbolic linkserror when attempting to access their share drive from one of the GPU machines. They’re not using docker at all (to their knowledge). There are no questionable symbolic links in their tree (via
find -type l), so it has to be something else getting into a weird state. The mount point looks like this under
ls -lfrom the parent directory:
dr-xr-xr-x 2 root root 0 Dec 5 18:38 labmember1
which seems… bad? root:root 555, really? and when you try to browse it you get, indeed:
$ cd /path/to/labmember1/ -bash: cd: /path/to/labmember1/: Too many levels of symbolic links
The share doesn’t seem to actually be mounted — it does not appear in /etc/mtab, and (predictably) attempts to unmount it manually report:
$ sudo umount /path/to/labmember1/ umount: /path/to/labmember1/: not mounted
Restarting autofs (
service autofs restart) did nothing.
What I thought was unrelated at the time: docker had been spewing veth interfaces everywhere. This was a machine being actively used as a cloud worker, so I figured it was our cloud software. Now I’m not so sure.
Too many levels of symbolic linksfailure occurred on another GPU machine, which has docker/nvidia-docker installed but does not run the cloud worker software. Lo and behold, veth interfaces everywhere, though in far fewer numbers than on the cloud worker machine.
On a whim, I stopped the docker service (
service docker stop). Magic! The share mounts normally and our lab member can use their stuff again. The share remains in working condition after starting docker back up again.
So I can clearly fix this issue by restarting docker if(when) it happens again, but I’d like to know
- what is causing this in the first place? or, how can I find out?
- is there a way to prevent this from happening again, or am I stuck just fixing it every time it breaks?
How have you defined mount options for
/etc/auto.master, are you doing direct or indirect automount?
Also did you
autofs entirely within a Docker container with the
--privileged option added to the docker run command? Using this approach you should be able to perform NFS mounts without any issues.
Please note that bind mounting an
autofs mount into a container with an independent autofs daemon running can’t be done because it may conflict with the autofs daemon running in the originating namespace.
For indirect mounts, running
autofs in the root namespace provides automounting for Docker containers by binding the autofs top level mounts into containers with the Docker volume option should function mostly as expected.