← Back to team overview

openstack team mailing list archive

Re: Libvirt LXC with volume-attach broken ?

 

Quoting Eric W. Biederman (ebiederm@xxxxxxxxxxxx):
> "Daniel P. Berrange" <berrange@xxxxxxxxxx> writes:
> 
> > On Thu, Jul 05, 2012 at 06:49:06PM -0700, Eric W. Biederman wrote:
> >> Serge Hallyn <serge.hallyn@xxxxxxxxxxxxx> writes:
> >> 
> >> > Quoting Daniel P. Berrange (berrange@xxxxxxxxxx):
> >> >> On Thu, Jul 05, 2012 at 03:00:26PM +0100, Daniel P. Berrange wrote:
> >> >> > Now, when using 'nova volume-attach':
> >> >> > 
> >> >> >   # nova volume-attach 05eb16df-03b8-451b-85c1-b838a8757736 a5ad1d37-aed0-4bf6-8c6e-c28543cd38ac /dev/sdf
> >> >> > 
> >> >> > nova will import an iSCSI LUN from the nova volume service, on the compute
> >> >> > node. The kernel will assign it the next free SCSI drive letter, in my
> >> >> > case '/dev/sdc'.
> >> >> > 
> >> >> > The libvirt nova driver will then do a mknod, using the volume name
> >> >> > passed to 'nova volume-attach'.
> >> >> > eg it will do
> >> >> > 
> >> >> >   mknod  /var/lib/nova/instances/instance-0000000e/rootfs/dev/sdf
> >> >> 
> >> >> Opps, I'm slightly wrong here. What it actually does is
> >> >> 
> >> >>   mount --bind /dev/sdc /var/lib/nova/instances/instance-0000000e/rootfs/dev/sdf
> >> >> 
> >> >> so you get a 'sdf' device, but with the major/minor number of the 'sdc'
> >> >> device. I can't say I particularly like this approach. Ultimately I
> >> >> think we need the kernel support to make this work correctly. In any
> >> >
> >> > Yes, that's what the 'devices namespace' is meant to address.  I'm hoping
> >> > we can some serious design discussion on that in the next few months.
> >> 
> >> This is not the device namespace problem.
> >> 
> >> This is the setns problem for mount namespaces, and the unprivilged
> >> mount problem.
> >> 
> >> There may be a notification issue so use space can perform actions
> >> in a container when a device shows up.
> >> 
> >> But it should be very possible on the host to call.
> >> setns(containers_mount_namespace);
> >> mknod("/dev/foo");
> >> chown("/dev/foo", CONTAINER_ROOT_UID, CONTAINER_ROOT_GID);
> >> 
> >> And then from inside the container especially when I get the rest of
> >> the user namespace merged it should be very possible to manipulate
> >> the block device because you have permission, and to mount the
> >> partitions of the block device, because you are root in your container.
> >> 
> >> But until the user namespace is merged you really are root so you can
> >> mount whatever.
> >> 
> >> Daniel does that sound like the support you are looking for?
> >
> > Yes, the setns(mnt) approach you describe above is exactly what I'd
> > like to be able todo, to solve the first half of the problem.
> >
> > The part of the problem is that I have a /dev/sdf, or even a
> > /dev/volgroup00/logvol3 in the host (with whatever major:minor
> > number that implies), and I want to be able to make it always
> > appear as /dev/sda  in the container (with the correspondingly
> > different major:minor number).  I'm guessing this is what Serge
> > was refering to as the 'device' namespace problem

Right.

> Getting the device to always appear with the name /dev/sda is easy.

It's easy to log in and make it look that way.  It's not easy to
make all distros see it that way across boot.

> Where does the need to have a specific device come from?  I would have
> thought by now that hotplug had been around long enough that in general
> user space would not care.

Yes the *primary* need for the devices namespace is to prevent udev
storm in the host and send uevents to the right place, and macvtap
and loop devices.

> The only case that I know of where keeping the same device number seems
> reasonable is in the case of live migration an application, in order to
> avoid issues with stat changing for the same file over the transition,
> and I think a synthesized hotplug event could probably handle that case.
> 
> Is there another case besides buggy applications that have hard
> coded device numbers that need specific device numbers?

Other cases where specific device maj-min numbers are important
are things like makedev.  There is lots of software, and especially
automatic update software, which insists that things have specific
'correct' maj-minor numbers.

FWIW my (presumably naive) view is that for each non-init devicens
we'd have a list of

type-major:minor::type2-major:minor2

(:: meaning maps-to).  Then if a uevent comes through not aimed at
any type2-major2:minor2 valid in the namespace, that ns doesn't get
the uevent.

-serge


References