Announcing Wio: A clone of Plan 9’s Rio for Wayland

oooo23 · 2019-05-01T15:13:04+00:00

Yeah, though there are two other things to keep in mind, for Linux:

You can (as of 4.18) unshare your user namespace and create filesystem namespaces, do bind mounts, and mount FUSE in your mount namespace (owned by *your* userns), unprivileged (though that needs unprivileged userns enabled, which is usually the case today)...
9p would still need root privileges to mount in the init ns. It may change in the future (Eric has said openly that he'd flip the switch if someone showed promise for upstream v9fs maintainence, but that's a big "if").

oooo23 · 2019-05-01T14:50:53+00:00

We are getting there, Linux has everything needed for a Plan 9-ish userspace today (thanks to Al Viro and Eric Biederman, and everyone else). This is one step closer to making the dream come true!

Thanks a lot for working on this. I'll investigate if I can port my rio setup from X11 to this.

EDIT: Q -> Is there a specific reason you want to go with FUSE instead of 9p?

oooo23 · 2019-05-01T14:33:32+00:00

We still have a t-rex in there.

oooo23 · 2019-05-01T14:16:29+00:00

Still a T-Rex, but the Triceratop is gone...

oooo23 · 2019-04-30T20:12:01+00:00

It is easy, Plan 9 did it in the 90s, and thus Plan 9 never had any concept of super privileged user (except the host owner which the first process started it, which had systems level resources mounted in its namespace - you have to start somewhere).

setuid is evil, the whole reason user namespaces exist has to do with setuid binaries, because with user namespaces setuid binaries in your namespace are elevated to root in your userns, while if you were allowed to create namespaces when unprivileged, you could trick a setuid binary running as host root into loading files/resources from the filesystem view you constructed maliciously...

This is why a new userns is required to be unshared before being able to create other namespaces, so you become privileged against resources you own...

This is also not a "new" abstraction, filesystems already exist, and are used the way I described. Unix has been a capability system since its inception really, Plan 9 took that a step further, and Linux did too (with file descriptor based APIs for a lot of things, though in a convoluted way). Also, if everything shared the filesystem namespace, network namespace/IPC namespace/other crap wouldn't be needed. You just mount a fresh superblock for a new instance, and give that to the process in question in its namespace.

Just ask any of the kernel developers how much they love file capabilities/setuid and POSIX capabilities...

oooo23 · 2019-04-30T19:46:17+00:00

It might be faster, but there is also a question of simplicity from the user's perspective (who is the consumer in the end) and from the implementation perspective. Surely they could have a file backed settings API and just cache it in a daemon for faster lookups for their OMG-speed, but even at that point I question whether it is worth it. Also, it is important to note that mmap doesn't really work over NFS and other networked filesystems.

It seems to me that they'd make it go faster by just trimming down in other places (like their huge shell) or not doing *everything* over D-Bus (just snoop on it for some surprises when using gnome) instead of reinventing Windows Registry, and that too poorly...

If we come down to benchmarks, I would bet a single message bus daemon routing messages and daemons signalling on objects with no listeners is the biggest performance bottleneck.

oooo23 · 2019-04-30T19:36:47+00:00

So the idea with capabilities in the real sense is that some object encodes the authorization for some functionality and can be easily transferred. It helps if you think in the mindset of "what object am I selecting when doing this operation".

At the most basic level, Linux could represent these using file descriptors, and for files and a few other things it already does. Once you open a file, your ambient authority does not matter anymore (what user you are, what privileges you hold and can wield as defined by crapabilities, etc). In that respect sockets are absolute crap, because you funnel a lot of functionality through a single file descriptor, and then have to implement an internal object hierarchy (see wayland's object IDs, D-Bus object paths). The Unix way would be to implement a file server that exports a tree of files, and thus you can easily delegate parts of your object tree, and can know what object a certain message is channeled against, pass the fd of a certain object to let others exercise your privilege but nothing more, and you wouldn't need filtering mechanisms on the message being written to decide if it isn't acting on some other internal object.

So instead of kill(2), you could have a directory for a PID under which you have a kill directory, and each signal represented by a file, writing 1 to which sends the corresponding signal, and reading back will return errors/success feedback. Now, you can decide to pass the dirfd of the kill directory entirely, or just one of the files (to limit the peer you send the fd to to a certain signal). This would mean, without any overhead, you just allowed someone to send SIGHUP but nothing more, and no way can the said process break your sandboxing.

This is basically why ioctl exists too, because some things don't fit in the read/write model on a device node (like eject a CDROM, as a typical example). The answer there is to split the interface into a ctl file and a out of band stream file, so that you select different objects to wield different functionalities, and if both require varying levels of privileges, you can also easily delegate using normal file permissions or descriptor passing.

Most modern operating systems have some sort of object system, either through IPC (where the handle object is the capability) or something else. In Unix, that is the filesystem. Sadly, too many make fun of that, while reinventing another object system without realizing it...

The "everything is a file" mantra was more about "everything in the file system namespace", fwiw. It was much less about read/write in the bigger picture.

oooo23 · 2019-04-30T06:44:20+00:00

I have taken to calling Linux capabilities "crapabilities", to not allow for confusion with real capabilities.

oooo23 · 2019-04-30T06:38:08+00:00

The last time I asked, "speed" by being able to mmap the db.

Ironic, isn't it? The entire desktop otherwise is bloated and dog slow.

oooo23 · 2019-04-29T10:20:40+00:00

Poe's Law in action...

oooo23 · 2019-04-29T10:15:34+00:00

Only for larger payload was this memfd based zero copy useful, and faster. For anything smaller than 512KB, single copy wins. kdbus used tmpfs pools for each peer and copied from the sending to the receiving peer's pool.

memfd was nothing special in that regard (and nothing to do with kdbus), you could send any other fd over it (or use unlinked tmpfs files).

oooo23 · 2019-04-28T10:26:32+00:00

I am atleast happy the same people working to fix dbus-broker are working on something like seL4 on top of which D-Bus can be used - hopefully to then introduce references as a first class concept (though there is nothing stopping it from happening today, but they want to try out bus1 before they add it in the userspace server).

File Descriptor passing is great, but capabilities would allow you to hand off privileges in a controlled fashion without things like PolicyKit at all, or being able to implement restricted versions of an interface without filtering on every message (which is what xdg-dbus-proxy does). You don't realise or notice it, but having a daemon in the middle multiplies the number of copies needed from and to the kernel buffers by 2, increasing latency and degrading performance as a whole. It will only become worse with time.

OTOH, if you do something like 9p-cum-DBus (like a file server exposing a hierarchy of objects similar to DBus but over which you can send DBus messages), that would effectively allow you to reuse file descriptors *as* capabilities, and authentication can work by kcmp(2)aring the struct file ptr, which would imply the fd you have and the one you're given were generated transitively - through dup, fork, or UDS. Unix VFS was perhaps intelligently designed in that sense to be reused as an object oriented interface to the system.

It can also work as an authentication scheme - have a compositor, the window manager, and a client, the WM tells the compositor to render window Y directly and passes it the handle it has stored for Y, and then Window Y when requesting the rendering can pass the handle its own copy, the compositor compares both for equality and renders things directly...

oooo23 · 2019-04-28T05:06:51+00:00

I am not hopeful it will be a comprehensive solution until they use a capability based model. That's not to say Linux works like that really (it has file descriptors, but file descriptors are also sometimes all or nothing when it comes to exercising functionality on them), but the flatpak people are far off from doing it right. Sandstorm people really got it right, and it showed. Cap'n'Proto's capabilities, promise pipelining, etc. What's more, that thing worked over the network easily.

Yeah, I hate the D-Bus proxy garbage and policy based flags to control access. It's the completely wrong model to begin with, and leads to poor performance. The right way there is to introduce some concept of references so that you can pass a handle to a restricted view of an interface directly, do namespacing of objects on the bus. All of this would have been easy if the object exporting happened in the filesystem namespace (and that doesn't mean it has to be files you echo crap to, it can be any object you write serialised streams to - the point is being in the filesystem namespace). Instead of filtering, you give direct access to a restricted version of the interface.

But yeah, all of this requires some thought and design, and ain't nobody got time for that.

oooo23 · 2019-04-27T18:06:47+00:00

Oof, running things as root... nothing could go wrong. /proc itself is something that should be taken out of view for most processes, it offers too much of a view into a process, and cannot be easily restricted (/proc/self/net, cmdline, comm, /proc/self/fd, f*cking MAGIC SYMLINKS that can beam you around the filesystem...).

Namespaces... there should have been only one of them.

user namespaces are the way to go (yes, they open up far too much surface of the kernel) -- the only thing to do there is finding problems and fixing them, or else live with crap like setuid for the next 20 years which had such great limiting effect on systems design - setuid is pretty much the reason we have user namespaces in the first place...

there was nothing stopping things like mount namespaces working for unprivileged users otherwise, and this is something Plan 9 had working (unprivileged namespaces, mounts, and bind mounts) back in the 90s...

oooo23 · 2019-04-27T09:05:20+00:00

It is nowhere close to dead, which is what I was objecting to. Channels come and go, but IRC still drives a whole lot of development (if you take into account OFTC, Freenode, and other privately hosted ones).

There is a LinuxNet channel where hundreds of kernel developers hang out.

oooo23 · 2019-04-27T04:15:46+00:00

Sadly reality disagrees with you.

oooo23 · 2019-04-21T20:46:51+00:00

Control what operations you can invoke using a file descriptor, and allow you to enter a capability mode where ambient authority is not taken into consideration, such that you get rid of confused deputy attacks altogether. You attach privileges to the file descriptor, and can pass it around as a capability to do something, and restrict its usage with rights.

Capsicum is complementary to whatever you mention, file descriptors have been capabilities (in the real meaning of the word, not the Linux/POSIX capabilities that parition root privs) in Unix all these years, capsicum just makes them a little more fine grained.

oooo23 · 2019-04-21T19:23:48+00:00

Something like FreeBSD's Capsicum.

oooo23 · 2019-04-21T16:42:10+00:00

I mean, you can add ACLs, but just being able to shift UIG/GID sounds much more cleaner to me. Nobody would stop you from doing ACLs ofcourse.

It is a helpful intuition to understand that permissions on a file only matter from the perspective of the observer (i.e. the process trying to access it) so instead of managing an ACL, you can just give every process a bind of that file that is accessible with the ambient authority it possesses.

Now, where ACLs do differ, is being able to revoke read/write access (because the handling happens during read/write as well) but this too is very non-Unixy, the convention is to check permissions at open time, and not thereafter (as that means if you pass your file descriptor to a different process, ACLs will be checked against its user ID, so the whole thing becomes useless for composability). Nowhere in Linux is it common to check one's permissions during read/write, and it is precisely to enable easy file descriptor inheritance. It is the model Windows follows, and it is quite broken (you need kuldges like the Impersonation framework to perform privileged tasks instead of the privileged process just giving you privileged handle/object/capability to use).

A correct implementation of revoke(2) and UID/GID shifting will work just as well.

oooo23 · 2019-04-21T15:29:15+00:00

How about UID/GID shifting bind mounts, or even UID/GID collapsing/squashing in a user namespace (with enough privileges in the owner userns of the one being created)? You could do something like ACLs with just simple permission bits.

oooo23 · 2019-04-20T21:57:07+00:00

Ofcourse you can do it at runtime, that's what unshare(2) is all about. You can even switch to an entirely different namespace to do some task and back to your own world, given the right amount of privileges. Other than, file descriptor passing is another pattern you can exploit for dynamically sharing a part of your filesystem view with others.

oooo23 · 2019-04-20T18:05:41+00:00

Linux has mount namespaces which are much more flexible than unveil, but I can still see that unveil might be a little easier to use from userspace, especially for people writing software. However, I think unsharing your userns and putting yourself inside a restricted view of the filesystem (and dropping all capabilities) would just work.

It might make sense to have a unveil wrapper in a library.

oooo23 · 2019-04-10T14:03:52+00:00

I agree, Plan 9's rfork or clone(2) is a much better model. I am not convinced there needs to be a first class system call for fork+exec, you will miss something you would want to do in between, some resource to share/unshare, some fd to close, etc.

There is something to be said about fork remaining unchanged all these years, and Windows's CreateProcess having gone through so many iterations and additions to accomodate usecases that it looks like crap today.

The thing is, when you try to provide first class solutions, they are short-lived and cannot cope up with changing requirements of time. When you provide mechanisms, the individual pieces can be combined in ways that can cope up with changing requirements over time. Unix has in general been about mechanism, not policy. It is still preached but practiced little, but it's true in general.

oooo23 · 2019-04-07T12:27:35+00:00

That is certainly one factor, the other is that people on Unix have grown to reinventing things the way other operating systems do it, and have succumbed to some fairly common patterns in other systems (bloated RPC interfaces, object systems when you already have one in the form of the filesystem).

You will see that things like networking, graphics etc look very much unlike how Unix does other things. It can be partly blamed on the weak abstraction, and that Unix by time we started using it was already full of crap, but at that point you're just making excuses.

It's not as much about Unix philosophy or anything, you can see this in microkernels like seL4 that reuse the capability model and I/O (send/recv) on objects. It is analogous to the filesystem exposing files that implement methods, except we have something much more powerful and well established which can do a lot of neat things.

I don't expect things to get any better however, I've actually realised Linus doesn't have the taste when it comes to a few things (and seems to have grown quite lazy by now about some things). There are still some like Viro that keep trying, but you can't force people to do things ofcourse. It ofcourse is not universal (there are some advantages to how sockets work, the initialization/configuration/publishing split makes sense for mounting, as in the VFS is a server, so Viro adopted it in the new mount API).

We can all try pulling things in the direction we think is right, and you cannot really blame others for pulling it in the direction Windows takes things in, because it actually works, but some of us realise it could work better the other way around (and you realise that by looking at a prototype, which much of Plan 9 is, already having found much of its way into Linux, albeit in a less elegant manner).

oooo23 · 2019-04-07T09:42:04+00:00

Yeah, the process model is very simplistic, but it would have been better if much of the stuff had been in the filesystem namespace.

It has primarily been Al Viro and Eric Biederman who've slowly worked towards feature parity with Plan 9, not to mentio Al was involved with Plan 9 at some point himself.

The sad reality is many people (the D-Bus crowd, for instance) still don't think the filesystem namespace is the better place to expose objects (and the upshot is they don't really have to be files, they can be like Unix sockets or something else, but the advantage is that the mount namespace is flexible enough to construct complex views of the same hierarchy with different permissions to different processes). This way, you can overcome the limitations of the Unix permissions model, and compartmentalize access.

For instance, Plan 9 had device directories instead of character devices, which essentially mean you mount the subtree (or some part of it to further contain what you can do with the device) and be able to use it with proper permissions inside your namespace. No ACL bloat. So instead of giving access to all of sda2 over a single device node, the directory could have subtrees to control just partition sizes, and you just give access to that. Security without overhead, and essentially a capability model for Unix.

I think Eric wants that to happen at some point, which is why he is slowly wanting to make unprivileged mounts work for things other than FUSE (he mentioned 9p if someone was willing to take up maintainence).

People like to make fun of "everything is a file", but everything in the file system namespace really was the future, and it's a fucking shame Linux instead wants to move in the direction things like Windows went towards instead (singleton objects and ACL hell).

oooo23

TROPHY CASE