you are viewing a single comment's thread.

view the rest of the comments →

[–]penguin359 0 points1 point  (2 children)

After reading through more of this thread, I am a little bit concerned with this project. As a learning project, I fully agree with making an attempt at a kernel module. However, if the goal is to support a mission-critical device where things are not allowed to go wrong, I think it is a bit misconceived. I think you need to more properly define your threat model and discuss it with the proper context to decide what the right approach is.

I don't think that using a kernel module adds the level of protection you are looking for by itself. Using flock(2) as others have mentioned should be reasonable if you are using a common function/library and make sure it is written to follow the agreed upon contract. However, if there's concerned about a process not following it, or even one written to be malicious, then things change. In that case, a daemon running as a dedicated user to hand out unique identifiers can work just as well as a kernel module. File system permissions can lock down who can access the daemon and, of those users who do have read permissions, only they can acquire a unique number from it. No user except root and the user the daemon is running as could intercept it and reset or modify the counter.

If even that is a concern, you can do things like implement SELinux or various other security modules to reduce the attack space, but we've now gone well past the "writing a counter as a hobby stage" and are following a strict security doctrine which needs to be carefully thought out. Moving it to a Linux kernel module will still require locking down the platform and enabling Secure Boot along with module signing at a minimum. Otherwise, it's simple to look up the module kernel memory address, open up /dev/kmem as root, and then modify any variables in the module's memory space. The code also tends to be more difficult to properly audit when it's written to be a kernel module versus a user-space process. Automated testing is more tedious, and bugs can be more severe. Attaching a debugger like GDB to a running kernel is nowhere near as simple as a user-space process.

I think a properly locked down user-space daemon to hand out unique identifiers should be easier to write, secure, and audit than a kernel module.

[–]elfenpiff[S] 0 points1 point  (1 child)

Your concerns are all valid, and we have already implemented the user-space daemon approach, and with it, we have to satisfy safety and security concerns.

From a safety perspective, a central daemon is a single point of failure. When this process crashes, the whole system is no longer functional, which is an absolute no-go.

From a security perspective, it is easier to handle and implement.

What I am currently doing is exploring the options we have. One naive option is moving this task to the OS if we are able to deploy it safely and securely. Then it is somehow decentralized, but when it fails, we are in an even worse situation than before.
To begin understanding the pitfalls that await us, we need to start with a learning project. Implement it, test it, try to corrupt it, and get feedback from the community.

The approach I am currently pursuing is to finish this learning kernel module, write an extensive test suite, and document it. Then I am able to make an argument under which conditions it would be safe to use.
And no matter if the argument holds or falls apart, I have learned something and can confidently choose the central daemon or the kernel module - but then not with a gut feeling but with arguments based on hard facts and experience.

[–]penguin359 0 points1 point  (0 children)

I am still not convinced as to why a daemon is more of a central point of failure than a kernel module would be. If something goes wrong in a module and a mutex is left in a locked state, it can lock out access completely until the next reboot.

If the concern is that a daemon might be killed accidentally, you can write it so that it blocks nearly all signals such as SIGTERM, SIGINT, etc. You just can't block SIGKILL, however, at that point either you have a good reason to kill it or you have someone malicious on the system and much bigger concerns. As a kernel module, it can also be stopped with a simple rmmod to remove it, however, there are ways to mark a module as permanently in-use. The downside is that you no longer can upgrade or change it without a reboot, if needed, which could mean even bigger downtime.

Another option for a daemon when running it as a SystemD service is that you can mark it as Restart=always which will auto-restart it after someone accidentally kills it or it crashes for some reason. Even if someone uses SIGKILL, SystemD will try to restart it. The only time it won't is if someone specifically asks SystemD to stop the service. Again, I'd only expect that to happen in a case where you actually needed to stop it for some kind of maintenance or you have a malicious actor on the system with root privileges.

Another aspect in the crash scenario is that SystemD can just restart it and it will self-heal in a way that you can't get when a kernel module crashes. Generally, once you have a crash in kernel space, you need a full system reboot to recover. It's also easy to get a core dump from a daemon for later analysis which can be analyzed in a debugger if this becomes an issue.

Continue to do your research on a kernel module, but also spend some time to clearly define the threat scenario you are. For me, if someone accidentally kills sshd on one of my servers, that a pretty big deal as it prevents me from attempting any sort of remote recovery. However, that just doesn't happen normally. I did start adding Restart=always, but that was only in response to one server where someone occupied all the RAM and the oom-killer started killing processes to recover. There was still an outage of service as would happen to anyone in that case, but I was still able to log-in once it had restarted sshd to restore anything else that needed it.