Yet people use container based isolation all the time in practice and the sky doesn't fall.
Also, every security domain in an Android systems shares a kernel, yet Android is one of the most secure systems out there. Sure, it uses tons of SELinux, but so what? It still has a shared kernel, and a quite featureful one at that.
I don't buy the idea that we can't do intra-kernel security isolation and so we shouldn't care about local privilege escalation.
Android delegated some security features to a different kernel called Trusty that is separated from the main Linux kernel using virtualisation. That kernel runs high value security services.
Yes, but that's not the main load-bearing security part of the system. Trusty doesn't isolate apps from each other. It doesn't isolate work profiles from user profiles. Regular SELinux-augmented thoughtfully-used uid- and process-isolation does that.
Semantics make hard assertions about "containers" worthless. It depends on what one means by a container exactly, since Linux has no such concept and our ecosystem doesn't have a strict definition.
It is the most widely used sandbox layer for pretty much everything. What escapes are you talking about? Are we supposed to take your word for it? Come on
Wait. What? What escapes? Is it that bubblewrap not faithfully implement the policy you give it or that there are surprising gaps in the kernel's namespace isolation?
Ironically Ubuntu 24 now blocks users from accessing namespaces because that kernel interface had a bunch of local privilege escalations, breaking programs that want to use them for isolation.
For the last 10 years or so, namespaces in Linux were the source of the absolute hightest number of local privilege escalations and sometimes even arbitrary code executions in kernel space. Building a kernel without user namespace support has been goto-advice for multiuser systems for almost as long. Ubuntu is just late to the game because they mostly have server or single-user-desktop customers.
Actually I think device drivers got you beat there, but no ones suggesting we break them for users safety. Ubuntu today is more user hostile than Windows.
Device drivers are worse if you just count the numbers. But they are usually far less exploitable because very often you need to have the corresponding hardware plugged in or even need to manipulate said hardware to provide crafted inputs. So in reality, device driver problems are almost never exploitable.
Yes, but actually no: usually setting up those namespaces is done through a privileged daemon or suid-root binaries. Both of those are prone to root exploits, which isn't as bad as a kernel exploit, but only a 'modprobe' away. Group membership in the 'docker' group is famous for being root-equivalent.
It isn't impossible to do things right, but in practice, things are usually done badly.
No. Things like eBPF, strace, and packet filtering are enabled. Android uses SELinux and other facilities to limit the amount of code the kernel will allow to access these features. Big difference from their being compiled out of the kernel entirely as the OP suggests is necessary.
Container isolation can fail at shared libraries in shared layers too can't it? My evil service is based on the same cooltechframework base layer as your safety critical hardware control service and if there is a mistake in the framework...
Separate processes running the same shared instructions. If you compromise and modify those shared instructions, the othe container runs instructions of your choosing.
Layers are COW so one container modifying a layer has no effect on other containers started from the same image. Of course, preexisting vulnerabilities will remain but they'd have to be separately exploited in each container.
> thought it saved memory by loading one instance of shared objects into memory
It does! The trick is that it loads the shared object read-only as far as the CPU is concerned. If a program tries to modify the memory, the CPU (I'm simplifying a lot here) throws an exception. The kernel catches that exception, makes a copy of the memory the program is trying to modify, puts the copy of the original memory at the same address as the original read-only memory, and tells the program to re-try the write operation, which now succeeds. All of this happens without the application doing the writing being aware of what's going on. From its point of view, writes Just Work.
This way, you get the memory savings of sharing and the flexibility to do writes all without the security problems of shared mutability.
You might enjoy reading about OS virtual memory operation more generally!
Also, every security domain in an Android systems shares a kernel, yet Android is one of the most secure systems out there. Sure, it uses tons of SELinux, but so what? It still has a shared kernel, and a quite featureful one at that.
I don't buy the idea that we can't do intra-kernel security isolation and so we shouldn't care about local privilege escalation.