Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yet people use container based isolation all the time in practice and the sky doesn't fall.

Also, every security domain in an Android systems shares a kernel, yet Android is one of the most secure systems out there. Sure, it uses tons of SELinux, but so what? It still has a shared kernel, and a quite featureful one at that.

I don't buy the idea that we can't do intra-kernel security isolation and so we shouldn't care about local privilege escalation.



Android delegated some security features to a different kernel called Trusty that is separated from the main Linux kernel using virtualisation. That kernel runs high value security services.

https://source.android.com/docs/security/features/trusty


Yes, but that's not the main load-bearing security part of the system. Trusty doesn't isolate apps from each other. It doesn't isolate work profiles from user profiles. Regular SELinux-augmented thoughtfully-used uid- and process-isolation does that.


If you weren't aware, containers aren't a security boundary. Things like bubblewrap are.


Semantics make hard assertions about "containers" worthless. It depends on what one means by a container exactly, since Linux has no such concept and our ecosystem doesn't have a strict definition.


What to you think bubblewrap is, if not a container runtime?


bubblewrap is actually worse - there are known escapes in there that haven't been fixed for years


It is the most widely used sandbox layer for pretty much everything. What escapes are you talking about? Are we supposed to take your word for it? Come on


Wait. What? What escapes? Is it that bubblewrap not faithfully implement the policy you give it or that there are surprising gaps in the kernel's namespace isolation?


Ironically Ubuntu 24 now blocks users from accessing namespaces because that kernel interface had a bunch of local privilege escalations, breaking programs that want to use them for isolation.


For the last 10 years or so, namespaces in Linux were the source of the absolute hightest number of local privilege escalations and sometimes even arbitrary code executions in kernel space. Building a kernel without user namespace support has been goto-advice for multiuser systems for almost as long. Ubuntu is just late to the game because they mostly have server or single-user-desktop customers.


Actually I think device drivers got you beat there, but no ones suggesting we break them for users safety. Ubuntu today is more user hostile than Windows.


Device drivers are worse if you just count the numbers. But they are usually far less exploitable because very often you need to have the corresponding hardware plugged in or even need to manipulate said hardware to provide crafted inputs. So in reality, device driver problems are almost never exploitable.


Seems ironic considering namespaces are highly utilized for isolation/security purposes.


I presume they're left enabled for root.


The same software that wants to use namespaces for isolation will refuse to run as root.


Not true. Docker, for example. There's plenty of cases where you set up an isolation environment as root and then use it as non-root.


Yes, but actually no: usually setting up those namespaces is done through a privileged daemon or suid-root binaries. Both of those are prone to root exploits, which isn't as bad as a kernel exploit, but only a 'modprobe' away. Group membership in the 'docker' group is famous for being root-equivalent.

It isn't impossible to do things right, but in practice, things are usually done badly.


I've even seen namespaces used for hiding malicious software in Ubuntu systems too.


Wouldn't Android's kernel have most of the hardening steps / disabled features described in GP's comment?


No. Things like eBPF, strace, and packet filtering are enabled. Android uses SELinux and other facilities to limit the amount of code the kernel will allow to access these features. Big difference from their being compiled out of the kernel entirely as the OP suggests is necessary.


Container isolation can fail at shared libraries in shared layers too can't it? My evil service is based on the same cooltechframework base layer as your safety critical hardware control service and if there is a mistake in the framework...


then it affects each one separately since they are separate processes. The fact they run the same code is irrelevant if the data is separate.


Separate processes running the same shared instructions. If you compromise and modify those shared instructions, the othe container runs instructions of your choosing.


Layers are COW so one container modifying a layer has no effect on other containers started from the same image. Of course, preexisting vulnerabilities will remain but they'd have to be separately exploited in each container.


I learned something new today! Thank you.

Edit: to be clear, I knew the disk was COW but I thought it saved memory by loading one instance of shared objects into memory.


> thought it saved memory by loading one instance of shared objects into memory

It does! The trick is that it loads the shared object read-only as far as the CPU is concerned. If a program tries to modify the memory, the CPU (I'm simplifying a lot here) throws an exception. The kernel catches that exception, makes a copy of the memory the program is trying to modify, puts the copy of the original memory at the same address as the original read-only memory, and tells the program to re-try the write operation, which now succeeds. All of this happens without the application doing the writing being aware of what's going on. From its point of view, writes Just Work.

This way, you get the memory savings of sharing and the flexibility to do writes all without the security problems of shared mutability.

You might enjoy reading about OS virtual memory operation more generally!


Worse, cannot disable eBPF due to too many packages demanding it.

Namely, nft tables and its filtering.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: