Yet people use container based isolation all the time in practice and the sky do...

antoinealb · 2025-06-23T15:15:29 1750691729

Android delegated some security features to a different kernel called Trusty that is separated from the main Linux kernel using virtualisation. That kernel runs high value security services.

https://source.android.com/docs/security/features/trusty

quotemstr · 2025-06-23T18:30:00 1750703400

Yes, but that's not the main load-bearing security part of the system. Trusty doesn't isolate apps from each other. It doesn't isolate work profiles from user profiles. Regular SELinux-augmented thoughtfully-used uid- and process-isolation does that.

zamalek · 2025-06-23T15:15:38 1750691738

If you weren't aware, containers aren't a security boundary. Things like bubblewrap are.

rlpb · 2025-06-23T15:35:52 1750692952

Semantics make hard assertions about "containers" worthless. It depends on what one means by a container exactly, since Linux has no such concept and our ecosystem doesn't have a strict definition.

NewJazz · 2025-06-23T16:09:54 1750694994

What to you think bubblewrap is, if not a container runtime?

eyberg · 2025-06-23T21:39:59 1750714799

bubblewrap is actually worse - there are known escapes in there that haven't been fixed for years

udev4096 · 2025-06-25T16:21:36 1750868496

It is the most widely used sandbox layer for pretty much everything. What escapes are you talking about? Are we supposed to take your word for it? Come on

quotemstr · 2025-06-24T14:42:26 1750776146

Wait. What? What escapes? Is it that bubblewrap not faithfully implement the policy you give it or that there are surprising gaps in the kernel's namespace isolation?

stefan_ · 2025-06-23T15:31:44 1750692704

Ironically Ubuntu 24 now blocks users from accessing namespaces because that kernel interface had a bunch of local privilege escalations, breaking programs that want to use them for isolation.

holowoodman · 2025-06-23T16:02:04 1750694524

For the last 10 years or so, namespaces in Linux were the source of the absolute hightest number of local privilege escalations and sometimes even arbitrary code executions in kernel space. Building a kernel without user namespace support has been goto-advice for multiuser systems for almost as long. Ubuntu is just late to the game because they mostly have server or single-user-desktop customers.

stefan_ · 2025-06-23T20:52:52 1750711972

Actually I think device drivers got you beat there, but no ones suggesting we break them for users safety. Ubuntu today is more user hostile than Windows.

holowoodman · 2025-06-23T23:02:33 1750719753

Device drivers are worse if you just count the numbers. But they are usually far less exploitable because very often you need to have the corresponding hardware plugged in or even need to manipulate said hardware to provide crafted inputs. So in reality, device driver problems are almost never exploitable.

ranger_danger · 2025-06-23T16:20:05 1750695605

Seems ironic considering namespaces are highly utilized for isolation/security purposes.

immibis · 2025-06-23T16:33:21 1750696401

I presume they're left enabled for root.

stefan_ · 2025-06-23T20:53:42 1750712022

The same software that wants to use namespaces for isolation will refuse to run as root.

immibis · 2025-06-24T09:43:42 1750758222

Not true. Docker, for example. There's plenty of cases where you set up an isolation environment as root and then use it as non-root.

holowoodman · 2025-06-25T17:46:01 1750873561

Yes, but actually no: usually setting up those namespaces is done through a privileged daemon or suid-root binaries. Both of those are prone to root exploits, which isn't as bad as a kernel exploit, but only a 'modprobe' away. Group membership in the 'docker' group is famous for being root-equivalent.

It isn't impossible to do things right, but in practice, things are usually done badly.

NexRebular · 2025-06-23T17:39:00 1750700340

I've even seen namespaces used for hiding malicious software in Ubuntu systems too.

pxeger1 · 2025-06-23T14:14:13 1750688053

Wouldn't Android's kernel have most of the hardening steps / disabled features described in GP's comment?

quotemstr · 2025-06-23T14:41:00 1750689660

No. Things like eBPF, strace, and packet filtering are enabled. Android uses SELinux and other facilities to limit the amount of code the kernel will allow to access these features. Big difference from their being compiled out of the kernel entirely as the OP suggests is necessary.

galangalalgol · 2025-06-23T15:23:53 1750692233

Container isolation can fail at shared libraries in shared layers too can't it? My evil service is based on the same cooltechframework base layer as your safety critical hardware control service and if there is a mistake in the framework...

immibis · 2025-06-23T16:32:52 1750696372

then it affects each one separately since they are separate processes. The fact they run the same code is irrelevant if the data is separate.

galangalalgol · 2025-06-23T21:39:59 1750714799

Separate processes running the same shared instructions. If you compromise and modify those shared instructions, the othe container runs instructions of your choosing.

kbolino · 2025-06-23T21:59:40 1750715980

Layers are COW so one container modifying a layer has no effect on other containers started from the same image. Of course, preexisting vulnerabilities will remain but they'd have to be separately exploited in each container.

galangalalgol · 2025-06-24T11:25:16 1750764316

I learned something new today! Thank you.

Edit: to be clear, I knew the disk was COW but I thought it saved memory by loading one instance of shared objects into memory.

quotemstr · 2025-06-24T14:32:23 1750775543

> thought it saved memory by loading one instance of shared objects into memory

It does! The trick is that it loads the shared object read-only as far as the CPU is concerned. If a program tries to modify the memory, the CPU (I'm simplifying a lot here) throws an exception. The kernel catches that exception, makes a copy of the memory the program is trying to modify, puts the copy of the original memory at the same address as the original read-only memory, and tells the program to re-try the write operation, which now succeeds. All of this happens without the application doing the writing being aware of what's going on. From its point of view, writes Just Work.

This way, you get the memory savings of sharing and the flexibility to do writes all without the security problems of shared mutability.

You might enjoy reading about OS virtual memory operation more generally!

egberts1 · 2025-06-23T15:53:51 1750694031

Worse, cannot disable eBPF due to too many packages demanding it.

Namely, nft tables and its filtering.