There is an interesting tool on their github repo called go_generics[0]. It looks like it transforms a Go source file and writes out a new file, by doing name replacement, prefixing/suffixing for variable, method, and class names.
Since I see that some of the developers are in this thread, I'll post my question to them here.
What are your plans to deal with the overhead and nastiness of ptrace? Beyond the performance losses, there's also the annoyance that you can't ptrace a single task twice, so no debuggers.
Are you familiar with the FlexSC paper? Have you considered using a FlexSC-like RPC interface (a secure one, of course) to achieve your syscall interception, instead of ptrace? That would allow you to not just match the performance of native system calls, but even theoretically exceed their performance, while still having the same level of control. (I have been working on such an approach, so I was excited to see this gvisor project posted - I hoped you might have already done this and saved me some work :))
Not sure how far this project can go if it sticks with ptrace...
To correct one misconception, the project is not bound to ptrace. There is a generic platform interface, and the repository also includes a KVM-based platform (wherein the Sentry acts as guest kernel and host VMM simultaneously) described in the README. The default platform is ptrace so that it works out of the box everywhere.
> What are your plans to deal with the overhead and nastiness of ptrace? Beyond the performance losses, there's also the annoyance that you can't ptrace a single task twice, so no debuggers.
It's true that you can't trace the sandbox itself, and that's annoying, but you can still use ptrace inside the sandbox (ptrace is implemented by the Sentry). Just wanted to make sure that was clear.
> Are you familiar with the FlexSC paper? Have you considered using a FlexSC-like RPC interface (a secure one, of course) to achieve your syscall interception, instead of ptrace? That would allow you to not just match the performance of native system calls, but even theoretically exceed their performance, while still having the same level of control. (I have been working on such an approach, so I was excited to see this gvisor project posted - I hoped you might have already done this and saved me some work :))
I am familiar with FlexSC. There are certainly opportunities for improvement, including kernel-hooks, shared regions for async system calls, etc. Given the pace that this space is evolving, our priority was to share these pieces so that we can discuss things in the open. While I don't think we'll be able to save you work (sorry!), we're aiming for collaboration and cross-fertilization.
Good to have a gVisor developer on here. At Dropbox one way we use secure containers is to run machine learning models. Do you know if TensorFlow works in a gVisor container? How is GPU support in the container? If running on a CPU, are BLAS libraries supported to speed up matrix math in the container? Finally, do you know if OpenCV currently runs in gVisor containers?
Tensorflow security geek here: TF works in gVisor. I strongly recommend a VM solution when dealing with GPUs. Haven't tried GPU in gVisor because the size of the attack surface against the GPU device driver is so large. Depends on your level of paranoia, though.
For CPU, gVisor is great.
You can absolutely use BLAS libraries. Eigen (the TF default) works for sure. I don't see any reason MKL wouldn't, but I haven't personally tested it.
Sadly, at least without violating NVidia's license agreement, you can't without a Tesla-series GPU. The consumer drivers won't let you put them in a VM using an IOMMU, which is the "right" way to do it.
Your main options are a Tesla P40 if you've got six grand to drop, which you can happily stuff in a VM and use the IOMMU to guarantee isolation for, or trying to massively optimize for your host CPU. Fortunately, with a lot of inference tasks, CPU isn't too bad. If you can get your batch sizes up, using MKL or MKL-DNN is a quite reasonable option on a decent Intel CPU. Installing TF with MKL-DNN is pretty easy these days, too (https://www.tensorflow.org/performance/performance_guide#ten... ). That would honestly be my first try if you need serious isolation on a budget.
You can map a GPU into a general container, it's just... not very good security. Any flaw in the (very large, complex) nvidia binary blob will leave you exposed to potential container escapes (or outright root). This really depends on your threat model. Putting it in a container is better than handing someone the root ssh keys to your server, it's just not very satisfying if you're serious about security.
If you're just accepting untrusted, e.g., CSV or image input from a customer and then running your own trusted model on it, you could half-a## it and do the format decode and some validation and initial processing in a gVisor sandbox, and then pass that to your own trusted process that has GPU access. It doesn't protect you against all exploits, and if you were doing it at Google, I'd tell you not to do that, but if you have less to lose it may be an acceptable middle ground. It's kinda complicated, though, and would incur decent data copy overhead
This is one area where using a cloud hosted service is appealing, 'cause they've bought the datacenter GPUs or TPUs or whateverPUs for you and handled the isolation story. (Disclaimer - as you probably gathered, I also work part time at Google, frequently on tensorflow and cloudml security, but this is all my opinion.)
> The consumer drivers won't let you put them in a VM using an IOMMU, which is the "right" way to do it.
For Windows guests at least, the workaround is pretty easy. You just prevent KVM from identifying itself. I have a gaming machine with a GTX1080Ti running in KVM at home.
> To correct one misconception, the project is not bound to ptrace. There is a generic platform interface, and the repository also includes a KVM-based platform (wherein the Sentry acts as guest kernel and host VMM simultaneously) described in the README. The default platform is ptrace so that it works out of the box everywhere.
Doesn't have even worse perf than ptrace? The time I tried to do that, I ended up getting bit by the rough 4 contexts switches to get most anything done. Guest->host_kernel->host_user_vmm->host_kernel->guest
Or are you just saying that if/when some third option that doesn't suck as much comes out, you'd be able to hopefully transparently switch to it?
You are right in that ptrace is slow and nasty. The key problem, I believe, is the tracer-tracee mode that involves two host processes and the switch is asynchronous (ptrace(SYSEMU) and then waitpid).
We do have the KVM platform that offers the synchronous switch, which performs better if you have bare-metal virtualization support.
This seems neat, but every time I read about something of this sort, I am left wondering: what's wrong with the pledge/seccomp model? According to TFA:
> Kernel features like seccomp filters can provide better isolation between the application and host kernel, but they require the user to create a predefined whitelist of system calls.
Isn't that something you'd effectively have to do anyway if you want a sandbox? Like, a sandbox isn't worth that much if you don't define what it can and can't do, no?
I'm far from an expert in this area. It's an honest question, not a veiled criticism.
The short version is that pledge and seccomp are implemented in the kernel, so any mistakes there let the attacker win immediately. As a result, they are implemented to be as simple as possible and don't let you express rules like "You can open /etc/ld.so.conf but nothing else in /etc". (Namespaces, which are also in the kernel but are more complex, let you do this sort of thing but definitely have had game-over bugs recently.)
The ideal way to deploy something like this is to stick seccomp/pledge on the userspace sandboxing process, and implement all your complicated policy decisions in that process.
> As a result, they are implemented to be as simple as possible and don't let you express rules like "You can open /etc/ld.so.conf but nothing else in /etc".
OpenBSD's pledgepaths (..soon to be renamed?) will address this, but in general "hoisting" the opening of files before pledge or "sandboxing" is often correct.
The proposed semantics for pledgepath are explicit paths/files are allowed, and must be done before any pledge including "rpath/wpath" or "cpath".
That seems cool, but I hope you don't end up in the place where namespaces ended up where there are too many security-sensitive codepaths in the kernel before you hit the sandbox. :) Unprivileged user namespaces exist but are disabled by default on many distros. It would be a shame for pledge to stay in the use case of "hardening programs that were previously running as root and that was okay" and not in the use case of "sandboxing actually untrusted programs".
OpenBSD has practically pioneered privsep and privdrop designs, including techniques like file descriptor passing. pledge() already works for privileged/unprivileged users/processes and most of the base system is pledged.
There are two primary strengths that gVisor provides over the seccomp model, the second of which you've actually alluded to above.
1. Layered security
While seccomp allows users to limit the attack surface on the kernel, the application is still directly interacting with it and any single bug in an allowed system call will allow compromise. One of the design principles of gVisor is that no single bug should allow compromise of the host system/user data.
By intercepting and handling all application system calls, the gVisor kernel is the first layer of defense against the application. The gVisor kernel itself puts itself inside a seccomp sandbox as a second layer of defense, so if the application gets privilege escalation into the gVisor kernel its attack surface to the host is still limited.
The gVisor kernel seccomp policy [1] is much more restrictive than the system calls we implement. For example, note that "open" and friends are not allowed at all. File system access is mediated by an external agent [2] which does not trust the gVisor kernel, so even a compromised gVisor kernel has no elevated file system access.
2. Ease of use
> > Kernel features like seccomp filters can provide better isolation between the application and host kernel, but they require the user to create a predefined whitelist of system calls.
> Isn't that something you'd effectively have to do anyway if you want a sandbox?
This is something we'd like to challenge with gVisor. gVisor intends to be "secure by default" and configuration-free to the largest extent possible.
gVisor runs and sandboxes arbitrary, unmodified Linux binaries. You don't need to specify a sandbox policy because gVisor safely implements the entire Linux API [3].
Building a sandbox policy can be a difficult and time consuming. It can also be a difficult maintenance burden to update as the application changes over time, especially if you've made modifications to the application to reduce its syscall surface. Additionally, some use-cases wish to sandbox arbitrary workloads, for which a sandbox policy cannot be defined.
With gVisor, we hope to remove this painful step in sandboxing and enable developers to easily sandbox their workloads.
The former is a set of kernel libraries derived from NetBSD, and the latter is a Unikernel built based on the former. gVisor is different in a couple of ways: 1) gVisor is written from scratch using Go for its memory and type safety; 2) gVisor tends to be compatible with Linux which most people use. In theory, gVisor can be restructured as a Unikernel, but we still like to pertain the ring privilege boundary for additional isolation. We are working on an academic paper which will have more details.
This is true, Go is not memory safe in the presence of data races, and data races are possible in safe Go.
But they're also generally easy to code-review out. There's definitely a huge difference between C and Go, regardless of this one caveat to Go's memory safety guarantees.
They aren't using single threaded Go from what I can see.
Data races are not easy to "code review out". That is contrary to decades of experience. All you have to do in Go to get a race is to close over a for loop induction variable in a goroutine.
There is not a large difference between C and Go here. In fact, races might be easier in Go than in C, because it's easier for goroutines to close over mutable variables.
> I haven't really seen this as a big problem in Go.
Go certainly does have problems with data races all the time. Just Google for "golang data race": you'll find many blog posts explaining common data race gotchas in Go.
> Regardless, I think there's a world of difference between C and Go when it comes to memory safety.
It also ignores that seccomp-bpf allows for far more fine-grained rules for syscalls (like specifying that certain bits be cleared or certain arguments be equal to a value). And they're adding more and more features over time to it. I don't get why you would use ptrace (and if you don't use ptrace then you don't need another layer -- just play with the exising OCI support for seccomp and use runc directly).
seccomp-bpf doesn't let you follow pointers, so you can't even implement most pledge() restrictions in it. For instance, pledge() always permits open("/etc/localtime"), but at the point seccomp
is run, all you know is open(some pointer to userspace).
You could imagine combining seccomp-bpf with some other system that reads the arguments after they've been copied to kernelspace, which is basically Landlock's approach https://lwn.net/Articles/698226/. But I've been personally waiting for something like this since 2011 or so when people were saying seccomp mode 2 should use ftrace, and Landlock itself has been in review (slash argument) for two years. An approach like gVisor works today.
> seccomp-bpf doesn't let you follow pointers, so you can't even implement most pledge() restrictions in it. For instance, pledge() always permits open("/etc/localtime"), but at the point seccomp is run, all you know is open(some pointer to userspace).
This is something that is being worked on (separately but similar to Landlock) in the form of seccomp syscall emulation (I don't remember the actual name of the patchset at the moment but it was proposed a month ago I think). However after talking to some seccomp folks I was told that in theory eBPF maps could be used for this purpose (though I'm not really convinced to be honest).
The real downside of ptrace is that you cannot filter which syscalls you're interested in -- so you pay the price of tracing for every syscall. seccomp doesn't have this problem.
You can use SECCOMP_RET_TRACE to kick complicated cases back to the ptracer but handle the easy cases without the slowdown. So you can write a seccomp policy that does something like this pseudocode:
if syscall == SYS_open:
if flags == O_RDONLY:
return (SECCOMP_RET_TRACE, 0)
else:
return (SECCOMP_RET_ERRNO, EPERM)
else if syscall in (SYS_read, SYS_write, ...):
return (SECCOMP_RET_ALLOW, 0)
else:
return (SECCOMP_RET_ERRNO, ENOSYS)
and it would be much much faster than tracing every system call, since most programs call open() rarely and read() and write() very often.
That said, the ptracer's job here is kind of hard, because the kernel still gets an untrusted userspace pointer, and another thread, another process, etc. can modify that memory in between the ptracer okaying it and the kernel getting to it. (See "Argument races" in Tal Garfinkel's 2003 "Traps and Pitfalls" paper.) So you either want the filtering to happen in the kernel after it's been copied to kernelspace (which is Landlock's approach), or do the open from a trusted process and send the fd over (which is I think gVisor's approach).
> seccomp-bpf doesn't let you follow pointers, so you can't even implement most pledge() restrictions in it.
That's not even the least of it, it's impossible to implement the ratcheting down semantics of pledge() using seccomp-bpf. For example, an initial pledge("stdio rpath") may be later reduced to pledge("stdio").
"Subsequent calls to pledge can reduce the abilities further, but abilities can never be regained."
Personally I don't think pledge()'s semantics are the best idea in the world, especially since all of the restrictions are cleared on exec() if it is permitted IIRC -- so it's useless for sandboxing.
Of course. If you promise "exec", then you're allowed exec. This allows pledge to be in software that otherwise would be impossible, for example text editors, and other shells. It still reduces attack surface, as the shell itself can no longer open sockets, or random device ioctls. The alternative is no protection at all.
> Package ptrace provides a ptrace-based implementation of the platform interface. This is useful for development and testing purposes primarily, and runs on stock kernels without special permissions.
I hate to rain on this interesting project’s parade, but if you need full sandbox isolation then you should probably look to full VM isolation (ala Kata containers, formerly Clear containers). User namespace-ing, SecComp, and Selinux/apparmor buy you about as much of a sandbox as you’ll need, with the one caveat that a kernel exploit could still take you down (all other exoits are rendered sandboxed).
If you need that final kernel sandboxing, then a full VM is your only guarantee. UML still sits on top of an exploitable kernel, and presumably this project itself can be hacked. While it certainly is better than nothing the only thing it seems to be buying you above Kata containers is a faster spin up time and the ability to dynamically resize the container. Maybe that trade off is worth it to someone, but the added performance overhead and the demi-kernel isolation seem like a high price to for those features.
Nothing wrong with a full VM, but I don't think it's a panacea (and you probably shouldn't guarantee security). You may have taken the UML comparison to heart: did you look at the KVM platform? I'm not clear on the distinctions you're making in that case -- a kernel escape would look a lot like a user-space VMM code execution vulnerability (which also sits on top an exploitable kernel).
Well, I wasn’t arguing for a VM being a panacea, but in the context of not being satisfied with the Linux primitives for sandboxing, I think it is the next logical step up in security.
From my perspective this project seems like an intermediate jump from Linux containerization primitives and a full blown VM, and I was wondering out loud who fits that use case?
Finally, I didn’t mention KVM, but my understanding of KVM is that it’s isolation primitive was the hardware virtualization instructions (or at least could be, I’m not sure if it has a PV mode or not).
I guess my question for you would be:
In what context would I want to use this over something like Kata containers?
KVM is the kernel interface for virtualization features, but the model created (i.e. the emulated hardware or lack thereof) is up to the user space component (normally QEMU). I think your understanding of KVM is tied with a specific implementation.
FWIW, I don't disagree that it's an intermediate step in some regards. The use cases follow (also from trade-offs discussed in the README). I can't speculate on a stranger's needs, it's great if Kata works for yours. I also think that approach is valuable (as an aside, I authored an experimental project with a similar approach years ago [1]).
Is there a comparison of Kata and gVisor based on how they act functionally rather than how they are implemented under-the-hood? Like the OP, I'm curious when you'd use this over Kata.
Not a direct comparison of these projects specifically but here is the write-up that was presented in the context of the Kubernetes SIG Node discussions about this topic:
VM's on commercial and FOSS platforms have been unable to ensure security as far back as first pentest of VM/370. They were too complicated depending on a lot of privileged code. The founders of INFOSEC took aim at it with KVM/370 and VAX VMM Security Kernel. One example (see esp layering and assurance sections):
Even those systems had too much complexity by our standards. They aimed for nearly-perfect implementation of the core kernels mediating everything. That led to separation kernels like INTEGRITY-178B, LynxSecure, seL4, and Muen. They're usually just 4-12Kloc. Whereas, projects trying to achieve similar assurance activities on KVM and Xen mostly gave up due to complexity. If aiming for correctness more than security, projects like Nova microhypervisor and GenodeOS show similar partitioning can still help.
Although separation kernels succeeded, their requirements assumed hardware/firmware that was sane and would work correctly. What people are using for virtualized workloads has been overly complex with shoddy implementations. Bypasses keep happening via everything from CPU's to RAM to peripheral firmware. Although methods exist to assure them, the companies providing them are not using them.
So, if you want anything close to a guarantee, you have to use physical separation with strongly-mediated, communications systems over optical links. The boards need to be electrically isolated from each other in their own TEMPEST boxes or safes. Alternatively, use old technique from separation kernel era of putting all the complex, enemy-facing stuff on their own boxes that interpret stuff down to requests in simple protocols interfacing with the trusted boxes on better hardware. Reduces hardware cost but what you get is nothing like a Google or Amazon cloud. It's more like a partitioned version of SoftLayer.
This is a tangent: I'm baffled by the subtext that the attackers have such an easy time attacking, and defenders have such a hard time defending that security experts are our only hope. It's really self serving, even if partly true. I don't know where the truth begins and the self serving ends though.
Developers are focused on developing, not securing. Attackers are focused on attacking, not developing. Defenders are focused on attackers and securing.
It's not that security experts are our only hope, it's that you don't go to a mechanical engineer when your car breaks down. One person is really good at designing, and another person is really good at fixing. In the ideal case, the fixer brings the broken thing to the designer so the designer can improve the design. The tighter that loop is, the quicker the quality gets better.
Your comment seems to be a murkier restatement of the discussion in the post, which also mentioned Kata. It’d be more useful to give details about why you don’t think the flexibility and lower resource usage are worthwhile rather than acting like they didn’t explicitly discuss the trade offs.
What ever happened to user-mode-linux (UML)? It seems like gVisor takes a similar approach, a "kernel" in userspace handling syscalls. I was playing with UML the other day, a UML kernel is still in Debian's repo and appears roughly up to date[1], but all the documentation I could find on it was many years old and seemed out of date.
My experience with UML is very old, but I think that it had a significant impact on performance. About a 50% loss in certain cases, vs running the same applications on the host directly.
https://twitter.com/tallclair/status/991621542265180161 -- "Google has been relying on gVisor to sandbox production workloads for years. I'm super excited that we've open sourced! Now we can talk about #Kubernetes integrations :)"
The Github page mentions that postgres, nginx and elasticsearch aren't currently supported by the gVisor kernel (or pseudo-kernel?). Those are significant, but they shouldn't be showstoppers for a lot of deployments. They list each of these shortcomings as bugs, so hopefully there's a real effort to support the capabilities required by those applications.
Also, I'd like to see what the performance impact is like. What types of applications would suffer the worst performance running on gVisor? What applications will be the least affected?
If it's not safe then it wouldn't be a proper sandbox I'm the first place.
The goal is to intercept all system calls and reimplement them in a lightweight kernel that talks with the host kernel only via a minimal 9p based protocol.
I.e. there is never a direct syscall being served by the host kernel on behalf of a process running inside the container.
From what I can read in their design docs, the user id running inside the container seems completely irrelevant.
User namespaces can already get you this, but this is an added layer of defence in case their are exploitable kernel vulnerabilities that could allow an attacker to break out of a container. With this runtime if you break out of the container you are in an isolated kernel. As others have pointed out though this is basically just a stripped down UML, so the performance is likely not great. Though in contexts where security is at a premium (compliance contexts for the healthcare and finance industries) it might be worth the cost.
Proponents of Illumos zones and FreeBSD jails claim that these solutions offer better security than Linux containers while maintaining the performance of running directly on a shared kernel. And now, both Illumos and FreeBSD have Linux emulation for x86-64. Has Google tried these solutions and found them wanting? Has anyone done research on whether Illumos zones or FreeBSD jails really provide better security than Linux containers?
> Has anyone done research on whether Illumos zones or FreeBSD jails really provide better security than Linux containers?
This is an ever changing property of both systems, and a little bit subjective, so a study is both difficult and outdated as soon as it's done.
What we can do is look at the number of published vulnerabilities over a timeframe, and compare the overall system designs and development philosophies. I don't know of a comparison of the numbers of vulnerabilities, but for a bit of history and why I would personally trust zones over containers I previously wrote this comment.
Your comment really speaks to the different philosophy of Linux vs FreeBSD.
FreeBSD is very much a single system - the kernel and userland are designed and built together. Linux, on the other hand, is a kernel which has multiple different userlands made up of different pieces that distributions pick and choose. Ubuntu, for example, is quite different from OpenSUSE, which is quite different from CentOS, but they're all still Linux.
Linux only being a kernel versus BSD a OS makes it easier to leverage with Android, ChromeOS, etc. Google uses the Linux kernel everywhere. From CC to Google home and wifi, etc. But then also their cloud
The appengine-java-vm-runtime is for Flexible Environment.
I want to know the sandbox of Standard Environment.
Java 8 Runtime seems that it use new sandbox mechanism.
Is this the same actual codebase that is running in production on Google's core infrastructure? Or is it more like Kubernetes, that is to say: an distinct open reimplementation based on the experience of running the in-house system?
From a security perspective, I don't think there is a big difference between process isolation and kernel isolation. Oh, you think you made some really secure software? That's great, here is how I will use a side-channel to work around it.
If it weren't for the fact that the vanilla Linux kernel is the security equivalent of swiss cheese, process isolation should be good enough for basic "sandboxing". Add SELinux and some patches and it's good enough for the NSA.
So rather than waste time on piling on another layer of abstraction for what is, in practical terms, no significant security advantage, just make userland containers (don't run your container manager as root) and secure the OS and stop reinventing the wheel with added complexity.
"Add SELinux and some patches and it's good enough for the NSA."
Exactly. They didn't accept that, though, since the underlying TCB was too insecure. I'll elaborate for other readers.
SELinux was a prototype to add a tiny amount of functionality from "Trusted Operating Systems" to Linux to see if its security could be improved. Those assessing security at NSA rated the confidence of systems like that at C2/EAL4+. Here's what that means as described by Shapiro when Windows 2000 got the rating:
Highly-assured software they accept as close to secure is rated EAL6/7 (new criteria) or B3/A1 (older criteria). One of my favorite papers illustrating the kind of rigor that goes into that was VAX VMM Security Kernel for secure virtualization back in early 1990's. It was designed for A1/EAL7. Look at layering and Assurance sections for examples of techniques high-assurance security still uses today albeit with different tooling.
One of the newer ones at EAL6+ is INTEGRITY-178B whose page nicely illustrates the kinds of features and evidence packages they had to use to assure the separation kernel. There's more politics in the process these days, though, where they try to play down any weaknesses. The features and analyses are still good examples, though, of what would be in an openly-developed alternative.
SELinux and common virtualization solutions don't begin to compare in confidence that attackers will find minimal hacks. Instead, the endless complexity demanded by the features people add assures there will be plenty vulnerabilities to come even in things that used to be safe. That's the default of proprietary and FOSS. Stuff making sacrifices for maximum security is rare. SELinux isn't one of latter. Its predecessor, LOCK, was though. Compare and contrast them, too, esp on where UNIX functionality set and what/why of the modifications.
Why do you think so? For compute (non syscall bound) it should be native speed (just like a VM). For syscall-heavy operations it will depend on the syscall and how it's implemented and the backend processing.
For example, if you call lots of syscalls that are fully implemented in gVisor (does not rely o an external backend/service for completing them) that happen to be implemented inefficiently in gVisor (because they aren't top priority right now/not a lot of users rely on them being fast) vs the same syscall being already optimized in Linux then obviously there will be a lot of difference. But if calling syscalls which need an external service (ex. network/storage) to complete the request, then depending on the latency of the external service the processing speed difference of gVisor vs guest Linux kernel may not matter.
It really depends on the workload, there is indeed potential that some workloads will be significantly slower with gVisor, all other things being equal, but it doesn't seem to me to be a general thing.
Clear Containers has merged into Kata Containers -- gVisor offers a different set of tradeoffs than Kata (or CC). The big one is resource footprint -- gVisor will typically be lower, often much lower than Kata. On the other hand, for very syscall heavy workloads Kata needn't take exits (while gVisor exits for many syscalls that make it beyond the sandbox), so performance on gVisor will have higher variance with respect to syscall rate. Kata also offers the opportunity for sandboxing things like kernel driver blobs (since you get a whole new Linux ring 0) while gVisor relies on the host kernel for many tasks.
There's a lot of overlap between cgroups, gVisor-style sandboxes, and VM-style sandboxes like Kata. The tradeoffs between them are mostly with respect to compatibility, robustness of the security boundaries, and performance. So, you know, the usual suspects.
We've reached a stage where we probably need better vocabulary for describing these tradeoffs :)
(I work on "near" the gVisor folks at Google, and I'm involved in the Kata community)
Basically run `mount` in a gvisor container and run `mount` in a runc container and see the major differences there. Just one example, but as you can see, linux mount namespaces tend to leak lots of mount information. some of it could be cleaned up with additional unmounts after setting up the new root for the container, but knowing what to unmount is not so simple (plus it's just janky AF).
Is it possible to hook a custom filesystem implementation to this? So an app won't touch filesystem at all but still will be able to use some sort of virtual files.
You can implement a custom file system as a separate process (called Gofers) and have the sandbox connect to that. The custom file system has to adhere to the p9.File API (see pkg/p9/file.go) - the protocol is an extended version of 9P2000.L.
This has been possible and quite easy with FUSE (and overlayfs / mergerfs / mhdfs / aufs if you don't want to implement all the functionality yourself) for the last 8 years at least -- I've used mhddfs and fuse for exactly this on Ubuntu 10.04
I'm not interested in fuse here. Gvisor has fsgofer, which is a proxy of some kind to a filesystem, there is even ReadAt [1]. But it is light on details, I was curious how and to what extent it proxies filesystem API.
> "Since gVisor is itself a user-space application, it will make some host system calls to support its operation, but much like a VMM, it will not allow the application to directly control the system calls it makes." [https://github.com/google/gvisor]
TLDR; This is a user-space process that hooks syscalls/ioctls made by your "containerised" applications.
(1) This is hardly a strong security model. Proper security cannot be guaranteed by simply hooking API calls in user-space alone.
(2) With this framework in mind, developers now need to worry about yet another layer of indirection. Assume <application> was tested to work on Ubuntu, that fact alone is not sufficient to assume it will keep running under gVisor.
(3) I would personally like to see more documentation/benchmarks regarding the performance impacts that come with using this.
(4) This is strongly coupled with internal Kernel implementations. It will not be easy to port and maintain this across different Kernels.
> "but much like a VMM, it will not allow the application to directly control the system calls it makes."
It has two modes of operation; one in which it uses ptrace with PTRACE_SYSEMU, which was implemented so that User Mode Linux could intercept all syscalls. This works in all environments, whether or not hardware virtualization is available (included VMs which don't enable nested virtualization).
The other is that it can use KVM, without any hardware emulation, to utilize hardware virtualization support and do it more efficiently.
Neither way relies purely on user-space; they both use kernel features that are designed specifically for allowing one user-space process to virtualize another.
> (1) This is hardly a strong security model. Proper security cannot be guaranteed by simply hooking API calls in user-space alone.
The thing you're talking about is not a security model, it is a (reliable) mechanism that can be used in the implementation of security models.
> (2) With this framework in mind, developers now need to worry about yet another layer of indirection. Assume <application> was tested to work on Ubuntu, that fact alone is not sufficient to assume it will keep running under gVisor.
This is true of existing container technologies. An application running under Ubuntu on bare hardware will potentially not run in an Ubuntu Docker image. You'll need to test it extensively.
> (4) This is strongly coupled with internal Kernel implementations. It will not be easy to port and maintain this across different Kernels.
I don't understand this—gVisor is a userspace application and is not itself tied to kernel implementations the way a kernel module would be. The interface gVisor exposes is the Linux syscall ABI, which is the thing Linux tries very hard to hold stable. There are multiple production reimplementations of this ABI (Windows Subsystem for Linux, FreeBSD's Linuxulator, Solaris's branded zones). You'll need to add new features if you want them, of course, but holding at a specific emulated kernel version is totally fine.
> From user-space? hold my beer.
ptrace (with PTRACE_O_EXITKILL from kernel 3.8+) is designed to be reliable for this.
Also, if you don't trust it, just set everything to SECCOMP_RET_TRACE, which kills the process if there is no ptracer.
> "ptrace (with PTRACE_O_EXITKILL from kernel 3.8+) is designed to be reliable for this."
One of the key points I'd like to raise is the (unsurprisingly) substantial performance degradation caused by tracing every syscall/ioctl a process makes. I submit tracing tens (or hundreds) of processes with ptrace/gVisor simply won't fly. Tracing the syscalls alone is expensive, let alone applying any other intricate mid-hook logic.
> "gVisor is a userspace application and is not itself tied to kernel implementations the way a kernel module would be"
I was referring to the never-ending chase that comes with having to keep tabs on any new/existent ioctls/syscalls. Ioctls are device/driver/hardware specific, which complicates things further.
> "This is true of existing container technologies. An application running under Ubuntu on bare hardware will potentially not run in an Ubuntu Docker image. You'll need to test it extensively."
There's a similarity, but solutions like VMWare/VirtualBox/hypervisors are fighting to be as transparent as possible to the underlying software. That makes things easier on software developers - as we don't all have to spend our time testing those products.
It would appear that gVisor is fundamentally different. It intercepts and tampers with the various syscalls a process makes with the sole purpose of affecting the underlying application - ie. failing a syscall that would otherwise succeed.
Oh, thanks. (It's still safe, because the inability to execute system calls basically translates into an inability to do anything the process was not previously authorized to do via... mmapped memory, and I think that's it.)
Do you have some pointers where we can read more about weaknesses of ptrace syscall interception?
To me this seems like an improvement over having to worry about the full host syscall surface area.
Until nested hardware virtualization is broadly available, I cannot run things like clear containers on major cloud vendors, so I'm pretty excited to have a way to increase the isolation between containers ... well, unless you point me to something that shows that all this is moot.
> To me this seems like an improvement over having to worry about the full host syscall surface area.
Seccomp already permits this type of attack surface restriction, and Docker (with runc) already has a default seccomp whitelist. So by default you already get this.
> Do you have some pointers where we can read more about weaknesses of ptrace syscall interception?
The basic problem is that the policy runs in userspace and is thus more vulnerable than a kernel-side policy. It also has the downside that you don't get any of the contextual information the kernel has about a syscall if you just know the syscall being called (such as labels or namespaces or LSMs or access rights or what state the process is in or whether another process is doing something nasty or ...).
There's a reason that UML didn't overtake KVM in virtualization. Because it had a worrying security model, since the only thing stopping a process from seeing the host was another process on the host tricking it. Everyone I've talked to about UML has cited security as the main drawback.
> Seccomp already permits this type of attack surface restriction, and Docker (with runc) already has a default seccomp whitelist. So by default you already get this.
gVisor's doc addresses this with:
in practice it can be extremely difficult (if not impossible) to reliably define a policy for arbitrary, previously unknown applications, making this approach challenging to apply universally.
gVisor's Sentry process in fact uses seccomp to limit the syscalls it can make (and thus in worst case the guest process by tricking Sentry). Furthermore it uses an actual networking filesystem protocol (good old 9p) to encode the rest of the file-oriented system calls so that they get executed by a separate process.
This arrangement shuffles the wide part of the kernel API surface into the per-container "proxy kernel",
while requiring a very narrow (and controlled) API surface to the rest of the host.
This is pretty much the same kind of deal (although quantitatively and qualitatively different) that OS level virtualization employs: guest kernels have a very narrow API surface area to the underlying hypervisor (and thus with the rest of system).
> The basic problem is that the policy runs in userspace and is thus more vulnerable than a kernel-side policy.
Color me skeptical, but running things in the kernel-side don't strike me as necessarily less vulnerable or more trustworthy. The linux kernel is quite a complicated beast with a very wide internal API surface area and despite the age still moving forward at a quite interesting pace.
There is a significant amount of research in running kernels with significant portions in user space (see the whole L4 family), and IIRC the problem has always been more about performance and adoption rather than an inherent problem of user-space vs kernel-space.
> It also has the downside that you don't get any of the contextual information the kernel has about a syscall if you just know the syscall being called (such as labels or namespaces or LSMs or access rights or what state the process is in or whether another process is doing something nasty or ...).
which in this case seems perfectly reasonable since this is not a generic "transparent sandbox" solution that enhances the security of regular processes, but more of a "lightweight kernel" that runs processes.
For example imagine you have a good single process sandbox (e.g. NaCl or https://pdos.csail.mit.edu/~baford/vm/) that is able to fully offer all necessary services to the logical guest process and only require a single TCP connection to perform all it's input and output (trough which you can e.g. run a 9p protocol and thus implement arbitrary I/O patterns with willing parties). It's easy to define a seccomp ruleset that will enforce that the sandbox host does only this.
gVisor is something "like that", except it's able to execute unmodified docker workloads.
> "I guess it rewrite syscalls that would otherwise fail for lack of privileges"
That's not how access control works (on most systems). The system calls are still issued from a single process. If the gVisor process is running in the context of a non-privileged user, the system calls will fail regardless of the codepath.
It rewrites the syscalls to make them succeed: reading a privileged file? Rewrite to read a non-privileged shadow file. Killing a privileged process? Return success without killing anything, etc.
Whatever is not rewritten will fail in kernel: there is no security risk.
Imagine some applications expect the syscall to actually fail (ie. some odd way to test permissions).
I fail to grasp how this is a strength. Pulling the rug from under applications is dangerous. You're in a direct battle with internal implementation specifics - you don't want to get into that as an abstraction layer. You don't want to tailor various hacks for specific applications.
Very different. It's also very different to Solaris Zones. The design of Linux containers is namespace-based while Jails and Zones are (basically) ID-based.
In addition gVisor is basically a ptrace wrapper around your process that applies restrictions and other things on top of containers. I don't really understand what the benefit is over seccomp-bpf (which is slowly becoming as powerful as ptrace but without the overhead and without the security flaw of your sandbox rules being entirely in userspace without any protections like seccomp). They call it a kernel, likely because it is based on the idea of USL (User-Space Linux), but there's a reason that USL never took of as a virtualisation tool -- its entire security was predicated on ptrace to trick processes spawned in the "guest os" into not being able to see the host. In this configuration it looks like gVisor is using both namespaces and ptrace -- but then you have to worry about the massive overhead of ptrace (it affects every syscall and signal event involving the process, and requires four context switches and signal delivery to the tracing process in addition to the normal syscall costs).
gAdvisor appears to be working on a KVM shim, but I'm not quite sure how you can use KVM and still differentiate yourself from Kata Containers (the project that came from Clear Containers and HyperHQ). Seems like duplicated effort to me.
EDIT: I just re-read the article and it looks like they don't actually use containers at all. Unless I'm mistaken this means that they are not taking advantage of any of the sophisticated security primitives in the kernel that ordinary containers use, and thus have the same (bad) security model as UML.
> I don't really understand what the benefit is over seccomp-bpf
With seccomp-bpf you are filtering what syscalls can be made, but those syscall still happen in the host kernel. If that kernel syscall has a vulnerability it could allow exploitation of the host and other containers in a multi-tenancy environment. The gVisor kernel actually implements the syscalls, it doesn't just pass them on.
> Advisor appears to be working on a KVM shim, but I'm not quite sure how you can use KVM and still differentiate yourself from Kata Containers
Kata Containers virtualize hardware and run a regular Linux kernel on the virtual hardware. gVisor doesn't virtualize hardware, it is a kernel running in userspace, implementing syscalls.
> I just re-read the article and it looks like they don't actually use containers at all. Unless I'm mistaken this means that they are not taking advantage of any of the sophisticated security primitives in the kernel that ordinary containers use, and thus have the same (bad) security model as UML.
The gVisor kernel (Sentry) runs in an empty user namespace with seccomp filters applied.
> With seccomp-bpf you are filtering what syscalls can be made, but those syscall still happen in the host kernel.
I'm not sure what the distinction you're making is. By the same token, because PTRACE_SYSCALL/PTRACE_SYSEMU only signals the calling process after the syscall boundary has been crossed, then ptrace also doesn't help with the problem you are describing (though I also don't really agree that it's a problem in the first place -- user-kernel context switches are not security vulnerabilities). In fact, the seccomp restrictions are applied immediately after PTRACE_SYSCALL/PTRACE_SYSEMU -- there is only a few lines of code separating the two cases in the syscall entry path[1]. And in the case of PTRACE_SYSEMU, seccomp rules are still executed even though the syscall is never going to be executed.
> gVisor doesn't virtualize hardware, it is a kernel running in userspace, implementing syscalls.
I understand that, but that's the ptrace helper (which I spent the rest of my comment talking about). In the README (which I did read before commenting) they mention an experimental KVM driver, which is what my VM comments were referring to.
Is there an explanation somewhere about how gVisor uses KVM to virtualize syscalls? I'm not sure I understand how you could use KVM to do that. That's why I mentioned Kata, because it's the only point of reference I have for using KVM to "emulate syscalls" (though of course it emulates more than that).
> The gVisor kernel (Sentry) runs in an empty user namespace with seccomp filters applied.
Good to know, but I didn't see that mentioned anywhere? If you're affiliated with the project it'd be great if you could add it somewhere in the README or the blog post.
Basically the Sentry binary works as both the VMM and the guest kernel. It uses the KVM API to setup page an address space and installs fault handlers through which it regains control when the guest payload faults (on memory access or soft interrupts/sysenter).
What I couldn't find on a quick skin is how much of that logic works on which "side" of the wall, i.e. how much logic can the Sentry evaluate without crossing the VM boundary.
I can imagine this can depend quite a lot on Go runtime internals. How do you setup the environment inside the "guest" so that it can run the Go code?
jail (and its Linux equivalent, namespaces) is a kernel feature. This is a userpace application that emulates all system calla. "Container" is only used here in the sense of API / functionality, not implementation.
Basically it is like running an application inside qemu, but lighter-weight.
[0] https://github.com/google/gvisor/blob/master/tools/go_generi...