> "Since gVisor is itself a user-space application, it will make some host syste...

lambda · on May 2, 2018

> From user-space? hold my beer.

You can follow that link to see how it does it.

It has two modes of operation; one in which it uses ptrace with PTRACE_SYSEMU, which was implemented so that User Mode Linux could intercept all syscalls. This works in all environments, whether or not hardware virtualization is available (included VMs which don't enable nested virtualization).

The other is that it can use KVM, without any hardware emulation, to utilize hardware virtualization support and do it more efficiently.

Neither way relies purely on user-space; they both use kernel features that are designed specifically for allowing one user-space process to virtualize another.

geofft · on May 2, 2018

> (1) This is hardly a strong security model. Proper security cannot be guaranteed by simply hooking API calls in user-space alone.

The thing you're talking about is not a security model, it is a (reliable) mechanism that can be used in the implementation of security models.

> (2) With this framework in mind, developers now need to worry about yet another layer of indirection. Assume <application> was tested to work on Ubuntu, that fact alone is not sufficient to assume it will keep running under gVisor.

This is true of existing container technologies. An application running under Ubuntu on bare hardware will potentially not run in an Ubuntu Docker image. You'll need to test it extensively.

> (4) This is strongly coupled with internal Kernel implementations. It will not be easy to port and maintain this across different Kernels.

I don't understand this—gVisor is a userspace application and is not itself tied to kernel implementations the way a kernel module would be. The interface gVisor exposes is the Linux syscall ABI, which is the thing Linux tries very hard to hold stable. There are multiple production reimplementations of this ABI (Windows Subsystem for Linux, FreeBSD's Linuxulator, Solaris's branded zones). You'll need to add new features if you want them, of course, but holding at a specific emulated kernel version is totally fine.

> From user-space? hold my beer.

ptrace (with PTRACE_O_EXITKILL from kernel 3.8+) is designed to be reliable for this.

Also, if you don't trust it, just set everything to SECCOMP_RET_TRACE, which kills the process if there is no ptracer.

xtrapolate · on May 2, 2018

> "ptrace (with PTRACE_O_EXITKILL from kernel 3.8+) is designed to be reliable for this."

One of the key points I'd like to raise is the (unsurprisingly) substantial performance degradation caused by tracing every syscall/ioctl a process makes. I submit tracing tens (or hundreds) of processes with ptrace/gVisor simply won't fly. Tracing the syscalls alone is expensive, let alone applying any other intricate mid-hook logic.

http://www.linux-kongress.org/2009/slides/system_call_tracin...

> "gVisor is a userspace application and is not itself tied to kernel implementations the way a kernel module would be"

I was referring to the never-ending chase that comes with having to keep tabs on any new/existent ioctls/syscalls. Ioctls are device/driver/hardware specific, which complicates things further.

> "This is true of existing container technologies. An application running under Ubuntu on bare hardware will potentially not run in an Ubuntu Docker image. You'll need to test it extensively."

There's a similarity, but solutions like VMWare/VirtualBox/hypervisors are fighting to be as transparent as possible to the underlying software. That makes things easier on software developers - as we don't all have to spend our time testing those products.

It would appear that gVisor is fundamentally different. It intercepts and tampers with the various syscalls a process makes with the sole purpose of affecting the underlying application - ie. failing a syscall that would otherwise succeed.

jagger11 · on May 2, 2018

Also, if you don't trust it, just set everything to SECCOMP_RET_TRACE, which kills the process if there is no ptracer

A small correction, it causes for the syscall not to be executed, and return with errno==ENOSYS

geofft · on May 2, 2018

Oh, thanks. (It's still safe, because the inability to execute system calls basically translates into an inability to do anything the process was not previously authorized to do via... mmapped memory, and I think that's it.)

ithkuil · on May 2, 2018

Do you have some pointers where we can read more about weaknesses of ptrace syscall interception?

To me this seems like an improvement over having to worry about the full host syscall surface area.

Until nested hardware virtualization is broadly available, I cannot run things like clear containers on major cloud vendors, so I'm pretty excited to have a way to increase the isolation between containers ... well, unless you point me to something that shows that all this is moot.

cyphar · on May 2, 2018

> To me this seems like an improvement over having to worry about the full host syscall surface area.

Seccomp already permits this type of attack surface restriction, and Docker (with runc) already has a default seccomp whitelist. So by default you already get this.

> Do you have some pointers where we can read more about weaknesses of ptrace syscall interception?

The basic problem is that the policy runs in userspace and is thus more vulnerable than a kernel-side policy. It also has the downside that you don't get any of the contextual information the kernel has about a syscall if you just know the syscall being called (such as labels or namespaces or LSMs or access rights or what state the process is in or whether another process is doing something nasty or ...).

There's a reason that UML didn't overtake KVM in virtualization. Because it had a worrying security model, since the only thing stopping a process from seeing the host was another process on the host tricking it. Everyone I've talked to about UML has cited security as the main drawback.

ithkuil · on May 2, 2018

> Seccomp already permits this type of attack surface restriction, and Docker (with runc) already has a default seccomp whitelist. So by default you already get this.

gVisor's doc addresses this with:

    in practice it can be extremely difficult (if not impossible) to reliably define a policy for arbitrary, previously unknown applications, making this approach challenging to apply universally.

gVisor's Sentry process in fact uses seccomp to limit the syscalls it can make (and thus in worst case the guest process by tricking Sentry). Furthermore it uses an actual networking filesystem protocol (good old 9p) to encode the rest of the file-oriented system calls so that they get executed by a separate process.

This arrangement shuffles the wide part of the kernel API surface into the per-container "proxy kernel", while requiring a very narrow (and controlled) API surface to the rest of the host.

This is pretty much the same kind of deal (although quantitatively and qualitatively different) that OS level virtualization employs: guest kernels have a very narrow API surface area to the underlying hypervisor (and thus with the rest of system).

> The basic problem is that the policy runs in userspace and is thus more vulnerable than a kernel-side policy.

Color me skeptical, but running things in the kernel-side don't strike me as necessarily less vulnerable or more trustworthy. The linux kernel is quite a complicated beast with a very wide internal API surface area and despite the age still moving forward at a quite interesting pace.

There is a significant amount of research in running kernels with significant portions in user space (see the whole L4 family), and IIRC the problem has always been more about performance and adoption rather than an inherent problem of user-space vs kernel-space.

> It also has the downside that you don't get any of the contextual information the kernel has about a syscall if you just know the syscall being called (such as labels or namespaces or LSMs or access rights or what state the process is in or whether another process is doing something nasty or ...).

which in this case seems perfectly reasonable since this is not a generic "transparent sandbox" solution that enhances the security of regular processes, but more of a "lightweight kernel" that runs processes.

For example imagine you have a good single process sandbox (e.g. NaCl or https://pdos.csail.mit.edu/~baford/vm/) that is able to fully offer all necessary services to the logical guest process and only require a single TCP connection to perform all it's input and output (trough which you can e.g. run a 9p protocol and thus implement arbitrary I/O patterns with willing parties). It's easy to define a seccomp ruleset that will enforce that the sandbox host does only this.

gVisor is something "like that", except it's able to execute unmodified docker workloads.

cat199 · on May 2, 2018

Not the same approach, but user space items can be prone to race conditions, esp. under load - see also:

https://en.wikipedia.org/wiki/Sysjail

dullgiulio · on May 2, 2018

As for the safety, I guess it rewrite syscalls that would otherwise fail for lack of privileges.

In a way, they have made an advanced version of the old fakeroot (a LD_PRELOAD magic thing--thus very limited in scope.)

xtrapolate · on May 2, 2018

> "I guess it rewrite syscalls that would otherwise fail for lack of privileges"

That's not how access control works (on most systems). The system calls are still issued from a single process. If the gVisor process is running in the context of a non-privileged user, the system calls will fail regardless of the codepath.

dullgiulio · on May 2, 2018

It rewrites the syscalls to make them succeed: reading a privileged file? Rewrite to read a non-privileged shadow file. Killing a privileged process? Return success without killing anything, etc.

Whatever is not rewritten will fail in kernel: there is no security risk.

xtrapolate · on May 2, 2018

> "It rewrites the syscalls to make them succeed"

Imagine some applications expect the syscall to actually fail (ie. some odd way to test permissions).

I fail to grasp how this is a strength. Pulling the rug from under applications is dangerous. You're in a direct battle with internal implementation specifics - you don't want to get into that as an abstraction layer. You don't want to tailor various hacks for specific applications.