Nebula, Slack's Open Source Global Overlay Network

viraptor · on Nov 19, 2019

> We tried a number of approaches to this problem, but each came with trade-offs in performance, security, features, or ease of use.

I wonder if they tried ZeroTier. It sounds really like what they wanted.

zrail · on Nov 19, 2019

They may have decided that ZT's encryption isn't proven well enough for their needs. It's also posssible they rejected ZT because they didn't want to use ZT's centralized infrastructure. Two years ago was long before ZT started working on making that optional.

tiernano · on Nov 19, 2019

ZT allows you to run your own "Moons", meaning you dont need their infrastructure... bit more config required on the client end, but less reliance on Zerotier....

zrail · on Nov 19, 2019

Moons are going to be deprecated soon and as far as I understand never actually worked how people wanted. I.e. they still needed ZTs root servers, even if you were running your own controller.

To contrast, with Nebula you run your own root(s) (lighthouses) and you don't need a controller because important config (ip, group, hostname) is signed by the same CA.

api · on Nov 19, 2019

"Moons" are being deprecated in favor of true federation:

https://www.zerotier.com/zerotier-2-0-status/

https://www.zerotier.com/lf-announcement/

The moon terminology will also go away since there will no longer be a difference between these and our core roots. They'll all just be roots and will be interchangeable. The use of a common underlying key/value store will allow ZeroTier to keep its unified namespace and easy ability to join anyone's network or communicate with anyone regardless of what roots they're using (as long as their roots are on the same global network as you... obviously you can't hop air gaps).

api · on Nov 19, 2019

We are about to make that much easier and also allow true infrastructure federation -- you won't have to have your nodes contact our roots directly. We came up with something very interesting.

As for crypto: we plan some improvements in 2.0, but note that these days easily >90% of the traffic over ZeroTier networks (or any other VPN / overlay) tends to be already encrypted via SSH, SSL, etc. Another layer of encryption in the overlay just provides some additional defense in depth. We're rapidly moving to a world where everything layer 3 and above encrypts everything. That's why we have not prioritized sexier crypto for our L2 overlay tech. It's a bit redundant.

artificialLimbs · on Nov 20, 2019

OT: Thank you guys for the work you've been doing. Been using ZT for years and looking forward to root server decentralization.

harikb · on Nov 19, 2019

This is really interesting news!! Ryan was at the Gophercon 2018 and was talking casually about his pet '20%' project with some of us. Happy to see if finally released in the opensource. Great work Ryan!. His off the record remarks really made me change my mind about Slack engineering team in general. Otherwise, I was always cursing them about their electron client.

geofft · on Nov 19, 2019

I feel like I don't totally follow how you would set this up for, say, a company that has infra in two cloud providers (but no office network or datacenter or anything)... I think the answer is you set up one or more lighthouses with stable IPs on the public internet, and you make sure all your ephemeral cloud machines have IPs on the public internet? And all your ephemeral cloud machines get RFC-1918 addresses that are effectively in a giant flat subnet with no broadcast / no L2 domain and no implied structure?

It feels a little different from Wireguard, in that with Wireguard your engineers would be able to connect from behind a NAT, but my reading of how it works is that machines route directly at each other. Which is good for a production network where you care deeply about routing (bandwidth, latency, costs, debugging, etc.), but it seems that here your engineers would still need to connect to a bastion host or something, i.e., it isn't a VPN in the sense of being able to join the corporate network directly.

I guess if you've also got the lighthouse node internally routable by all your machines (e.g. you have an internal datacenter network and something like AWS Direct Connect) it would work too?

It'd be nice to see a sample network design.

zrail · on Nov 19, 2019

(from a single read of the README and the OP)

I think the answer is that your lighthouse(s) are the only machines that need publicly routable IPs. Your ephemeral cloud machines get any RFC-1918 address you want, with any subnet you want.

Engineers would have Nebula set up on their laptop with a configuration that knows about your lighthouse(s) static IP(s). They use the lighthouses for meeting other nodes, UDP hole punching, etc, but otherwise every connection is peer to peer.

geofft · on Nov 20, 2019

NAT traversal sounds like a thing I very much don't want to deal with for a production network, instinctively. It's fine for video games with friends but I've seen enough stuff go wrong with even normal networks that I wouldn't want to trust it. If this is what Slack is actually doing, I'd be very curious to hear how it's working out for them and how they debug network outages.

(Which is why I suspect it's not and the readme isn't clear)

viraptor · on Nov 20, 2019

Yeah, after doing a few years of VoIP I pretty much learned the same. Yes, there are multiple methods of traversal, yes they are sound in theory. Yet stuff breaks all the time on consumer routers.

e12e · on Nov 19, 2019

This does look really great! Is there ipv6 support?

Both for overlay network[1], and/or for nodes?

[1] https://github.com/slackhq/nebula/issues/6

sanxiyn · on Nov 20, 2019

The answer is no.

AcerbicZero · on Nov 19, 2019

Network virtualization seems to have been extremely slow to be adopted. Even companies pushing "cloud first" seem to be running their physical networks like its 2003.

geofft · on Nov 20, 2019

When your Kubernetes falls over, you can ssh to it and run commands like it's 2003 (or 1993). When your network falls over, not so much. We deprecated our network virtualization at $work and have been way happier for it.

Also, the end-to-end principle argues for putting complicated logic in the endpoints and making the network boring. See also, TCP is implemented at the endpoints and just requires network infrastructure to drop packets sometimes. You could imagine a congestion control protocol implemented on each router on the Internet, but it would be much more fragile and also much harder to deploy changes to.

tptacek · on Nov 20, 2019

Part of the reason for the end-to-end argument is to enable more clever (or, at least, more purpose-designed) functionality to ride on top of the dumb network. So e2e would suggest (to my reading at least) that you keep the "real" IP layer dumb and flexible, and do the fun stuff in overlays, which is what this is.

pvg · on Nov 20, 2019

It's also mostly about two rather than N endpoints. The later, follow-on Blumenthal & Clark paper is a kind of long list of end-to-end-principle analysis 'it's complicated's.

fulafel · on Nov 20, 2019

There has been a big "software defined networking" bandwagon but it's been confined to inside data centers.

sandstrom · on Nov 19, 2019

If anyone has detailed knowledge, it would be interesting to learn how Nebula is similar, and different, from e.g. Consul Connect and ZeroTier.

shantly · on Nov 19, 2019

The BSD license is a pretty big difference.

[EDIT] MIT, that is, of course.

jen20 · on Nov 19, 2019

Consul Connect is under MPLv2, which is a perfectly reasonable license unless you want to do shady things. There may be other differentiators, but this is not one.

shantly · on Nov 19, 2019

Is that one Apple App Store (iOS) compatible?

sansnomme · on Nov 19, 2019

This looks like an open source Zerotier, my dream has come true.

e12e · on Nov 19, 2019

Zerotier is open source?

tptacek · on Nov 19, 2019

It is, but under a noncommercial license.

api · on Nov 19, 2019

We have a blog post about our transition to the BSL from the GPL:

https://www.zerotier.com/on-the-gpl-to-bsl-transition/

I don't think the BSL is perfect. We're thinking and discussing with a number of people about potentially better licenses that would be closer to traditional FOSS while preventing "SaaSification" and similar. I think we're in the early stages of a renegotiation of the open source social contract and I don't think we've figured out the best model yet.

The AGPL is close but suffers from two problems: (1) it isn't perfect either and has numerous loopholes, and (2) there are a ton of companies out there with an irrational but nevertheless very entrenched phobia of anything associated with the GPL (as we have discovered). Maybe something a bit like the AGPL but not GPL branded would work.

tptacek · on Nov 20, 2019

I'm not judging, I'm simply relating a fact: Nebula is MIT licensed, and ZeroTier is BSL'd; a paid license is required to use ZeroTier in a closed-source application.

api · on Nov 20, 2019

Yes, that's intentional. It used to be GPL which imposed the same requirement, but we shifted to BSL because it's a bit more explicit and because of (again, irrational) GPL-phobia on the part of some non-trivial subset of corporate users.

BTW the closed source restriction in the BSL is effectively the same as the GPL and the only other meaningful restriction is on SaaS direct monetization. Companies can still run ZT for free and run it behind the scenes for free. It's a lot like the AGPL.

sanxiyn · on Nov 20, 2019

This is not true. AGPL allows SaaS monetization (you just need to publish the source). BSL does not allow it.

api · on Nov 20, 2019

That's why the BSL exists: to stop SaaS companies from monetizing the software without giving anything back to its developers.

A SaaS company can get a commercial license.

tptacek · on Nov 20, 2019

Or just use Nebula, which is superior in some ways (though presumably not every way) to ZeroTier, and MIT-licensed.

e12e · on Nov 19, 2019

Oh, indeed. I wasn't aware of that:

https://github.com/zerotier/ZeroTierOne/blob/master/LICENSE....

> Business Source License 1.1

> "Business Source License" is a trademark of MariaDB Corporation Ab.

Which is a bit odd, as MariaDB itself is gpl2 (+commercial)?

https://github.com/MariaDB/server/blob/10.5/COPYING

detaro · on Nov 19, 2019

MariaDB Corporation uses the license for some other products surrounding MariaDB, not for the main database system.

veeralpatel979 · on Nov 20, 2019

It would be nice if the team could publish a high-level architecture of the system!

I'd like to understand how Nebula works -- any other suggestions besides just diving into the code?

gfodor · on Nov 19, 2019

I wonder if this could be used in a trustless context to create a mesh of contributed internet nodes.

lorenzo95 · on Nov 28, 2019

The guys on linux unplugged interviewed the developer in their last podcast here https://linuxunplugged.com/329 Starts at about 28:20. He explains more of the why and how.

virtuallynathan · on Nov 19, 2019

How does this differ from WireGuard?

tptacek · on Nov 19, 2019

WireGuard is a VPN, and Nebula is an overlay network (also known as a service mesh). They are closely related concepts.

VPNs are primarily used for remote access, to get random machines access to closed IP networks. Service meshes synthesize a new network (sometimes IP, sometimes something else) to connect a bunch of related machines, almost always with policy controls for who can talk to what, usually cryptographic.

It would be weird (but not "wrong") to use a service mesh to get developer laptops access to staging Postgres.

It would be weird (but not "wrong") to use WireGuard to connect an application server to its Postgres instance.

WireGuard is a much tighter and more limited design, intended for integration directly into operating system kernels, with a strong emphasis on performance. Nebula is a much more ambitious design; it includes direct DNS support, certificates, and server infrastructure. WireGuard is a few thousand lines of very carefully written C code; Nebula is a typical Go project.

They're both very cool.

rgun · on Nov 20, 2019

Why do you think it is "weird" to use WireGuard for connecting application server with a DB instance?

(Backdrop: I have recently moved our various prod servers into a WireGuard based VPN to encrypt the traffic between them. I found it was easier/pragmatic to do this than:

* to setup SSL for my DB

* to figure out how to encrypt traffic between my application server and Redis or my application server and Nginx )

tptacek · on Nov 20, 2019

I like WireGuard and wouldn't blink at a client proposing to use it to create a secure network fabric for their deployment environment, but it is not the norm for people to do stuff like this; in K8s land, this is what service meshes like Istio do, and more generally this is what people use overlay networks for. WireGuard could form the basis of an overlay network, if you added the same bells and whistles Nebula has. But I don't think Jason has in his plans to add those bells and whistles himself, because that's not really WireGuard's charter.

virtuallynathan · on Nov 20, 2019

Thanks, that’s a helpful comparison!

tupilaq · on Nov 19, 2019

Like wireguard, Nebula is using the Noise Protocol Framework[1], but it seems that Nebula is using a ca-cert authority to tie together the peers in the same Nebula network[2]

[1] https://www.noiseprotocol.org/noise.html#introduction [2] https://github.com/slackhq/nebula#3-a-nebula-certificate-aut...

inetknght · on Nov 19, 2019

I get a certificate error for `www.noiseprotocol.org`. It turns out they're serving a certificate for `noiseprotocol.org` instead. The URL is still valid without `www.` [0].

https://noiseprotocol.org/noise.html#introduction

hangonhn · on Nov 19, 2019

It sounds like encryption was a necessary but not a sufficient requirement for Nebula.

In addition to VPN, Nebula added traffic filtering and spanning different clouds and data centers. I don't think Wireguard had those as goals.

They serve very different purposes. I use WireGuard to encrypt my mobile traffic but I wouldn't have picked it to connect the various hosts in my network at work. Nebula, however, might do the trick.

blurbleblurble · on Nov 20, 2019

Other similar projects that deserve a mention: tinc and yggdrasil.

I used to use tinc and have recently switched to yggdrasil, which was much easier to setup. So far it works great!

bjeanes · on Nov 20, 2019

Tinc is mentioned in the post.

fulafel · on Nov 20, 2019

IPsec does not require tunnel hosts.

Nor per connection configuration, you can eg let any nodes that hold certs from your CA communucate.

dastx · on Nov 19, 2019

It feels like their issues would have been solved by a service mesh using e.g. consul or istio. If so, I'd wonder writing a tool from scratch was the right use of engineering time. Anyway, as an engineer, I'd certainly have found this a fun project. Kudos to slack for trying something new and open sourcing it.

KaiserPro · on Nov 19, 2019

Not entirely as it only really allows stuff thats running in that service mesh's world to connect to the network.

But they want a global VPN for _everything_ including laptops. This means some level of access control.

What I like here is the use of lighhouses, to allow external nodes to punch in and discover the rest of the network. Something which is very difficult to do if you are relying on a service mesh in an unknown and unconnectable network.

hedwall · on Nov 19, 2019

With Consul intentions and edge gateways you may be able to have the same functionality, but those are quite recent and was not present 2 years ago.

thu2111 · on Nov 20, 2019

How does this differ from cjdns?

KaiserPro · on Nov 20, 2019

I suspect that the goals are slightly different.

The thing that immediately stands out is the routing. It looks like cjdns is a traditional-ish multi-hop network. The DHT routing table allows you to map out a route to peer A via peer B, R, & D.

What wireguard and nebula allow is for the underlying network to figure out most of the routing, and effectively create a massive point to-point network. whilst you can have concentrators/gateways, the idea is that most of the traffic goes direct from peer to peer. This can reduce load considerably.

thu2111 · on Nov 22, 2019

I think cjdns allows arbitrary peering, so you can certainly set up a full mesh if you want point-to-point traffic, with multiple hops only for cases where the underlying network topology requires it.

neilalexander · on Nov 22, 2019

Right, cjdns and Yggdrasil will both forward on behalf of other nodes where no direct paths are available.

tptacek · on Nov 19, 2019

This is effectively a service mesh implemented directly at the IP layer, integrated with the OS.

rahimnathwani · on Nov 20, 2019

Can nodes communicate with each other directly even if they're behind NAT, without port mappings or UPnP?

I know there are ways to make this happen (e.g. using the techniques from Samy Kamkar's pwnat/chownat), but am not sure whether Nebula is designed to work within this constraint.

TheDong · on Nov 20, 2019

Having two nodes communicate to each other when you have a cooperating third-party server (a lighthouse or discovery node) that isn't behind a nat isn't hard. That's what STUN servers and other forms of UDP hole punching accomplish.

pwnat is notable because it doesn't require having a public stun-like server, but nebula already assumes there's public servers, so traversing nat is a non-issue.

The readme says "Discovery nodes allow individual peers to find each other and optionally use UDP hole punching to establish connections from behind most firewalls or NATs".

In practice, I didn't see any code that implements it, but I didn't look too hard.

sanxiyn · on Nov 20, 2019

It's there in lighthouse.go.

gyrgtyn · on Nov 20, 2019

Can this make taps, or just tuns?

roberson87 · on Nov 20, 2019

Sounds like a service mesh. How is this any difference to Istio/linkerd? This library may be useful, but the stated problem it seeks to solve is hardly a unique one.

vsupalov · on Nov 20, 2019

To me it reads more like Nebula is a VPN solution, with end-to-end encryption and security groups baked in.

To my understanding, a service mesh does not establish a common VPN-like network, but assumes it's there already. Nebula and service meshes both provide authentication, end-to-end encryption and role-based access control. A service mesh can do more than Nebula: it makes it possible to shift traffic between services for example apart from a "security group"-like filtering.

However, I might be mistaken. Any corrections are more than welcome.

dopylitty · on Nov 19, 2019

This is a really cool project

That being said the code is full of TODO and other comments indicating that shortcuts were taken which should be fixed later. I would be worried about running such a thing in prod given the criticality of its function. At best you could risk performance issues under load and at worst you could have significant security issues allowing unintended traffic in/out.

inetknght · on Nov 19, 2019

Do you think it's unusual for businesses to deploy to production code which has TODO statements?

pferde · on Nov 20, 2019

Depends on the TODO statements themselves. "TODO: document this section better" is a huge difference from "TODO: add error handling to this section"

tptacek · on Nov 19, 2019

Slack has a substantial security team, put Nebula through a 3rd party review, and has been running this in production for two years.