Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Application data caching using SSDs (netflix.com)
107 points by sgmansfield on May 25, 2016 | hide | past | favorite | 28 comments


> The decision to use Go was deliberate, because we needed something that had lower latency than Java (where garbage collection pauses are an issue) and is more productive for developers than C, while also handling tens of thousands of client connections. Go fits this space well.

This is interesting. There haven't been many instances yet where Go's memory/GC performance is compared directly and favorably to Java's performance.


In this case I don't think it's the raw performance they are looking for but rather freedom from the unpredictable GC pauses. Although I think they should have explained why they didn't use Azul Zing for this, I realize it's expensive but this seems like the exact use case for it.

Found some interesting discussion here about this: https://news.ycombinator.com/item?id=10149961


Author here (of both Rend and the post).

We really are looking for raw performance, but not at any cost. I spent a bunch of time, and continue to do so, to eliminate extra work, indirection, and garbage.

Efficiency of runtime and development is a careful balance that Go fits well. Rust is immature. C is difficult for people who haven't been writing it professionally. Java (less Zing or other special cases) is too latent, even if it's very fast when it's running because of HotSpot optimizations. The list goes on, but I didn't see a reason to choose the others over Go, nor have I found something yet that beats it for our use case.

Really, Java just wasn't the language for the job. Go has a lower level and simpler interface, simpler runtime (with less indirection), and better model for concurrency. Yes, it's a new language for us, but it was the right tool for the job. Our garbage collection pauses are mostly under a millisecond, so they aren't really an issue.

Zing never really was on my radar. Looking at it now I shudder to think of the licensing costs, especially since we run tens of thousands of cache instances. We already pay enough just to run these servers. It's also a new Java runtime which very few people know about, and probably nobody at Netflix has used. I would have had to learn it all independently and continue to be on my own from then on. The Go slack team and community in general are very helpful and supportive, and I would speculate that the same level of support isn't available for Zing without paying.


Cool, thanks for the insight! I love reading the thinking process behind design decisions.

On the subject of Azul's price: I think they would give you a high volume discount, but even half the price (1750) per server would be pretty outrageous for your use case.

For my use case I'm going with Apache Ignite right now for tiered off heap caching, but I'm not sure it would work for you guys. Might be worth a look though.

I admire Netflix OSS ecosystem. Thanks for the great work you guys do!


Just curious, is that 3500 per server per year? Or 3500 for the lifetime of the server?


Here's how they word it on the informational page for Zing:

“Zing® is priced on a subscription basis per server (physical or virtual). With per-server pricing, you don’t need to worry about core counts, memory size, or number of instances deployed per server. The single license annual subscription price for Zing is $3500 USD per physical server, with significantly lower prices for higher volumes and longer-term subscriptions. Please contact us to learn about the special pricing available for start-ups and companies with $25 million or less in annual revenue, and for ISVs and manufacturers looking to embed/integrate Zing with their products.”

Seen here: https://www.azul.com/products/zing/


> Although I think they should have explained why they didn't use Azul Zing for this, I realize it's expensive but this seems like the exact use case for it.

Azul Zing is extremely expensive for commodity hardware; that's a massive price-tag for just the lower latency constraint. Their pricing model is a much better fit for vertically-scaled services, though they even argue that it can lead to reduced costs on horizontally scaled services, presumably on the assumption that you can use fewer hardware instances at peak to handle latency guarantees:

> With Zing, Azul has created the most scalable Java Virtual Machine (JVM) for enterprise workloads and made it available on cost-efficient commodity hardware. With Zing, enterprises can now dramatically simplify Java deployments by using fewer instances while achieving greater response time consistency under load and dramatically lowering operating costs.

- https://www.azul.com/products/zing/zinqfaq/#importance-produ...

Somehow, I don't see a cache solution working well with this pricing model. The underlying operations are presumably simple enough you can get the performance you want with thin, simple software deployed on many commodity machines.


Good points, I agree. It seems they're probably going for finance with that pricing model.


Zing needs heavily over provisioned machines like 64+ GB RAM to make those GC guarantees.


Can you provide a source? I haven't used Zing (only read about it) but my understanding is that it should work for low memory and high memory JVMs.


I got a pdf link in a google search sometime back. That document mentions that. You should try searching.


I ask because the pdf you're probably referring to seems to imply the opposite. I only skimmed though but heres what I found that seems to imply the opposite:

"An entire Java heap (large or small) can be marked consistently without falling behind and resorting to long garbage collection pauses because the C4 employs: • Single pass completion to avoid the possibility of applications changing references faster than the mark phase can revisit them • “Self-healing” traps that ensure a finite workload and avoid unnecessary overhead • Parallel marking threads to provide scalability that can support large applications and large dataset sizes"

Maybe I'm looking at the wrong pdf can you provide a source for your claim? It's something useful to know if it's true.


I am hesitant to post direct link as document says 'Confidential and Proprietary'. But here is something in that doc about memory requirement

`The Zing components require the following amount of RAM on the ZVM machine. l

64+ GB RAM recommended

16-32 GB RAM minimum, depending on the application memory requirements, for demonstration purposes only. Minimum is not sufficient for performance or scalability production testing of large or memory consuming enterprise applications. Increase the memory as needed to accommodate the number and size of your Java application/Zing Virtual Machine (ZVM) instances.

The Pool license server requires the following amount of RAM on the license server machine. l 8 GB RAM recommended (4 GB minimum)`

This document is like 300+ pages with all the details about tuning and optimizations which make me think that it is not simple plug and play once you have paid for license.


Ah I see the document now, thanks. Based on this it seems like Zing is only a good solution for vertically scaling software.

Now it makes sense why it's priced at 3500 per machine.


I think the bigger thing is the deliberate design of go for parallel processing. I feel like the Go GC is still an issue, although getting better every day, but the benefits in regards to "handling tens of thousands of client connections" potentially make the GC easier to swallow than Java.

But that's is just my outside looking in take on it.


If you check this link: http://benchmarksgame.alioth.debian.org/u64q/go.html

Java uses about 2-10 times more memory most of time compare to Go. So I'd think Java's GC has to be at least 10 times better/sophisticated than Go's. Though I doubt if it is really 10 times better.





As a Python programmer, I am intimately familiar with the distinction between concurrency (Coroutines) and Parallelism...and additionally the difference between multi-threaded and multi-processing within Parallelism.

It was my understanding that Go's center around concurrency made parallelism also easier to approach than most languages:

https://golang.org/doc/effective_go.html#parallel

Although there are still additional challenges with parallel vs. simple concurrent programming.


The link redirects to a https URL that doesn't work (Firefox and Opera). I'm not using a https anywhere addon so I was very surprised that wget http://techblog.netflix.com/2016/05/application-data-caching... doesn't get redirected. Loading the downloaded file in the browser let me read the article.


Sorry, not sure what's wrong. I tested in the latest firefox, safari, opera, and chrome and it works for me.

Edit: blogger doesn't do SSL on custom domains. Sorry =/


the http:// link you posted works fine, but if you attempt to access the link via https ( https://techblog.netflix.com/2016/05/application-data-cachin... ), I get a "Secure Connection Failed" in Firefox.

Screenshot:

http://storage8.static.itmages.com/i/16/0525/h_1464199626_32...

That said, I'm not being redirected and I don't know why gp is.


OK, I can reproduce. I'll forward, something seems wrong (obviously). Thanks for reporting this.

edit: No SSL on blogger custom domains. Sorry =/


What was the problem? I ask not to make it on my servers. Thanks.


I get the same result in Firefox and in Chrome it says This site can't be reached with a specific ERR_CONNECTION_CLOSED error. I've been wondering why this item kept getting voted up but I couldn't even access it. I'm on Time Warner Cable in southern Kentucky by the way, could this be a network issue?

Edit: I should note that I cannot access this via HTTP either as I'm getting redirected to HTTPS.

Edit 2: Correction: If I force http:// in Chrome I can access the blog entry. I tried even using Private mode in Firefox and while I specify http:// the URL gets modified to https:// and I get the insecure connection message. Above I used the word redirect but I can see no evidence of an actual redirect in the developer interface. But it's not clear to me if the Firefox development interface Network tab would indicate a redirect response.


Works for me, but is it possible there was an incorrect HSTS configuration at some point? Possibly even the parent domain with includeSubdomains.


Are you sure you don't have an extension installed that changes http requests to https?

The https is broken on that site.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: