Have read a bit of the intro material, but I'm still not grokking what makes Kafka fundamentally different from ActiveMQ / Apollo. Can anyone sum up where and why one might need Kafka?
It's an architecture thing. Most message queues are written the way you'd initially think to manage a message queue, you keep a big queue of objects in memory, and in order to get delivery guarantees, you have to hold on to them until the consumer confirms receipt. This leads to pathological garbage collection scenarios, the old "holding onto an object just long enough to make it really expensive to GC".
Kafka, on the other hand, when you write a message to the broker the broker writes it immediately to disk queue rather than holding it in memory. But isn't that slower? No, it's not, because it's in page cache, which is managed more efficiently than garbage collected memory. Then, when consuming, rather than keeping metrics for each individual message being received, consumers simply have a log position -- they periodically commit, which tells the broker that all of the messages until that point have been consumed. If they never commit, eventually another consumer will get those messages.
So basically, it scales a ton better because you're just doing scads of sequential I/O with occasional commits, rather than tracking a bunch of messages in memory individually (which in theory should be fast but causes GC problems).
I'm not familiar at all with RabbitMQ, so can't really comment, but I'm pretty sure they give guarantees like a producer can wait until a given message is consumed. This means there's no fire-and-forget, even though the message is logged to disk at one point, you need to do all that per message book keeping.
It is more accurate to say that RabbitMQ supports both fire-and-forget and producer-waits. The exact behavior is specified by a combination of how you configure exchanges and queues, and per-message settings, and how you write your client code. For example, your application can decide that some of the messages it injects into a queue are to be durable and others not. It is quite flexible (though the docs are lacking when it comes to specific advice for various use cases).
It trades reliability for throughput. So event tracking logs in realtome is the most common use case I think, where there's a ton of data. But we have a mapreduce calculate at the end of the day from raw logs for accuracy.
Which part is not correct? They go into great detail about their design for throughput. Re: reliability, "Not all use cases require such strong guarantees. For uses which are latency sensitive we allow the producer to specify the durability level it desires."
Pre-0.8, if a machine fails you lose all the data on that machine, only the lower durability levels were available. It guarantees at least once processing, while other queues generally make stronger claims. etc.
Before 0.8, you are correct that if you lose a disk, you lose the data. However, just because a broker goes down, that doesn't mean you lose the data on it - it just becomes unavailable to consumers. The log files backing Kafka partitions are fully durable append only files. So the durability guarantees are pretty good, even before 0.8.
What you're talking about is failover and fault tolerance, which are greatly improved in 0.8 with the addition of replication.
"A full company, scalable event bus like this can totally revolutionize the way you build services."
Shameless (and shameful) plug, but if anyone wants to be part of such an enterprise that's already gained traction with big companies, send me a message!
It keeps all messages in a log(sorted) and each client has a position of where they are in the log. So they can restart some parts, or skip some. This makes it very fast.
And it has sharding, which no-other messagequeue has (i think).
I wrote about this release and Kafka in general in Chinese and gather some information perhaps useful for Chinese guys: http://geek.csdn.net/news/detail/3866
This has nothing to do with the software itself, but it bothers me: Why do you call a "high-throughput distributed messaging system" Kafka? Kafka's stories essentially describe the polar-opposite: crippled, ineffective, labyrinthine message systems that are exceedingly hierarchical in nature. They are also rather user-unfriendly, i.e. their users usually die horrible, lonely deaths. Am I just missing the in-joke here, and it's called Kafka because it is exactly the opposite of this, or did somebody overdo the hipster naming scheme?
That was my first thought. Based on the novels I've read, I would interpret it in a couple of possible ways:
- You receive a message, but the system can't tell you why your received it nor what you should do. (The Trial)
- It's not a distributed messaging system with bugs. Actually, you are the bug. (Metamorphosis)
As an aside, I went to a tech conference in Prague two months ago and visited Café Slavia, a hangout not just of Kafka, but also author Milan Kundera and president/poet Václav Havel. I had a glass of absinthe in their honour.
The rationale was that it should be a writer because we were building a distributed log or journal (a service dedicated to writing). The writer needed to be someone that (1) I liked, (2) sounded cool as a name, (3) was dead (because it would be creepy to use the name of someone who was alive).
At LinkedIn, they started code naming projects after authors and literary works. E.g., Kafka, Camus, and a bunch of Harry Potter references. Kafka was donated to Apache by LinkedIn and the name stuck.
Java is pretty high performance, usually on par with C++, and usually faster than e.g. Go.
Writing Java code so that there are no perceivable GC pauses is an art, but it is not impossible to achieve.
JVM might require more RAM upfront, but a well-written program is usually reasonably memory-efficient, too, so the consumed memory grows reasonably slowly with the problem size.
Writing things in pure C is often just too time-consuming.
> Java is pretty high performance, usually on par with C++
I'm not convinced. Java I/O is far form perfect, and Kafka is probably very heavy on I/O side.
> and usually faster than e.g. Go.
That's strange, since Go to some degree was intended as replacement for Java without having Java's downsides. Why would Go be less performant?
I'd be interested if someone would write such framework in Rust though. C++ is of course a default expectation, but usage of Java somehow surprises me in this case.
Kahka is written in Scala, not Java. The only Java in the project is for a JavaAPI and hadoop connectors.
Comparing languages in absolute performance terms is bad idea, it's an extreme simplification of what really goes into creating performant applications.
I didn't say it's just "slow". I said it's an expected performance hit in comparison with languages like C++. In some cases that hit is tolerable, in others not.
Right, but even what you said isn't really within shouting distance of reality.
Certainly, many naive people would expect such a performance hit; personally, I'd strongly expect that the actual performance is predicated on the architecture and the expertise of the teams involved and that any differences between the languages become either apples to oranges or noise.
I'd expect an actual performance comparison to be sure, rather than relying on expertise of the teams for assuming it. As much as any claim that difference between languages becomes irrelevant with advancement of the hardware, it still can be relevant in different cases.
Definitely, definitely not true. Go is extremely fast, it has its strengths in some areas and trade offs in other areas. If you want a good comparison to Apache Kafka, you should look at Apcera's NATS server, a publish-subscribe and distributed queueing messaging system written in Go by Derek Collison: https://github.com/apcera/gnatsd
You should never discredit a language, especially with blanket terms such as "its faster than Go". In what respects and in what areas? Here's a blog post which performs benchmarks on Go and Scala: http://eng.42go.com/scala-vs-go-tcp-benchmark/
They found Go to perform better than scala, however it had a high footprint. Every language has its tradeoffs, Java and Go are no exception.
Your link says the opposite of what you posted. From the article: "To our surprise, Scala did quite a bit better than Go, with an average round trip time of ~1.6 microseconds (0.0016 milliseconds) vs. Go’s ~11 microseconds (0.011 milliseconds).", and "Notable in the opposite direction was that the Go server had a memory footprint of only about 10 MB vs. Scala’s nearly 200 MB."
I have no idea where people got the idea that Go was faster than Java, or that Java/JVM is slow in 2013. Not trying to discredit Go (its my language of choice), but to say it surpassed Java while only being around for almost 5 years is disingenuous.
I/O is I/O: the big differences are lack of certain APIs (which could make it poorly suited to some -- but not majority -- of I/O intensive use cases; needing -- afaik this isn't resolved with NIO.2 -- needing to use native bindings for kernel AIO is one example), poor APIs (ByteBuffer, I'm looking at you, but here's an alternative: https://github.com/airlift/slice), and the fact that by IO (obviously) involves making JNI calls which do have a latency penalty, but one that is dominated by costs of I/O itself (if code is well written, we're speaking nanoseconds here -- while even a FusionIO device would have latency measured in microseconds). I would say for bulk throughput oriented transfer from disk to socket via sendfile(), I'd say the costs of using Java are not noticeable at all.
Re: Google. MapReduce paper was published in 2004, only a few years after Java 1.4 was released. The infrastructure it is built on top of -- and like M/R itself -- were most certainly written before Java 1.4 was released and likely at a time where running Java on Linux meant using Blackdown -- which had its own issues. Java 1.4 is when java.util.concurrent and non-blocking I/O were introduced; prior to this, writing scalable socket in Java was far more difficult. It would also be until Java 1.6 that epoll() would be supported by Java on Linux, etc...
The other big part is that Map/Reduce is not in a vacuum: while Map/Reduce workloads are mostly I/O dominated, other pieces of infrastructure built on those working blocks are very memory intensive. Google "Java GC" to why that is still an issue with Java (despite having an extremely advanced concurrent garbage collector) In addition, I'd machine in the case of Google the performance advantages of C++ (more about things like being able to lay memory out in precise ways and being friendly to the CPU caches, rather than about pure performance) really do matter -- as Joshua Bloch put it "inner loops of indexers would likely stay in C++".
Finally Google does have at least some infrastructural pieces that in Java: I do know that there are first-class supported APIs for BigTable/Spanner and Map/Reduce in Java, there is also the FlumeJava paper (an EDSL for Map/Reduce in Java), and while I believe C++ infrastructure has superseded MegaStore it is one piece of Java being used for Google's core infrastructure.
That's not to even mention Google's well known exteandnsive use of Java for web-applications/middle-ware including AdSense/AdWords billing and management and gmail.
In the end, you can certainly write I/O intensive applications with Java, but with some caveats: be sure it's actually I/O intensive and the non-I/O intensive parts perform fine in Java, be wary of GC hell, know how to avoid excess/implicit copying (pretty much a case in any languages), and be sure that the APIs you plan to use are available.
In the end, I don't think it's an either/or answer: given (hate to use a buzzword, but it certainly applies) service oriented architecture, you can implement various pieces of a system in languages best suited for that task. This is becoming easier in the H* world now that protobufs is the "native" RPC language as opposed to Java serialization.