Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Apache Kafka 0.8.0 released (apache.org)
131 points by mumrah on Dec 9, 2013 | hide | past | favorite | 51 comments


Have read a bit of the intro material, but I'm still not grokking what makes Kafka fundamentally different from ActiveMQ / Apollo. Can anyone sum up where and why one might need Kafka?


It's an architecture thing. Most message queues are written the way you'd initially think to manage a message queue, you keep a big queue of objects in memory, and in order to get delivery guarantees, you have to hold on to them until the consumer confirms receipt. This leads to pathological garbage collection scenarios, the old "holding onto an object just long enough to make it really expensive to GC".

Kafka, on the other hand, when you write a message to the broker the broker writes it immediately to disk queue rather than holding it in memory. But isn't that slower? No, it's not, because it's in page cache, which is managed more efficiently than garbage collected memory. Then, when consuming, rather than keeping metrics for each individual message being received, consumers simply have a log position -- they periodically commit, which tells the broker that all of the messages until that point have been consumed. If they never commit, eventually another consumer will get those messages.

So basically, it scales a ton better because you're just doing scads of sequential I/O with occasional commits, rather than tracking a bunch of messages in memory individually (which in theory should be fast but causes GC problems).


How does that compare to RabbitMQ's disk backed store?


I'm not familiar at all with RabbitMQ, so can't really comment, but I'm pretty sure they give guarantees like a producer can wait until a given message is consumed. This means there's no fire-and-forget, even though the message is logged to disk at one point, you need to do all that per message book keeping.


It is more accurate to say that RabbitMQ supports both fire-and-forget and producer-waits. The exact behavior is specified by a combination of how you configure exchanges and queues, and per-message settings, and how you write your client code. For example, your application can decide that some of the messages it injects into a queue are to be durable and others not. It is quite flexible (though the docs are lacking when it comes to specific advice for various use cases).


How does it compare to services like Amazon SNS, IronIO, 'etc in terms of costs and other tradeoffs?


Since you host it yourself you can't compare it because you on the variable cost of servers and you time to maintain them.


In Kafka, topics are (partitioned) streams of messages. Consumers of a topic keep track of their "cursor" in that stream.

EDIT: should add that morkbot had a great link too:

https://news.ycombinator.com/item?id=6874607 http://www.quora.com/RabbitMQ/RabbitMQ-vs-Kafka-which-one-fo...


Stores messages for default 2 days, and messages can be replayed. Plus distributed servers.


It trades reliability for throughput. So event tracking logs in realtome is the most common use case I think, where there's a ton of data. But we have a mapreduce calculate at the end of the day from raw logs for accuracy.



Which part is not correct? They go into great detail about their design for throughput. Re: reliability, "Not all use cases require such strong guarantees. For uses which are latency sensitive we allow the producer to specify the durability level it desires."

Pre-0.8, if a machine fails you lose all the data on that machine, only the lower durability levels were available. It guarantees at least once processing, while other queues generally make stronger claims. etc.

It's still very good at what it does.


Before 0.8, you are correct that if you lose a disk, you lose the data. However, just because a broker goes down, that doesn't mean you lose the data on it - it just becomes unavailable to consumers. The log files backing Kafka partitions are fully durable append only files. So the durability guarantees are pretty good, even before 0.8.

What you're talking about is failover and fault tolerance, which are greatly improved in 0.8 with the addition of replication.


Use cases (from the project page): http://kafka.apache.org/documentation.html#uses


Kafka is an integral part of Shopify's infrastructure. It's brilliant but under appreciated technology.

A full company, scalable event bus like this can totally revolutionize the way you build services.


And thank you for the amazing go client.


"A full company, scalable event bus like this can totally revolutionize the way you build services."

Shameless (and shameful) plug, but if anyone wants to be part of such an enterprise that's already gained traction with big companies, send me a message!


This is the first release as an Apache top-level project and represents many months of hard work.

A few of the major improvements (from https://archive.apache.org/dist/kafka/0.8.0/RELEASE_NOTES.ht...):

  * Intra-cluster replication support
  * Support multiple data directories
  * Many new internal metrics
  * Time based log segment rollout
Plus many bug fixes and other improvements.


May I ask how Kafka compares with an AMQP solution such as RabbitMQ? Thank you.



Quora link where you don't have to register/log-in: http://www.quora.com/RabbitMQ/RabbitMQ-vs-Kafka-which-one-fo...


It keeps all messages in a log(sorted) and each client has a position of where they are in the log. So they can restart some parts, or skip some. This makes it very fast.

And it has sharding, which no-other messagequeue has (i think).


RabbitMQ supports sharding via exchanges


I wrote about this release and Kafka in general in Chinese and gather some information perhaps useful for Chinese guys: http://geek.csdn.net/news/detail/3866


This has nothing to do with the software itself, but it bothers me: Why do you call a "high-throughput distributed messaging system" Kafka? Kafka's stories essentially describe the polar-opposite: crippled, ineffective, labyrinthine message systems that are exceedingly hierarchical in nature. They are also rather user-unfriendly, i.e. their users usually die horrible, lonely deaths. Am I just missing the in-joke here, and it's called Kafka because it is exactly the opposite of this, or did somebody overdo the hipster naming scheme?


That was my first thought. Based on the novels I've read, I would interpret it in a couple of possible ways:

- You receive a message, but the system can't tell you why your received it nor what you should do. (The Trial)

- It's not a distributed messaging system with bugs. Actually, you are the bug. (Metamorphosis)

As an aside, I went to a tech conference in Prague two months ago and visited Café Slavia, a hangout not just of Kafka, but also author Milan Kundera and president/poet Václav Havel. I had a glass of absinthe in their honour.


The rationale was that it should be a writer because we were building a distributed log or journal (a service dedicated to writing). The writer needed to be someone that (1) I liked, (2) sounded cool as a name, (3) was dead (because it would be creepy to use the name of someone who was alive).


At LinkedIn, they started code naming projects after authors and literary works. E.g., Kafka, Camus, and a bunch of Harry Potter references. Kafka was donated to Apache by LinkedIn and the name stuck.


It's written in Scala, so perhaps the reference is to the source code rather than the client interface ...

... or maybe you're just overthinking it.


Congratulations to the Kafka team! Their work is always extremely impressive.


Was this developed in Scala?


It is a mixture of Java and Scala, though mostly Scala. GitHub mirror of the source: https://github.com/apache/kafka


Yup.


I like the naming trend and expect releases of Apache Ionesco, Apache Lovecraft, and Apache Poe real soon now. </obligatory-joke>


Finally!


Why is it written in Java? Isn't it a default performance hit? I'd expect such frameworks to be written in high performance languages.


Java is pretty high performance, usually on par with C++, and usually faster than e.g. Go.

Writing Java code so that there are no perceivable GC pauses is an art, but it is not impossible to achieve.

JVM might require more RAM upfront, but a well-written program is usually reasonably memory-efficient, too, so the consumed memory grows reasonably slowly with the problem size.

Writing things in pure C is often just too time-consuming.


> Java is pretty high performance, usually on par with C++

I'm not convinced. Java I/O is far form perfect, and Kafka is probably very heavy on I/O side.

> and usually faster than e.g. Go.

That's strange, since Go to some degree was intended as replacement for Java without having Java's downsides. Why would Go be less performant?

I'd be interested if someone would write such framework in Rust though. C++ is of course a default expectation, but usage of Java somehow surprises me in this case.


> Java I/O is far form perfect

The API certainly isn't perfect, but what do you find lacking about the performance of Java I/O?

http://docs.oracle.com/javase/7/docs/api/java/nio/channels/p...


Kahka is written in Scala, not Java. The only Java in the project is for a JavaAPI and hadoop connectors.

Comparing languages in absolute performance terms is bad idea, it's an extreme simplification of what really goes into creating performant applications.


> Kahka is written in Scala, not Java.

Thanks for the correction.


Why are you so deadset on assuming that Java is slow? That's an incredibly misinformed and narrow view of what's possible on the JVM.


I didn't say it's just "slow". I said it's an expected performance hit in comparison with languages like C++. In some cases that hit is tolerable, in others not.


Right, but even what you said isn't really within shouting distance of reality.

Certainly, many naive people would expect such a performance hit; personally, I'd strongly expect that the actual performance is predicated on the architecture and the expertise of the teams involved and that any differences between the languages become either apples to oranges or noise.


I'd expect an actual performance comparison to be sure, rather than relying on expertise of the teams for assuming it. As much as any claim that difference between languages becomes irrelevant with advancement of the hardware, it still can be relevant in different cases.


Definitely, definitely not true. Go is extremely fast, it has its strengths in some areas and trade offs in other areas. If you want a good comparison to Apache Kafka, you should look at Apcera's NATS server, a publish-subscribe and distributed queueing messaging system written in Go by Derek Collison: https://github.com/apcera/gnatsd

You should never discredit a language, especially with blanket terms such as "its faster than Go". In what respects and in what areas? Here's a blog post which performs benchmarks on Go and Scala: http://eng.42go.com/scala-vs-go-tcp-benchmark/

They found Go to perform better than scala, however it had a high footprint. Every language has its tradeoffs, Java and Go are no exception.


Your link says the opposite of what you posted. From the article: "To our surprise, Scala did quite a bit better than Go, with an average round trip time of ~1.6 microseconds (0.0016 milliseconds) vs. Go’s ~11 microseconds (0.011 milliseconds).", and "Notable in the opposite direction was that the Go server had a memory footprint of only about 10 MB vs. Scala’s nearly 200 MB."

I have no idea where people got the idea that Go was faster than Java, or that Java/JVM is slow in 2013. Not trying to discredit Go (its my language of choice), but to say it surpassed Java while only being around for almost 5 years is disingenuous.


nemothekid,

You're absolutely right, my apologizes, its been a while since i read that article. Thank you for the correction! Much appreciated.


Java I/O is plenty fast. Just take a look at Hadoop, Lucene, Jetty, Netty, or any number of other pieces of high performance software written in Java.

Here's a good writeup talking about sequential I/O in Java: http://mechanical-sympathy.blogspot.com/2011/12/java-sequent...


Thanks for the pointer. Is there any comparison with C/C++ I/O in various operating systems?

Note, that unlike Hadoop, original Google's map reduce system was written in C++.


I/O is I/O: the big differences are lack of certain APIs (which could make it poorly suited to some -- but not majority -- of I/O intensive use cases; needing -- afaik this isn't resolved with NIO.2 -- needing to use native bindings for kernel AIO is one example), poor APIs (ByteBuffer, I'm looking at you, but here's an alternative: https://github.com/airlift/slice), and the fact that by IO (obviously) involves making JNI calls which do have a latency penalty, but one that is dominated by costs of I/O itself (if code is well written, we're speaking nanoseconds here -- while even a FusionIO device would have latency measured in microseconds). I would say for bulk throughput oriented transfer from disk to socket via sendfile(), I'd say the costs of using Java are not noticeable at all.

Re: Google. MapReduce paper was published in 2004, only a few years after Java 1.4 was released. The infrastructure it is built on top of -- and like M/R itself -- were most certainly written before Java 1.4 was released and likely at a time where running Java on Linux meant using Blackdown -- which had its own issues. Java 1.4 is when java.util.concurrent and non-blocking I/O were introduced; prior to this, writing scalable socket in Java was far more difficult. It would also be until Java 1.6 that epoll() would be supported by Java on Linux, etc...

The other big part is that Map/Reduce is not in a vacuum: while Map/Reduce workloads are mostly I/O dominated, other pieces of infrastructure built on those working blocks are very memory intensive. Google "Java GC" to why that is still an issue with Java (despite having an extremely advanced concurrent garbage collector) In addition, I'd machine in the case of Google the performance advantages of C++ (more about things like being able to lay memory out in precise ways and being friendly to the CPU caches, rather than about pure performance) really do matter -- as Joshua Bloch put it "inner loops of indexers would likely stay in C++".

Finally Google does have at least some infrastructural pieces that in Java: I do know that there are first-class supported APIs for BigTable/Spanner and Map/Reduce in Java, there is also the FlumeJava paper (an EDSL for Map/Reduce in Java), and while I believe C++ infrastructure has superseded MegaStore it is one piece of Java being used for Google's core infrastructure.

That's not to even mention Google's well known exteandnsive use of Java for web-applications/middle-ware including AdSense/AdWords billing and management and gmail.

In the end, you can certainly write I/O intensive applications with Java, but with some caveats: be sure it's actually I/O intensive and the non-I/O intensive parts perform fine in Java, be wary of GC hell, know how to avoid excess/implicit copying (pretty much a case in any languages), and be sure that the APIs you plan to use are available.

In the end, I don't think it's an either/or answer: given (hate to use a buzzword, but it certainly applies) service oriented architecture, you can implement various pieces of a system in languages best suited for that task. This is becoming easier in the H* world now that protobufs is the "native" RPC language as opposed to Java serialization.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: