Cassandra Internals – Reading

ableal · on March 18, 2010

Those interested in the topic may also want to read this:

"Why we’re using HBase (at Adobe)": http://hstack.org/why-were-using-hbase-part-1/

It is a fine "war-story" of picking new technology and making it work without losing data.

(It was submitted yesterday by the author here: http://news.ycombinator.com/item?id=1196382, but got killed with 5 points, which baffles me. I found it when puzzling out why my submission today was instantly killed, with a different item id ...)

[P.S. minor bug report: my 'dead' item has a working link to the article, which it perhaps shouldn't. http://news.ycombinator.com/item?id=1200833]

jbellis · on March 18, 2010

The reason uncached reads are slower in Cassandra is not because the sstable is inherently io-intensive (it's actually better than b-tree based storage on a 1:1 basis) but because in the average case you'll have to merge row fragments from 2-4 sstables to complete the request, since sstables are not update-in-place.

suhail · on March 18, 2010

Little misinformative imo. While Cassandra has eventual consistency, reads are not slow necessarily. With the right Cache settings tuned correctly (KeysCached/RowsCached) and available memory, Cassandra actually performs quite well. Cassandra is virtually worthless without those cache features kind of like MySQL is without indexes. They are slower than writes but I think it would've been more proper to talk about how the cache works and more interesting.

Like any database, MySQL/Postgres/etc, it's a dark art in terms of understanding how to make it work.

jbellis · on March 18, 2010

Right. Digg dropped memcached entirely from their architecture when we added RowsCached to Cassandra.

CWIZO · on March 18, 2010

Cached (by Google) text only version: http://209.85.129.132/search?q=cache:http://www.mikeperham.c...

ra · on March 18, 2010

Cassandra isn't easy to learn like, say, couchdb. But Couch uses JSON (An awesome choice, BTW), and Cassandra uses Thrift.

Cassandra is kinda difficult to pick up because there is no SQL equivalent, there are no relationships, joins or "where"s.

So, basically, it's an engine without user friendly controls. But - it's probably the most awesomely powerful storage engine yet available in the public domain.

Imagine if Google released a server image of one of their storage nodes... ostensibly, that's what Facebook did with Cassandra.