Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We used a combination of Kafka + Hbase+ Phoenix (http://phoenix.apache.org/) for similar purpose. It takes some effort to setup initial Hbase cluster but once you do it manually once and automate with Ansible /systemd it's pretty robust in operation.

All our development was around query engine using plain JDBC/SQL to talk to Hbase via Phoenix. Scaling is as simple as adding a node in the cluster.



That's interesting. What are query times like? Let's say for single series to query data for a week at a five-minute interval, how many seconds it would take?


Does Kafka have timestamps? I didn't see any when I looked, but I was working with an older client version & didn't get far into it.


we didnt rely on it, or on the ordering of messages received from kafka, timestamps and transaction IDs were generated by the client app/kafka publisher and were part of the message put into kafka topic, when we consume that message with one of the parallel kafka consumers and save a row in hbase table that original timestamp + transactionId becomes part of the rowkey string, other parts being attributes that we wanted to index (secondary indices are supported in hbase/phoenix but we didnt use them too much, basically the composite rowkey is the index). Then when querying hbase it works as a parallel scanning machine and can do a time range scan + filtering +aggregation very fast.

On a separate note we didn't use joins even though they are supported in Phoenix, data was completely denormalized into one big table.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: