I want to optimize my usage of HBase for faster writes. I have a task which reads from a Kafka topic then write to HBase based on that. Since Kafka will have a log of everything to be written, it's an easy source to recover from. I'm reading "HBase High Perormance Cookbook" and there's this note:
Note that this brings an interesting thought about when to use WAL and when not to. By default, WAL writes are on, and the data are always written to, WAL. But if you are sure the data can be rewritten or a small loss won't be impacting the overall outcome of the processing, you disable the write to WAL. WAL provides an easy and definitive recovery. This is the fundamental reason why, by default, it's always enabled. In scenarios where data loss is not expectable, you should leave it in the default settings; otherwise, change it to use memstore. Alternatively, you can plan for a DR (disaster recovery)
How do I configure this recovery to be automatic? I see 2 options:
- I write to HBase without WAL (only to memstore) and am somehow notified that writes were lost and not committed to disk. Then I go back in the Kafka log and replay. or
- I write to HBase without WAL (only to memstore) and every so often get notified from HBase what Kafka offset can be committed.
How do I do either of these?