HBase (Easy): How to Perform Range Prefix Scan in hbase shell
Asked Answered
M

4

34

I am designing an app to run on hbase and want to interactively explore the contents of my cluster. I am in the hbase shell and I want to perform a scan of all keys starting with the chars "abc". Such keys might inlcude "abc4", "abc92", "abc20014" etc... I tried a scan

hbase(main):003:0> scan 'mytable', {STARTROW => 'abc', ENDROW => 'abc'}

But this does not seem to return anything since there is technically no rowkey "abc" only rowkeys starting with "abc"

What I want is something like

hbase(main):003:0> scan 'mytable', {STARTSROWPREFIX => 'abc', ENDROWPREFIX => 'abc'}

I hear HBase can do this quickly and is one of its main selling points. How do I do this in the hbase shell?

Muff answered 9/7, 2013 at 21:26 Comment(0)
M
56

So it turns out to be very easy. The scan ranges are not inclusive, the logic is start <= key < end. So the answer is

scan 'mytable', {STARTROW => 'abc', ENDROW => 'abd'}
Muff answered 9/7, 2013 at 21:46 Comment(4)
That's right - looks like you found this out the hard way. :) Do you want to mark this as the right answer?Maritime
however hbase doc should say that startrow is actually startrowprefixWhole
If your rows only use 'ASCII' values then it is as simple as you describe here. If you really use binary rowkeys then it becomes a lot more difficult. Check here issues.apache.org/jira/browse/HBASE-11990 to see what discussion and edge cases trying to create a generic solution brought to light.Filling
Does this {STARTROW => 'abc', ENDROW => 'abd'} have a Java API equivalent? I've only managed to find PrefixFilter and this range-like approach would suit me betterMullet
F
44

In recent versions of HBase you can now do in the hbase shell:

scan 'mytable', {ROWPREFIXFILTER => 'abc'}

This effectively does this (and also works for binary situations)

scan 'mytable', {STARTROW => 'abc', ENDROW => 'abd'}

This method is a LOT more efficient than the "PrefixFilter" approach because the latter puts all records through the comparison code the is present in this PrefixFilter class.

Filling answered 28/7, 2016 at 9:19 Comment(4)
I'm having trouble understanding the purpose of the PrefixFilter, when startrow and stoprow appear to be superior. Do you know of any usecases? I've also heard that people combine all three.Instalment
I never use the PrefixFilter at all anymore. Perhaps there is a good reason to use it when doing something in a coprocessor, otherwise I would even vote to remove the class from HBase altogether.Filling
Unfortunately I've been using it this whole time because I mistakenly assumed that you needed to have an exact match on the start and end rows. I ran a test on 5million rows divided between 26 different rowkey prefixes, and the prefix filter is about 300% slower for me on average. Now I'm spending my Saturday refactoring all of my jobs :)Instalment
Not sure if you would know the answer to this, but I figured I would send it your way: #40198383Instalment
K
26

The accepted solution won't work in all cases (binary keys). In addition, using a PrefixFilter can be slow because it performs a table scan until it reaches the prefix. A more performant solution is to use a STARTROW and a FILTER like so:

 scan 'my_table', {STARTROW => 'abc', FILTER => "PrefixFilter('abc')"}
Krutz answered 30/10, 2015 at 16:0 Comment(2)
I'm having trouble understanding the purpose of the PrefixFilter, when startrow and stoprow appear to be superior. Do you know of any usecases? I've also heard that people combine all three.Instalment
This is the solution that worked for me. My key is composed of AAA_B_CCC. I needed all the rows where the key started with AAA_.Cosmetic
C
1

I think what you need is a filter

checkout the answer for following question Scan with filter using HBase shell

more filters are listed in http://hbase.apache.org/book/client.filter.html

Cloak answered 9/7, 2013 at 21:37 Comment(2)
I am under the impression filters are much slower that range scans. #10943138. Is there a way to do do this with a range scan?Muff
@DavidWilliams : Yes, range queries are faster.Eventual

© 2022 - 2024 — McMap. All rights reserved.