HBase Scan with Multiple Ranges
Asked Answered
T

1

10

I have a HBase table, and I need to get the result from several ranges. For example, I may need get data from different ranges like row 1-6, 100-150,..... I know that for each scan, I can define the start row and stop row. But if I have 6 ranges, I need to do scan 6 times. Is there any way that I can get the result from multiple ranges just from one scan or from one RPC? My HBase version is 0.98.

Tanta answered 29/10, 2015 at 20:31 Comment(6)
Hbase 2 has MultiRowRanger which allow to set multiple ranges. if ranges are small then multiple Scans queries also would be faster.Defalcate
Will the multiRowRanger just sent one RPC for multiple ranges?Tanta
If you cannot use MultiRowRangeFilter then multiple scans is your best choice especially if number of keys between ranges is big.Targe
I change my HBase 2 and use MultiRowRangeFilter finally.Tanta
Rahul and kostya, can any of you guys post MultiRowRangeFilter to the answer? I already take your guys' advice to use it to solve my problem. BTW, could you guys upvode my question if you don't mind? I need some reputations to have the priviledge to comment on others' question.Tanta
Cheng good question! +1 since no one gave answer I thought of giving detailed answer with an example. Please go throughHuai
H
5

Filter to support scan multiple row key ranges. It can construct the row key ranges from the passed list which can be accessed by each region server.

HBase is quite efficient when scanning only one small row key range. If user needs to specify multiple row key ranges in one scan, the typical solutions are:

  1. through FilterList which is a list of row key Filters,
  2. using the SQL layer over HBase to join with two table, such as hive, phoenix etc. However, both solutions are inefficient.

    Both of them can't utilize the range info to perform fast forwarding during scan which is quite time consuming. If the number of ranges are quite big (e.g. millions), join is a proper solution though it is slow.
    However, there are cases that user wants to specify a small number of ranges to scan (e.g. <1000 ranges). Both solutions can't provide satisfactory performance in such case.

MultiRowRangeFilter is to support such usec ase (scan multiple row key ranges), which can construct the row key ranges from user
specified list and perform fast-forwarding during scan. Thus, the scan will be quite efficient.

package chengchen;

import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.filter.Filter;
import org.apache.hadoop.hbase.filter.MultiRowRangeFilter;
import org.apache.hadoop.hbase.filter.MultiRowRangeFilter.RowKeyRange;
import org.apache.hadoop.hbase.util.Bytes;



public class MultiRowRangeFilterTest {
    public static void main(String[] args) throws Exception {
        if (args.length < 1) {
            throw new Exception("Table name not specified.");
        }
        Configuration conf = HBaseConfiguration.create();
        HTable table = new HTable(conf, args[0]);

        TimeCounter executeTimer = new TimeCounter();
        executeTimer.begin();
        executeTimer.enter();
        Scan scan = new Scan();
        List<RowKeyRange> ranges = new ArrayList<RowKeyRange>();
        ranges.add(new RowKeyRange(Bytes.toBytes("001"), Bytes.toBytes("002")));
        ranges.add(new RowKeyRange(Bytes.toBytes("003"), Bytes.toBytes("004")));
        ranges.add(new RowKeyRange(Bytes.toBytes("005"), Bytes.toBytes("006")));
        Filter filter = new MultiRowRangeFilter(ranges);
        scan.setFilter(filter);
        int count = 0;
        ResultScanner scanner = table.getScanner(scan);
        Result r = scanner.next();
        while (r != null) {
            count++;
            r = scanner.next();
        }
        System.out
                .println("++ Scanning finished with count : " + count + " ++");
        scanner.close();


    }

}

Please see this test case for implementing in java

Note : However, This kind of requirements SOLR or ES is the best way in my opinion... you can check my answer with solr for high level architecture overview. Im suggesting that since hbase scan for huge data will be very slow.

Huai answered 2/2, 2017 at 6:33 Comment(3)
Hi Ram, What do you mean by your final statement. Can you please clarify as it is not clear. Do you mean that Solr or ES would be a better solution for this issue, if so, can you please add a high-level architectural view on how that would work?Robinia
Yes Solr is good along with hbase in my experience. for querying data from hbase and also can publish in to UI dashboards. Since the question is not related to solr I thought of not adding it.Huai
@eboni: I updated my answer.. since solr is different context I cant elaborate it here. I added link to my answer aboveHuai

© 2022 - 2024 — McMap. All rights reserved.