Performance difference between Scan and Get?

Asked 27/1, 2013 at 8:10 Answered 15/3, 2023 at 4:10

I have an HBase table containing 8G of data.

When I use a partial key scan on that table to retrieve a value for a given key I get almost constant time value retrieval.

When I use a Get, the time taken is far greater than with the scan. However when I looked inside the code, I found that Get itself uses a Scan.

Can anyone explain this time difference?

Conduit answered 27/1, 2013 at 8:10 Comment(2)

They should at least be equally fast in my opinion. Can you post both your scan and get query? – Strophic 28/1, 2013 at 15:7

I think you should post your code, and table design, so we can discuss it. – Beaner 4/4, 2014 at 15:57

Correct, when you issue a Get, there is a scan happening behind the scenes. Cloudera's blog post confirms this: "Each time a get or a scan is issued, HBase scan (sic) through each file to find the result."

I can't confirm your results, but I think the clue may lie in your "partial key scan". When you compare a partial key scan and a get, remember that the row key you use for Get can be a much longer string than the partial key you use for the scan.

In that case, for the Get, HBase has to do a deterministic lookup to ascertain the exact location of the row key that it needs to match and fetch it. But with the partial key, HBase does not need to lookup the exact key match, and just needs to find the more approximate location of that key prefix.

The answer for this is: it depends. I think it will depend on:

Your row key "schema" or composition
The length of the Get key and the Scan prefix
How many regions you have

and possibly other factors.

Vasos answered 28/1, 2013 at 23:1 Comment(2)

while blog suggests this, but there has to be a difference as get can make use of bloom filter but scans can not right? – Wilma 15/3, 2023 at 3:53

Yes, apparently it is possible to speed up scans using a bloom filter, but I don't know enough about bloom filters to answer this question, hopefully someone else knows. – Vasos 29/3, 2023 at 1:15

On the backend HRegion both Scan and Get amount to nearly the same thing. They both end up executed by HRegion.RegionScannerImpl. Note below that the get() within that class instantiates a RegionScanner - similarly to invoking a Scan.

org.apache.hadoop.hbase.regionserver.HRegion.RegionScannerImpl

public List<Cell> get(Get get, boolean withCoprocessor)
throws IOException {

List<Cell> results = new ArrayList<Cell>();

// pre-get CP hook
if (withCoprocessor && (coprocessorHost != null)) {
   if (coprocessorHost.preGet(get, results)) {
     return results;
   }
}

Scan scan = new Scan(get);

In the case of a get(), only a single row is returned - by invoking scanner.next() one time:

RegionScanner scanner = null;
try {
  scanner = getScanner(scan);
  scanner.next(results);

Festinate answered 19/9, 2014 at 19:33 Comment(0)

While cloudera document suggests that Get is scan behind the scene

Get and Scan are the two ways to read data from HBase, aside from manually parsing HFiles. A Get is simply a Scan limited by the API to one row. A Scan fetches zero or more rows of a table

There is a subtle difference between them, Get calls can make use of bloom filter and not read few StoreFiles. Bug scans can not use these bloom filters. Quoted here

In terms of HBase, Bloom filters provide a lightweight in-memory structure to reduce the number of disk reads for a given Get operation (Bloom filters do not work with Scans) to only the StoreFiles likely to contain the desired Row. The potential performance gain increases with the number of parallel reads.

Hence it purely depends on use cases. Performance of scan should be less than or equals to Get operations.

Wilma answered 15/3, 2023 at 4:10 Comment(0)

Recommended topics

Hot tags