Why Impala Scan Node is very slow (RowBatchQueueGetWaitTime)?

Asked 14/8, 2020 at 2:34 Answered 18/8, 2020 at 19:52

This query returns in 10 seconds most of the times, but occasionally it need 40 seconds or more.

There are two executer nodes in the swarm, and there is no remarkable difference between profiles of the two nodes, following is one of them:

      HDFS_SCAN_NODE (id=0):(Total: 39s818ms, non-child: 39s818ms, % non-child: 100.00%)
     - AverageHdfsReadThreadConcurrency: 0.07 
     - AverageScannerThreadConcurrency: 1.47 
     - BytesRead: 563.73 MB (591111366)
     - BytesReadDataNodeCache: 0
     - BytesReadLocal: 0
     - BytesReadRemoteUnexpected: 0
     - BytesReadShortCircuit: 0
     - CachedFileHandlesHitCount: 0 (0)
     - CachedFileHandlesMissCount: 560 (560)
     - CollectionItemsRead: 0 (0)
     - DecompressionTime: 1s501ms
     - MaterializeTupleTime(*): 11s685ms
     - MaxCompressedTextFileLength: 0
     - NumColumns: 9 (9)
     - NumDictFilteredRowGroups: 0 (0)
     - NumDisksAccessed: 1 (1)
     - NumRowGroups: 56 (56)
     - NumScannerThreadMemUnavailable: 0 (0)
     - NumScannerThreadReservationsDenied: 0 (0)
     - NumScannerThreadsStarted: 4 (4)
     - NumScannersWithNoReads: 0 (0)
     - NumStatsFilteredRowGroups: 0 (0)
     - PeakMemoryUsage: 142.10 MB (149004861)
     - PeakScannerThreadConcurrency: 2 (2)
     - PerReadThreadRawHdfsThroughput: 151.39 MB/sec
     - RemoteScanRanges: 1.68K (1680)
     - RowBatchBytesEnqueued: 2.32 GB (2491334455)
     - RowBatchQueueGetWaitTime: 39s786ms
     - RowBatchQueuePeakMemoryUsage: 1.87 MB (1959936)
     - RowBatchQueuePutWaitTime: 0.000ns
     - RowBatchesEnqueued: 6.38K (6377)
     - RowsRead: 73.99M (73994828)
     - RowsReturned: 6.40M (6401849)
     - RowsReturnedRate: 161.27 K/sec
     - ScanRangesComplete: 56 (56)
     - ScannerThreadsInvoluntaryContextSwitches: 99 (99)
     - ScannerThreadsTotalWallClockTime: 1m10s
       - ScannerThreadsSysTime: 630.808ms
       - ScannerThreadsUserTime: 12s824ms
     - ScannerThreadsVoluntaryContextSwitches: 1.25K (1248)
     - TotalRawHdfsOpenFileTime(*): 9s396ms
     - TotalRawHdfsReadTime(*): 3s789ms
     - TotalReadThroughput: 11.70 MB/sec
    Buffer pool:
       - AllocTime: 1.240ms
       - CumulativeAllocationBytes: 706.32 MB (740630528)
       - CumulativeAllocations: 578 (578)
       - PeakReservation: 140.00 MB (146800640)
       - PeakUnpinnedBytes: 0
       - PeakUsedReservation: 33.83 MB (35471360)
       - ReadIoBytes: 0
       - ReadIoOps: 0 (0)
       - ReadIoWaitTime: 0.000ns
       - WriteIoBytes: 0
       - WriteIoOps: 0 (0)
       - WriteIoWaitTime: 0.000ns

We can notice that RowBatchQueueGetWaitTime is very high, almost 40 seconds, but I cannot figure out why, admitting that TotalRawHdfsOpenFileTime takes 9 seconds and TotalRawHdfsReadTime takes almost 4 seconds, I still cannot explain where are other 27 seconds spend on.

Can you suggest the possible issue and how can I solve it?

Erlandson answered 14/8, 2020 at 2:34 Comment(0)

The threading model in the scan nodes is pretty complex because there are two layers of workers threads for scanning and I/O - I'll call them scanner and I/O threads. I'll go top down and call out some potential bottlenecks and how to identify them.

High RowBatchQueueGetWaitTime indicates that the main thread consuming from the scan is spending a lot of time waiting for the scanner threads to produce rows. One major source of variance can be the number of scanner threads - if the system is under resource pressure each query can get fewer threads. So keep an eye on AverageScannerThreadConcurrency to understand if that is varying.

The scanner threads would be spending their time doing a variety of things. The bulk of time is generally

Not running because the operating system scheduled a different thread.
Waiting for I/O threads to read data from the storage system
Decoding data, evaluating predicates, other work

With #1 you would see a higher value for ScannerThreadsInvoluntaryContextSwitches and ScannerThreadsUserTime/ScannerThreadsSysTime much lower than ScannerThreadsTotalWallClockTime. If ScannerThreadsUserTime is much lower than MaterializeTupleTime, that would be another symptom.

With #2 you would see high ScannerThreadsUserTime and MaterializeTupleTime. It looks like here there is a significant amount of CPU time going to that, but not the bulk of the time.

To identify #3, I would recommend looking at TotalStorageWaitTime in the fragment profile to understand how much time threads actually spent waiting for I/O. I also added ScannerIoWaitTime in more recent Impala releases which is more convenient since it's in the scanner profile.

If the storage wait time is slow, there are a few things to consider

If TotalRawHdfsOpenFileTime is high, it could be that opening the files is a bottleneck. This can happen on any storage system, including HDFS. See Why Impala spend a lot of time Opening HDFS File (TotalRawHdfsOpenFileTime)?
If TotalRawHdfsReadTime is high, reading from the storage system may be slow (e.g. if the data is not in the OS buffer cache or it is a remote filesystem like S3)
Other queries may be contending for I/O resources and/or I/O threads

I suspect in your case that the root cause is both slowness opening files for this query, and slowness opening files for other queries causing scanner threads to be occupied. Likely enabling file handle caching will solve the problem - we've seen dramatic improvements in performance on production deployments by doing that.

Another possibility worth mentioning is that the built-in JVM is doing some garbage collection - this could block some of the HDFS operations. We have some pause detection that logs messages when there is a JVM pause. You can also look at the /memz debug page, which I think has some GC stats. Or connect up other Java debugging tools.

Amaryllidaceous answered 18/8, 2020 at 19:52 Comment(1)

Thank you, this answer is very helpful. but I'm still wondering how do you explain that ScannerThreadsVoluntaryContextSwitches is very high if you think it's not because of I/O performance issue? – Erlandson 19/8, 2020 at 8:11

ScannerThreadsVoluntaryContextSwitches: 1.25K (1248) means that there were 1248 situations were scan threads got "stuck" waiting for some external resource, and subsequently put to sleep(). Most likely that resource was disk IO. That would explain quite low average reading speed (TotalReadThroughput: *11.70 MB*/sec) while having "normal" per-read thruput (PerReadThreadRawHdfsThroughput: 151.39 MB/sec).

EDIT

To increase performance, you may want to try:

enable short circuit reads (dfs.client.read.shortcircuit=true)
configure HDFS caching and alter Impala table to use cache

(Note that both applicable if you're running Impala against HDFS, not some sort of object store.)

Lashaunda answered 17/8, 2020 at 22:1 Comment(4)

Thank you, that is very helpful ! but I'm still wondering how to solve this issue without adding more physical resource, since most of these slow scan happens on a specific table, I'm wondering whether duplicating this table (so that queries about this table decentralized) helps – Erlandson 18/8, 2020 at 3:52

I'd be surprised if HDFS caching helped and it's generally best to avoid manual cache management. I've only ever seen it be particularly useful when there was hot-spotting on a particular node and it was a way to increase the replication of a small table or partition. – Amaryllidaceous 18/8, 2020 at 19:37

I also think the I/O performance theory is likely wrong. We learned this the hard way in the past by spending significant time chasing I/O performance when the bottleneck was HDFS namenode RPCs. Back then we included TotalRawHdfsOpenFileTime in TotalRawHdfsReadTime so had trouble distinguishing the two in profiles. – Amaryllidaceous 18/8, 2020 at 19:54

@tim-armstrong very interesting, appreciate the insight! Thanks! It'd be really nice to check NN RPC call queue and timings to see if you're right :) – Lashaunda 18/8, 2020 at 21:48

Recommended topics

Hot tags