UI Tests Flaky on CI
Asked Answered
M

1

6

update:

VNC into the build machine (without making any changes) somehow fixes this....strange


Context

We use Mac EC2 instances CI cluster to run UI tests. The cluster is set up to recycle every night. Meaning new cluster is provisioned in the morning. We notice that the tests get increasing more and more flaky throughout the day.

enter image description here

  • Build 312 is the first build in the morning.
  • Red is failed tests, Orange is silenced tests, Yellow is flaky tests, Grey is unknown status
  • The commit are just white space auto commit that trigger off builds. The tests content are the same.
  • Note: Bluepill configure to run one Simulator per machine (no concurrency)

Log

The flaky test or failure are usually due to "waiting for some resource to idle" and reach timeout, or some network request to finish. The tests that failed or flaky also has a commonality which is they are usually related to video

Here are a couple sample logs

Waiting for app to idle

Timed out waiting for app to idle. 
The following idling resources are busy.
1. GREYAppStateTracker:
Waiting for network requests to finish. By default, EarlGrey tracks all network requests. To change this behavior, refer to GREYConfiguration.
<__NSCFLocalDataTask:0x7fcfcfd87450, URL:"https://v.myorgimg.com/videos/mc/hls/e7/01/ca/e701caebac12684056eb699f316fc411_360w.m3u8"> => Waiting for network requests to finish. By default, EarlGrey tracks all network requests. To change this behavior, refer to GREYConfiguration.
(
    0   MyorgDevelopmentEG2             0x000000010ebf3205 -[GREYAppStateTrackerObject setState:] + 69
    1   MyorgDevelopmentEG2             0x000000010ebf22b3 __133-[GREYAppStateTracker grey_changeState:usingOperation:forObject:orInternalObjectDeallocationTracker:orExternalAppStateTrackerObject:]_block_invoke + 931
    2   MyorgDevelopmentEG2             0x000000010ebf18f0 -[GREYAppStateTracker grey_performBlockInCriticalSection:] + 224
    3   MyorgDevelopmentEG2             0x000000010ebf1e34 -[GREYAppStateTracker grey_changeState:usingOperation:forObject:orInternalObjectDeallocationTracker:orExternalAppStateTrackerObject:] + 1076
    4   MyorgDevelopmentEG2             0x000000010ebf0ac6 -[GREYAppStateTracker trackState:forObject:] + 102
    5   MyorgDevelopmentEG2             0x000000010ebca867 -[__NSCFLocalDataTask_GREYApp grey_track] + 135
    6   MyorgDevelopmentEG2             0x000000010ebcac55 -[__NSCFLocalDataTask_GREYApp greyswizzled_resume] + 181
    7   MediaToolbox                        0x00007fff2b617285 figHttpRequestSetupNSURLSessionTask + 1380
    8   MediaToolbox                        0x00007fff2b616bb6 _FigHTTPRequestCreateWithNSURLSession + 2923
    9   MediaToolbox                        0x00007fff2b64aaaf figHTTPRequestSessionNSCreateHTTPRequest + 199
    10  MediaToolbox                        0x00007fff2b71f5b1 segPumpCreateHTTPRequest + 944
    11  MediaToolbox                        0x00007fff2b721de6 segPumpSendIndexFileRequest + 1703
    12  MediaToolbox                        0x00007fff2b7431d5 segPumpRequestIndexForStream + 753
    13  MediaToolbox                        0x00007fff2b7421ee segPumpSetAlternateForStream + 3145
    14  MediaToolbox                        0x00007fff2b73e9e5 segPumpSetCurrentAlternate + 9898
    15  MediaToolbox                        0x00007fff2b791d6a ProduceStreamingAssetProperty + 661
    16  MediaToolbox                        0x00007fff2b78e07c URLAssetPropertyWorkFunction + 872
    17  libdispatch.dylib                   0x00007fff201078df _dispatch_client_callout + 8
    18  libdispatch.dylib                   0x00007fff2010de15 _dispatch_lane_serial_drain + 715
    19  libdispatch.dylib                   0x00007fff2010e9c3 _dispatch_lane_invoke + 455
    20  libdispatch.dylib                   0x00007fff20117a8e _dispatch_root_queue_drain + 350
    21  libdispatch.dylib                   0x00007fff2011786c _dispatch_worker_thread + 222
    22  libsystem_pthread.dylib             0x00007fff60342950 _pthread_start + 224
    23  libsystem_pthread.dylib             0x00007fff6033e47b thread_start + 15
)

and waiting for network request to finish

9. GREYAppStateTracker:
Waiting for network requests to finish. By default, EarlGrey tracks all network requests. To change this behavior, refer to GREYConfiguration.
Waiting for UIView's draw/layout pass to complete. A draw/layout pass normally completes in the next runloop drain.
<ASCollectionView: 0x7feca21ad000> => Waiting for UIView's draw/layout pass to complete. A draw/layout pass normally completes in the next runloop drain.
(
    0   MyorgDevelopmentEG2             0x000000010bb2e335 -[GREYAppStateTrackerObject setState:] + 69
    1   MyorgDevelopmentEG2             0x000000010bb2d3e3 __133-[GREYAppStateTracker grey_changeState:usingOperation:forObject:orInternalObjectDeallocationTracker:orExternalAppStateTrackerObject:]_block_invoke + 931
    2   MyorgDevelopmentEG2             0x000000010bb2ca20 -[GREYAppStateTracker grey_performBlockInCriticalSection:] + 224
    3   MyorgDevelopmentEG2             0x000000010bb2cf64 -[GREYAppStateTracker grey_changeState:usingOperation:forObject:orInternalObjectDeallocationTracker:orExternalAppStateTrackerObject:] + 1076
    4   MyorgDevelopmentEG2             0x000000010bb2bbf6 -[GREYAppStateTracker trackState:forObject:] + 102
    5   MyorgDevelopmentEG2             0x000000010bb02b0d -[UIView(GREYApp) greyswizzled_setNeedsLayout] + 93
    6   UIKitCore                           0x00007fff24ba5675 -[UIScrollView setNeedsLayout] + 79
    7   UIKitCore                           0x00007fff23dc0390 -[UICollectionView _updateAnimationDidStop:finished:context:] + 2236
    8   UIKitCore                           0x00007fff23dbf9e4 __102-[UICollectionView _updateWithItems:tentativelyForReordering:propertyAnimator:collectionViewAnimator:]_block_invoke.2110 + 77
    9   MyorgDevelopmentEG2             0x000000010bb2fdef __60-[GREYDispatchQueueTracker grey_dispatchAsyncCallWithBlock:]_block_invoke + 47
    10  libdispatch.dylib                   0x00007fff2010670d _dispatch_call_block_and_release + 12
    11  libdispatch.dylib                   0x00007fff201078df _dispatch_client_callout + 8
    12  libdispatch.dylib                   0x00007fff20114a27 _dispatch_main_queue_callback_4CF + 1045
    13  CoreFoundation                      0x00007fff203908f8 __CFRUNLOOP_IS_SERVICING_THE_MAIN_DISPATCH_QUEUE__ + 9
    14  CoreFoundation                      0x00007fff2038b169 __CFRunLoopRun + 2781
    15  CoreFoundation                      0x00007fff2038a1a7 CFRunLoopRunSpecific + 567
    16  GraphicsServices                    0x00007fff2b874d85 GSEventRunModal + 139
    17  UIKitCore                           0x00007fff246c14df -[UIApplication _run] + 912
    18  UIKitCore                           0x00007fff246c639c UIApplicationMain + 101
    19  MyorgDevelopmentEG2             0x0000000108f175de main + 190
    20  libdyld.dylib                       0x00007fff2025abbd start + 1
    21  ???                                 0x0000000000000005 0x0 + 5
)
<_ASDisplayLayer: 0x600007996580> => Waiting for UIView's draw/layout pass to complete. A draw/layout pass normally completes in the next runloop drain.

Hypothesis

This leads me to think there are some orphaned subprocess or daemon that still holds onto to resources. Or there are artifacts not being clean up properly.

Things I have tried

  • shutdown, erase, delete all simulator between builds (No Help)
    xcrun simctl shutdown all 
    xcrun simctl erase all
    xcrun simctl delete all
    osascript -e 'tell application "iOS Simulator" to quit'
    osascript -e 'tell application "Simulator" to quit'
  • Kill CoreSimulator related services between builds (No Help)
    launchctl remove com.apple.CoreSimulator.CoreSimulatorService || true
    launchctl remove com.apple.CoreSimulator.SimLaunchHost-x86 || true
    launchctl remove com.apple.CoreSimulator.SimulatorTrampoline || true
    launchctl remove com.apple.CoreSimulator.SimLaunchHost-arm64 || true
  • delete simulator devices, logs, and cache between builds. This actually helped a bit. The flakiness went away initially but came back quickly after a few builds while the machines still performing the cleaning between builds....unsure why.
    echo "Delete all simulator devices, cache, and temp"
    rm -rf ~/Library/Developer/CoreSimulator

    echo "Delete Simulator Logs"
    rm -rf ~/Library/Logs/CoreSimulator

Anyone else running into the same issue? Appreciate if someone can shed any light on this or give suggestions to try.

Note: We uses Bluepill and EarlGrey2 to run UI tests (iOS 14.5, Xcode 12.5) on Buildkite CI pipeline.

Mendelian answered 19/7, 2021 at 5:17 Comment(12)
Could you label which log is from which build?Elisabethelisabethville
@I’L’I sure thing! Just updated.Mendelian
Have you tried removing the DerivedData directory between builds?Elisabethelisabethville
I don’t think I have tried that yet. You got any tips?Mendelian
All you would need to do to test it is to rm -rf /app/build/path/to/DerivedData before your next build to see if your results differ. DerivedData is basically a cache of sorts which doesn't get cleared unless you clean that directory before each build.Elisabethelisabethville
Hmmm clearing DerjvedData doesn’t seem to help :/Mendelian
If your tests are not hermetic, they might be failing due to external factors. Have you tried delaying tests to start later in the day to see if the network resources (network routing, or the host itself) aren't the cause?Tumbleweed
@Eric Shieh The test suites run hourly so we should have a good sample of tests performance through out the days. I wondered about failure due to external factors as well, but then we should see failure more sporadic/random instead of tests always being green when a new build machine fleet gets provisioned.Mendelian
If the network resource is public facing then network contention would not necessarily be random. Also, are other tests running in parallel (on other instances) that would also increase load on the network resource?Tumbleweed
@EricShieh I am guessing the network resource is not public facing. We don't have any tests running in parellel (we run 1 test without concurrency per machine). I don't know a lot about network. Is there a way to find residual network call that might hung from previous builds/tests?Mendelian
The EarlGrey error seems to indicate that network calls are hanging. Termination of the app process should also kill any network connections. You can verify using netstat -a | grep <hostname> . You mentioned an EC2 cluster, but you're saying only one test is running at any given time within the cluster? You mentioned blacklisting the URL helps in the stability. Is it ALWAYS the same host/server/services that it's hanging on?Tumbleweed
@EricShieh sorry let me clarify. One test is running at any given time on a single build machine not within the build cluster. The request seem to be hanging on different resources. Here are some sample logs gist.github.com/OliverKoo/9d193bea732fcaaace363326341b9ec3Mendelian
T
1

Have you tried running:

sudo systemsetup -setcomputersleep never

Titled answered 7/8, 2021 at 2:0 Comment(1)
this fixes it! I baked this into the AMI and the flakiness drop significantly! Also running the caffeinate command pre-build works too.Mendelian

© 2022 - 2024 — McMap. All rights reserved.