Algorithm for estimating remaining time for a time-intensive loop with heterogenous iterations
Asked Answered
R

5

6

I have a loop of instructions, such as (pseudocode):

for i = 1 to 1000000
    // Process the ith input
    doSomething(input[i])
end

This takes a long time to complete. I would like to output some kind of progress and more importantly remaining time estimate to the user, so that they can decide whether they should just sit there twiddling their thumbs, go get some coffee, go for a walk, or go on a weeklong vacation to Europe while the algorithm crunches its numbers.

To simplify matters, you can assume the number of iterations will be large (for instance, larger than 100, so you can print progress at every percentile).

A common algorithm is to simply measure the time the last iteration took, then multiply that by the number of iterations left and give that as output. This breaks down if each iteration can vary a lot in how long it takes to execute.

Another approach is to divide the time passed since the first iteration, by the number of iterations completed, and then multiply that by remaining iterations. This breaks down if durations of iterations are not evenly distributed. For example, if the first few inputs are "difficult" and get easier towards the end of the input array, the algorithm will overestimate remaining time until it is almost finished (at which point it will overestimate very slightly).

So how can one get a better estimate of remaining time when the time each iteration will take is a non-straightforward, arbitrary function (such that simply analytically deriving and implementing the time-to-complete of each iteration is impractical) of the iteration ordinate?

Two ideas I can imagine might be fruitful avenues of research, but am unable to fully explore myself at this time:

  • Exponential average of the time to complete each past iteration multiplied by remaining iterations.
  • Tracking time used to complete each iteration, then fitting a function and extrapolating.

Why computationally intensive solutions (like fitting equations) are okay:

Firstly, for truly large tasks where this discussion is worthwhile, run time may be measured in hours or days. Complicated mathematical operations these days take milliseconds, so the added burden would not be great - in my above example, clearly doSomething takes so long as to dwarf the cost of doing some math, otherwise I would not care so much about precisely estimating remaining time in the first place.

Second, it is possible to, for instance, bin iterations into percentiles. Then, instead of operating on a dataset of "iterations complete vs. time taken" the estimator would operate on a dataset of "percent complete vs. time taken" which has at most 100 data points. This affords further complexity: Say your task takes a day or more to complete. Estimating remaining time only once each percent of it is complete means 100 evaluations of the estimator function. When you already take a day, an extra minute and a half for estimating remaining time isn't a big deal, but that already gives you a 1 second window for fitting equations and what not - 1 second is a lot of time for doing math on a modern system. As such, I welcome computationally intensive solutions.

tl;dr: How to over-engineer an accurate remaining time estimator function for very lengthy tasks.

Raincoat answered 20/8, 2012 at 0:44 Comment(0)
C
1

In addition to Penguino's algorithm: instead of fitting n and f(n) you might want to fit log(n) and log(f(n)). As long as your complexity is polynomial this will work.

Carter answered 20/8, 2012 at 5:33 Comment(0)
P
2

If you want to get a consistently good prediction then the second method (fit and extrapolate) is likely to do best - but only on the assumption that the fitting function is a reasonable match to the true dependence of processing time as a function of index. For example if f(n) is an O(n^2) algorithm, predicting the time for

for i = 1 to N
  f(i)

will take approx k*N^3 time to solve. So fitting a cubic to total time should provide a pretty good aproximation, but fitting a quadratic or exponential might be worse than the simple percent-completed approximation. Likewise, if f is O(2^n) then any polynomial fit will massively underestimate the remaining time. This all assumes that N is large enough that the true O(n^2) behaviour is dominating.

So while a well chosen fitting function should be able to accurately predict remaining time, a generic predictive function is unlikely to be useful.

Precipitant answered 20/8, 2012 at 4:20 Comment(0)
C
1

In addition to Penguino's algorithm: instead of fitting n and f(n) you might want to fit log(n) and log(f(n)). As long as your complexity is polynomial this will work.

Carter answered 20/8, 2012 at 5:33 Comment(0)
B
1

You are looking for rolling/moving average

tl;dr ⇒ You should use a rolling average where the weight for each element is exponentially correlated with the index of that element.

The answer is a rolling average. This is a type of average where the value is constantly updated as new data comes in. It takes into account recent data points and gives them more weight.

Here's a simplified explanation of how you can do this in R:

# Create an empty vector to store the time taken for each iteration
mean_time <- rep(NULL, max_numnber_of_iterations)

for (i in 1:max_numnber_of_iterations) {
  
  start_time <- Sys.time() 
  do_something(i) 
  finish_time <- Sys.time()

  mean_time[i] <- finish_time - start_time

  # Define a function to calculate the moving average
  moving_average <- function(x){  
    x <- x %>% na.exclude()
    weighted_x <- 1.1^(-seq_along(x)) * (x %>% rev()) # Notice the number 1.1 here.
    sum_weighted_x <- weighted_x %>% sum()
    weights <- 1.1^(-seq_along(x)) %>% sum()
    moving_average <- sum_weighted_x / weights
    return(moving_average)
  }
  
  # Calculate the moving average of the time differences
  average_diff_time <- mean_time %>% moving_average()
  
  # Estimate the remaining time
  estimating_remaining_time <- (max_numnber_of_iterations - i) * average_diff_time
}

The number 1.1 is a factor that determines how much weight is given to recent data points. The higher this number, the more weight is given to recent data points.

This is useful when the process speed varies over time. BUT, I wanted to use this algorithm both for local computations and web scraping. And, I live in Iran where we don't have quality internet connections; the speed can vary drastically within a few minutes! On the other hand, local computations usually show stable speeds.

In such cases as low-quality internet speeds, you might want to give less weight to recent data points because they might not represent the average speed accurately. You don't want to be fooled by a temporary change in the internet speed. So, in these cases, you can adjust this number based on the variance of average_diff_time from a certain number of iterations or iterations from the last certain amount of time. As the variance increases, your number should go low and vice versa.

So, when writing the algorithm, I recommend anticipating this weighting mechanism, too.

Bullroarer answered 19/5 at 11:58 Comment(0)
L
0

I've done something like this before. The easiest way I've found that creates a pretty accurate time estimate is (again, in p-code):

initTime = getTime()
for i = 0 to maxIter
    doSomething()
    remainTime = convertToHoursMinutes(((getTime - initTime)/i)*maxIter)
next

This way, you've got your 'average' time down per iteration, and after 30-50 iterations your user might have a good idea of remaining time (eventually, the central limit theorem comes into play).

Lyndes answered 20/8, 2012 at 1:7 Comment(0)
O
0

Here's what I ended up doing for a function that sometimes had several really quick iterations at the beginning that were bringing the average speed way up so that the estimated time wasn't accurate near the start.

With each iteration, I appended to an array the estimated total time (time elapsed plus estimated time remaining using "Another approach is to divide the time passed since the first iteration, by the number of iterations completed, and then multiply that by remaining iterations"). I then got a log curve of best fit of that data (estimated total time as a function of time elapsed.)

Then to get the point on the curve that should represent total time most correctly, I found where time elapsed equals estimated total time, as those should presumably be equal in the extrapolated estimate for the last iteration.

Octastyle answered 23/12, 2022 at 22:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.