How can I efficiently remove elements by index from a very large list?
Asked Answered
P

6

27

I have a very large list of integers (about 2 billion elements) and a list with indices (couple thousand elements) at which I need to remove elements from the first list. My current approach is to loop over all indices in the second list, passing each to the RemoveAt() method of the first list:

indices.Sort();
indices.Reverse();
for (i = 0; i < indices.Count; i++)
{
    largeList.RemoveAt(indices[i]);
}

However, it takes about 2 minutes to finish. I really need to perform this operation much faster. Is there any way of optimizing this?

I have a Intel i9X CPU with 10 cores, so maybe some way of parallel processing?

Palla answered 19/8, 2020 at 21:29 Comment(17)
So... this 2 billion items list - is it in memory or on disk? What is the intended result of this operation - a new list? a file on disk?Involved
Parallel Linq can help you : learn.microsoft.com/en-us/dotnet/standard/parallel-programming/…Prompter
I've added code that i'm currently using. List is in the memory and as a result I also need a list in memory for further use in code.Palla
Why indices.Reverse()?Involved
That's a large list... by which I mean: you're getting close to he vector limit even with GC-large-objects enabled. What are the characteristics of this data? Unique? Sorted? Does order matter? Could it be sensibly partitioned? If you can partition: you can parallelize. Otherwise, maybe some alternative storage format...? You can't "simply" parallelize, because of synchronization problemsEaglewood
@Vernou I don't think Parallel LINQ will help - List<T> is not thread safe.Wry
@Involved Because when you remove the item at an indice, you will shift down all following items, so you want to remove from the end so indices won't reference the wrong items.Wry
At a minimum I would suggest reversing the for instead of using List<T>.Reverse(). Does the final order of the large list matter?Wry
In particular, I'm wondering if you could just use a 250MiB bit mask to cover the 2 billion range; then you're just clearing bits by position, much quicker than shuffling - and much smaller data size too. Alternatively, there are some sparse vector formats, although frankly 2 billion integers is almost fully dense!Eaglewood
Items in largeList are not unique (that's why I'm using second list with indices) and not sorted. Also, the order of them has to be preserved.Palla
Maybe you could accumulate the effects of all those RemoveAt() calls in a single pass by iterating through each index of largeList starting at indices[0]? Everything from indices[0] + 1 through indices[1] - 1 gets shifted back one position using the indexer, everything from indices[1] + 1 (pre-removal of indices[0]) through indices[2] - 1 gets shifted back two positions, and so on. Then at the end call RemoveRange() to trim off the end of the list the same number of elements that were "removed". You'll have to double-check the math on that, but that's the general idea.Liberal
Related: Remove list elements at given indicesAnuradhapura
The question is..... why do you keep 2 billions element in memory? That's about ~8G in memory for a list of int. How about using a database for dealing with data?Smell
What additional focus does this question require? A specific definition of "much faster"? If the votes are keying on the "if your question has many valid answers..." part of the close description, then couldn't any performance question fail that criteria? How should the author know how many ways there are to make their code faster? Don't ignore the "(but no way to determine which - if any - are correct)` that follows, though; there is a way to evaluate the correctness — or, at least, merit — of an answer here: benchmarks, which is exactly what I did in my answer!Liberal
Another possibility is overwriting the deletion with an illegal value. If the requirements show that negative values are illegal, just changing the sign of an entry to 'remove' it may make overall operation faster, depending on the ratio of lookups to inserts/deletions. An occasional sweep, using one of the other suggested removal mechanisms, may be needed to trim the list size.Preterhuman
This question has been under meta effect in early October 2020: meta.#402243Breeching
I would consider iterating over 2 billion items in 2 minutes extremely fast. What do you consider to be acceptable speed here? By what objective metric can a person consider an answer to acceptably solve your problem? Voting to close as 'needs details' until that is clarified, at least.Coincidental
L
18

Each time RemoveAt() is called it has to move every element that comes after the specified index up by one element. If there are thousands of elements to remove in a very large list, that will result in many, many (needless) moves.

My thinking was that if you can calculate the start index and length of each move, you can process the list in a single pass with no overlapping moves. This is done by comparing adjacent elements in the list of indices to remove. While this does mean building a third List<> of move operations to perform, my hope is having the most efficient, minimal move strategy planned out in advance will pay off in the end (or perhaps there's a way to do that without allocating any objects that presently isn't occurring to me).

You can see my implementation, RemoveUsingMoveStrategy(), in the benchmark code below. Running the launcher program below in test mode, we can see it — as well as the other answers — produces the correct result when given indices 0, 1, 5, 10, 11, 15, 15 (repeated), 18, and 19 to remove from a 20-element List<int>...

PS> dotnet run --configuration release --framework netcoreapp3.1 -- test
[snip]
RemoveUsingMoveStrategy():
        Elements: { 2, 3, 4, 6, 7, 8, 9, 12, 13, 14, 16, 17 }
           Count: 12
        Sequence: Equal to control list
        Duration: 3.95 milliseconds
        Returned: Input list
[snip]

Benchmark notes

  • It was my intention to benchmark against a billion-element List<int>, but RemoveUsingRemoveAt() — inspired by the code in the question — is so inefficient that it was taking too long, so I only went up to 10 million elements.

  • I still hope to run a billion-element benchmark that excludes RemoveUsingRemoveAt(). For that reason, I introduced a not-quite-as-naive implementation called RemoveUsingListCopy() to serve as the baseline for comparison across all list sizes. As the name implies, it doesn't modify the input list but creates a new one with the removals applied.

  • Since not all implementations require the list of indices to remove to be unique and/or sorted, I figured the fairest way to benchmark would be to include that overhead in the benchmark time where the method requires it.

  • The only other changes I made to code from other answers was to make use of the benchmark's lists (DataList and RemovalList) and to change from var to explicit types for reader clarity.

  • RemovalListLocation indicates from where in DataList indices were removed.

    • For Beginning, Middle, and End it is a contiguous RemovalListSize-sized block removed from that location.
    • For Random it is RemovalListSize random, valid, unsorted, not-guaranteed-to-be-unique indices generated from a constant seed.

    To keep the results short(er), I opted to only benchmark the Middle — thinking it'd be a good middle-of-the-road compromise — and Random values.

Benchmark results

  • RemoveUsingRemoveAt() is terrible. Don't do that.

  • For the smaller list, RemoveUsingListCopy() is consistently the fastest, at the expense of doubling memory usage.

  • For the larger list, Vernou's answer is as least 5x as fast as the other answers, at the expense of relying on implementation details.

    This just goes to show that you shouldn't necessarily always favor List<> over an array — unless you need its additional capabilities — because it isn't always superior. It insulates the underlying array, preventing you (without reflection) from using more performant access methods like Array.Copy() and unsafe code on it.

  • Theodor Zoulias's answer does well with the smaller list, but I think having to "touch" all 10 million indices — each with a call to HashSet<>.Contains() and increment of an index variable — really hurt it with the larger list.

  • The remaining implementations (including mine) have roughly the same performance: not the best, but still pretty good.

Benchmark data

Running the launcher program defined later in this answer in benchmark mode, I get these results from BenchmarkDotNet...

PS> dotnet run --configuration release --framework netcoreapp3.1 -- benchmark
[snip]
// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.450 (2004/?/20H1)
Intel Core i7 CPU 860 2.80GHz (Nehalem), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=3.1.401
  [Host]    : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
  MediumRun : .NET Framework 4.8 (4.8.4200.0), X64 RyuJIT

Job=MediumRun  InvocationCount=1  IterationCount=15  
LaunchCount=2  UnrollFactor=1  WarmupCount=10  

|                        Method |       Runtime | DataListSize | RemovalListSize | RemovalListLocation |           Mean |         Error |        StdDev |         Median | Ratio | RatioSD |
|------------------------------ |-------------- |------------- |---------------- |-------------------- |---------------:|--------------:|--------------:|---------------:|------:|--------:|
|           RemoveUsingListCopy |      .NET 4.8 |        10000 |            2000 |              Random |       379.9 μs |      15.93 μs |      23.36 μs |       380.4 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt |      .NET 4.8 |        10000 |            2000 |              Random |     1,463.7 μs |      93.79 μs |     137.47 μs |     1,434.7 μs |  3.87 |    0.41 |
|       RemoveUsingMoveStrategy |      .NET 4.8 |        10000 |            2000 |              Random |       635.9 μs |      19.77 μs |      29.60 μs |       624.0 μs |  1.67 |    0.14 |
| Answer63496225_TheodorZoulias |      .NET 4.8 |        10000 |            2000 |              Random |       372.1 μs |      11.52 μs |      16.15 μs |       373.8 μs |  0.99 |    0.07 |
|       Answer63496768_CoolBots |      .NET 4.8 |        10000 |            2000 |              Random |       594.5 μs |      25.13 μs |      36.03 μs |       593.1 μs |  1.57 |    0.12 |
|        Answer63495657_NetMage |      .NET 4.8 |        10000 |            2000 |              Random |       618.8 μs |      17.53 μs |      23.99 μs |       622.4 μs |  1.65 |    0.13 |
|         Answer63496256_Vernou |      .NET 4.8 |        10000 |            2000 |              Random |       645.6 μs |      27.28 μs |      39.99 μs |       632.2 μs |  1.71 |    0.16 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy | .NET Core 3.1 |        10000 |            2000 |              Random |       391.5 μs |      10.39 μs |      15.55 μs |       391.6 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt | .NET Core 3.1 |        10000 |            2000 |              Random |     1,402.2 μs |      44.20 μs |      64.80 μs |     1,407.6 μs |  3.59 |    0.21 |
|       RemoveUsingMoveStrategy | .NET Core 3.1 |        10000 |            2000 |              Random |       557.9 μs |      19.73 μs |      27.00 μs |       557.2 μs |  1.43 |    0.10 |
| Answer63496225_TheodorZoulias | .NET Core 3.1 |        10000 |            2000 |              Random |       424.3 μs |      20.90 μs |      29.30 μs |       424.2 μs |  1.09 |    0.09 |
|       Answer63496768_CoolBots | .NET Core 3.1 |        10000 |            2000 |              Random |       535.0 μs |      19.37 μs |      27.16 μs |       537.1 μs |  1.37 |    0.08 |
|        Answer63495657_NetMage | .NET Core 3.1 |        10000 |            2000 |              Random |       557.7 μs |      18.73 μs |      25.63 μs |       550.0 μs |  1.43 |    0.09 |
|         Answer63496256_Vernou | .NET Core 3.1 |        10000 |            2000 |              Random |       554.2 μs |      13.82 μs |      18.45 μs |       554.0 μs |  1.42 |    0.07 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy |      .NET 4.8 |        10000 |            2000 |              Middle |       221.6 μs |       7.25 μs |      10.63 μs |       222.5 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt |      .NET 4.8 |        10000 |            2000 |              Middle |     1,195.3 μs |      20.01 μs |      28.69 μs |     1,187.7 μs |  5.42 |    0.30 |
|       RemoveUsingMoveStrategy |      .NET 4.8 |        10000 |            2000 |              Middle |       405.0 μs |      13.65 μs |      19.14 μs |       410.7 μs |  1.83 |    0.10 |
| Answer63496225_TheodorZoulias |      .NET 4.8 |        10000 |            2000 |              Middle |       206.3 μs |       8.62 μs |      12.09 μs |       204.9 μs |  0.94 |    0.08 |
|       Answer63496768_CoolBots |      .NET 4.8 |        10000 |            2000 |              Middle |       427.5 μs |      15.56 μs |      22.81 μs |       435.4 μs |  1.93 |    0.13 |
|        Answer63495657_NetMage |      .NET 4.8 |        10000 |            2000 |              Middle |       405.4 μs |      13.80 μs |      19.35 μs |       403.8 μs |  1.84 |    0.11 |
|         Answer63496256_Vernou |      .NET 4.8 |        10000 |            2000 |              Middle |       413.9 μs |      15.26 μs |      20.89 μs |       419.8 μs |  1.87 |    0.12 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy | .NET Core 3.1 |        10000 |            2000 |              Middle |       235.2 μs |      10.73 μs |      15.73 μs |       236.2 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt | .NET Core 3.1 |        10000 |            2000 |              Middle |     1,345.6 μs |      32.07 μs |      43.90 μs |     1,352.7 μs |  5.77 |    0.41 |
|       RemoveUsingMoveStrategy | .NET Core 3.1 |        10000 |            2000 |              Middle |       324.0 μs |       4.92 μs |       7.05 μs |       326.6 μs |  1.39 |    0.09 |
| Answer63496225_TheodorZoulias | .NET Core 3.1 |        10000 |            2000 |              Middle |       262.9 μs |       6.18 μs |       9.06 μs |       265.4 μs |  1.12 |    0.08 |
|       Answer63496768_CoolBots | .NET Core 3.1 |        10000 |            2000 |              Middle |       333.6 μs |      10.14 μs |      13.87 μs |       340.9 μs |  1.43 |    0.11 |
|        Answer63495657_NetMage | .NET Core 3.1 |        10000 |            2000 |              Middle |       313.5 μs |       9.05 μs |      12.69 μs |       310.5 μs |  1.34 |    0.11 |
|         Answer63496256_Vernou | .NET Core 3.1 |        10000 |            2000 |              Middle |       332.3 μs |       6.70 μs |       8.95 μs |       331.9 μs |  1.43 |    0.09 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy |      .NET 4.8 |     10000000 |            2000 |              Random |   253,977.1 μs |   2,721.70 μs |   3,989.43 μs |   253,809.0 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt |      .NET 4.8 |     10000000 |            2000 |              Random | 5,191,083.4 μs |  13,200.66 μs |  18,931.99 μs | 5,187,162.3 μs | 20.43 |    0.34 |
|       RemoveUsingMoveStrategy |      .NET 4.8 |     10000000 |            2000 |              Random |    65,365.4 μs |     422.41 μs |     592.16 μs |    65,307.3 μs |  0.26 |    0.00 |
| Answer63496225_TheodorZoulias |      .NET 4.8 |     10000000 |            2000 |              Random |   240,584.4 μs |   3,687.89 μs |   5,048.03 μs |   244,336.1 μs |  0.95 |    0.02 |
|       Answer63496768_CoolBots |      .NET 4.8 |     10000000 |            2000 |              Random |    54,168.4 μs |   1,001.37 μs |   1,436.14 μs |    53,390.3 μs |  0.21 |    0.01 |
|        Answer63495657_NetMage |      .NET 4.8 |     10000000 |            2000 |              Random |    72,501.4 μs |     452.46 μs |     634.29 μs |    72,161.2 μs |  0.29 |    0.00 |
|         Answer63496256_Vernou |      .NET 4.8 |     10000000 |            2000 |              Random |     5,814.0 μs |      89.71 μs |     128.67 μs |     5,825.3 μs |  0.02 |    0.00 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy | .NET Core 3.1 |     10000000 |            2000 |              Random |   239,784.0 μs |   2,721.35 μs |   3,902.88 μs |   241,125.5 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt | .NET Core 3.1 |     10000000 |            2000 |              Random | 5,538,337.5 μs | 353,505.30 μs | 495,565.06 μs | 5,208,226.1 μs | 23.12 |    2.15 |
|       RemoveUsingMoveStrategy | .NET Core 3.1 |     10000000 |            2000 |              Random |    33,071.8 μs |     103.80 μs |     138.57 μs |    33,030.5 μs |  0.14 |    0.00 |
| Answer63496225_TheodorZoulias | .NET Core 3.1 |     10000000 |            2000 |              Random |   240,825.5 μs |     851.49 μs |   1,248.11 μs |   240,520.9 μs |  1.00 |    0.02 |
|       Answer63496768_CoolBots | .NET Core 3.1 |     10000000 |            2000 |              Random |    26,265.0 μs |      90.76 μs |     124.23 μs |    26,253.0 μs |  0.11 |    0.00 |
|        Answer63495657_NetMage | .NET Core 3.1 |     10000000 |            2000 |              Random |    48,670.6 μs |     581.51 μs |     833.99 μs |    48,303.0 μs |  0.20 |    0.00 |
|         Answer63496256_Vernou | .NET Core 3.1 |     10000000 |            2000 |              Random |     5,905.5 μs |      96.27 μs |     131.78 μs |     5,915.1 μs |  0.02 |    0.00 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy |      .NET 4.8 |     10000000 |            2000 |              Middle |   153,776.2 μs |   2,454.90 μs |   3,674.38 μs |   152,872.0 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt |      .NET 4.8 |     10000000 |            2000 |              Middle | 5,245,952.0 μs |  13,845.58 μs |  20,294.67 μs | 5,252,922.4 μs | 34.10 |    0.81 |
|       RemoveUsingMoveStrategy |      .NET 4.8 |     10000000 |            2000 |              Middle |    33,233.6 μs |     110.33 μs |     158.24 μs |    33,217.3 μs |  0.22 |    0.01 |
| Answer63496225_TheodorZoulias |      .NET 4.8 |     10000000 |            2000 |              Middle |   128,949.8 μs |     560.72 μs |     804.17 μs |   128,724.9 μs |  0.84 |    0.02 |
|       Answer63496768_CoolBots |      .NET 4.8 |     10000000 |            2000 |              Middle |    48,965.1 μs |      70.75 μs |      94.45 μs |    48,957.3 μs |  0.32 |    0.01 |
|        Answer63495657_NetMage |      .NET 4.8 |     10000000 |            2000 |              Middle |    32,641.5 μs |      66.85 μs |      91.51 μs |    32,610.0 μs |  0.21 |    0.01 |
|         Answer63496256_Vernou |      .NET 4.8 |     10000000 |            2000 |              Middle |     2,982.2 μs |      29.47 μs |      41.31 μs |     2,961.9 μs |  0.02 |    0.00 |
|                               |               |              |                 |                     |                |               |               |                |       |         |
|           RemoveUsingListCopy | .NET Core 3.1 |     10000000 |            2000 |              Middle |   144,208.7 μs |   2,035.88 μs |   2,984.16 μs |   142,693.2 μs |  1.00 |    0.00 |
|           RemoveUsingRemoveAt | .NET Core 3.1 |     10000000 |            2000 |              Middle | 5,235,957.7 μs |  13,674.19 μs |  20,043.46 μs | 5,241,536.1 μs | 36.32 |    0.78 |
|       RemoveUsingMoveStrategy | .NET Core 3.1 |     10000000 |            2000 |              Middle |    16,547.3 μs |      72.72 μs |     101.95 μs |    16,520.7 μs |  0.11 |    0.00 |
| Answer63496225_TheodorZoulias | .NET Core 3.1 |     10000000 |            2000 |              Middle |   137,218.2 μs |     716.45 μs |     980.69 μs |   137,027.0 μs |  0.95 |    0.02 |
|       Answer63496768_CoolBots | .NET Core 3.1 |     10000000 |            2000 |              Middle |    23,728.5 μs |      79.84 μs |     111.93 μs |    23,689.9 μs |  0.16 |    0.00 |
|        Answer63495657_NetMage | .NET Core 3.1 |     10000000 |            2000 |              Middle |    17,298.3 μs |     216.46 μs |     310.44 μs |    17,165.5 μs |  0.12 |    0.00 |
|         Answer63496256_Vernou | .NET Core 3.1 |     10000000 |            2000 |              Middle |     2,999.7 μs |      85.78 μs |     123.03 μs |     2,957.1 μs |  0.02 |    0.00 |
[snip]

Benchmark code

To see the various implementations, scroll a third of the way down and look for the methods decorated with [Benchmark()].

This requires the BenchmarkDotNet package.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Reflection;
using BenchmarkDotNet.Attributes;

namespace SO63495264
{
    public class Benchmarks
    {
        public enum RemovalLocation
        {
            Random,
            Beginning,
            Middle,
            End
        }

        [Params(10_000, 10_000_000)]
        public int DataListSize
        {
            get; set;
        }

        [Params(2_000)]
        public int RemovalListSize
        {
            get; set;
        }

        //[ParamsAllValues()]
        [Params(RemovalLocation.Random, RemovalLocation.Middle)]
        public RemovalLocation RemovalListLocation
        {
            get; set;
        }

        public List<int> DataList
        {
            get;
            set;
        }

        public IReadOnlyList<int> RemovalList
        {
            get;
            set;
        }

        [GlobalSetup()]
        public void GlobalSetup()
        {
            IEnumerable<int> removalIndices;

            if (RemovalListLocation == RemovalLocation.Random)
            {
                // Pass a constant seed so the same indices get generated for every run
                Random random = new Random(12345);

                removalIndices = Enumerable.Range(0, RemovalListSize)
                    .Select(i => random.Next(DataListSize));
            }
            else
            {
                int removalSegmentOffset = RemovalListLocation switch {
                    RemovalLocation.Beginning => 0,
                    RemovalLocation.Middle => (DataListSize - RemovalListSize) / 2,
                    RemovalLocation.End => DataListSize - RemovalListSize,
                    _ => throw new Exception($"Unexpected {nameof(RemovalLocation)} enumeration value {RemovalListLocation}.")
                };

                removalIndices = Enumerable.Range(removalSegmentOffset, RemovalListSize);
            }

            // For efficiency, create a single List<int> to be reused by all iterations for a given set of parameters
            DataList = new List<int>(DataListSize);
            RemovalList = removalIndices.ToList().AsReadOnly();
        }

        [IterationSetup()]
        public void IterationSetup()
        {
            // Each iteration could modify DataList, so repopulate its elements for the
            // next iteration.  DataList is either new or has been Clear()ed by IterationCleanup().
            for (int i = 0; i < DataListSize; i++)
                DataList.Add(i);
        }

        [IterationCleanup()]
        public void IterationCleanup()
        {
            DataList.Clear();
        }

        [GlobalCleanup()]
        public void GlobalCleanup()
        {
            // Force collection of the List<> for the current set of parameters
            int generation = GC.GetGeneration(DataList);

            DataList = null;
            GC.Collect(generation, GCCollectionMode.Forced, true);
        }

        [Benchmark(Baseline = true)]
        public List<int> RemoveUsingListCopy()
        {
            HashSet<int> removalSet = RemovalList.ToHashSet();
            List<int> newList = new List<int>(DataList.Count - removalSet.Count);

            for (int index = 0; index < DataList.Count; index++)
                if (!removalSet.Contains(index))
                    newList.Add(DataList[index]);

            return newList;
        }

        // Based on Stack Overflow question 63495264
        // https://mcmap.net/q/499604/-how-can-i-efficiently-remove-elements-by-index-from-a-very-large-list/150605
        [Benchmark()]
        public List<int> RemoveUsingRemoveAt()
        {
            // The collection of indices to remove must be in decreasing order
            // with no duplicates; all are assumed to be in-range for DataList
            foreach (int index in RemovalList.Distinct().OrderByDescending(i => i))
                DataList.RemoveAt(index);

            return DataList;
        }

        // From Stack Overflow answer 63498191
        // https://mcmap.net/q/499604/-how-can-i-efficiently-remove-elements-by-index-from-a-very-large-list
        [Benchmark()]
        public List<int> RemoveUsingMoveStrategy()
        {
            // The collection of indices to remove must be in increasing order
            // with no duplicates; all are assumed to be in-range for DataList
            int[] coreIndices = RemovalList.Distinct().OrderBy(i => i).ToArray();
            List<(int startIndex, int count, int distance)> moveOperations
                = new List<(int, int, int)>();

            int candidateIndex;
            for (int i = 0; i < coreIndices.Length; i = candidateIndex)
            {
                int currentIndex = coreIndices[i];
                int elementCount = 1;

                // Merge removal of consecutive indices into one operation
                while ((candidateIndex = i + elementCount) < coreIndices.Length
                    && coreIndices[candidateIndex] == currentIndex + elementCount
                )
                {
                    elementCount++;
                }

                int nextIndex = candidateIndex < coreIndices.Length
                    ? coreIndices[candidateIndex]
                    : DataList.Count;
                int moveCount = nextIndex - currentIndex - elementCount;

                // Removing one or more elements from the end of the
                // list will result in a no-op, 0-length move; omit it
                if (moveCount > 0)
                {
                    moveOperations.Add(
                        (
                            startIndex: currentIndex + elementCount,
                            count: moveCount,
                            distance: i + elementCount
                        )
                    );
                }
            }

            // Perform the element moves
            foreach ((int startIndex, int count, int distance) in moveOperations)
            {
                for (int offset = 0; offset < count; offset++)
                {
                    int sourceIndex = startIndex + offset;
                    int targetIndex = sourceIndex - distance;

                    DataList[targetIndex] = DataList[sourceIndex];
                }
            }

            // "Trim" the number of removed elements from the end of the list
            DataList.RemoveRange(DataList.Count - coreIndices.Length, coreIndices.Length);

            return DataList;
        }

        // Adapted from Stack Overflow answer 63496225 revision 4
        // https://stackoverflow.com/revisions/63496225/4
        [Benchmark()]
        public List<int> Answer63496225_TheodorZoulias()
        {
            HashSet<int> indicesSet = new HashSet<int>(RemovalList);
            int index = 0;

            DataList.RemoveAll(_ => indicesSet.Contains(index++));

            return DataList;
        }


        // Adapted from Stack Overflow answer 63496768 revision 2
        // https://stackoverflow.com/revisions/63496768/2
        [Benchmark()]
        public List<int> Answer63496768_CoolBots()
        {
            List<int> sortedIndicies = RemovalList.Distinct().OrderBy(i => i).ToList();

            int sourceStartIndex = 0;
            int destStartIndex = 0;
            int spanLength = 0;

            int skipCount = 0;

            // Copy items up to last index to be skipped
            foreach (int skipIndex in sortedIndicies)
            {
                spanLength = skipIndex - sourceStartIndex;
                destStartIndex = sourceStartIndex - skipCount;

                for (int i = sourceStartIndex; i < sourceStartIndex + spanLength; i++)
                {
                    DataList[destStartIndex] = DataList[i];
                    destStartIndex++;
                }

                sourceStartIndex = skipIndex + 1;
                skipCount++;
            }

            // Copy remaining items (between last index to be skipped and end of list)
            spanLength = DataList.Count - sourceStartIndex;
            destStartIndex = sourceStartIndex - skipCount;

            for (int i = sourceStartIndex; i < sourceStartIndex + spanLength; i++)
            {
                DataList[destStartIndex] = DataList[i];
                destStartIndex++;
            }

            DataList.RemoveRange(destStartIndex, sortedIndicies.Count);

            return DataList;
        }

        // Adapted from Stack Overflow answer 63495657 revision 6
        // https://stackoverflow.com/revisions/63495657/6
        [Benchmark()]
        public List<int> Answer63495657_NetMage()
        {
            List<int> removeAtList = RemovalList.Distinct().OrderBy(i => i).ToList();

            int srcCount = DataList.Count;
            int ralCount = removeAtList.Count;
            int removeAtIndice = 1;
            int freeIndex = removeAtList[0];
            int current = freeIndex + 1;
            while (current < srcCount)
            {
                while (removeAtIndice < ralCount && current == removeAtList[removeAtIndice])
                {
                    ++current;
                    ++removeAtIndice;
                }

                if (current < srcCount)
                    DataList[freeIndex++] = DataList[current++];
            }

            DataList.RemoveRange(freeIndex, srcCount - freeIndex);

            return DataList;
        }

        // Adapted from Stack Overflow answer 63496256 revision 3
        // https://stackoverflow.com/revisions/63496256/3
        [Benchmark()]
        public List<int> Answer63496256_Vernou()
        {
            List<int> indices = RemovalList.Distinct().OrderBy(i => i).ToList();

            //Get the internal array
            int[] largeArray = (int[]) typeof(List<int>)
                .GetField("_items", BindingFlags.Instance | BindingFlags.NonPublic)
                .GetValue(DataList);

            int current = 0;
            int copyFrom = 0;

            for (int i = 0; i < indices.Count; i++)
            {
                int copyTo = indices[i];
                if (copyTo < copyFrom)
                {
                    //In case the indice is duplicate,
                    //The item is already passed
                    continue;
                }
                int copyLength = copyTo - copyFrom;
                Array.Copy(largeArray, copyFrom, largeArray, current, copyLength);
                current += copyLength;
                copyFrom = copyTo + 1;
            }
            //Resize the internal array
            DataList.RemoveRange(DataList.Count - indices.Count, indices.Count);

            return DataList;
        }
    }
}

Launcher code

RunBenchmark() defines the type of benchmark job to run and for which runtime(s).

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Reflection;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Environments;
using BenchmarkDotNet.Jobs;

namespace SO63495264
{
    class Program
    {
        static void Main(string[] args)
        {
            if (args?.Length == 0)
            {
                string assemblyFilePath = Assembly.GetExecutingAssembly().Location;
                string assemblyFileName = System.IO.Path.GetFileName(assemblyFilePath);

                Console.WriteLine($"{assemblyFileName} {{ benchmark | test }}");
            }
            else if (string.Equals(args[0], "benchmark", StringComparison.OrdinalIgnoreCase))
                RunBenchmark();
            else if (string.Equals(args[0], "test", StringComparison.OrdinalIgnoreCase))
                RunTest();
            else
                Console.WriteLine($"Unexpected parameter \"{args[0]}\"");
        }

        static void RunBenchmark()
        {
            Job baseJob = Job.MediumRun;
            IConfig config = DefaultConfig.Instance;

            foreach (Runtime runtime in new Runtime[] { CoreRuntime.Core31, ClrRuntime.Net48 })
            {
                config = config.AddJob(
                    baseJob.WithRuntime(runtime)
                );
            }

            BenchmarkDotNet.Running.BenchmarkRunner.Run<Benchmarks>(config);
        }

        static void RunTest()
        {
            const int ListSize = 20;
            const int MaxDisplayElements = 20;

            IEnumerable<int> data = Enumerable.Range(0, ListSize);
            IReadOnlyList<int> indices = new List<int>(
                new int[] {
                    0,                               1, // First two indices
                    ListSize / 4,
                    ListSize / 2,     ListSize / 2 + 1, // Consecutive indices
                    ListSize / 4 * 3, ListSize / 4 * 3, // Duplicate indices
                    ListSize - 2,     ListSize - 1      // Last two indices
                }
            ).AsReadOnly();

            // Discover and invoke the benchmark methods the same way BenchmarkDotNet would
            Benchmarks benchmarks = new Benchmarks() {
                RemovalList = indices
            };
            IEnumerable<MethodInfo> benchmarkMethods = benchmarks.GetType()
                .GetMethods(BindingFlags.Instance | BindingFlags.Public)
                .Where(
                    method => method.CustomAttributes.Any(
                        attribute => attribute.AttributeType == typeof(BenchmarkAttribute)
                    )
                );

            // Call a known-good method to get the correct results for comparison
            benchmarks.DataList = data.ToList();
            List<int> controlList = benchmarks.RemoveUsingListCopy();

            foreach (MethodInfo benchmarkMethod in benchmarkMethods)
            {
                List<int> inputList = data.ToList();

                benchmarks.DataList = inputList;
                Stopwatch watch = Stopwatch.StartNew();
                List<int> outputList = (List<int>) benchmarkMethod.Invoke(benchmarks, Array.Empty<object>());
                watch.Stop();

                Console.WriteLine($"{benchmarkMethod.Name}():");
                Console.WriteLine($"\tElements: {{ {string.Join(", ", outputList.Take(MaxDisplayElements))}{(outputList.Count > MaxDisplayElements ? ", ..." : "") } }}");
                Console.WriteLine($"\t   Count: {outputList.Count:N0}");
                Console.WriteLine($"\tSequence: {(outputList.SequenceEqual(controlList) ? "E" : "*Not* e")}qual to control list");
                Console.WriteLine($"\tDuration: {watch.Elapsed.TotalMilliseconds:N2} milliseconds");
                Console.WriteLine($"\tReturned: {(object.ReferenceEquals(outputList, inputList) ? "Input list" : "New list")}");
            }
        }
    }
}

In order to benchmark on .NET Framework (4.8) I had to add the following properties to my .NET Core .csproj project file:

<TargetFrameworks>netcoreapp3.1;net48</TargetFrameworks>
<LangVersion>8.0</LangVersion>

41 characters to spare!

Liberal answered 20/8, 2020 at 4:12 Comment(0)
W
17

Since the order of the source list is significant, you can move each item down the list, skipping indices to be removed, and then remove the end of the list.

UPDATE: Took the .Net Core source code for RemoveAll and modified it for a indice list instead of a predicate.

UPDATE 2: Optimized to not repeat tests if possible.

UPDATE 3: Simplified as optimization having extra code proved slower in benchmarks.

Having src as the large list, and removeAtList as the indices to remove in some random order, you can do:

removeAtList.Sort();

var srcCount = src.Count;
var ralCount = removeAtList.Count;
var removeAtIndice = 1;
var freeIndex = removeAtList[0];
var current = freeIndex+1;
while (current < srcCount) {
    while (removeAtIndice < ralCount && current == removeAtList[removeAtIndice]) {
        ++current;
        ++removeAtIndice;
    }

    if (current < srcCount)
        src[freeIndex++] = src[current++];
}

src.RemoveRange(freeIndex, srcCount-freeIndex);

For a one billion element list of random integers, and a 1000 - 3000 element list of random indices, I get 1.1 ms per remove with this algorithm. Using RemoveAt, I get over 232.77 ms per remove, so this is about 200 times faster.

Wry answered 19/8, 2020 at 22:14 Comment(6)
So, are you saying your total running time for 1_000_000_000 values and, say, 2000 indicies to remove would be 1.14ms x 2000 = 2.28s? I'm getting ~26 seconds with your solution... which is still much faster than anything I came up with so far, lol!Involved
@Involved Yes - it is taking about 12 seconds to create the list, then about 3.2 seconds to remove 2000 indices - my per millisecond time is verying from 1.2 to 3 ms using .Net Core 3.1 in LINQPad 6.Wry
@Involved I updated my answer by modifying the code from .Net Core 3.1 List<T>.RemoveAll for a skip indices list and optimizing. (PS LINQPad 6 x64.)Wry
It's faster than it was, but I can't get to the times you're claiming. I get about 12 seconds runtime for your version. I just came up with a solution that runs in under 5 seconds, will post.Involved
@Involved Nice. That could also be because I am running on a i7-4790 with 32GB of RAM under Windows 10 x64Wry
Interesting, your spec is better than mine, especially RAM - can you try my version on your machine?Involved
E
6

One way to allow this to be parallelized would be to break the list into multiple fragments; perhaps initially (arbitrarily) separate slabs of 1 million elements. As long as each slab maintains its own count, you can split the work by index into removals from different slabs (based purely on the counts), and then do the actual removal work concurrently. If you leave some spare capacity in each, you can also add elements into the middle more cheaply, as you are usually only touching one slab. Random access will be a little slower, as you may need to look at multiple slab counts to determine the correct slab, but if the slab counts are maintained in a contiguous vector (rather than against each slab), you should have excellent memory cache hit while doing it.

Eaglewood answered 19/8, 2020 at 22:56 Comment(1)
Or to parallelize with the same flat linear data structure: If you have a sorted unique list of indices to remove, you can calculate the output index for a given input index: starting at indices[i], there will have been i removals before this. If indices[] is small, you can scan through it to distribute block copying work across threads, and have all threads copy into a single array.Dilatation
I
6

This answer is based on other answers here - mainly, I am shifting elements up within the list, as suggested by @Vernou (in their answer) and @BACON (in comments). This one is finally performant (unlike my first few approaches), and is faster than other solutions posted so far, at least in my tests - I tried OP's setup of 2_000_000_000 entries and 2_000 indicies - runtime is under 10 seconds on my laptop (i7-8550U @ 1.8GHz, 16GB RAM):

static void FilterOutIndicies(List<int> values, List<int> sortedIndicies)
{
    int sourceStartIndex = 0;
    int destStartIndex = 0;
    int spanLength = 0;

    int skipCount = 0;

    // Copy items up to last index to be skipped
    foreach (var skipIndex in sortedIndicies)
    {
        spanLength = skipIndex - sourceStartIndex;
        destStartIndex = sourceStartIndex - skipCount;

        for (int i = sourceStartIndex; i < sourceStartIndex + spanLength; i++)
        {
            values[destStartIndex] = values[i];
            destStartIndex++;
        }

        sourceStartIndex = skipIndex + 1;
        skipCount++;
    }

    // Copy remaining items (between last index to be skipped and end of list)
    spanLength = values.Count - sourceStartIndex;
    destStartIndex = sourceStartIndex - skipCount;

    for (int i = sourceStartIndex; i < sourceStartIndex + spanLength; i++)
    {
        values[destStartIndex] = values[i];
        destStartIndex++;
    }

    values.RemoveRange(destStartIndex, sortedIndicies.Count);
}
Involved answered 20/8, 2020 at 0:35 Comment(5)
With a fixed Random, I get 1.22 ms for mine and 0.86 ms for yours, very nice!Wry
A quick check says my CPU should be about 23% faster than yours.Wry
@Wry thanks for checking my version on your PC. Yeah, I think it's the faster CPU and double the RAM that's really helping, due to the array size. Maybe time for an upgrade for me, lol!Involved
On my computer, NetMage solution take 22 seconds, CoolBots solution take 15 seconds. My solution take 2 seconds, but I use reflection. All solutions are equivalent in memory usage.Prompter
The semi-last “Copy remaining items” step is obsolete. You just need to use the right range for the RemoveRange call.Mediatize
P
4

The method List.RemoveAt copy all next items from the removed item. In your case, this copy 2,000 * 2,000,000,000 times each items (not really, but the true is near).

A solution is manually copy item between the removed item and the next removed item :

static void Main(string[] args)
{
    var largeList = Enumerable.Range(0, 2_000_000_000).ToList();

    var indices = new List<int>();
    var rand = new Random();
    for (var i = 0; i < 20000; i++)
    {
        indices.Add(rand.Next(0, largeList.Count - 1));
    }
    indices.Sort();

    var watch = new Stopwatch();
    watch.Start();

    // You can convert the list to array with ToArray,
    // but this duplicate the memory use.
    // Or get the internal array by reflection,
    // but reflection on external library isn't recommended
    var largeArray = (int[])typeof(List<int>)
        .GetField("_items", BindingFlags.Instance | BindingFlags.NonPublic)
        .GetValue(largeList);

    var current = 0;
    var copyFrom = 0;

    for (var i = 0; i < indices.Count; i++)
    {
        var copyTo = indices[i];
        if (copyTo < copyFrom)
        {
            //In case the indice is duplicate,
            //The item is already passed
            continue;
        }
        var copyLength = copyTo - copyFrom;
        Array.Copy(largeArray, copyFrom, largeArray, current, copyLength);
        current += copyLength;
        copyFrom = copyTo + 1;
    }
    //Resize the internal array
    largeList.RemoveRange(largeList.Count - indices.Count, indices.Count);

    watch.Stop();
    Console.WriteLine(watch.Elapsed);
    Console.WriteLine(largeList.Count);
}
Prompter answered 19/8, 2020 at 23:23 Comment(11)
It looks like correct implementation if OP starts with an array. Unfortunately as is it copies array couple extra times and hence require 3x memory which in case of 2 billion items array is non-trivial. Converting Array.Copy to regular for would be the solution. If performance of this code is ok than it would be better to just copy items into new list and skip indexes in the "remove" array - which will have just 2x memory consumption.Ananthous
Array.Copy is really more performant, because is implemented by JIT. More information in this question : #6558766Prompter
@Alexei-LevenkovList is encapsulated Array. I edited my answer to get the internal array.Prompter
@Vernou that's a really interesting approach. Mixed feelings about it, lol, but definitely interesting!Involved
@Involved That is what I figured would be the only way to speed yours up, but I definitely wouldn't use it in production :)Wry
Yes, don't use this in prod. In prod, it better to use directly Array or convert the list.Prompter
I have updated the benchmark in my answer to include your code. For the larger (10-million element) list I benchmarked against, the other answers generally took 10-20% of the time of the baseline implementation; yours tooks 5-10% of the time of those other answers (i.e. 2% of the baseline).Liberal
meta.#402282Pagination
Is there no other way to get current .NET compilers / JITs to do an efficient memcpy over a block of elements? In-place copy is clearly very good for performance, and needs to be done with SIMD vectors to keep up with memory on modern CPUs. I don't know C#, but using reflection to get at the internals of the container seems like an unfortunate requirement. If so, you should highlight that danger in the answer text. BTW, a C++ implementation would be able to use std::copy or even memmove between parts of a std::vector<uint32_t> and have it compile efficiently.Dilatation
I edited the answer to indicate the danger of the reflection on external library.Prompter
With .Net Core 5.0 it looks like List<T>.AsSpan is part of CollectionMarshal, which should make this possible without Reflection.Wry
A
-1

When you have multiple items to remove from a List, and replacing the List with a new List is not an option, the most efficient way is to use the RemoveAll method instead of the RemoveAt. The RemoveAll rearranges the internal state of the List only once, instead of doing it once per each item removed.

The RemoveAll accepts a Predicate<T> that is invoked once for each item in the list (the large list). Unfortunately this delegate doesn't receive the index of the currently tested item. You could however depend on knowing how the RemoveAll is implemented. The source code reveals that the items are tested sequentially in ascending order. So based on this knowledge you could remove the selected indices from the list, very efficiently, with this three-liner:

HashSet<int> indicesSet = new(indices);
int index = 0;
largeList.RemoveAll(_ => indicesSet.Contains(index++));

This behavior is not documented explicitly though. In theory Microsoft could change the behavior of the RemoveAll method in a future .NET release, breaking the above code. Personally I consider this scenario to be almost improbable, but if you want to be on the safe side you could use a custom RemoveAll implementation that has fixed behavior, like the one found in this answer, which also features a delegate with index. You could use it like this:

HashSet<int> indicesSet = new(indices);
largeList.RemoveAll((_, index) => indicesSet.Contains(index));
Anuradhapura answered 19/8, 2020 at 23:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.