Processing large arrays in PowerShell
Asked Answered
A

4

7

I am having a difficult time understanding the most efficient to process large datasets/arrays in PowerShell. I have arrays that have several million items that I need to process and group. This list is always different in size meaning it could be 3.5 million items or 10 million items.

Example: 3.5 million items they group by "4's" like the following:

Items 0,1,2,3 Group together 4,5,6,7 Group Together and so on.

I have tried processing the array using a single thread by looping through the list and assigning to a pscustomobject which works it just takes 45-50+ minutes to complete.

I have also attempted to break up the array into smaller arrays but that causes the process to run even longer.

$i=0
$d_array = @()
$item_array # Large dataset


While ($i -lt $item_array.length){

    $o = "Test"
    $oo = "Test"
    $n = $item_array[$i];$i++
    $id = $item_array[$i];$i++
    $ir = $item_array[$i];$i++
    $cs = $item_array[$i];$i++

    $items = [PSCustomObject]@{
        'field1' = $o
        'field2' = $oo
        'field3' = $n
        'field4' = $id
        'field5' = $ir
        'field6'= $cs
    }
    $d_array += $items

}

I would imagine if I applied a job scheduler that would allow me to run the multiple jobs would cut the process time down by a significant amount, but I wanted to get others takes on a quick effective way to tackle this.

Argol answered 1/6, 2019 at 12:40 Comment(1)
First thing you should try is to not use array addition.Communistic
R
4

If you are working with large data, using C# is also effective.

Add-Type -TypeDefinition @"
using System.Collections.Generic;

public static class Test
{
    public static List<object> Convert(object[] src)
    {
        var result = new List<object>();
        for(var i = 0; i <= src.Length - 4; i+=4)
        {
            result.Add( new {
                field1 = "Test",
                field2 = "Test",
                field3 = src[i + 0],
                field4 = src[i + 1],
                field5 = src[i + 2],
                field6 = src[i + 3]
            });
        }
        return result;
    }
}
"@

$item_array = 1..10000000
$result = [Test]::Convert($item_array)
Resupinate answered 1/6, 2019 at 17:14 Comment(4)
I was about to suggest exactly the change you just made yourself, by far the fastest version (+1).Submerge
Everyone here is amazing thank you for your response.Argol
29 ms with my example. Nice.Upstretched
Does this implementation only work on Windows or can it be used on a Mac? I am asking as I am trying to support both.Hollow
S
4

While rokumarus version is unsurpassed, here my try with my local measurements from js2010

Same $item_array = 1..100000 applied to all versions

> .\SO_56406847.ps1
measuring...BDups
measuring...LotPings
measuring...Theo
measuring...js2010
measuring...rokumaru
BDups    = 75,9949897 TotalSeconds
LotPings = 2,3663763 TotalSeconds
Theo     = 2,4469917 TotalSeconds
js2010   = 2,9198114 TotalSeconds
rokumaru = 0,0109287 TotalSeconds

## Q:\Test\2019\06\01\SO_56406847.ps1
$i=0
$item_array = 1..100000  # Large dataset

'measuring...LotPings'
$LotPings = measure-command {
    $d_array = for($i=0;$i -lt $item_array.length;$i+=4){
        [PSCustomObject]@{
            'field1' = "Test"
            'field2' = "Test"
            'field3' = $item_array[$i]
            'field4' = $item_array[$i+1]
            'field5' = $item_array[$i+2]
            'field6' = $item_array[$i+3]
        }
    }
} # measure-command
Submerge answered 1/6, 2019 at 19:26 Comment(1)
That's cool, but the comma in totalseconds should be a decimal point, right?Upstretched
U
2

How's this? 32.5x faster. Making arrays with += kills puppies. It copies the whole array every time.

$i=0
$item_array = 1..100000 # Large dataset

'measuring...'

# original 1 min 5 sec                                                                 
# mine 2 sec              
# other answer, 2 or 3 sec
# c# version 0.029 sec, 2241x faster!

measure-command {

$d_array = 
While ($i -lt $item_array.length){
    $o = "Test"
    $oo = "Test"
    $n = $item_array[$i];$i++                                                      
    $id = $item_array[$i];$i++                                                     
    $ir = $item_array[$i];$i++                                                     
    $cs = $item_array[$i];$i++      
    # $items =                                               
    [PSCustomObject]@{
        'field1' = $o
        'field2' = $oo
        'field3' = $n
        'field4' = $id
        'field5' = $ir
        'field6'= $cs
    }
    # $d_array += $items
}

}
Upstretched answered 1/6, 2019 at 13:51 Comment(0)
S
0

You could optimize this somewhat using an ArrayList, or perhaps even better by using a strongly typed List but going through millions of elements in an array will still take time..

As for your code: there is no need to capture the array item values in a variable first and use that later to add to the PSCustomObject.

$item_array = 'a','b','c','d','e','f','g','h' # Large dataset
$result = New-Object System.Collections.Generic.List[PSCustomObject]
# or use an ArrayList: $result = New-Object System.Collections.ArrayList

$i = 0
While ($i -lt $item_array.Count) {
    [void]$result.Add(
        [PSCustomObject]@{
            'field1' = "Test" # $o
            'field2' = "Test" # $oo
            'field3' = $item_array[$i++]  #$n
            'field4' = $item_array[$i++]  #$id
            'field5' = $item_array[$i++]  #$ir
            'field6' = $item_array[$i++]  #$cs
        }
    )
}

# save to a CSV file maybe ?
$result | Export-Csv 'D:\blah.csv' -NoTypeInformation

If you need the result to become a 'normal' array again, use $result.ToArray()

Southwestwardly answered 1/6, 2019 at 13:59 Comment(3)
Except in the case of maintaining compatibility with existing code, I don't think there's really a reason to even consider using ArrayList any more. The ArrayList documentation you linked to states as such: "We don't recommend that you use the ArrayList class for new development. Instead, we recommend that you use the generic List<T> class." Perhaps the difference between the two is not so apparent when using PowerShell as it would be with, say, a compiled language like C#, but I still think it'd be good to not present an obsolete class as a viable alternative.Officious
@BACON AFAIK the Arraylist is not yet classified obsolete. It is recommended to not use it anymore as there is a better alternative with List. That is exactly the reason why I used List in my code and added the links to both classes in the first line of my answer.Southwestwardly
While ArrayList has not been marked [Obsolete()] it is obsolete in that there is a newer, better alternative that should be used going forward. The very first thing your answer mentions is ArrayList and then says "perhaps" List[T] would be better. My point is, List[T] was introduced with .NET 2.0 in 2005; why even mention ArrayList at all for non-legacy code in 2019? Just my opinion, but including commented out code and a documentation link for ArrayList, which Microsoft recommends not to use, presents it as a reasonable, equivalent alternative for modern code when it's not.Officious

© 2022 - 2024 — McMap. All rights reserved.