Fastest Way to get a uniquely index item from the property of an array
Asked Answered
P

2

6

Make an array like this which represents what I'm looking for:

$array = @(1..50000).foreach{[PSCustomObject]@{Index=$PSItem;Property1='Hello!';Property2=(Get-Random)}}

What's the fastest way to get the item with Index property '43122'?

Some ideas I had but I feel like there must be a quicker way:

Where pipeline

measure-command {$array | where-object index -eq 43122} | % totalmilliseconds
420.3766

Where method

measure-command {$array.where{$_ -eq 43122}} | % totalmilliseconds
155.1342

Make a hashtable first and query the "index" result. Slow at first but subsequent lookups are faster.

measure-command {$ht = @{};$array.foreach{$ht[$PSItem.index] = $psitem}} | % totalmilliseconds
124.0821

measure-command {$ht.43122} | % totalmilliseconds
3.4076

Is there a faster way than building a hashtable first? Maybe a different .NET array type like some special kind of indexed list that I can store it in initially and then run a method to pull out the item based on the unique property?

Phelloderm answered 20/12, 2019 at 23:4 Comment(3)
from what i can tell, a hashtable is the fastest lookup strategy for single-key lookups. if you need more than that, you likely should look into something like the datatable structure OR something like SQL Lite.Lo
Is there a built-in .NET type like a [List] or something that you can specify the property and it will automatically do that? I want to speed up the hashtable construction if at all possible. To clarify, the use case is pulling a bunch of objects from Exchange Online, where the alias property is known to be unique, and doing a lookup to make some changes. And before you ask, yes I thought to use -Filter, but the round trip on the API calls is waaay slower, it is more performant to pull it all down at once and cache it since I'm working on 90% of the records.Phelloderm
not that i am aware of. the time to build a hashtable does not appear increase as fast as the time to use the .Where() method when the collection size increases. the lookup time with a hashtable is pretty flat ... and i suspect that it is designed to do what you seem to want done.Lo
A
7

Partly thanks to the fact that PowerShell is able to invoke .Net methods, it offers quiet some possibilities to filter objects. At stackoverflow you will find a lot of (PowerShell) questions and answers measuring the performance of a specific extricated command or cmdlet. This usually leaves wrong impression as the performance of a complete (PowerShell) solution is supposed to be better than the sum of its parts. Each command is depended on the expected input and - output. Especially when using the PowerShell pipeline, commands (cmdlets) interact with prior commands and commands that follow. Therefore it is important to look at the bigger picture and understand how and where each command gains its performance.
This means that I can't tell which command you should choose, but with a better understanding of the commands and concepts listed below, I hope you are better able to find "fastest way" for your specific solution.

[Linq.Enumerable]::Where

Language Integrated Query (LINQ) is often (dis)qualified as the fasted solution to filter objects in PowerShell (see also High Performance PowerShell with LINQ):

(Measure-Command {
    $Result = [Linq.Enumerable]::Where($array, [Func[object,bool]] { param($Item); return $Item.Index -eq 43122 })
}).totalmilliseconds
4.0715

Just over 4ms!, none of the other methods can ever beat that...
But before jumping into any conclusions that LINQ beats any other method by a factor 100 or more you should keep the following in mind. There are two pitfalls in measuring the performance of a LINQ query when you just look at the performance of the activity itself:

  • LINQ has a big cache, meaning that you should restart a new PowerShell session to measure the actual results (or just not, if you often want to reuse the query). After restarting the PowerShell session, you will find that it will take about 6 times longer to initiate the LINQ query.
  • But more importantly, LINQ performs lazy evaluation (also called deferred execution). This means that actually nothing has been done yet other than defining what should be done. This actually shows if you want to access one of the properties of the $Result:
(Measure-Command {
    $Result.Property1
}).totalmilliseconds
532.366

Where it usually takes about 15ms to retrieve a property of a single object:

$Item = [PSCustomObject]@{Index=1; Property1='Hello!'; Property2=(Get-Random)}
(Measure-Command {
    $Item.Property1
}).totalmilliseconds
15.3708

Bottom line, you need to instantiate the results to correctly measure the performance of a LINQ query (for this, let's just retrieve one of the properties of the returned object within the measurement):

(Measure-Command {
    $Result = ([Linq.Enumerable]::Where($array, [Func[object,bool]] { param($Item); return $Item.Index -eq 43122 })).Property1
}).totalmilliseconds
570.5087

(which is still fast.)

HashTable

Hash tables are generally fast because they are based on a binary search algorithm, this means that you maximal have to guess ln 50000 / ln 2 = 16 times to find your object. Nevertheless, building a HashTabe for a single lookup is a little over done. But if you control the contruction of the object list, you might construct the hash table on the go:

(Measure-Command {
    $ht = @{}
    $array = @(1..50000).foreach{$ht[$PSItem] = [PSCustomObject]@{Index=$PSItem;Property1='Hello!';Property2=(Get-Random)}}
    $ht.43122
}).totalmilliseconds
3415.1196

vs:

(Measure-Command {
    $array = @(1..50000).foreach{[PSCustomObject]@{Index=$PSItem;Property1='Hello!';Property2=(Get-Random)}}
    $ht = @{}; $array.foreach{$ht[$PSItem.index] = $psitem}
    $ht.43122
}).totalmilliseconds
3969.6451

Where-Object cmdlet vs Where method

As you might already have concluded yourself the Where method appears about twice as fast then the Where-Object cmdlet:

Where-Object cmdlet:

(Measure-Command {
    $Result = $Array | Where-Object index -eq 43122
}).totalmilliseconds
721.545

Where method:

(Measure-Command {
    $Result = $Array.Where{$_ -eq 43122}
}).totalmilliseconds
319.0967

The reason for that is because the Where command requires you load the whole array into memory which is actually not required for the Where-Object cmdlet. If the data is already in memory (e.g. by assigning it to a variable $array = ...) it is not be a big deal but this might actually a disadvantage by itself: except that it consumes memory, you have to wait until all objects are received before you can start filtering...

Don't underestimate the power of the PowerShell cmdlets like Where-Object especially look to the solution as a whole in combination with the pipeline. As shown above, if you just measure on the specific action you might find these cmdlets slow but if you measure your whole end-to-end solution you might find that there isn't much difference and that cmdlets might even outperform methods other techniques. Where LINQ queries are extremely reactive, PowerShell cmdlets are extremely proactive.
In general, if your input is not yet in memory and supplied via the pipeline, you should try to continue to build on that pipeline and avoid stalling it in any way by avoiding variables assignments ($array = ...) and the use of brackets ((...)) :

Presume that your objects come from a slower input, in that case all the other solutions need to wait for the very last object to be able start filtering where the Where-Object has already filtered most of the objects on the fly and as soon it has found it, is indeterminately passed to the next cmdlet...

For example let's presume that the data comes from a csv file rather then memory...

$Array | Export-Csv .\Test.csv

Where-Object cmdlet:

(Measure-Command {
    Import-Csv -Path .\Test.csv | Where-Object index -eq 43122 | Export-Csv -Path .\Result.csv
}).totalmilliseconds
717.8306

Where method:

(Measure-Command {
    $Array = Import-Csv -Path .\Test.csv
    Export-Csv -Path .\Result.csv -InputObject $Array.Where{$_ -eq 43122}
}).totalmilliseconds
747.3657

This is just a single test example, but in most cases where the data isn't instantly available in memory, Where-Object streaming appears to be a often faster then using the Where method.
Besides, the Where method uses a lot more memory which might make performance even worse if your file (list of objects) size exceeds the available physical memory. (See also: Can the following Nested foreach loop be simplified in PowerShell?).

ForEach-Object cmdlet vs ForEach method vs ForEach comand

Instead of using the Where-Object cmdlet or the Where method, you might consider to iterate through all the objects and just compare them with an If statement. Before going into depth of this approach it is worth mentioning that comparison operators already iterate through the left argument by itself, quote:

When the input to an operator is a scalar value, comparison operators return a Boolean value. When the input is a collection of values, the comparison operators return any matching values. If there are no matches in a collection, comparison operators return an empty array.

This means that if you just want to know whether the object with the specific property exists and don't care about the object itself, you might just simply compare the specific property collection:

(Measure-Command {
    If ($Array.Index -eq 43122) {'Found object with the specific property value'}
}).totalmilliseconds
55.3483

For the ForEach-Object cmdlet and the ForEach method, you will see that the approach just takes a little longer then using their counterparts (Where-Object cmdlet and the Where method) as there is a little more overhead for the embedded comparison:

Directly from memory:
ForEach-Object cmdlet:

(Measure-Command {
    $Result = $Array | ForEach-Object {If ($_.index -eq 43122) {$_}}
}).totalmilliseconds
1031.1599

ForEach method:

(Measure-Command {
    $Result = $Array.ForEach{If ($_.index -eq 43122) {$_}}
}).totalmilliseconds
781.6769

Streaming from disk:
ForEach-Object cmdlet:

(Measure-Command {
    Import-Csv -Path .\Test.csv |
    ForEach-Object {If ($_.index -eq 43122) {$_}} |
    Export-Csv -Path .\Result.csv
}).totalmilliseconds
1978.4703

ForEach method:

(Measure-Command {
    $Array = Import-Csv -Path .\Test.csv
    Export-Csv -Path .\Result.csv -InputObject $Array.ForEach{If ($_.index -eq 43122) {$_}}
}).totalmilliseconds
1447.3628

ForEach command But even with the embeded comparison, the ForEach command appears close to the performance of using the Where method when the $Array is already available in memory:

Directly from memory:

(Measure-Command {
    $Result = $Null
    ForEach ($Item in $Array) {
        If ($Item.index -eq 43122) {$Result = $Item}
    }
}).totalmilliseconds
382.6731

Streaming from disk:

(Measure-Command {
    $Result = $Null
    $Array = Import-Csv -Path .\Test.csv
    ForEach ($Item in $Array) {
        If ($item.index -eq 43122) {$Result = $Item}
    }
    Export-Csv -Path .\Result.csv -InputObject $Result
}).totalmilliseconds
1078.3495

But there might be another advantage of using the ForEach command if you only looking for one (or the first) occurrence: you can Break out of the loop once you have found the object and with that simply skip the rest of the array iteration. In other words, if the item appears at the end, there might not be much of a difference but if it appears at the beginning you have a lot to win. To level this, I have taken the average index (25000) for the search:

(Measure-Command {
    $Result = $Null
    ForEach ($Item in $Array) {
        If ($item.index -eq 25000) {$Result = $Item; Break}
    }
}).totalmilliseconds
138.029

Note that you can't use the Break statement for the ForEach-Object cmdlet and ForEach method, see: How to exit from ForEach-Object in PowerShell

Breaking out (added 2023-05-23)

As commented by ili it is although not completely impossible to break out a pipeline using the Select-Object cmdlet with the -First parameter which actually breaks the pipeline when the specified number of pipeline items have been processed, E.g.:

$Array | Where-Object index -eq 43122 | Select-Object -First 1

(see also: Using continue/return statement inside .ForEach() method - is it better to use foreach ($item in $collection) instead?)

As well for the Where() method which has an optional argument that allows additional selection capabilities as limiting how many items are returned from the filter, e.g.:

$array.where({$_ -eq 43122}, 'First', 1)

Conclusion

Purely looking at the tested commands and making a few assumptions like:

  • The input isn't a bottleneck (the $Array is already resident in memory)
  • The output isn't a bottleneck (the $Result isn't actually used)
  • You only need one (the first) occurrence
  • There is nothing else to do prior, after and within the iteration

Using the ForEach command and simply comparing each index property until you find the object, appears to be the fastest way in the given/assumed boundaries of this question but as stated at the beginning; to determine what is fastest for your used case, you should understand what you doing and look at the whole solution and not just a part.

Related:

Acuminate answered 21/12, 2019 at 14:53 Comment(1)
Small correction pipelines can be stopped with ` | Select-Object -First 1`, don't know why it's so low in the post linked https://mcmap.net/q/22789/-how-to-exit-from-foreach-object-in-powershellSprag
G
0

The fastest way I think is to use a Hashtable and take it for granted that building this would take some time. Also, I would reverse the Hashtable, so that the property you want to seek is the key and the array indexd will be the value.

Note that while your example creates an array with start index 1, you need to account for that when retrieving the exact index (starting at 0) later. Also note that by using (Get-Random) for the property to search for leaves you with possible duplicate values. For demo this is fine, but remember that while doing so, the index found will be the last index in the series of duplicates..

# create the demo array of objects
$startIndex = 0
$array = @($startIndex..50000).Foreach{[PSCustomObject]@{Index=$PSItem; Property1='Hello!'; Property2=(Get-Random)}}

# create the hashtable where Property2 is the key and the array index the value
Write-Host 'Create HashTable: ' -NoNewline
(Measure-Command { $ht = @{}; foreach ($i in $array) { $ht[$i.Property2] = ($i.Index - $startIndex) } }).TotalMilliseconds

# try and find the index. This will take longer if there was no Property2 with value 43122 
Write-Host 'Find array index: ' -NoNewline
(Measure-Command { $ht[43122] }).TotalMilliseconds

Output on my Windows 7 machine (12 GB RAM, SSD disk):

Create HashTable: 250.3011
Find array index: 0.3865
Gelinas answered 21/12, 2019 at 11:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.