Find what is different in two very large lists
Asked Answered
E

2

1

I have two lists about 1k people each list. What I want to do is find who is leftover between the two.

$BunchoEmail = Import-Csv C:\temp\Directory.csv | Select-Object primaryEmail -ExpandProperty primaryEmail

$GoogleUsers = gam print users fields suspended | ConvertFrom-Csv | Where-Object suspended -ne $true | Select-Object primaryEmail -ExpandProperty primaryEmail

$objects = @{
    ReferenceObject  = $GoogleUsers
    DifferenceObject = $BunchoEmail
}
Compare-Object @objects

Above didn't produce what I wanted.

What is the best way to find what is different ?

Effectuate answered 22/7, 2021 at 19:29 Comment(0)
W
4

Load each list into a [hashtable]:

$emailTable = @{}
$BunchoEmail |ForEach-Object { $emailTable[$_] = $_ }

$gsuiteTable = @{}
$GoogleUsers |ForEach-Object { $gsuiteTable[$_] = $_ }

Now you can iterate over one list and check whether the other doesn't contain any particular email addresses with Where-Object:

$notInGSuite = $BunchoEmail |Where-Object { -not $gsuiteTable.ContainsKey($_) }

$notInEmailList = $GoogleUsers |Where-Object { -not $emailTable.ContainsKey($_) }

The time complexity of ContainsKey() on a hashtable is O(1), so it'll keep working for lists with 1000s of emails

Wilie answered 22/7, 2021 at 19:36 Comment(0)
E
6

Compare-Object is capable of finding what elements are missing from one collection relative to the other, vice versa, or both.

However, it can be slow, and given that you mention large lists, it sounds like you're looking for a solution that performs well.

  • However, collections with 1,000 items are likely not a problem in practice.

  • Something like the following may therefore be sufficient to get all entries in $BunchoEmail that aren't also in $GoogleUsers (substitute => for <= to reverse the logic):

    (Compare-Object -PassThru $BunchoEmail $GoogleUsers).
      Where({ $_.SideIndicator -eq '<=' })
    
  • Getting those entries that aren't in both collections (that are unique to either collection) is even easier:

    Compare-Object -PassThru $BunchoEmail $GoogleUsers
    

As for improving performance:

Combining type [System.Collections.Generic.HashSet`1] with LINQ enables a fast and concise solution:

Note:

  • Use of HashSet implies that the results are reported in no particular order; to get them in sorted order, use [System.Collections.Generic.SortedSet[string]] instead. (There is no built-in type for maintaining the insertion order as of .NET 6).

  • The solutions below are true set operations, i.e. they report distinct differences, unlike Compare-Object. E.g., if unique email [email protected] is present twice in a collection, the solutions below report it only once, whereas Compare-Object reports both instances.

  • Unlike Compare-Object, the HashSet and SortedSet types are case-sensitive by default; you can pass an equality comparer to the constructor for case-insensitive behavior, using System.StringComparer; e.g.:

    [System.Collections.Generic.HashSet[string]]::new(
      [string[]] ('foo', 'FOO'),
      [System.StringComparer]::InvariantCultureIgnoreCase
    )
    

To get all entries in $BunchoEmail that aren't also in $GoogleUsers, use [System.Linq.Enumerable]::Except() (reverse the operands for the inverse solution):

[Linq.Enumerable]::Except(
  [System.Collections.Generic.HashSet[string]] $BunchoEmail,
  [System.Collections.Generic.HashSet[string]] $GoogleUsers
)

Note: You could also use a hash set's .ExceptWith() method, but that requires storing one of the hash sets in an auxiliary variable, which is then updated in place - analogous to the .SymmetricExceptWith() solution below.

Getting those entries that aren't in both collections (that are unique to either collection, called the symmetric difference in set terms) requires a bit more effort, using a hash set's .SymmetricExceptWith() method:

# Load one of the collections into an auxiliary hash set.
$auxHashSet = [System.Collections.Generic.HashSet[string]] $BunchoEmail

# Determine the symmetric difference between the two sets, which
# updates the calling set in place.
$auxHashSet.SymmetricExceptWith(
  [System.Collections.Generic.HashSet[string]] $GoogleUsers
)

# Output the result
$auxHashSet
Edging answered 22/7, 2021 at 20:37 Comment(3)
This is perfect, just noticed that [System.Collections.Generic.HashSet[string]] can also handle duplicated items in the collection without any exception thrown which is very nice.Isolde
Indeed, @SantiagoSquarzon, it creates a true set, and you can also call .Add() repeatedly without getting an exception (the [bool] return value indicates whether the element was already present).Edging
Excellent question, @SantiagoSquarzon: you can - please see my update.Edging
W
4

Load each list into a [hashtable]:

$emailTable = @{}
$BunchoEmail |ForEach-Object { $emailTable[$_] = $_ }

$gsuiteTable = @{}
$GoogleUsers |ForEach-Object { $gsuiteTable[$_] = $_ }

Now you can iterate over one list and check whether the other doesn't contain any particular email addresses with Where-Object:

$notInGSuite = $BunchoEmail |Where-Object { -not $gsuiteTable.ContainsKey($_) }

$notInEmailList = $GoogleUsers |Where-Object { -not $emailTable.ContainsKey($_) }

The time complexity of ContainsKey() on a hashtable is O(1), so it'll keep working for lists with 1000s of emails

Wilie answered 22/7, 2021 at 19:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.