Merge two CSV files while adding new and overwriting existing entries
Asked Answered
R

3

1

I have a configuration.csv that holds a template data like this:

| path       | item  | value  | type |
|------------|-------|--------|------|
| some/path  | item1 | value1 | ALL  |
| some/path  | item2 | UPDATE | ALL  |
| other/path | item1 | value2 | SOME |

and customization.csv that has service specific configuration:

| path       | item  | value  | type |
|------------|-------|--------|------|
| some/path  | item2 | value3 | ALL  |
| new/path   | item3 | value3 | SOME |

My goal is to merge them and end up with something like this:

| path       | item  | value  | type |
|------------|-------|--------|------|
| some/path  | item1 | value1 | ALL  |
| some/path  | item2 | value3 | ALL  |
| other/path | item1 | value2 | SOME |
| new/path   | item3 | value3 | SOME |

This should add any new entries and update any existing ones. No one column can be used for unique identification - both path and item needs to be combined, as they are guaranteed to be unique.

Roommate answered 1/3, 2019 at 15:53 Comment(1)
Does this answer your question? In PowerShell, what's the best way to join two tables into one?Inbred
S
2

I suggest to use Compare-Object and as the values from customization.csv shall persist use this files values as -ReferenceObject

## Q:\Test\2019\03\01\SO_54948111.ps1

$conf = Import-Csv '.\configuration.csv'
$cust = Import-Csv '.\customization.csv'

$NewData = Compare-Object -ref $cust -diff $conf -Property path,item -PassThru -IncludeEqual|
    Select-Object -Property * -ExcludeProperty SideIndicator

$NewData
$NewData |Export-Csv '.\NewData.csv' -NoTypeInformation

Sample output

> Q:\Test\2019\03\01\SO_54948111.ps1

path       item  value  type
----       ----  -----  ----
some/path  item2 value3 ALL
some/path  item1 value1 ALL
other/path item1 value2 SOME
new/path   item3 value3 SOME
Slim answered 1/3, 2019 at 16:33 Comment(1)
Great solution. I've used Comprae-Object in the past but obviously not enough to notice all the useful tricks.Roommate
R
2

After a lot of searching, I figured the easiest way to manipulate the entries without recreating the managing framework will be through . During the process I had to account for two edge cases:

  1. additional commas in the values
  2. empty values

The final solution I got is this:

$configuration = Import-Csv .\configuration.csv
$customization = Import-Csv .\customization.csv
$merged = New-Object System.Collections.ArrayList
$hashTable = @{}

#initializing the hashTable with the defaults
foreach ($entry in $configuration)
{
    $hashTable[$entry.path + ',' + $entry.item] = $entry.value + ',' + $entry.type
}

#updating the hashTable with customization that add or overwrite existing entries
foreach ($entry in $customization)
{
    $hashTable[$entry.path + ',' + $entry.item] = $entry.value + ',' + $entry.type
}

#the regex handles multiple commas and empty values.
#It returns an empty string before and after group so we start from 1 
foreach ($key in $hashTable.keys)
{
    $psobject = [PSCustomObject]@{
        path  = ($key -split '(.*),(.*)')[1]
        item  = ($key -split '(.*),(.*)')[2]
        value = ($hashTable[$key] -split '(.*),(.*)')[1]
        type  = ($hashTable[$key] -split '(.*),(.*)')[2]
    }
    [void] $merged.Add($psobject)
}
Write-Output $merged

Once imported, I transform the configuration.csv into hashTable with keys comprised of path and value. I then do the same with customization.csv using the same hashTable which overwrites any existing key values or add them as new.

The third loop converts the hashTable to PSCustomObject similar to what Import-Csv does. I split each of the key and value attributes while accounting for multiple commas and also empty values.
NOTE: the regex will split on the last occurrence of the separator (here it's comma, but you can select anything, really). If you want to split on the first, you can use (.*?),(.*). In my case only the value column could contain an instance of the separator.

If the CSV had a unique column, then a solution similar to this answer could've been used.

Another alternative is to set the keys to be the sum of all columns, and this will filter out any duplicates in the CSV, but the splitting can get tricky, depending on the values in the columns.

Roommate answered 1/3, 2019 at 15:53 Comment(3)
Nice use of hash tables which are indeed fast as they make use of a binary search algorithm but I am afraid that your solution (where you treat the related items a strings) might not hold for a reusable solution. As in your case, it wil probably break if one of your files contains a single quote (like Frank's File.txt) or a comma which you use for separator...Leucomaine
I see your concern. In my case I tested for multiple commas, but only in the left variable (path or value). The other two are not expected to have commas. As for the ' - I wasn't able to produce a case where it would break the script. It is handled properly as a string.Roommate
My comment on the quotes is incorrect (as the keys string will not be expanded) but the concern on the comma stays, for this you might actual consider to use a non-printable character for separating the fields instead, e.g.:$entry.path + [char]31 + $entry.item (it is very unlikely that will be used in any property name).Leucomaine
S
2

I suggest to use Compare-Object and as the values from customization.csv shall persist use this files values as -ReferenceObject

## Q:\Test\2019\03\01\SO_54948111.ps1

$conf = Import-Csv '.\configuration.csv'
$cust = Import-Csv '.\customization.csv'

$NewData = Compare-Object -ref $cust -diff $conf -Property path,item -PassThru -IncludeEqual|
    Select-Object -Property * -ExcludeProperty SideIndicator

$NewData
$NewData |Export-Csv '.\NewData.csv' -NoTypeInformation

Sample output

> Q:\Test\2019\03\01\SO_54948111.ps1

path       item  value  type
----       ----  -----  ----
some/path  item2 value3 ALL
some/path  item1 value1 ALL
other/path item1 value2 SOME
new/path   item3 value3 SOME
Slim answered 1/3, 2019 at 16:33 Comment(1)
Great solution. I've used Comprae-Object in the past but obviously not enough to notice all the useful tricks.Roommate
L
1

Your idea 'using the same hashTable which overwrites any existing key values or add them as new.' will only work if the path, item is unique on each side as you will also overwrite any duplicates... Consider this Join-Object cmdlet.

$configuration = ConvertFrom-SourceTable '

| path       | item  | value  | type |
|------------|-------|--------|------|
| some/path  | item1 | value1 | ALL  |
| some/path  | item2 | UPDATE | ALL  |
| other/path | item1 | value2 | SOME |
| other/path | item1 | value3 | ALL  |
'

$customization= ConvertFrom-SourceTable '

| path       | item  | value  | type |
|------------|-------|--------|------|
| some/path  | item2 | value3 | ALL  |
| new/path   | item3 | value3 | SOME |
| new/path   | item3 | value4 | ALL  |
'

Using the Merge-Object, alias Merge, proxy command (see help):

$configuration | Merge $customization -on path, item

path       item  value  type
----       ----  -----  ----
some/path  item1 value1 ALL
some/path  item2 value3 ALL
other/path item1 value2 SOME
other/path item1 value3 ALL
new/path   item3 value3 SOME
new/path   item3 value4 ALL
Leucomaine answered 1/3, 2019 at 16:51 Comment(3)
I saw the cmdlet during my research, but I wanted solution that is self sufficient and does not rely on external dependencies.Roommate
Your idea 'using the same hashTable which overwrites any existing key values or add them as new.' will only work if the path, item is unique on each side. See the results in the updated example of my answer and: https://mcmap.net/q/22799/-efficiently-merge-large-object-datasets-having-mulitple-matching-keysLeucomaine
That was my initial request in the question - to update the values for any existing paths. As this is configuration file, I need each entry to be unique and to not repeat. Your solution is great but for different requirement. Thanks for pointing the other questions.Roommate

© 2022 - 2024 — McMap. All rights reserved.