How to Modify Large CSV with PowerShell without using all Server Memory
Asked Answered
B

2

2

I am using PowerShell to do some data modification on CSV files before importing them into Oracle. I have watched the resource monitor while the process is running and the process is chewing up all 20 GB of available memory on the server. One of my CSVs is roughly 90 MB having almost 200 columns and 100K rows. The Generated CSV is about 120 MB. Here is the code I am currently using:

# Process Configuration File
$path = $PSScriptRoot + "\"

#Set Extraction Date-Time in format for Oracle Timestamp with TZ
$date = Get-Date -Format "yyyy-MM-dd HH:mm:ss K"

Import-Csv -Path ($path + 'documents.csv') -Encoding UTF8 |
   # Convert Date Time values that are always populated
   % {$_.document_creation_date__v = ([datetime]($_.document_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |
   % {$_.version_creation_date__v = ([datetime]($_.version_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |
   % {$_.version_modified_date__v = ([datetime]($_.version_modified_date__v)).ToString('yyyy-MM-dd HH:mm:ss K');$_} |

   # Convert DateTime values that may be blank
   % {if($_.binder_last_autofiled_date__v -gt ""){$_.binder_last_autofiled_date__v = ([datetime]($_.binder_last_autofiled_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')};$_} |
   % {if($_.locked_date__v -gt ""){$_.locked_date__v = ([datetime]($_.locked_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')};$_} |

   # Fix Multi-Select Picklist fields, replacing value divider with "|"
   % {$_.clinical_data__c = ((($_.clinical_data__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.composition_formulation_ingredients__c = ((($_.composition_formulation_ingredients__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.content_category__c = ((($_.content_category__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.crm_disable_actions__v = ((($_.crm_disable_actions__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.indication_dosage_administration__c = ((($_.indication_dosage_administration__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.pharmacodynamics_and_pharmacokinetics__c = ((($_.pharmacodynamics_and_pharmacokinetics__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.indication__c = ((($_.indication__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.rights_channel__v = ((($_.rights_channel__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.rights_language__v = ((($_.rights_language__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.safety__c = ((($_.safety__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.special_population__c = ((($_.special_population__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.storage_stability__c = ((($_.storage_stability__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.ta_subcategory__c = ((($_.ta_subcategory__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.tags__v = ((($_.tags__v).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.user_groups__c = ((($_.user_groups__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.vaccines__c = ((($_.vaccines__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.channels__c = ((($_.channels__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.material_type__c = ((($_.material_type__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |
   % {$_.target_audience__c = ((($_.target_audience__c).Replace(',,','~comma~')).Replace(',','|')).Replace('~comma~',',');$_} |

   # Trim values that can be too long
   % {$_.product__v = ($_.product__v)[0..254] -join "";$_} |

   # Add ExtractDate Column
   Select-Object *,@{Name='Extract_Date';Expression={$date}} |

   #Export Results
   Export-Csv ($path + 'VMC_DOCUMENTS.csv') -NoTypeInformation -Encoding UTF8

Is there a more efficient way to modify large CSV files with PowerShell than what I am currently doing? The process takes roughly 10 minutes to complete. I am by no means a PowerShell guru and built my scripts based on info from this site and the MS PowerShell documentation. Any suggestions would be greatly appreciated.

Here is a data to create a sample documents.csv with a single record:

allow_pdf_download__v,allow_source_download__v,annotations_all__v,annotations_anchors__v,annotations_lines__v,annotations_links__v,annotations_notes__v,annotations_resolved__v,annotations_unresolved__v,associated_content_notes__c,author__c,batch_number__v,binder__v,binder_created_from__v,binder_last_autofiled_by__v,binder_last_autofiled_date__v,binder_locked__v,binder_metadata__v,bound_source_major_version__v,bound_source_minor_version__v,classification__v,clinical_data__c,composition_formulation_ingredients__c,content_category__c,copyright__c,copyright_license_expiration__c,copyright_owner__c,copyright_title__c,country__v,created_by__v,crosslink__v,date_permissions_obtained__c,decision_date__c,description_of_copyrighted_content__c,detail_group__v,disclaimer__c,document_creation_date__v,document_fit__v,document_host_url__v,document_number__v,source_type__c,dossier_type__c,duration_of_use__c,email_domain__v,email_template_type__v,expiration_date__c,external_id__v,extra_scientific_content__c,filename__v,format__v,from_address__v,from_name__v,ftp_source_location__v,grant_type__c,id,indication_disease__c,indication_dosage_administration__c,intended_use__c,language__c,last_modified_by__v,latest_source_major_version__v,latest_source_minor_version__v,latest_version__v,legacy_document_number__c,legal_approval_form__c,legal_approval_required__c,lifecycle__v,link_status__v,locked__v,locked_by__v,locked_date__v,major_version_number__v,md5checksum__v,members_of_public__c,minor_version_number__v,name__v,obtained_by__c,one_of_use__c,other__c,pages__v,payment_amount_usd__c,payment_date__c,payment_made__c,permissions_fee__c,pharmacodynamics_and_pharmacokinetics__c,product__v,public_content__v,publication_date__c,reapproval_cycle_count__c,reapproval_date__c,reason_for_iactivation__c,region_code__c,rendition_black_list_flag__v,reply_to_address__v,reply_to_name__v,response_type__c,restrict_fragments_by_product__v,restricted_countries__c,rights_channel__v,rights_countries__v,rights_expiration_date__v,rights_language__v,rights_other__v,rights_resource_type__v,safety__c,size__v,source__c,source_binding_rule__v,source_document_id__v,source_document_name__v,source_document_number__v,source_owner__v,source_vault_id__v,source_vault_name__v,special_population__c,start_date__c,status__v,storage_stability__c,subject__v,submission_date__c,subtype__v,tags__v,target__c,target_description__c,template_doctype__v,territory__v,therapeutic_area__c,title__v,type__v,use_location__c,user_groups__c,vaccines__c,version_created_by__v,version_creation_date__v,version_id,version_modified_date__v,clm_content__v,clm_id__v,crm_custom_reaction__v,crm_directory__v,crm_disable_actions__v,crm_enable_survey_overlay__v,crm_end_date__v,crm_hidden__v,crm_segment__v,crm_start_date__v,crm_survey__v,crm_training__v,engage_html_filename__v,cdn_content__v,check_consent__v,production_cdn_url__v,crm_product__v,ta_subcategory__c,notify_msls_of_significant_update__c,global_id__sys,global_version_id__sys,link__sys,version_link__sys,activity_end_date__c,activity_name__c,activity_start_date__c,activity_type__c,business_owner__c,channels__c,material_type__c,objective__c,proactive__c,target_audience__c,indication__c
"00W000000000101",,0,0,0,0,0,0,0,,,,false,,,,false,,,,,"Immunogenicity",,"Clinical Data,Special Population",false,,,,"00C000000000389",1436711,false,,,,,,2018-05-15T09:03:51.000Z,"Fit Width",,MED--TST-1923,,,,,,2020-06-10,2634,,Test.docx,application/vnd.openxmlformats-officedocument.wordprocessingml.document,,,,,10000,"Vaccines",,,,1,,,false,TST50316,,,Advanced LC,,false,,,3,398ea1bf3682f8c8e51cde5bd133bb73,false,0,Use of XXXXXXXXXXXXXXXX vaccine recombinant in Transplant Patients,,false,,4,,,,,,"00P000000001F36",true,,1,2018-08-31,,,false,,,,,,,,,,,,,16815,,,,,,,,,,,Expired,,,,Global Response,,,,,,,Use of XXXXXXXXXXX vaccine recombinant in Transplant Patients,Global Content (Advanced),,,,1436711,2018-05-15T09:03:51.000Z,10000_3_0,2020-07-02T13:17:11.000Z,false,,,,,false,,false,,,,false,,false,,,,,,23108_10000,23108_10000_19347,,,,,,,,,,,,,
Bloated answered 13/10, 2022 at 14:47 Comment(1)
All the % (ForEach-Object) are probably pretty intensive and expensive (I would just create a new [PSCustomObject]@{ $_.document_creation_date__v = ([datetime]($_.document_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K'); .... Besides the manipulation with the comma's is weird (Try to keep your code DRY), for this I would do a separate question of what you have as a general "comma?" Issue and what you trying to achive.Lx
T
6

PowerShell's Import-Csv cmdlet is a known memory hog, primarily due to the high memory requirements of the [pscustomobject] instances that it constructs - see GitHub issue #7603.

There are several mitigation strategies, in ascending order of complexity:

  • In your ForEach-Object (%) script block (you should combine your separate % calls into one), force garbage collection every, say, 1000 objects to relieve memory pressure.

    • As Santiago Squarzon points out, the inefficient implementation of ForEach-Object - as of PowerShell 7.2.x, see GitHub issue #10982 - exacerbates the problem, both with respect to memory consumption and runtime.

    • See the code below, which combines periodic garbage collection with . { process { ... } } as a faster and memory-friendlier alternative to ForEach-Object.[1]

  • Use a custom PowerShell class to represent your CSV rows, but note that this will increase execution time.

    • See this answer for an example.

    • GitHub issue #8862 proposes building this functionality into Import-Csv, so as to make it construct instances of a given type in lieu of [pscustomobject]s to begin with.

  • If the above approaches turn out to be too slow, you'll need to resort to a third-party .NET parser library such as CSVHelper.

    • See this blog post with comparative benchmarks with links to many libraries, and the answers to this SO question (focused on C#).

    • Unfortunately, using general-purpose .NET NuGet packages is cumbersome in PowerShell as of v7.2.x. This answer shows what is currently needed. GitHub issue #6724 asks for a future Add-Type improvement for direct support of NuGet packages.


Here's a streamlined formulation of your code that implements periodic garbage collection to relieve memory pressure:

# Process Configuration File
$path = $PSScriptRoot + '\'

#Set Extraction Date-Time in format for Oracle Timestamp with TZ
$date = Get-Date -Format "yyyy-MM-dd HH:mm:ss K"

# See above for why . { process { ... } } is used in lieu of % { ... }
$i = 0
Import-Csv -Path ($path + 'documents.csv') -Encoding UTF8 | . {
    process {

      # Perform garbage collection every 1000 objects 
      # in order to relieve memory pressure.
      if (++$i % 1000 -eq 0) { [GC]::Collect() }

      # Convert Date Time values that are always populated
      $_.document_creation_date__v = ([datetime]($_.document_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')
      $_.version_creation_date__v = ([datetime]($_.version_creation_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')
      $_.version_modified_date__v = ([datetime]($_.version_modified_date__v)).ToString('yyyy-MM-dd HH:mm:ss K')

      # Convert DateTime values that may be blank
      if ($_.binder_last_autofiled_date__v -gt "") { $_.binder_last_autofiled_date__v = ([datetime]($_.binder_last_autofiled_date__v)).ToString('yyyy-MM-dd HH:mm:ss K') }
      if ($_.locked_date__v -gt "") { $_.locked_date__v = ([datetime]($_.locked_date__v)).ToString('yyyy-MM-dd HH:mm:ss K') }

      # Fix Multi-Select Picklist fields, replacing value divider with "|"
      $_.clinical_data__c = ((($_.clinical_data__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.composition_formulation_ingredients__c = ((($_.composition_formulation_ingredients__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.content_category__c = ((($_.content_category__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.crm_disable_actions__v = ((($_.crm_disable_actions__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.indication_dosage_administration__c = ((($_.indication_dosage_administration__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.pharmacodynamics_and_pharmacokinetics__c = ((($_.pharmacodynamics_and_pharmacokinetics__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.indication__c = ((($_.indication__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.rights_channel__v = ((($_.rights_channel__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.rights_language__v = ((($_.rights_language__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.safety__c = ((($_.safety__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.special_population__c = ((($_.special_population__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.storage_stability__c = ((($_.storage_stability__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.ta_subcategory__c = ((($_.ta_subcategory__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.tags__v = ((($_.tags__v).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.user_groups__c = ((($_.user_groups__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.vaccines__c = ((($_.vaccines__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.channels__c = ((($_.channels__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.material_type__c = ((($_.material_type__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')
      $_.target_audience__c = ((($_.target_audience__c).Replace(',,', '~comma~')).Replace(',', '|')).Replace('~comma~', ',')

      # Trim values that can be too long
      $_.product__v = ($_.product__v)[0..254] -join ""

      # Finally add an ExtractDate Column and output the modified object
      # (-PassThru) - this obviates the need for a separate Select-Object call.
      Add-Member -InputObject $_ -PassThru -NotePropertyName 'Extract_Date' -NotePropertyValue $date
    }
  } |
  Export-Csv ($path + 'VMC_DOCUMENTS.csv') -NoTypeInformation -Encoding UTF8

[1] Note that the variation & { process { ... } }, i.e. execution in a child scope, can speed up execution (see this answer for an explanation), but again increases memory consumption, which is why it isn't used here.

Theseus answered 13/10, 2022 at 15:32 Comment(1)
@BarryTheSprout, if you use Write-Output, the message will become part of the commands (success) output and will end up in the file. To print to the display, use Write-Host. Another option is Write-Verbose -Verbose.Theseus
T
0

In desperate circumstances requiring the most performance and flexibility (but still requiring Powershell), I have resorted to rolling my own CSV processing using StreamReader and StreamWriter. Here is an example that assumes a three column source CSV file, and outputs another CSV file with the values in the first column capitalized, and the values in the second column lowercased:

$infilename = Join-Path $PSScriptRoot 'documents.csv'
$outfilename = Join-Path $PSScriptRoot 'VMC_DOCUMENTS.csv'
$bufsize = 1mb
$rowsep = "`r?`n"
$fieldsep = ","

New-Item -Force -Type "file" $outfilename

$readstream = New-Object -TypeName System.IO.StreamReader -ArgumentList $infilename
$writestream = New-Object -TypeName System.IO.StreamWriter -ArgumentList $outfilename

$writestream.WriteLine($readstream.ReadLine())
$partial = ''
$continue = $true
while ($continue) {
    [char[]]$chunk = New-Object char[] $bufsize
    $received = $readstream.Read($chunk, 0, $bufsize)
    $continue = ($received -gt 0)
    if ($continue -eq $false) {
        break
    }
    $chunkstr = $chunk -join ""
    $lines = (($partial, $chunkstr) -join "") -split $rowsep
    $partial = $lines[-1]
    for ($i = 0; $i -lt $lines.Length - 1; $i++) {
        $row = $lines[$i] -split ($fieldsep)
        
        # Process row/fields here:
        $new = ($row[0].ToUpper(), $row[1].ToLower(), $row[2]) -join $fieldsep 

        $writestream.WriteLine($new)
    }
}
$readstream.Close()
$writestream.Close()

Please note the CSV parsing is quite rudimentary, and assumes no escape characters or need for quoting. If such is needed, more robust logic using regular expressions can be utilized.

One could simplify the above using ReadLine rather than the chunk processing, but only if conventional newline characters are used. The above code instead allows for an arbitrary line separator in the source file.

Thirtieth answered 19/5, 2024 at 13:36 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.