PowerShell: Importing 16MB CSV Into PowerShell Variable Creates >600MB's of PowerShell Memory Usage
Asked Answered
C

2

6

I'm trying to understand why PowerShell's memory balloons so much when I import a file that's ~16MB's as a variable. I can understand there's additional memory structure around that variable but I'm just trying to understand why it's THAT high. Here's what I'm doing below - just a stripped down simplistic snippet of another script that anyone can run.

Notes/Questions

  1. Not complaining, trying to understand why so much usage and is there a better way of doing this or managing memory more efficiently to respect the system I'm running this on.
  2. This same behavior occurs in PowerShell 5.1 and on PowerShell 7, RC3 just released. I don't think this is a bug, simply another opportunity for me to learn more.
  3. My overall objective with this is to run a foreach loop to check another much smaller array against this array for matches or lack there of.

My Test Code

Invoke-WebRequest -uri "http://s3.amazonaws.com/alexa-static/top-1m.csv.zip" -OutFile C:\top-1m.csv.zip

Expand-Archive -Path C:\top-1m.csv.zip -DestinationPath C:\top-1m.csv

$alexaTopMillion = Import-Csv -Path C:\top-1m.csv

To anyone answering this: Thank you for your time and helping me learn more every day!

Cochleate answered 22/2, 2020 at 15:30 Comment(2)
You need to stream your objects by using the PowerShell pipeline which basically means do not assigned all object to a variable ($alexaTopMillion) or use brackets ((...)). But pipe the input to your process and immediately pipe that to the output: Import-Csv -Path C:\top-1m.csv | ForEach-Object {...} | Export-Csv -Path C:\output.csvIncur
For 3. My overall objective with this is to run a foreach loop to check another much smaller array against this array for matches or lack there of., you might have a look at: Join-Object script/Join-Object Module (see also: In Powershell, what's the best way to join two tables into one?): Import-Csv -Path C:\top-1m.csv |Join (Import-Csv .\Smaller.Csv) -on <something> |Export-Csv .\Output.CsvIncur
E
11

Generally speaking, iRon's advice in a comment on the question is worth heeding (the specific question is addressed in the section that follows this one):

To keep memory usage low, use streaming of objects in the pipeline rather than collecting them in memory first - if feasible.

That is, instead of doing this:

# !! Collects ALL objects in memory, as an array.
$rows = Import-Csv in.csv
foreach ($row in $rows) { ... }

do this:

# Process objects ONE BY ONE.
# As long as you stream to a *file* or some other output stream
# (as opposed to assigning to a *variable*), memory use should remain constant,
# except for temporarily held memory awaiting garbage collection.
Import-Csv in.csv | ForEach-Object { ... } # pipe to Export-Csv, for instance

However, even then you seemingly can run out of memory with very large files - see this question - possibly related to buildup of memory from no longer needed objects that haven't yet been garbage-collected; therefore, periodically calling [GC]::Collect() in the ForEach-Object script block may solve the problem - see this answer for an example.


If you do need to collect all objects output by Import-Csv in memory at once:

The inordinate memory use you observe comes from how [pscustomobject] instances (Import-Csv's output type) are implemented, as discussed in GitHub issue #7603 (emphasis added):

The memory pressure most likely comes from the cost of PSNoteProperty [which is how [pscustomobject] properties are implemented]. Each PSNoteProperty has an overhead of 48 bytes, so when you just store a few bytes per property, that becomes massive.

The same issue proposes a workaround to reduce memory consumption (as also shown in Wasif Hasan's answer):

  • Read the first CVS row and dynamically create a custom class that represents the rows, using Invoke-Expression.

    • Note: While its use is safe here, Invoke-Expression should generally be avoided.

    • If you know the column structure in advance, you can create a custom class the regular way, which also allows you to use appropriate data types for the properties (which are otherwise all strings by default); for instance, defining appropriate properties as [int] (System.Int32) further reduces memory consumption.

  • Pipe Import-Csv to a ForEach-Object call that converts each [pscustomobject] created to an instance of the dynamically created class, which stores the data more efficiently.

Note: This workaround comes at the expensive of considerably diminished execution speed.

$csvFile = 'C:\top-1m.csv'

# Dynamically define a custom class derived from the *first* row
# read from the CSV file.
# Note: While this is a legitimate use of Invoke-Expression, 
#       it should generally be avoided.
"class CsvRow { 
 $((Import-Csv $csvFile | Select-Object -first 1).psobject.properties.Name -replace '^', '[string] $$' -join ";") 
}" | Invoke-Expression

# Import all rows and convert them from [pscustomobject] instances 
# to [CsvRow] instances to reduce memory consumption.
# Note: Casting the Import-Csv call directly to [CsvRow[]] would be noticeably
#       faster, but increases *temporary* memory pressure substantially.
$alexaTopMillion = Import-Csv $csvFile | ForEach-Object { [CsvRow] $_ }

Longer-term, a better solution that would also be faster is to make Import-Csv support outputting parsed rows with a given output type, say, via an -OutputType parameter, as proposed in GitHub issue #8862.
If that is of interest to you, show your support for the proposal there.


Memory-use benchmarks:

The following code compares memory use with normal Import-Csv import (array of [pscustomobject]s) to the workaround (array of custom-class instances).

The measurement isn't exact, as the PowerShell's process working memory is simply queried, which can show the influence of background activities, but it gives a rough sense of how much less memory using a custom class requires.

Sample output, which shows that the custom-class workaround requires only about one 5th of the memory with the sample 10-column CSV input file with about 166,000 rows used below - the specific ratio depends on the number of input rows and columns:

MB Used Command
------- -------
 384.50  # normal import…
  80.48  # import via custom class…

Benchmark code:

# Create a sample CSV file with 10 columns about 16 MB in size.
$tempCsvFile = [IO.Path]::GetTempFileName()
('"Col1","Col2","Col3","Col4","Col5","Col6","Col7","Col8","Col9","Col10"' + "`n") | Set-Content -NoNewline $tempCsvFile
('"Col1Val","Col2Val","Col3Val","Col4Val","Col5Val","Col6Val","Col7Val","Col8Val","Col9Val","Col10Val"' + "`n") * 1.662e5 |
  Add-Content $tempCsvFile

try {

  { # normal import
    $all = Import-Csv $tempCsvFile
  },
  { # import via custom class
    "class CsvRow {
      $((Import-Csv $tempCsvFile | Select-Object -first 1).psobject.properties.Name -replace '^', '[string] $$' -join ";")
    }" | Invoke-Expression
    $all = Import-Csv $tempCsvFile | ForEach-Object { [CsvRow] $_ }
  } | ForEach-Object {
    [gc]::Collect(); [gc]::WaitForPendingFinalizers() # garbage-collect first.
    Start-Sleep 2 # Wait a little for the freed memory to be reflected in the process object.
    $before = (Get-Process -Id $PID).WorkingSet64
    # Execute the command.
    & $_
    # Measure memory consumption and output the result.
    [pscustomobject] @{
      'MB Used' = ('{0,4:N2}' -f (((Get-Process -Id $PID).WorkingSet64 - $before) / 1mb)).PadLeft(7)
      Command = $_
    }
  }

} finally {
  Remove-Item $tempCsvFile
}
Eamon answered 22/2, 2020 at 19:45 Comment(0)
P
2

You can generate a type for the each items as described here https://github.com/PowerShell/PowerShell/issues/7603

Import-Csv "C:\top-1m.csv" | Select-Object -first 1 | ForEach {$_.psobject.properties.name} | Join-String -Separator "`r`n" -OutputPrefix "class MyCsv {`r`n" -OutputSuffix "`n}" -Property {"`t`$$_"}  | Invoke-Expression
Import-Csv "C:\top-1m.csv" | Foreach {[MyCsv]$_} | Export-Csv "C:\alexa_top.csv"

This is quite more efficient. You can measure the time using Measure-Command.

If you use Get-Content it's very very slow. The Raw parameter improves the speed. But memory pressure gets high.

Even ReadCount parameter sets the lines to read per process to read. This is even faster than Using Raw parameter.

It can be even read using Switch statement like:

Switch -File "Path" {default {$_}}

It's even more faster! But regret it even used more memory.

Palais answered 22/2, 2020 at 16:30 Comment(3)
As in my comments to the question. You should not assign it to a variable ($alexaTopMillion = ...) that will choke the pipeline and pileup everything into memory. Instead you should directly pipe it to the next cmdlet (Export-Csv) to intermediately release the objects from memory. See also: Implement for the Middle of a PipelineIncur
Thanks @iRon. I have updated my answer. I have also included how to increase speed of reading files. (Still they require high memory....)Palais
The workaround around is helpful to reduce memory consumption, but it slows down the command considerably. The OP's concern is memory use with respect to Import-Csv, whereas the remainder of your answer discusses execution speed with plain-text files. There, -Raw is by far the fastest (even if you use -ReadCount 0 to read into a single array), but you get a single, multi-line string. You can make switch stream in the pipeline too - just wrap the whole statement in & { ... }.Eamon

© 2022 - 2024 — McMap. All rights reserved.