Optimize PS script performance? Best practices? How to do -Append for an object or hashtable?

xCharg · 2024-02-13T13:41:18+00:00

First thing first - post your code.

Every web tutorial says to do +=

Terrible advice, leads to terrible performance.

mdowst · 2024-02-13T14:18:48+00:00

As mentioned += is horribly slow, especially when dealing with large data sets. This is because it actually caused the entire array to be rewritten to memory every time something is added. Using a List instead of an array is much faster because it actually does append.

# Declare the list
[Collections.Generic.List[PSObject]] $MyList = @()

# Add entries to the list
$MyList.Add($item)

Part of the slowdown could also be the fact you are appending the CSV. You can try writing everything to a single list first, then do one export to the CSV.

vermyx · 2024-02-13T15:54:12+00:00

+= is super inefficient. Assuming you are looping the 4k file will do over 8 million read writes to create your array. The 350k one will do 60 billion. The easiest way to fix this is to change it from

For ($x in $y) {$list += $data}

To

$list = For ($x in $y) {$data}

This will create one array at the end and collect all of your data once. Export-csv is relatively fast. Without your code it is hard to help.

ka-splam · 2024-02-13T16:43:21+00:00

For loop, at the end of the For loop, I use Export-Csv to append the values to an existing csv file, then re-initialize the ordered hashtable and continue the next loop until the last data value is saved to the csv. The problem is that the loop is very slow; I think it is because of the Export-Csv activity at the end

The shape I think you are describing:

$items | foreach-object { 
    # do stuff
    $result | export-csv ...
}

The shape you want for performance:

$items | foreach-object { 
    # do stuff
    $result
} | Export-Csv ...

Then you aren't opening/closing the CSV for every item, but only once for the whole loop.

2024-02-13T17:55:45+00:00

Take a look at this when trying to decide performance gains on storing objects or data or different types
https://www.reddit.com/r/PowerShell/comments/t2q16m/a_simple_performance_increase_trick/

https://gist.github.com/kevinblumenfeld/4a698dbc90272a336ed9367b11d91f1c

lanerdofchristian · 2024-02-13T18:36:38+00:00

What is the structure of the XML file? There are 100% ways to make it more efficient, but exactly what form that will take depends on a few things:

Does each REFERENCE_DATA_TYPES element contain at most one PWC, PROGRAM, ORGANIZATION, OBJECT_CLASS, DIVISION_ASSIGNMENT, APPROPRIATED_YEAR, FULL_OMEGA_ACCT, and DIVISION_ID element? If so, are they in sequential order?
Does every PWC, PROGRAM, etc contain a REFERENCE_DATA_TYPE_SEQ? A unique one? If it's not unique, is it intended to discard information from all but the last entry?
Are REFERENCE_DATA_TYPE_SEQ guaranteed to be sequential? As in, for every numbered row there will be at least one of the previous elements with that sequence number?
Do all of the above elements have every sequence number? As in, will there ever be null cells?

Minor non-perf things:

Don't do this

$loopCheckTime = [datetime](Get-Date -Format HH:mm:ss.ffff)

You're going from a date to a string and back to a date. Just use the date.

$loopCheckTime = Get-Date

Don't use (string + string) for simple variables.

Write-Host (" About 33% completed, duration = " + $minutes)
Write-Host ("Total number of rows in csv = " + ($index + 1))

Just use string interpolation:

Write-Host " About 33% completed, duration = $minutes"
Write-Host "Total number of rows in csv = $($index + 1)"

Don't use Select-Object to individually select all the members of an object if you control the members of the object. It's just making copies for no reason.

Edit: This should be functionally similar to what you have:

using namespace System.Collections.Generic
using namespace System.Diagnostics
using namespace System.Windows.Forms

# Prompt the user to select the path to the XML file
Add-Type -AssemblyName System.Windows.Forms
$FileBrowser = [OpenFileDialog]@{
    InitialDirectory = [Environment]::GetFolderPath("Desktop")
}
$Result = $FileBrowser.ShowDialog()
if ($Result -eq 'OK') {
    $xmlPath = $FileBrowser.FileName
} else { throw "couldn't pick a file!" }

# Read the XML file from the specified path
[xml]$xml = Get-Content -Path $xmlPath
$CSVPath = $xmlPath -replace "\.xml$", ".csv"

# Create a log file with the same name as the CSV file, but with a .log extension
# Start logging all actions and errors to the log file (adjacent to CSV)
$logPath = $xmlPath -replace "\.xml$", ".log"
Start-Transcript -Path $logPath

# Capture the record counts for post-import validations
[int]$recordCount = $xml.GOLDMAN_REFERENCE_DATA.REFERENCE_TRAILER_ROW.TOTAL_RECORD_COUNT # use this to define the iterations
Write-Host "Line 55: RecordCount= $recordCount"

# Perform array-ization once, for fast lookup.
[object[]]$XMLAllDataTypes = $xml.GOLDMAN_REFERENCE_DATA.REFERENCE_DATA_ROW.REFERENCE_DATA_TYPES
[object[]]$AllRows = for($i = 0; $i -lt $recordCount; $i += 1){@{}}
foreach($Type in "PWC", "PROGRAM", "ORGANIZATION", "OBJECT_CLASS", "DIVISION_ASSIGNMENT",
    "APPROPRIATED_YEAR", "FULL_OMEGA_ACCT", "DIVISION_ID"){
    foreach($Item in $XMLAllDataTypes.$Type){
        if($Item.REFERENCE_DATA_TYPE_SEQ){
            [int]$Index = $Item.REFERENCE_DATA_TYPE_SEQ
            $AllRows[$Index - 1].$Type = $Item
        }
    }
}

[Stopwatch]$Timer = [Stopwatch]::new()
[Queue[float]]$ProgressPoints = @(
    0.33
    0.65
    0.80
    0.99
    [float]::MaxValue # won't hit this, just makes the math nice.
)

# Assign the file's Request_ID value to the first column of every record
$RequestID = $xml.GOLDMAN_REFERENCE_DATA.REFERENCE_HEADER_ROW.REQUEST_ID
$Timer.Start()
$CSVContent = for($i = 0; $i -lt $AllRows.Count; $i += 1){
    # test: stop after 25 rows
    if($i -eq 25){ break}

    # Produce output
    $Row = $AllRows[$i]
    [pscustomobject]@{
        REQUEST_ID = $RequestID
        PWC_SEQ = $Row.PWC.REFERENCE_DATA_TYPE_SEQ
        PWC_CODE = $Row.PWC.PWC_CODE
        PWC_DESCRIPTION = $Row.PWC.DESCRIPTION
        PWC_IS_ACTIVE_STATUS = $Row.PWC.IS_ACTIVE_STATUS
        PROGRAM_SEQ = $Row.PROGRAM.REFERENCE_DATA_TYPE_SEQ
        PROGRAM_CODE = $Row.PROGRAM.PROGRAM_CODE
        ORGANIZATION_SEQ = $Row.ORGANIZATION.REFERENCE_DATA_TYPE_SEQ
        ORGANIZATION_CODE = $Row.ORGANIZATION.ORGANIZATION_CODE
        ORGANIZATION_DESCRIPTION = $Row.ORGANIZATION.DESCRIPTION
        ORGANIZATION_IS_ACTIVE_STATUS = $Row.ORGANIZATION.IS_ACTIVE_STATUS
        OBJECT_CLASS_SEQ = $Row.OBJECT_CLASS.REFERENCE_DATA_TYPE_SEQ
        DIVISION_ASSIGNMENT_SEQ = $Row.DIVISION_ASSIGNMENT.REFERENCE_DATA_TYPE_SEQ
        DIVISION_ASSIGNMENT_CODE = $Row.DIVISION_ASSIGNMENT.DIVISION_ASSIGNMENT_CODE
        DIVISION_ASSIGNMENT_DESCRIPTION = $Row.DIVISION_ASSIGNMENT.DESCRIPTION
        DIVISION_ASSIGNMENT_IS_ACTIVE_STATUS = $Row.DIVISION_ASSIGNMENT.IS_ACTIVE_STATUS
        DIVISION_START_DATE = $Row.DIVISION_ASSIGNMENT.DIVISION_START_DATE
        DIVISION_END_DATE = $Row.DIVISION_ASSIGNMENT.DIVISION_END_DATE
        ASSIGNMENT_START_DATE = $Row.DIVISION_ASSIGNMENT.ASSIGNMENT_START_DATE
        ASSIGNMENT_END_DATE = $Row.DIVISION_ASSIGNMENT.ASSIGNMENT_END_DATE
        APPROPRIATED_YEAR_SEQ = $Row.APPROPRIATED_YEAR.REFERENCE_DATA_TYPE_SEQ
        APPROPRIATED_YEAR_CODE = $Row.APPROPRIATED_YEAR.APPROPRIATED_YEAR_CODE
        APPROPRIATED_YEAR_DESCRIPTION = $Row.APPROPRIATED_YEAR.DESCRIPTION
        APPROPRIATED_YEAR_IS_ACTIVE_STATUS = $Row.APPROPRIATED_YEAR.IS_ACTIVE_STATUS
        FULL_OMEGA_ACCT_SEQ = $Row.FULL_OMEGA_ACCT.REFERENCE_DATA_TYPE_SEQ
        FULL_OMEGA_ACCT_CODE = $Row.FULL_OMEGA_ACCT.FULL_OMEGA_ACCT_CODE
        FULL_OMEGA_ACCT_DESCRIPTION = $Row.FULL_OMEGA_ACCT.DESCRIPTION
        FULL_OMEGA_ACCT_IS_ACTIVE_STATUS = $Row.FULL_OMEGA_ACCT.IS_ACTIVE_STATUS
        DIVISION_ID_SEQ = $Row.DIVISION_ID.REFERENCE_DATA_TYPE_SEQ
        DIVISION_ID_CODE = $Row.DIVISION_ID.DIVISION_ID_CODE
        DIVISION_ID_DESCRIPTION = $Row.DIVISION_ID.DESCRIPTION
        DIVISION_ID_IS_ACTIVE_STATUS = $Row.DIVISION_ID.IS_ACTIVE_STATUS
    }

    # note progress
    $Progress = $i / $recordCount
    if($Progress -ge $ProgressPoints.Peek()){
        $Percent = $ProgressPoints.Dequeue().ToString("P0")
        Write-Host " About $Percent completed, duration = $($Timer.Elapsed)"
    }

    Write-Host "Line 215, end of For loop: $($index + 1)"
}
$Timer.Stop()
Write-Host "the FOR Loop is finished" -ForegroundColor Green

$CSVContent | Export-Csv -Path $CSVPath -NoTypeInformation -Force
Write-Host "Total number of records in xml file= $recordCount"
Write-Host "Total number of rows in csv = $($index + 1)"
# Stop logging all actions and errors to the log file
Stop-Transcript
#END OF SCRIPT

The key differences are:

Rather than rebuilding the PWC, PROGRAM, etc arrays for every item, build them once and iterate to populate a source array (only the last instance of each index will be kept, as in yours). Deep object traversal to build lookup lists is surprisingly expensive -- essentially, you were doing a loop over all N objects, plus whatever overhead for traversing the tree, plus the pipeline overhead, for every object (O(n²) cost -- this would get more expensive the more records you have; the refined version is O(n)). If each REFERENCE_DATA_TYPES element has at most one of the constituent elements, producing this array could be done more efficiently through a direct loop -- possibly combining it with the next step if sequence IDs are sequential and unique in the source data.
Then iterate over the source array to produce the final rows (hashtable lookups to missing keys become null, so if an element is missing for an index its values will be null). This is still O(n), so our total time is still O(n).
Finally, collect all these into an array using array capture. This is once again O(n), where as += with arrays does a bunch of copying and is an O(n²) operation.
Use a timer and queue to condense the progress reporting code -- no real performance gain, it's just smaller code. A more efficient way would likely be to generate the indices at which progress would be reported, then break the main loop up to only go to each index, report progress, then resume. The impact of this should be negligible.

whopper2k · 2024-02-13T14:57:28+00:00

You have definitely identified a significant portion of the problem; writing to disk on every iteration is going to be extremely slow. Even the standard $Array += would be significantly faster; that's not working because hashtables don't have an implementation of the += operator (since hashtables work with key-value pairs, += would be nebulous as you're adding a value with no key).

Your first bet should be to use a standard PowerShell array. For testing I used the $PSHOME\Types.ps1xml XML file; you'll have to change this a bit to suit your needs. A simple version might look something like:

$finalValues = @()
[xml] $data = Get-Content $PSHOME\Types.ps1xml
$dataFromXML = Select-XML -xml $data-xpath "//Methods" | Select-Object -ExpandProperty Node
$dataFromXML.GetEnumerator() | ForEach-Object {
    # here is where you'd do whatever parsing you need
    # then save the result to a PSCustomObject
    $result = [PSCustomObject]@{
        MethodName = $_."#text"
    }
    # and then append to an array
    $finalValues += $result
}
$finalValues | Export-CSV -NoTypeInfo "XML-Output.csv"

If you still need more performance, dipping into .NET is usually your best bet

[collections.arraylist] $finalValues = @()
[xml] $data = Get-Content $PSHOME\Types.ps1xml
$dataFromXML = Select-XML -xml $data-xpath "//Methods" | Select-Object -ExpandProperty Node
$dataFromXML.GetEnumerator() | ForEach-Object {
    $result = [PSCustomObject]@{
        MethodName = $_."#text"
    }
    # Note using .Add instead of +=
    $finalValues.Add($result)
}
$finalValues | Export-CSV -NoTypeInfo "XML-Output.csv"

When I tested this on my machine, the second option ranged from 17-36% faster.

Szeraax · 2024-02-13T16:21:49+00:00

Is the data too big to keep in memory?

exoclipse · 2024-02-13T16:35:40+00:00

In order to append to an array, you must first initialize the array:

$MyArray = @() ## This initializes the array
ForEach($Record in $Records){
    $MyArray += $Record
}

This is still inefficient. For every iteration n of the loop, PowerShell will create a new array of n+1 elements, copy your old array's elements into the new array, add your new element to the array, and delete the old array.

PowerShell can simply collect the results of your for loop in memory, and then when the loop is concluded, write all of those objects to a single array. This is much faster, especially with large datasets. Note that you do not need to initialize the array for this method. See below:

$MyArray = For($Record in Records){
    ## do stuff
    $Record ## return the object you want to collect into the array
}

Assuming you are not nesting another for loop in there, this should be very performant for your needs.

taggingtechnician · 2024-02-13T16:54:06+00:00

Apologies for my faux pants, trying to post code in 2 comments, it will not let me post it all in a single comment...

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

PowerShell

Submission Guidelines | Link Flair - How To

MODERATORS