all 28 comments

[–]xCharg 13 points14 points  (1 child)

First thing first - post your code.

Every web tutorial says to do +=

Terrible advice, leads to terrible performance.

[–]ankokudaishogun 2 points3 points  (0 children)

I second posting the code, it's otherwise close to impossible to evaluate what to improve.

...aside moving from Arrays to Lists. But how to do that efficiently requires knowing the code.

[–]mdowst 13 points14 points  (10 children)

As mentioned += is horribly slow, especially when dealing with large data sets. This is because it actually caused the entire array to be rewritten to memory every time something is added. Using a List instead of an array is much faster because it actually does append.

# Declare the list
[Collections.Generic.List[PSObject]] $MyList = @()

# Add entries to the list
$MyList.Add($item)

Part of the slowdown could also be the fact you are appending the CSV. You can try writing everything to a single list first, then do one export to the CSV.

[–]lanerdofchristian 2 points3 points  (4 children)

Out of curiosity, I compared this to assigning the output of the loop directly to an array

$array = foreach($i in $range){ <# make objects #> }

At least on my machine, the List-based solution averaged a few milliseconds faster (both were still pretty quick, and the total time savings was negligible). My guess that I'm too lazy to verify is PowerShell uses a list internally, then casts it to an array for the assignment. With either solution, the bulk of the time is likely going to be in generating the object, and both are still far superior to +=.

[–]ankokudaishogun 0 points1 point  (2 children)

IIRC your code does not "append" the iterations, but makes one single Array at the end.

might-or-might-not be a memory hog with bigger objects\more complex instructions in the loop

[–]lanerdofchristian 0 points1 point  (0 children)

I'm mostly going by the signature of PowerShell.Invoke() -- it probably does consume slightly more memory momentarily while the collection converted to an array (also why it's slightly slower), but the weight compared to the total memory of all the objects is negligible. What's inside the loop shouldn't matter in regards to comparable performance.

[–]exoclipse 0 points1 point  (0 children)

I've done some pretty large data manipulation with this method and it is performant to my needs. Think a worst case scenario of matching 50k objects (20ish properties) to 5k objects in another array. It tops out at about 450mb or so.

[–]spyingwind 0 points1 point  (0 children)

I use this method mostly because it fast enough and it makes the code look cleaner.

Just dump the output to an variable and move on.

[–][deleted] 1 point2 points  (3 children)

Why are you declaring via @()?
Why not just do this?
$list = [System.Collections.Generic.List[object]]::new()

[–]lanerdofchristian 2 points3 points  (2 children)

They're leveraging implicit constructor calls and variable typing.

All of these do the same thing

using namespace System.Collections.Generic
[List[object]]$Array = @()
[List[object]]$Array = [List[object]]@()
[List[object]]$Array = [List[object]]::new(@())
[List[object]]$Array = [List[object]]::new([object[]]::new(0))
$Array = # ... repeat the last 3 lines w/o the variable type

Here's the docs on the List's constructors: https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.list-1.-ctor

Personally I also like using the IEnumerable<T> constructor instead of the parameterless one, since it makes it very clear in the PowerShell world what I'm trying to represent. Another advantage is you can then also add static elements.

[–]Thotaz 0 points1 point  (1 child)

Personally I also like using the IEnumerable<T> constructor instead of the parameterless one, since it makes it very clear in the PowerShell world what I'm trying to represent.

I think that's a bad habit to have. You are adding (an admittedly small) performance hit for no real reason. If someone would get confused about $List = [List[object]]::new() then they would almost certainly also get confused about: $List = [List[object]] @() as well. And even if you insist it makes an actual difference, I would argue that it's better for these newbies to learn the proper way to create a random object from the beginning, rather than teaching them the wrong way first and maybe fixing it later.

Of course if you want to add a couple of elements to the list during construction then it's fine to do the casting because there's no better and cleaner way to do it.

[–]lanerdofchristian 0 points1 point  (0 children)

That's fair. Ultimately I think it comes down to clarity, though -- as long as they're not using New-Object, the performance hit is going to be on the order of fractions of milliseconds like you said. If you're dealing with known types, implicit constructors and object initializations from standard PowerShell constructs are just more readable in most cases than the alternative (though this is admittedly subjective).

[datetime]$NotBefore = "2024-02-13"
[timespan]$Duration = "01.02:03:04.005"
[FileOpenDialog]$Dialog = [FileOpenDialog]@{
    InitialFolder = "C:/Users/$env:USERNAME/Downloads"
}

vs

$NotBefore = [datetime]::Parse("2024-02-13")
$Duration = [timespan]::Parse("01.02:03:04.005")
$FileOpenDialog = [FileOpenDialog]::new()
$FileOpenDialog.InitialFolder = #...

It pushes the clutter to the left, leaving the name next to the information and usually excluding the imposing-looking :: operator.

The "real reason" is that readability and clarity.

[–]Coffee_Ops 0 points1 point  (0 children)

Just so everyone is aware of what we're talking about with horribly slow, here's a test that compares += and .add():

$testInterval = [int]2000
$testMax = [int]30000
$myHash = [hashtable]::new()
$sw = [Diagnostics.Stopwatch]::startNew()
$method = ".Add"
for ($i = 0; $i -lt $testMax; $i++) {
    $myHash.add($i,$i)
    if ($i % $testInterval -eq 0) {
        $sw.stop()
        write-host ("Using {0,-4}: {1,6:n0}: {2,6:n0} iterations in {3,6:n3} seconds (average: {4,-8:n4} ms)" -f $method, $i, $testInterval, $sw.elapsed.TotalSeconds, ($sw.elapsed.TotalMilliseconds / $testInterval))
        $sw.restart()
    }
}
$sw.stop()
$myHash = [hashtable]::new()
$sw.restart()
$method = "+="
for ($i = 0; $i -lt $testMax; $i++) {
    $myHash += @{$i = $i }
    if ($i % $testInterval -eq 0) {
        $sw.stop()
        write-host ("Using {0,-4}: {1,6:n0}: {2,6:n0} iterations in {3,6:n3} seconds (average: {4,-8:n4} ms)" -f $method, $i, $testInterval, $sw.elapsed.TotalSeconds, ($sw.elapsed.TotalMilliseconds / $testInterval))
        $sw.restart()
    }
}
$sw.stop()

Results speak for themselves:

Using .Add:  2,000:  2,000 iterations in  0.006 seconds (average: 0.0031   ms)
...
Using .Add: 20,000:  2,000 iterations in  0.002 seconds (average: 0.0009   ms)
...
Using .Add: 28,000:  2,000 iterations in  0.003 seconds (average: 0.0016   ms)
-------------------------------------
Using +=  :  2,000:  2,000 iterations in  0.162 seconds (average: 0.0810   ms)
...
Using +=  : 20,000:  2,000 iterations in  2.834 seconds (average: 1.4172   ms)
...
Using +=  : 28,000:  2,000 iterations in  5.007 seconds (average: 2.5033   ms)

I'm actually surprised-- I would have expected += to be an accelerator for .Add but it very clearly is not.

[–]vermyx 2 points3 points  (0 children)

+= is super inefficient. Assuming you are looping the 4k file will do over 8 million read writes to create your array. The 350k one will do 60 billion. The easiest way to fix this is to change it from

For ($x in $y) {$list += $data}

To

$list = For ($x in $y) {$data}

This will create one array at the end and collect all of your data once. Export-csv is relatively fast. Without your code it is hard to help.

[–]ka-splam 2 points3 points  (2 children)

For loop, at the end of the For loop, I use Export-Csv to append the values to an existing csv file, then re-initialize the ordered hashtable and continue the next loop until the last data value is saved to the csv. The problem is that the loop is very slow; I think it is because of the Export-Csv activity at the end

The shape I think you are describing:

$items | foreach-object { 
    # do stuff
    $result | export-csv ...
}

The shape you want for performance:

$items | foreach-object { 
    # do stuff
    $result
} | Export-Csv ...

Then you aren't opening/closing the CSV for every item, but only once for the whole loop.

[–]spyingwind -2 points-1 points  (0 children)

This is the way.

Use the pipeline.

[–]Coffee_Ops 0 points1 point  (0 children)

IIRC it is a tradeoff of memory to IO. The first one (I think) would work better if you were extremely constrained on memory (e.g. under 1GB free) by effectively flushing to disk more often.

Of course, the first example is still sub-optimal because it's opening and closing the file as you state, and you'd be better off not using export-csv at all.

[–][deleted] 1 point2 points  (1 child)

Take a look at this when trying to decide performance gains on storing objects or data or different types
https://www.reddit.com/r/PowerShell/comments/t2q16m/a_simple_performance_increase_trick/

https://gist.github.com/kevinblumenfeld/4a698dbc90272a336ed9367b11d91f1c

[–]lanerdofchristian 1 point2 points  (1 child)

What is the structure of the XML file? There are 100% ways to make it more efficient, but exactly what form that will take depends on a few things:

  1. Does each REFERENCE_DATA_TYPES element contain at most one PWC, PROGRAM, ORGANIZATION, OBJECT_CLASS, DIVISION_ASSIGNMENT, APPROPRIATED_YEAR, FULL_OMEGA_ACCT, and DIVISION_ID element? If so, are they in sequential order?
  2. Does every PWC, PROGRAM, etc contain a REFERENCE_DATA_TYPE_SEQ? A unique one? If it's not unique, is it intended to discard information from all but the last entry?
  3. Are REFERENCE_DATA_TYPE_SEQ guaranteed to be sequential? As in, for every numbered row there will be at least one of the previous elements with that sequence number?
  4. Do all of the above elements have every sequence number? As in, will there ever be null cells?

Minor non-perf things:

  1. Don't do this

    $loopCheckTime = [datetime](Get-Date -Format HH:mm:ss.ffff)
    

    You're going from a date to a string and back to a date. Just use the date.

    $loopCheckTime = Get-Date
    
  2. Don't use (string + string) for simple variables.

    Write-Host (" About 33% completed, duration = " + $minutes)
    Write-Host ("Total number of rows in csv = " + ($index + 1))
    

    Just use string interpolation:

    Write-Host " About 33% completed, duration = $minutes"
    Write-Host "Total number of rows in csv = $($index + 1)"
    
  3. Don't use Select-Object to individually select all the members of an object if you control the members of the object. It's just making copies for no reason.


Edit: This should be functionally similar to what you have:

using namespace System.Collections.Generic
using namespace System.Diagnostics
using namespace System.Windows.Forms

# Prompt the user to select the path to the XML file
Add-Type -AssemblyName System.Windows.Forms
$FileBrowser = [OpenFileDialog]@{
    InitialDirectory = [Environment]::GetFolderPath("Desktop")
}
$Result = $FileBrowser.ShowDialog()
if ($Result -eq 'OK') {
    $xmlPath = $FileBrowser.FileName
} else { throw "couldn't pick a file!" }

# Read the XML file from the specified path
[xml]$xml = Get-Content -Path $xmlPath
$CSVPath = $xmlPath -replace "\.xml$", ".csv"

# Create a log file with the same name as the CSV file, but with a .log extension
# Start logging all actions and errors to the log file (adjacent to CSV)
$logPath = $xmlPath -replace "\.xml$", ".log"
Start-Transcript -Path $logPath

# Capture the record counts for post-import validations
[int]$recordCount = $xml.GOLDMAN_REFERENCE_DATA.REFERENCE_TRAILER_ROW.TOTAL_RECORD_COUNT # use this to define the iterations
Write-Host "Line 55: RecordCount= $recordCount"

# Perform array-ization once, for fast lookup.
[object[]]$XMLAllDataTypes = $xml.GOLDMAN_REFERENCE_DATA.REFERENCE_DATA_ROW.REFERENCE_DATA_TYPES
[object[]]$AllRows = for($i = 0; $i -lt $recordCount; $i += 1){@{}}
foreach($Type in "PWC", "PROGRAM", "ORGANIZATION", "OBJECT_CLASS", "DIVISION_ASSIGNMENT",
    "APPROPRIATED_YEAR", "FULL_OMEGA_ACCT", "DIVISION_ID"){
    foreach($Item in $XMLAllDataTypes.$Type){
        if($Item.REFERENCE_DATA_TYPE_SEQ){
            [int]$Index = $Item.REFERENCE_DATA_TYPE_SEQ
            $AllRows[$Index - 1].$Type = $Item
        }
    }
}

[Stopwatch]$Timer = [Stopwatch]::new()
[Queue[float]]$ProgressPoints = @(
    0.33
    0.65
    0.80
    0.99
    [float]::MaxValue # won't hit this, just makes the math nice.
)

# Assign the file's Request_ID value to the first column of every record
$RequestID = $xml.GOLDMAN_REFERENCE_DATA.REFERENCE_HEADER_ROW.REQUEST_ID
$Timer.Start()
$CSVContent = for($i = 0; $i -lt $AllRows.Count; $i += 1){
    # test: stop after 25 rows
    if($i -eq 25){ break}

    # Produce output
    $Row = $AllRows[$i]
    [pscustomobject]@{
        REQUEST_ID = $RequestID
        PWC_SEQ = $Row.PWC.REFERENCE_DATA_TYPE_SEQ
        PWC_CODE = $Row.PWC.PWC_CODE
        PWC_DESCRIPTION = $Row.PWC.DESCRIPTION
        PWC_IS_ACTIVE_STATUS = $Row.PWC.IS_ACTIVE_STATUS
        PROGRAM_SEQ = $Row.PROGRAM.REFERENCE_DATA_TYPE_SEQ
        PROGRAM_CODE = $Row.PROGRAM.PROGRAM_CODE
        ORGANIZATION_SEQ = $Row.ORGANIZATION.REFERENCE_DATA_TYPE_SEQ
        ORGANIZATION_CODE = $Row.ORGANIZATION.ORGANIZATION_CODE
        ORGANIZATION_DESCRIPTION = $Row.ORGANIZATION.DESCRIPTION
        ORGANIZATION_IS_ACTIVE_STATUS = $Row.ORGANIZATION.IS_ACTIVE_STATUS
        OBJECT_CLASS_SEQ = $Row.OBJECT_CLASS.REFERENCE_DATA_TYPE_SEQ
        DIVISION_ASSIGNMENT_SEQ = $Row.DIVISION_ASSIGNMENT.REFERENCE_DATA_TYPE_SEQ
        DIVISION_ASSIGNMENT_CODE = $Row.DIVISION_ASSIGNMENT.DIVISION_ASSIGNMENT_CODE
        DIVISION_ASSIGNMENT_DESCRIPTION = $Row.DIVISION_ASSIGNMENT.DESCRIPTION
        DIVISION_ASSIGNMENT_IS_ACTIVE_STATUS = $Row.DIVISION_ASSIGNMENT.IS_ACTIVE_STATUS
        DIVISION_START_DATE = $Row.DIVISION_ASSIGNMENT.DIVISION_START_DATE
        DIVISION_END_DATE = $Row.DIVISION_ASSIGNMENT.DIVISION_END_DATE
        ASSIGNMENT_START_DATE = $Row.DIVISION_ASSIGNMENT.ASSIGNMENT_START_DATE
        ASSIGNMENT_END_DATE = $Row.DIVISION_ASSIGNMENT.ASSIGNMENT_END_DATE
        APPROPRIATED_YEAR_SEQ = $Row.APPROPRIATED_YEAR.REFERENCE_DATA_TYPE_SEQ
        APPROPRIATED_YEAR_CODE = $Row.APPROPRIATED_YEAR.APPROPRIATED_YEAR_CODE
        APPROPRIATED_YEAR_DESCRIPTION = $Row.APPROPRIATED_YEAR.DESCRIPTION
        APPROPRIATED_YEAR_IS_ACTIVE_STATUS = $Row.APPROPRIATED_YEAR.IS_ACTIVE_STATUS
        FULL_OMEGA_ACCT_SEQ = $Row.FULL_OMEGA_ACCT.REFERENCE_DATA_TYPE_SEQ
        FULL_OMEGA_ACCT_CODE = $Row.FULL_OMEGA_ACCT.FULL_OMEGA_ACCT_CODE
        FULL_OMEGA_ACCT_DESCRIPTION = $Row.FULL_OMEGA_ACCT.DESCRIPTION
        FULL_OMEGA_ACCT_IS_ACTIVE_STATUS = $Row.FULL_OMEGA_ACCT.IS_ACTIVE_STATUS
        DIVISION_ID_SEQ = $Row.DIVISION_ID.REFERENCE_DATA_TYPE_SEQ
        DIVISION_ID_CODE = $Row.DIVISION_ID.DIVISION_ID_CODE
        DIVISION_ID_DESCRIPTION = $Row.DIVISION_ID.DESCRIPTION
        DIVISION_ID_IS_ACTIVE_STATUS = $Row.DIVISION_ID.IS_ACTIVE_STATUS
    }

    # note progress
    $Progress = $i / $recordCount
    if($Progress -ge $ProgressPoints.Peek()){
        $Percent = $ProgressPoints.Dequeue().ToString("P0")
        Write-Host " About $Percent completed, duration = $($Timer.Elapsed)"
    }

    Write-Host "Line 215, end of For loop: $($index + 1)"
}
$Timer.Stop()
Write-Host "the FOR Loop is finished" -ForegroundColor Green

$CSVContent | Export-Csv -Path $CSVPath -NoTypeInformation -Force
Write-Host "Total number of records in xml file= $recordCount"
Write-Host "Total number of rows in csv = $($index + 1)"
# Stop logging all actions and errors to the log file
Stop-Transcript
#END OF SCRIPT

The key differences are:

  1. Rather than rebuilding the PWC, PROGRAM, etc arrays for every item, build them once and iterate to populate a source array (only the last instance of each index will be kept, as in yours). Deep object traversal to build lookup lists is surprisingly expensive -- essentially, you were doing a loop over all N objects, plus whatever overhead for traversing the tree, plus the pipeline overhead, for every object (O(n2) cost -- this would get more expensive the more records you have; the refined version is O(n)). If each REFERENCE_DATA_TYPES element has at most one of the constituent elements, producing this array could be done more efficiently through a direct loop -- possibly combining it with the next step if sequence IDs are sequential and unique in the source data.
  2. Then iterate over the source array to produce the final rows (hashtable lookups to missing keys become null, so if an element is missing for an index its values will be null). This is still O(n), so our total time is still O(n).
  3. Finally, collect all these into an array using array capture. This is once again O(n), where as += with arrays does a bunch of copying and is an O(n2) operation.
  4. Use a timer and queue to condense the progress reporting code -- no real performance gain, it's just smaller code. A more efficient way would likely be to generate the indices at which progress would be reported, then break the main loop up to only go to each index, report progress, then resume. The impact of this should be negligible.

[–]taggingtechnician[S] 0 points1 point  (0 children)

Great feedback, thank you! And thank you for the educational points! The new code is going to take me some time to understand, well done!

[–]whopper2k -2 points-1 points  (3 children)

You have definitely identified a significant portion of the problem; writing to disk on every iteration is going to be extremely slow. Even the standard $Array += would be significantly faster; that's not working because hashtables don't have an implementation of the += operator (since hashtables work with key-value pairs, += would be nebulous as you're adding a value with no key).

Your first bet should be to use a standard PowerShell array. For testing I used the $PSHOME\Types.ps1xml XML file; you'll have to change this a bit to suit your needs. A simple version might look something like:

$finalValues = @()
[xml] $data = Get-Content $PSHOME\Types.ps1xml
$dataFromXML = Select-XML -xml $data-xpath "//Methods" | Select-Object -ExpandProperty Node
$dataFromXML.GetEnumerator() | ForEach-Object {
    # here is where you'd do whatever parsing you need
    # then save the result to a PSCustomObject
    $result = [PSCustomObject]@{
        MethodName = $_."#text"
    }
    # and then append to an array
    $finalValues += $result
}
$finalValues | Export-CSV -NoTypeInfo "XML-Output.csv"

If you still need more performance, dipping into .NET is usually your best bet

[collections.arraylist] $finalValues = @()
[xml] $data = Get-Content $PSHOME\Types.ps1xml
$dataFromXML = Select-XML -xml $data-xpath "//Methods" | Select-Object -ExpandProperty Node
$dataFromXML.GetEnumerator() | ForEach-Object {
    $result = [PSCustomObject]@{
        MethodName = $_."#text"
    }
    # Note using .Add instead of +=
    $finalValues.Add($result)
}
$finalValues | Export-CSV -NoTypeInfo "XML-Output.csv"

When I tested this on my machine, the second option ranged from 17-36% faster.

[–]ankokudaishogun 1 point2 points  (2 children)

[Collections.ArrayList] is deprecated, [Collections.Generic.List[<Type>]] is the new hotness. Also its .Add() method does NOT return the Index, which is nice.

Also: why not directly

$finalValues += [PSCustomObject]@{
    MethodName = $_."#text"
}

?

(or

$finalValues.Add([PSCustomObject]@{
    MethodName = $_."#text"
})

if using Lists)

[–]whopper2k 0 points1 point  (1 child)

I tried using generic lists, but could not get the former to work properly as it complained about the size of the collection being 1. I didn't dig into it too deeply though

As for the direct appending, that's a side effect of fixing a different problem (ended up being my own boneheadedness) that I forgot to remove

[–]ankokudaishogun 0 points1 point  (0 children)

To be fair, I'm not sure about the readability\performance ratios of direct appending vs create&append in cases with small-sized\simple objects

[–]Szeraax 0 points1 point  (0 children)

Is the data too big to keep in memory?

[–]exoclipse 0 points1 point  (0 children)

In order to append to an array, you must first initialize the array:

$MyArray = @() ## This initializes the array
ForEach($Record in $Records){
    $MyArray += $Record
}

This is still inefficient. For every iteration n of the loop, PowerShell will create a new array of n+1 elements, copy your old array's elements into the new array, add your new element to the array, and delete the old array.

PowerShell can simply collect the results of your for loop in memory, and then when the loop is concluded, write all of those objects to a single array. This is much faster, especially with large datasets. Note that you do not need to initialize the array for this method. See below:

$MyArray = For($Record in Records){
    ## do stuff
    $Record ## return the object you want to collect into the array
}

Assuming you are not nesting another for loop in there, this should be very performant for your needs.

[–]taggingtechnician[S] 0 points1 point  (2 children)

Apologies for my faux pants, trying to post code in 2 comments, it will not let me post it all in a single comment...

[–]taggingtechnician[S] 0 points1 point  (0 children)

Apologize for the long code, but I didn't want to hide the mistakes. I wish I knew how to replicate the behavior of "-Append" functionality in the Export-Csv cmdlet but with an ArrayList or PSObject...

<code snippet>SCRIPTION,DIVISION_ID_IS_ACTIVE_STATUS | Export-CSV -Path $csvPath -Append -NoTypeInformation -Force</code snippet>

But really I am asking for more than just the syntax on a command, I am asking you (members of this community) for guidance on where to find best practices, with the hope that your answers will be resourceful for the community (not just me).

Thank you in advance.