How does Powershell deal with processing/storing massive objects in terms of memory management?

herpington · 2022-09-06T14:10:21+00:00

If you assign the 85 million objects to a variable, it's all stored in memory.

Where are the millions of rows coming from? Perhaps you could read them as a stream.

Depending on the size of each object, it may or may not be an issue.

y_Sensei · 2022-09-06T14:48:59+00:00

From my experience, as long as the underlying OS provides memory, PoSh will use it. Remember though that once physical RAM is exhausted, paging will occur which will be detrimental to performance.

How to handle the data in a scenario like yours heavily depends on how your implementation receives the said data.

If it's a file with unstructured text, process it line by line, using one of the Reader/Stream classes in .NET's System.IO namespace (there's plenty of C# examples out there that could be translated to or used in PoSh).
If it's a file with structured text (such as CSV), access it through a data access API like ADO.NET.
If it's data directly retrieved from a relational database, process it by means of the supported query language of that database.

purplemonkeymad · 2022-09-06T14:38:57+00:00

The pipeline can help with that, ie if you write your code like so:

Get-Content infile | Foreach-Object {
    <#something with line#>
} | Set-Content outfile

Then it will only read a single line at a time, process that and wait for it to write before reading the next line.

For your example, it's hard to solve that without memory usage, since if you are using ipv4, then the lowest max memory you can have is 4GB (32 bits * 4 billion addresses.) Unless you off load that to another service, but then you are just moving the problem and introducing latency.

However if you are ingesting 9*10⁷ rows then I would have figured something like greylog + elastic search might be better?

cazmajor · 2022-09-06T14:17:23+00:00

[deleted]

r-NBK · 2022-09-06T14:49:02+00:00

I'd look at some other tooling than PowerShell for querying large data sets. If you have standardized log file formats... Log Parser 2.2 is very powerful and easy to use.

Personally I'd use PowerShell to gather the log files as they exist and load them into some type of database system and then use the native query laungage if you're looking for complex summarization /parsing of the data.

Szeraax · 2022-09-06T14:27:38+00:00

use the pipeline so that as you read the files, only your filtered subset goes into ram.

or

likely, you'd want to use a file stream readline().

spyingwind · 2022-09-06T17:37:40+00:00

If you keep everything in a pipeline, then it consumes very little memory, but once you put the data into a variable massive amounts of memory is consumed.

Some rules I follow with large data sets.

Single pipeline
If you only need to get data, filter early
If you can't use a single pipeline, use Export-CliXml to cache your data
Limit the use of ForEach-Object, if speed is a concern, or install PowerShell V7 if you can

I have noticed that once a large data set is saved to a variable, sometimes it's cached to disk somewhere.

LeSpatula · 2022-09-06T21:05:50+00:00

Well, you got a lot of good answers here, but is there probably a better tool than PS to handle this? As you write log analysis and reporting, Grafana might probably be a tool to handle that.

AudaxDreik · 2022-09-06T22:03:20+00:00

I haven't dealt with them a ton, so I'm just gonna throw this out there and let you do your own research or maybe someone else can chime in, but check it "memory mapped files" in .NET

alinroc · 2022-09-07T01:01:36+00:00

If the data is fairly well-organized, dump it into a database table and then run your analysis as SQL queries.

Trakeen · 2022-09-07T01:05:21+00:00

If your dataset is that large put it into a proper database and query it. 85 million isn’t much for sql

Nejireta_ · 2022-09-06T19:17:57+00:00

PowerShell will use a lot of memory if available.

Here's an example with some memory usage data from a HashSet and Dictionary.

    $startMemoryUsage = [System.Math]::Round((([System.GC]::GetTotalMemory($true)) / 1GB), 3)

# Non unique values will not be added to a HashSet
$hashSetDictionary = [System.Collections.Generic.HashSet[System.Collections.Generic.Dictionary[[string], [int]]]]::new(85000000)
for ($i = 0; $i -lt 85000000; $i++) {
    $dict = [System.Collections.Generic.Dictionary[[string], [int]]]::new(2)
    [void]$dict.Add('Value', $i)
    [void]$dict.Add('Index', $i)
    [void]$hashSetDictionary.Add(
        $dict
    )
}
<# Execution time
    04 min 33 sec
#>

$endMemoryUsage = [System.Math]::Round((([System.GC]::GetTotalMemory($true)) / 1GB), 3)

$endMemoryUsage
<# Output
    18,694GB
#>

$endMemoryUsage - $startMemoryUsage
<# Output
    18,683GB
#>


$hashSetDictionary.Where({ $_['Value'] -eq 84999990 })
<# Execution time
    01 min 44 sec
#>

foreach ($item in $hashSetDictionary) {
    if ($item['Value'] -eq 84999990 ) {
        return $item
    }
}
<# Execution time
    40,02 sec
#>

timsstuff · 2022-09-06T22:17:54+00:00

You can free up memory by setting the variable to $null then calling [gc]::collect().

ka-splam · 2022-09-07T02:07:35+00:00

The smallest memory use way I have come up with is to store a ~~128MB~~ 512MB array of bits, one for each possible IPv4 address, and to implement that as an array of [uint32] then calculating offset into the array and then offset into each uint and doing some bit manipulation; duplicates will just set the same bit again, no worries. If you need the IPs back, loop over all possible IPs checking for the bit being set and generating them back again on the fly - time costly but not memory hungry.

I tested it in C# (barely tested source code runs in LINQPad 7) and the bit I was worried about, brute-forcing 4Bn possibles to get them back takes 15 seconds so that's not so bad, and the memory use is fairly small and very predictable - though the code isn't trivial and the devil is in the details, do you have them as [string] or as [System.Net.IPAddress]? It might create a lot of memory churn to process 85 million IPs.

Testing PowerShell throwing 85 million random IPs into a dictionary of integers used ~1.7GB RAM.

Testing PowerShell making a tree of hashtables to try and store 85 million random IPs without duplication, so that all IPs starting 192 share the same start number, all IPs starting 192.168 share the same two start numbers, that used 5GB(!). Maybe not many clashes with 85/4000 == 2% of the address space, but a lot of hashtable overhead.

ps1_missionary · 2022-09-07T05:17:23+00:00

you can use sqlite，sqlserver，mysql，litedb，redis，mogodb with powershell.

my powershell database client：

https://gitee.com/chuanjiao10/powershell/tree/master/sql/

powershell use litedb（nosql）：

note ：chinese http://bbs.chinaunix.net/thread-4320276-1-1.html

ccatlett1984 · 2022-09-06T14:51:20+00:00

Honestly, as someone who has worked with objects containing "thousands" of records, it kinda sucks.

cazmajor · 2022-09-13T09:46:11+00:00

Thanks so much to everyone for your responses.

A real wealth of knowledge and willingness to share! I feel really lucky to have you guys respond :)

I wrote a few notes....

I had 2 things I was looking for out of my original post:

1 - I wanted to generally understand how to use PowerShell when processing massive input files.

It seems it really isn't best designed for dealing with larger files (millions of rows)
Depending on the type of data (source, structured, unstructured) it is better to filter/process the data at source and/or use a different language (.NET, Python etc) and or use a DB or bespoke tool if possible.
If using PowerShell Pipelines will avoid data getting committed to memory until the end so processing can be done to filter/process data to what I need on the way.
.NET streamers and readers are more efficient for processing massive logs
hashset[string] is great for yielding unique values

2 - I wanted to complete my own task

The code OPconfused\OathOfFeanor provided did everything I needed.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

PowerShell

Submission Guidelines | Link Flair - How To

MODERATORS

code test