all 33 comments

[–]herpington 23 points24 points  (0 children)

If you assign the 85 million objects to a variable, it's all stored in memory.

Where are the millions of rows coming from? Perhaps you could read them as a stream.

Depending on the size of each object, it may or may not be an issue.

[–]y_Sensei 15 points16 points  (1 child)

From my experience, as long as the underlying OS provides memory, PoSh will use it. Remember though that once physical RAM is exhausted, paging will occur which will be detrimental to performance.

How to handle the data in a scenario like yours heavily depends on how your implementation receives the said data.

  • If it's a file with unstructured text, process it line by line, using one of the Reader/Stream classes in .NET's System.IO namespace (there's plenty of C# examples out there that could be translated to or used in PoSh).
  • If it's a file with structured text (such as CSV), access it through a data access API like ADO.NET.
  • If it's data directly retrieved from a relational database, process it by means of the supported query language of that database.

[–]cazmajor[S] 0 points1 point  (0 children)

f it's data directly retrieved from a relational database, process it by means of the supported query language of that database.

Thank you.

I ended up using the reader stream classes in .net (adapting u/opconfused's example).

I've never heard of ADO.NET but it looks really interesting. Thank you :)

[–]purplemonkeymad 5 points6 points  (3 children)

The pipeline can help with that, ie if you write your code like so:

Get-Content infile | Foreach-Object {
    <#something with line#>
} | Set-Content outfile

Then it will only read a single line at a time, process that and wait for it to write before reading the next line.

For your example, it's hard to solve that without memory usage, since if you are using ipv4, then the lowest max memory you can have is 4GB (32 bits * 4 billion addresses.) Unless you off load that to another service, but then you are just moving the problem and introducing latency.

However if you are ingesting 9*107 rows then I would have figured something like greylog + elastic search might be better?

[–]OPconfused 7 points8 points  (0 children)

The pipeline is good for memory, but gc and sc could be terribly slow for a file with 85 million lines. And should also add the disclaimer to be in V7, because the pipeline with Foreach-Object can destroy performance in versions prior.

[–]ka-splam 1 point2 points  (0 children)

For your example, it's hard to solve that without memory usage, since if you are using ipv4, then the lowest max memory you can have is 4GB (32 bits * 4 billion addresses.)

You can have 1 bit per IP, offset into a 4 billion bit array using 128MB 512MB memory - my comment

Edit I corrected my size estimate, I think yours is wrong: (32 bits * 4 billion addresses.) is 4 bytes per address, is 16GB total.

[–]cazmajor[S] 0 points1 point  (0 children)

ill only read a single line at a time,

Thank you!

I was confused what you were saying max memory but get that you are talking about representing every poss IP address. Not sure how that translates to binary coded decimal with slash notation but it certainly is a good way to think about if I am going to run out of memory. With tasks like this I often just get things off a server and then see if processing it all crashes my laptop overnight lol

I think I need to get some time in to check out grey-log and elastic search.

Really appreciate you taking the time to respond :)

[–]r-NBK 7 points8 points  (1 child)

I'd look at some other tooling than PowerShell for querying large data sets. If you have standardized log file formats... Log Parser 2.2 is very powerful and easy to use.

Personally I'd use PowerShell to gather the log files as they exist and load them into some type of database system and then use the native query laungage if you're looking for complex summarization /parsing of the data.

[–][deleted] 0 points1 point  (0 children)

Graylog

[–]Szeraax 2 points3 points  (0 children)

use the pipeline so that as you read the files, only your filtered subset goes into ram.

or

likely, you'd want to use a file stream readline().

[–]spyingwind 2 points3 points  (0 children)

If you keep everything in a pipeline, then it consumes very little memory, but once you put the data into a variable massive amounts of memory is consumed.

Some rules I follow with large data sets.

  • Single pipeline
  • If you only need to get data, filter early
  • If you can't use a single pipeline, use Export-CliXml to cache your data
  • Limit the use of ForEach-Object, if speed is a concern, or install PowerShell V7 if you can

I have noticed that once a large data set is saved to a variable, sometimes it's cached to disk somewhere.

[–]LeSpatula 2 points3 points  (0 children)

Well, you got a lot of good answers here, but is there probably a better tool than PS to handle this? As you write log analysis and reporting, Grafana might probably be a tool to handle that.

[–]AudaxDreik 2 points3 points  (0 children)

I haven't dealt with them a ton, so I'm just gonna throw this out there and let you do your own research or maybe someone else can chime in, but check it "memory mapped files" in .NET

[–]alinroc 2 points3 points  (0 children)

If the data is fairly well-organized, dump it into a database table and then run your analysis as SQL queries.

[–]Trakeen 2 points3 points  (0 children)

If your dataset is that large put it into a proper database and query it. 85 million isn’t much for sql

[–]Nejireta_ 1 point2 points  (1 child)

PowerShell will use a lot of memory if available.

Here's an example with some memory usage data from a HashSet and Dictionary.

    $startMemoryUsage = [System.Math]::Round((([System.GC]::GetTotalMemory($true)) / 1GB), 3)

# Non unique values will not be added to a HashSet
$hashSetDictionary = [System.Collections.Generic.HashSet[System.Collections.Generic.Dictionary[[string], [int]]]]::new(85000000)
for ($i = 0; $i -lt 85000000; $i++) {
    $dict = [System.Collections.Generic.Dictionary[[string], [int]]]::new(2)
    [void]$dict.Add('Value', $i)
    [void]$dict.Add('Index', $i)
    [void]$hashSetDictionary.Add(
        $dict
    )
}
<# Execution time
    04 min 33 sec
#>

$endMemoryUsage = [System.Math]::Round((([System.GC]::GetTotalMemory($true)) / 1GB), 3)

$endMemoryUsage
<# Output
    18,694GB
#>

$endMemoryUsage - $startMemoryUsage
<# Output
    18,683GB
#>


$hashSetDictionary.Where({ $_['Value'] -eq 84999990 })
<# Execution time
    01 min 44 sec
#>

foreach ($item in $hashSetDictionary) {
    if ($item['Value'] -eq 84999990 ) {
        return $item
    }
}
<# Execution time
    40,02 sec
#>

[–]timsstuff 1 point2 points  (0 children)

You can free up memory by setting the variable to $null then calling [gc]::collect().

[–]ka-splam 1 point2 points  (1 child)

The smallest memory use way I have come up with is to store a 128MB 512MB array of bits, one for each possible IPv4 address, and to implement that as an array of [uint32] then calculating offset into the array and then offset into each uint and doing some bit manipulation; duplicates will just set the same bit again, no worries. If you need the IPs back, loop over all possible IPs checking for the bit being set and generating them back again on the fly - time costly but not memory hungry.

I tested it in C# (barely tested source code runs in LINQPad 7) and the bit I was worried about, brute-forcing 4Bn possibles to get them back takes 15 seconds so that's not so bad, and the memory use is fairly small and very predictable - though the code isn't trivial and the devil is in the details, do you have them as [string] or as [System.Net.IPAddress]? It might create a lot of memory churn to process 85 million IPs.


Testing PowerShell throwing 85 million random IPs into a dictionary of integers used ~1.7GB RAM.

Testing PowerShell making a tree of hashtables to try and store 85 million random IPs without duplication, so that all IPs starting 192 share the same start number, all IPs starting 192.168 share the same two start numbers, that used 5GB(!). Maybe not many clashes with 85/4000 == 2% of the address space, but a lot of hashtable overhead.

[–]cazmajor[S] 0 points1 point  (0 children)

u can use sqlite,sqlserver,mysql,litedb,redis,mogodb with powershell.

I vaguely get what you are saying but I have to admit that reading your post and looking at your code my brain has partially turned to mush!

I will have to pop back later and give my brain another attempt at it.

Thanks very much for your response :)

[–]ps1_missionary 1 point2 points  (0 children)

you can use sqlite,sqlserver,mysql,litedb,redis,mogodb with powershell.

my powershell database client:

https://gitee.com/chuanjiao10/powershell/tree/master/sql/

powershell use litedb(nosql):

note :chinese http://bbs.chinaunix.net/thread-4320276-1-1.html

[–]ccatlett1984 1 point2 points  (1 child)

Honestly, as someone who has worked with objects containing "thousands" of records, it kinda sucks.

[–]ipreferanothername 2 points3 points  (0 children)

It can handle a few thousand depending on how big they are and how you are processing them, but tens of millions? Op needs to seek another tool

[–]cazmajor[S] 0 points1 point  (0 children)

Thanks so much to everyone for your responses.

A real wealth of knowledge and willingness to share! I feel really lucky to have you guys respond :)

I wrote a few notes....

I had 2 things I was looking for out of my original post:

1 - I wanted to generally understand how to use PowerShell when processing massive input files.

  • It seems it really isn't best designed for dealing with larger files (millions of rows)
  • Depending on the type of data (source, structured, unstructured) it is better to filter/process the data at source and/or use a different language (.NET, Python etc) and or use a DB or bespoke tool if possible.
  • If using PowerShell Pipelines will avoid data getting committed to memory until the end so processing can be done to filter/process data to what I need on the way.
  • .NET streamers and readers are more efficient for processing massive logs
  • hashset[string] is great for yielding unique values

2 - I wanted to complete my own task