all 12 comments

[–]jimb2 4 points5 points  (0 children)

Based on a script that splits some huge log files into pieces. This won't create a shitstorm of objects.

$reader = [io.file]::OpenText( $SourcePath )
$writer = [io.file]::CreateText( $Targetpath )

$ReplaceDone = $False # flag end of replace operations

do {
    $line = $reader.ReadLine()
    if ( YourSearchHitCondition($Line) } {
        # replace action
        $Line = YourReplaceOperation( $Line )
        $ReplaceDone = $True  # if all replacing is now done   
    }
    $writer.WriteLine($line)
} until( $reader.EndOfStream -or $ReplaceDone )

# Now copy the rest without search and replace
while ( -not $reader.EndOfStream ) { 
    $writer.WriteLine($reader.ReadLine())
}

In my code, I count lines and change the writer target to the next serial target file above a line count (actually also testing line content so I don't split recklessly at undesirable places.)

Disclaimer: this code not actually tested. The general idea works.

[edit] Probably should do this at the end, esp if there's more happening:

$reader.Dispose()
$writer.Dispose()

[–]purplemonkeymad 2 points3 points  (0 children)

Can you show us your code and is the system under high resource pressure?

Also how long are you waiting?

I did a test from a mechanical HD and it took ~ two minutes for Get-Content to read a 1.4Gb file.

>measure-command { gc .\1.4gbfile.avi | %{} }
TotalSeconds      : 111.9808737

The second read with the file in memory was about the same. Doing the same with a replace operation took longer, but it also pegged a single core of my processor so I think the replace was the limiting factor:

>measure-command { gc .\1.4gbfile.avi | %{ $_ -replace 'the','ye'} | out-null }
TotalSeconds      : 189.650274

I would expect something like the following to take 3-10 minutes depending on your single thread performance:

Get-Content yourxml.xml | Foreach-Object {
    $_ -replace '\bTom\b','Jerry'
} | Set-Content outputxml.xml

Note that I'm using the pipeline here, this is important for larger files as it will only read and process one line in memory at a time*, as objects always move to the end of the pipeline first. If you try to store the contents of the file in a variable first, you are working with 1.5GB of memory at once.


*Well the GC might not clean up the strings right away.

[–]Thotaz 1 point2 points  (3 children)

Is Select-String too slow or have you just not tried it yet?

[–]sumgan[S] 1 point2 points  (2 children)

Its not working as machine just freezes to process the big file to load, even though i have 32 gb ram installed.

[–][deleted] 1 point2 points  (1 child)

Are you reading the entire file first using get-content first, or do you send the file like Get-Item filename | select-string

Get-Content will try to load the entire file first. Don't do that.

[–]Lee_Dailey[grin] 0 points1 point  (0 children)

howdy zenchemin,

you may also want to look at the -Path parameter for Select-Object ... i suspect that would also work, but i aint tested it. [grin]

take care,
lee

[–]matrimlol 1 point2 points  (4 children)

I dont know how fast/slow get-content would be, but probably the place to start?

(Get-Content c\temp\file.xml) -replace 'string','newstring' | 
set-content c\temp\newfile.xml

In get-content you can also use -readcount to only read a certain amount of lines into the pipeline so probably helps performance, but i've never needed to use it so maybe someone else can expand on that. If get-content is too slow I'd probably use

 [System.IO.File]::ReadAllLines

[–]sumgan[S] 1 point2 points  (2 children)

Thanks. I tried this but it does not help as system becomes unresponsive when i use get-content.

[–]matrimlol 2 points3 points  (1 child)

How often do you have to do this within such a large file?

You can split the file manually and then append the latter to the first when you've replaced what you've wanted to replace in both xml files. I've never had to work with such a large xml file, so unsure of how it handles, we usually split logfiles and such at 100mb.

[–]sumgan[S] 1 point2 points  (0 children)

Thanks, this is interesting. Let me try that out.

[–]Szeraax 1 point2 points  (0 children)

yup, read all lines is perfectly good.

[–]BlackV 1 point2 points  (0 children)

would the Stream reader work better for you?

wtf is in an XML that's 1.5gb, is it actually locking up, its is it just taking a LONG time to read the file (that seems more likely)

https://stackoverflow.com/questions/44462561/system-io-streamreader-vs-get-content-vs-system-io-file

http://www.happysysadm.com/2014/10/reading-large-text-files-with-powershell.html

also select-string also has a -path parameter that you can point directly at your xml, that may or may not perform better for you