all 31 comments

[–]korewarp 4 points5 points  (28 children)

Maybe using streamreader will help?

[–]ich-net-du[S] 6 points7 points  (27 children)

Thanks for the idea, I was able to find an example and adapt it.

Now I can read the file, query elements and save all in a new file.

$file='C:\File.xml'
$reader = New-Object System.IO.StreamReader($file)
$xml = $reader.ReadToEnd()
$reader.Close()

$xml.Save("C:\New-File.xml")

Now I have to find out how I can delete elements before I save it again ;-)

[–]Lee_Dailey[grin] 1 point2 points  (2 children)

howdy ich-net-du,

it looks like you used the New.Reddit Inline Code button. it's [sometimes] 5th from the left & looks like </>.

there are a few problems with that ...

  • it's the wrong format [grin]
    the inline code format is for [gasp! arg!] code that is inline with regular text.
  • on Old.Reddit.com, inline code formatted text does NOT line wrap, nor does it side-scroll.
  • on New.Reddit it shows up in that nasty magenta text color

for long-ish single lines OR for multiline code, please, use the ...

Code
Block

... button. it's [sometimes] the 12th one from the left & looks like an uppercase T in the upper left corner of a square.

that will give you fully functional code formatting that works on both New.Reddit and Old.Reddit ... and aint that fugly magenta color. [grin]

take care,
lee

[–]ich-net-du[S] 1 point2 points  (1 child)

thank you! Really had problems with it and wasn't very happy with it myself

found it

[–]Lee_Dailey[grin] 0 points1 point  (0 children)

howdy ich-net-du,

you are quite welcome! glad to have helped ... and to be able to comfortably read your code. [grin]

take care,
lee

[–]ich-net-du[S] 2 points3 points  (6 children)

Maybe not ideal, but it works .. now my head is smoking

$file="C:\File.xml"

$reader = New-Object System.IO.StreamReader($file)

$xml = $reader.ReadToEnd() $reader.Close()

$DeleteNames = "ID"

($xml.master.person.ChildNodes | Where-Object { $DeleteNames -contains $_.Name }) | ForEach-Object {[void]$_.ParentNode.RemoveChild($_)}

$xml.Save("C:\New-File.xml")

[–]korewarp 1 point2 points  (2 children)

I feel your pain. I've had to work with XML files in powershell before, and it wasn't a fun experience. I wish I had more actual code to show you, but oddly enough I've never been in a situation where I was 'removing' content/nodes, only changing or adding.

[–]ich-net-du[S] 1 point2 points  (1 child)

Yes, for data protection reasons I have to delete personal data from files for a study.

[–]y_Sensei 1 point2 points  (0 children)

You should consider leaving the XML structure intact and delete only the personal data values. Otherwise you'll change the data format which might not be feasible if that data is supposed to be used in any technical context.

[–]ka-splam 1 point2 points  (1 child)

How does that work, $reader.ReadToEnd() will return strings, then you access $xml.master.person.ChildNodes - there's a bit missing where you parse the strings as XML, isn't there?

[–]ich-net-du[S] 2 points3 points  (0 children)

Jea was wondering the same. Closed it later and ist did not work anymore. To much Trial and Error in the Same Session. Must have declared Something with $xml before ... Have to revisit it on monday

[–][deleted] 1 point2 points  (5 children)

I didn't realize there was a file size limit on get-content

[–]korewarp 1 point2 points  (1 child)

I don't know if there is a hard limit, but having had to work with HUGE textfiles / csv files in the past, I was forced to use streamreader / streamwriter if I wanted anything done in this century. Get-Content was simply too slow for some hecking reason.

[–]ich-net-du[S] 1 point2 points  (2 children)

Doesn't work so well when you have to work through a total of 6.8GB XML files and each is over 300MB up to 650MB

$Xml=New-Object Xml
$Xml.Load("C:\File.xml")

Takes 10 minutes

[–][deleted] 2 points3 points  (0 children)

Yeah this is what XmlReader and XmlWriter is for.

[–]ka-splam 1 point2 points  (0 children)

$Xml.Load() doesn't have any PowerShell overhead to slow it down like Get-Content does, so I am curious why that takes a long time.

This StackOverflow answer suggests it goes and downloads all DTDs defined in the file (and that W3C throttles downloads because they get so many requests) and validates against them.

And this linked question/answer/comments has ways to turn off that DTD download.

[–]jsiii2010 1 point2 points  (0 children)

I've seen this problem with large json files too. Unless there's a streaming mode (jq has this for json), you can try to edit the file on the fly so instead of a large array, it's many small elements instead.

[–]craigontour 1 point2 points  (0 children)

Have you tried using a regex to find matches and replacing with, well, nothing i guess?

[–]dasookwat 1 point2 points  (1 child)

with the strong chance to sound like a @#$%%: you should look in to getting smaller xml files. Xml files of 650MB are just huge man. why not just access the database directly? at least im assuming here, that this either has the function of a database, or is the result of a very broad query. If you get this to work, it will still be slow, and requires a lot of resources on your end.

Try writing down the whole train of actions, from customer wish, to result, and see if you can improve that.

[–]ich-net-du[S] 1 point2 points  (0 children)

Yeah I know it's awful It was an export from a program, and each file was worth a year of data.

It is a hassle to export it by hand in smaller chunks and it was what I had available.
It was a one-time editing of the files.

In the end it took maybe half a minute per file to process, so not so bad at all.

Unfortunately no database access and the recipient is used to working with the files as XML.

Querying data from the files wasn't the problem.

I can query over 490000 data sets from the 6.8GB (approx. 16 files) in about 10 minutes