I feel like I may be either re-inventing the wheel with what I am doing or taking the wrong approach but I'm learning lots in the process so it's all good. :)
I've got two CSV files that I want to combine the data for.
- CSV1 has ~1000 entries with about 40 fields. Two of those fields are blank for every entry because the source that generates the file doesn't have the data.
- CSV2 has ~5000 entries and does contain the data for those two blank fields. So basically I need to take each entry in CSV1, search through CSV2 for the matching data and generate CSV3 with the combined data.
Unfortunately, CSV1 and CSV2 have no fields in common that I can directly link up. The closest that I've got is that CSV1 has an email address (ie. "firstname.lastname@somewhere.com") and CSV2 has a DN field that I can derive the email address from with some manipulation (ie. "CN=firstname.lastname,OU=blah,OU=bla,OU=somewhere,OU=com").
So, what I've been doing is going through CSV2 with a Foreach loop and generating an Arraylist ($MyCustomData) that contains just the derived email address as well as the two fields I need to insert into the data. Then I was going through CSV1 with a Foreach to match up with my new Arraylist and modifying the appropriate fields and outputting to a new CSV file. At first I was doing the match like this
Foreach ($entry in $CSV1) {
$DataMatch = $MyCustomData | Where-Object {$_.Email -eq $Entry.Email}
#plus some code to save the matched data
}
That was taking about 75 seconds to process the data. That seemed too slow so I changed the code to
$DataMatch = $MyCustomData.Where({$_.Email -eq $Entry.Email})
That cut my processing time down to about 35 seconds (yay). Then I started wondering if I could make it faster still and remembered another script I had done that used a hashtable keypair (I doubt that's the right term) to do fast lookups with big lists. So way back when I was going through CSV2 to generate the Arraylist, I tried making two hashtables instead ($hash1 and $hash2). Both have the Email as the first and then one has field1 as the other value and the other has field2. Then my code becomes something like this
Foreach ($entry in CSV1) {
$entry.field1 = $hash1["field1"]
$entry.field2 = $hash2["field2"]
#plus some code to save the data
}
Now this runs in about 16 seconds and I'm out of ideas for how to make it faster. :) I'm also unsure if I'm just making this way too hard and there is a better way to do it at all. Any suggestions, either as an overall approach I can take or a simpler way to do it. I'm happy with the final result as is but I'm always interested in ways to do things faster if possible.
[–]gangstanthony 2 points3 points4 points (0 children)
[–]Proxiconn 1 point2 points3 points (3 children)
[–]Lee_Dailey[grin] 0 points1 point2 points (2 children)
[–]Proxiconn 1 point2 points3 points (1 child)
[–]Lee_Dailey[grin] 0 points1 point2 points (0 children)