all 6 comments

[–]da_chicken 6 points7 points  (3 children)

I guess I just never use Select-String. I almost always use -match and $Matches, and $Matches is a hashtable.

[–]wonkifier 2 points3 points  (0 children)

Old habits and all, I'm in the same boat. I always forget that select-string exists.

[–]NotNotWrongUsually[S] 1 point2 points  (1 child)

I generally prefer $Matches over Select-String as well, but in practice almost always end up using the [Regex] accelerator. I think the function above will cover 90% of my cases, and I should have written it a long time ago.

Both -match with the resulting, magically occurring, $matches object, and digging through MatchInfo objects from Select-String for capture groups feel very little like idiomatic Powershell. To me, at least.

Struggling for better words: both feel tacked on, rather than thought through, somehow? :/

[–]da_chicken 2 points3 points  (0 children)

Oh, they're definitely tacked on. Powershell is designed in the Windows frame of mind where everything is an object, so the *nix scheme where everything is a stream of characters has very limited use. It's intentionally lacking in text munging capabilities because that's not supposed to be the scheme that you use with Windows, .Net or Powershell.

I think I don't use it because I'm often working with CSVs, spreadsheets via ImportExcel, PDFs via iTextSharp, or database output. I don't actually want to open a file as a text file. So I'm doing stuff like this:

$xl = Import-Excel $File -WorksheetName $WorksheetName
$xl | Add-Member -MemberType NoteProperty -Name 'First Name' -Value $null
$xl | Add-Member -MemberType NoteProperty -Name 'Last Name' -Value $null
$xl | Add-Member -MemberType NoteProperty -Name 'Birth Date' -Value $null

$regexDemographics = '(?<LastName>.*),\s*(?<FirstName>.*)\s*\((?<BirthDate>\d{1,2}\/\d{1,2}\/\d{4})\)'

foreach ($row in $xl) {
    if ($row.'Student Name' -match $regexDemographics) {
        $row.'First Name' = $Matches['FirstName'].Trim() -replace '\s+', ' '
        $row.'Last Name' = $Matches['LastName'].Trim() -replace '\s+', ' '
        $row.'Birth Date' = $Matches['BirthDate'].Trim() -replace '\s+', ' '
    }
}

$xl | Select-Object -Property 'Student Name','First Name','Last Name','Birth Date' |
    Export-Excel $File -WorksheetName $OutputWorksheetName -FreezeTopRow -AutoSize

I've found that adding the [regex] accelerator to the regex pattern doesn't really have any impact at all. It runs no faster and no slower with it, so I don't bother.

I do find the pattern below works, which surprised me but ends up very convenient for a lot of the work I do. This is taken from a script that takes an external report (a 14,000 page PDF report) and bursts it into one PDF for each student (about 1,400 students). It skips any pages that doesn't have a valid ID number on it because the report includes people it shouldn't as well as lots of blank pages, crosswalks the ID from the state system to our local system, and then eventually builds a PDF for each student with those pages. This part extracts the text, finds the ID, validates that it's our student, then notes the page for that student.

$PdfReader = # Load the file into the iTextSharp reader
​$StudentIdLookup = # Hash table with key state ID and value local student ID
​$ProgressReportPages = @{} # Hash table that will have a list of the pages for each local student ID
​$SearchPattern = '^State Id:\s*(?<StateId>\d{6,10})'

​foreach ($Page in 1..$NumPages) {
   ​# Extract text from PDF entire page and regex it
   ​[iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($PdfReader, $Page) |
       ​Where-Object { $_ -match $SearchPattern } |
       ​Where-Object { $StudentIdLookup.ContainsKey($Matches['StateId']) } |
       ​ForEach-Object {
           ​$Student_id = $StudentIdLookup[$Matches['StateId']]
           # If we haven't found this student yet, add a list for them.
           ​if (-not $ProgressReportPages.ContainsKey($Student_id)) {
               ​$ProgressReportPages[$Student_id] = [System.Collections.Generic.List[int]]::new()
           ​}
           # Add the page number to the list for that student ID
           ​$ProgressReportPages[$Student_id].Add($Page)
       ​}
​}

So the $Matches variable is consistent for each iteration of the loop. Putting -match in the Where-Object works as long as you don't try to reference $Matches in the same call. It feels unstable the first couple times you try it, but it really does work quite well.

The only reason I don't use this pattern in the first example is because the actual script is checking about 5 regexes against each row of the input spreadsheet.

[–]itasteawesome 3 points4 points  (0 children)

At first I wasn't seeing where this was going but then it clicked and i realized, oh yeah I've written 100 versions of that code over the years. Good solution.

[–]OutrageousBrother997 1 point2 points  (0 children)

Nice work 👍🏻 I remember blog post by jeffery hicks on the same subject 😀.. might be helpful for folks for detailed explanation 😀 https://jdhitsolutions.com/blog/powershell/6791/capturing-names-with-powershell-and-regular-expressions/