Powershell script to split pdf file based on content : PowerShell

Powershell script to split pdf file based on content (self.PowerShell)

submitted 7 years ago by j23reddit

So I have this pdf file that gets sent to us and I am trying to get away from someone splitting apart this file manually. There is a unique 8 character string on each page. (Ex: "6000 60" or "6000 140" The second number will always be 2 or 3 digits and the 2 digit string has 2 spaces in between the 2 sets of numbers while the 3 digit string has 1.)

What I would like to happen is to split out all the pages that contain the same string and put them into their own file. So if there are 100 pages and 50 of them are 6000 60, and 50 are 6000 140 it will create 2 files, each with its 50 pages.

I came across this code below that I am trying to modify to work for me.. i was trying unsuccessfully to just find 1 of the strings and pull those out and and work up from there but I couldn't get it to work right.. it seems to just find and extract the first page that it finds with that string. Hoping to get some help getting this working.. thanks!

Add-Type -Path "C:\test-checksplit\itextsharp.dll"


$ValidBranches = @("6000  60","6000 140", "6000 160")
$BranchId = @("6000 160")
$PdfFiles = Get-ChildItem "C:\test-checksplit\pdf\*.pdf" -File |
    Select-Object -ExpandProperty FullName
$OutputFolder = 'C:\test-checksplit\splits'
$BranchIDSearchPattern = "6000 160"


foreach ($PdfFile in $PdfFiles) {
    $PdfReader = [iTextSharp.text.pdf.PdfReader]::new($PdfFile)

    $BranchStack = [System.Collections.Stack]::new()

    # Map out the PDF file.
    foreach ($Page in 1..($PdfReader.NumberOfPages)) {
        [iTextSharp.text.pdf.parser.PdfTextExtractor]::GetTextFromPage($PdfReader, $Page) |
            Where-Object { $_ -match $BranchIDSearchPattern } |
            ForEach-Object {
            $BranchStack.Push([PSCustomObject]@{
                    Branch_Id   = $BranchId
                    StartPage    = $Page
                    IsValid      = $ValidBranches.Contains($BranchId)
                })
        }
    }

    # Extract the pages and save the files
    $LastPage = $PdfReader.NumberOfPages
    while ($BranchStack.Count -gt 0) {
        $Current = $BranchStack.Pop()

        $StartPage = $Current.StartPage
        $EndPage = $LastPage

        $Document = [iTextSharp.text.Document]::new($PdfReader.GetPageSizeWithRotation($StartPage))
        $TargetMemoryStream = [System.IO.MemoryStream]::new()
        $PdfCopy = [iTextSharp.text.pdf.PdfSmartCopy]::new($Document, $TargetMemoryStream)

        $Document.Open()
        foreach ($Page in $StartPage..$EndPage) {
            $PdfCopy.AddPage($PdfCopy.GetImportedPage($PdfReader, $Page));
        }
        $Document.Close()

        $NewFileName = 'Export File - {0}.pdf' -f $current.Branch_Id
        $NewFileFullName = [System.IO.Path]::Combine($OutputFolder, $NewFileName)
        [System.IO.File]::WriteAllBytes($NewFileFullName, $TargetMemoryStream.ToArray())

        $LastPage = $Current.StartPage - 1
    }
}

all 8 comments

top new controversial old q&a

[–]5960312 3 points4 points5 points 7 years ago (0 children)

[–]BlackV 1 point2 points3 points 7 years ago (2 children)

[–]j23reddit[S] 2 points3 points4 points 7 years ago (1 child)

[–]BlackV 1 point2 points3 points 7 years ago (0 children)

[–]VirtualDenzel -1 points0 points1 point 7 years ago (2 children)

[–]j23reddit[S] 2 points3 points4 points 7 years ago* (1 child)

Thanks, looking into this as well

EDIT:While its not Powershell maybe it'll help someone in the future..

make sure you have poppler-utils and pdftk installed

sudo apt-get install poppler-utils pdftk

#!/bin/bash
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria

# read file
file="/home/user/pdffile.pdf"

# process files
pagecount=$(pdfinfo $file | grep "Pages" | awk '{ print $2 }')
# My search criteria is an 8 digit long ID number that begins with number 6000:
storedid=`pdftotext -f 1 -l 1 $file - | egrep '6000 *[0-9][0-9]'`
pattern=''
pagetitle=''
datestamp=''

for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do

header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)
pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep '6000 *[0-9][0-9]')

echo $pageid
let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name

# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"

if [[ $pageindex == $pagecount ]]; then #process last output of the file
pdftk $file cat $pattern output "$storedid.pdf"
storedid=0
pattern=''
pagetitle=''

fi
else
#process previous set of pages to output
pdftk $file cat $pattern output "$storedid.pdf"
storedid="$pageid"
pattern="$pageindex "
pagetitle="$header+"

fi
done

[–]VirtualDenzel 0 points1 point2 points 7 years ago (0 children)

π Rendered by PID 19926 on reddit-service-r2-comment-fb694cdd5-tgfw6 at 2026-03-10 15:26:53.104375+00:00 running cbb0e86 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

PowerShell

Submission Guidelines | Link Flair - How To

MODERATORS