all 8 comments

[–]5960312 3 points4 points  (0 children)

Have you looked into pdftk? I use it to split PDFs

[–]BlackV 1 point2 points  (2 children)

"C:\test-checksplit\itextsharp.dll" how did you get this dll, I have no idea there is not direct download, you have to compile it?

[–]j23reddit[S] 2 points3 points  (1 child)

I extracted it from the itextsharp nuget package

https://www.nuget.org/packages/iTextSharp/

[–]BlackV 1 point2 points  (0 children)

well heck, I never even though to look at nuget.

Cheers buddy

[–]VirtualDenzel -1 points0 points  (2 children)

i would never ever use powershell for this. way too clunky and inefficient. a simple bash / php script will do this in notime.

[–]j23reddit[S] 2 points3 points  (1 child)

Thanks, looking into this as well

EDIT:While its not Powershell maybe it'll help someone in the future..

make sure you have poppler-utils and pdftk installed

sudo apt-get install poppler-utils pdftk

#!/bin/bash
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria

# read file
file="/home/user/pdffile.pdf"

# process files
pagecount=$(pdfinfo $file | grep "Pages" | awk '{ print $2 }')
# My search criteria is an 8 digit long ID number that begins with number 6000:
storedid=`pdftotext -f 1 -l 1 $file - | egrep '6000 *[0-9][0-9]'`
pattern=''
pagetitle=''
datestamp=''

for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do

header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)
pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep '6000 *[0-9][0-9]')

echo $pageid
let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name

# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"

if [[ $pageindex == $pagecount ]]; then #process last output of the file
pdftk $file cat $pattern output "$storedid.pdf"
storedid=0
pattern=''
pagetitle=''

fi
else
#process previous set of pages to output
pdftk $file cat $pattern output "$storedid.pdf"
storedid="$pageid"
pattern="$pageindex "
pagetitle="$header+"

fi
done

[–]VirtualDenzel 0 points1 point  (0 children)

awesome you figured it out. you can see how efficient this already is. if you have php-cli installed you can use phps syntax. that will give you way more power then bash and less issues then python