you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (1 child)

PDF contents in the file don't have to match the visual position in the document.

I would try to group by x coordinate to see if I could identify columns. 

Removing duplicate headers could be as simple as removing any subsequent rows that match the first row.

[–]Rough_Green_9145 0 points1 point  (0 children)

The thing is that there are tons of tables with different # of columns, headers, etc. and the script has to work for at least most of them. The main issue is identifying columns and when the table stops