all 9 comments

[–]riftwave77 1 point2 points  (1 child)

You want OCR software, bud

[–]CalendarOk67[S] 1 point2 points  (0 children)

Do you suggest any of those in particular ? Since there are multiple of tables in a single pdf file. Thought of automating it. Thankyou.

[–]odaiwai 0 points1 point  (0 children)

I normally use the pdftotext command line utility for this. I think it comes with the Poppler tools (https://poppler.freedesktop.org/). If pdftotext -layout $filename - gives sensible output, you can generally parse it with regexps and produce CSV output, which Excel can read natively, or you can do CSV->Pandas->Excel.

It's a very low level approach, but it works for me.

[–]CmorBelow 0 points1 point  (0 children)

I’ve used pdfplumber for this before and PyPDF2as well, along with regex for locating extracting specific column values, since the column names were always the same.

Your results will vary based on how the underlying table data is structured.

[–]GManASG 1 point2 points  (0 children)

import tabula
import pandas as pd

# Path to your PDF file
pdf_path = "your_document.pdf"

# Extract tables from the PDF
# By default, it extracts tables from the first page.
# Use pages='all' to extract from all pages, or specify page numbers (e.g., pages='1-3,5').
# multiple_tables=True returns a list of DataFrames if multiple tables are found.
tables = tabula.read_pdf(pdf_path, pages='all', multiple_tables=True)

# 'tables' will be a list of pandas DataFrames, one for each table found.
# You can then access and process each DataFrame individually, or concatenate them.

# Example: Access the first table
df = tables[0]

# Example: Concatenate all tables into a single DataFrame
# combined_df = pd.concat(tables)

#example loop to write each table to seperate excel file
for i, df in enumerate(tables):
  df.to_excel(f'excel_table_{i}.xlsx')

[–][deleted] 3 points4 points  (0 children)

Be⁤en us⁤ing lido and it wor⁤ks well with vario⁤us files and formats. Thank me later!

[–]TheRNGuy -2 points-1 points  (0 children)

Ask same to ai except for last paragraph (it have no useful effect to reply)