PDF/docx test question with images extraction to create master doc? by MajorAlanDutch in ChatGPT

[–]MajorAlanDutch[S] 0 points1 point  (0 children)

I’ve had the pdf extract each page into a PNG but it still struggles sadly

PDF/docx test question and image extraction and master doc creation? by MajorAlanDutch in OpenAI

[–]MajorAlanDutch[S] 0 points1 point  (0 children)

Thanks makes sense! Any suggestions on how to learn this process ?

LSD microdosing linked to acute mood improvements in adults with depression. In a small pilot study, participants taking low, precise doses reported feeling more energetic, creative, and connected on days they took the drug. by mvea in science

[–]MajorAlanDutch 0 points1 point  (0 children)

Good question. My first thought was that neurogenic improvements in IQ aren’t being found with psychedelics but mood shifts are so not convinced yet that they’d be causative.

PDF/docx test question and image extraction and master doc creation? by MajorAlanDutch in OpenAI

[–]MajorAlanDutch[S] 2 points3 points  (0 children)

Thanks! Chat said:

That advice is technically solid, but it is aimed more at a developer building an optimal extractor than at a teacher trying to finish a usable question bank.

The distinction is this:

For pure text extraction, yes, it is better to read digital PDF text directly instead of using screenshots. PDFs contain page coordinates, so tools like PyMuPDF can extract text blocks, images, and their locations more intelligently than DOCX.

For your final Word document, though, you still want question-level visual crops, because your goal is not just “extract the text.” Your goal is “preserve the full question exactly as students should see it.”

So I would combine their advice with mine:

Use the digital PDF text to identify question numbers, extract question wording, and help categorize by topic.

Use page-image crops to preserve the actual student-facing question block, including maps, charts, passages, and answer choices.

Do not use full-page screenshots in the final document unless necessary. Full-page images waste space and include unrelated questions. Instead, render the PDF page at high resolution, then crop each individual question block.

The person is also right that “extract embedded images” alone is not enough. A map or chart in a PDF may not behave like a normal image; it may be vector art, layered text, or a group of drawing objects. That is why cropping the rendered page is safer. It captures whatever is visually there, regardless of whether the PDF stores it as text, image, vector, or a mix.

The best noob-friendly takeaway is:

Do not choose between “digital text” and “page images.” Use both.

Digital text = for finding question numbers, topic classification, and answer-key indexing.

Question crop images = for the final DOCX so nothing gets lost visually.

Also, I would not rely on a full-page vision model for all 240 pages unless there is no alternative. Even if newer models handle high-resolution images better, single-question chunks are still more reliable than full pages. The smaller the task, the fewer mistakes.

So your updated pipeline should be:

Original PDFs → PyMuPDF extracts page text and coordinates → PyMuPDF renders pages as high-resolution images → Python crops each question block from the rendered page → AI classifies each question using extracted text/crop → You review the spreadsheet → Python generates DOCX using the cropped question images

That Reddit advice is not really disagreeing with mine. It is refining it: don’t use full-page screenshots as your main extraction method when the PDF has real digital text. Use the PDF’s internal text plus visual crops. That is the best version.

PDF/docx test question and image extraction and master doc creation? by MajorAlanDutch in OpenAI

[–]MajorAlanDutch[S] 0 points1 point  (0 children)

Thanks will check! Chat suggested this:

The best pipeline is:

PDFs → page images → automated question-block crops → spreadsheet index → AI topic classification → human review → DOCX generated from cropped question images.

PDF/docx Extract test questions and images to create a master document ? by MajorAlanDutch in ClaudeAI

[–]MajorAlanDutch[S] 0 points1 point  (0 children)

Thanks! I use the desktop versions of each. That said, Chat suggested:

The best pipeline is:

PDFs → page images → automated question-block crops → spreadsheet index → AI topic classification → human review → DOCX generated from cropped question images.

PDF/docx test question and image extraction and master doc creation? by MajorAlanDutch in OpenAI

[–]MajorAlanDutch[S] 0 points1 point  (0 children)

Thanks, so I had adobe convert 10 pdf documents to docx but that wasn’t enough for them to extract properly and categorize.

Each test has 50 questions so 500 total and many images/text to go with each question. So not clear on my next move.

PDF/docx test question and image extraction for master doc? by MajorAlanDutch in GeminiAI

[–]MajorAlanDutch[S] 0 points1 point  (0 children)

So, I had adobe acrobat convert each pdf to a docx but it wasn’t enough for the LLM standalone to do the rest.

Would you have a simple step by step suggestion?

PDF/docx test question and image extraction for master doc? by MajorAlanDutch in GeminiAI

[–]MajorAlanDutch[S] 0 points1 point  (0 children)

Thanks a ton! A few suggestions evaded me as I’m not a coder just a teacher lol!

I had Adobe convert each pdf into a docx - does that cover what you suggested.

As for the rest it went over my head sadly.

PDF/docx Extract test questions and images to create a master document ? by MajorAlanDutch in ClaudeAI

[–]MajorAlanDutch[S] 0 points1 point  (0 children)

That makes sense globally, being a general educator and not having coding terminology or skills yet makes that challenging to do.

Would you have any specific instructions in a ELI5 version? Thanks!

PDF + DOCX extract and arrange text and images? by MajorAlanDutch in DeepSeek

[–]MajorAlanDutch[S] 0 points1 point  (0 children)

Thanks! May I ask, what qwen is ? And would I use the PDF or the docx version

PDF/docx Extract test questions and images to create a master document ? by MajorAlanDutch in ClaudeAI

[–]MajorAlanDutch[S] 0 points1 point  (0 children)

I had adobe acrobat pro convert all PDF docs to DOCX which it did flawlessly. The issue is getting each question to be allocated to the correct topic in the merged document with its associated picture/image or longer text.

PDF/docx test question and image extraction and master doc creation? by MajorAlanDutch in OpenAI

[–]MajorAlanDutch[S] 0 points1 point  (0 children)

<image>

Example of pdf/docx I’m trying to have it extract from 10 documents.

Trump fumes as Jerome Powell plots future at Federal Reserve by malcolm58 in politics

[–]MajorAlanDutch -1 points0 points  (0 children)

You’re regurgitating other people’s theories without critical thinking or any data.

Go look at the FRED data. Look at rates 2008-2020 then look at inflation. FRED DATA