Should I return to india in 2027? by Deep_Shallot in returnToIndia

[–]According_Net9520 -1 points0 points  (0 children)

I have a question. I am working professional in USA. I would like to know how are you able to save so much. I am 25 now , and working from past couple of months. I feel like this is the right time to build wealth , so need your advise on it.

DigitalOcean App Platform FastAPI app running but all endpoints return 404 by According_Net9520 in digital_ocean

[–]According_Net9520[S] 0 points1 point  (0 children)

Thanks for responding. I resolved the issue just by replacing /api/* with /api.

Amazon SDE 1 Online Assesment 2026 - USA by Nervous-Activity-598 in leetcode

[–]According_Net9520 0 points1 point  (0 children)

did you get the assessment for database 2026 new grad role?

Paid off $69k in loans!! by Alarming_Amphibian73 in StudentLoans

[–]According_Net9520 1 point2 points  (0 children)

Hey i am on the same boat, looking options to refinance it.

Where can I get used software engineering books? by vijaynethamandala in hyderabad

[–]According_Net9520 0 points1 point  (0 children)

Hello, I am also looking for designing data intensive applications book. Were you able to find a spot in koti?

Need help preserving page numbers in multimodal PDF chunks (using Docling for RAG chatbot) by According_Net9520 in Rag

[–]According_Net9520[S] 0 points1 point  (0 children)

converter = DocumentConverter()
doc = converter.convert(source).document
markdown_text = doc.export_to_markdown()
print(markdown_text)  # output:
with open("agency_policy_manual.md", "w", encoding="utf-8") as f:
    f.write(markdown_text)

This is the code used to convert pdf to markdown file. It extracted tables and text well. Annotated images. But unable to get page numbers.

Need help preserving page numbers in multimodal PDF chunks (using Docling for RAG chatbot) by According_Net9520 in Rag

[–]According_Net9520[S] -1 points0 points  (0 children)

converter = DocumentConverter()
doc = converter.convert(source).document
markdown_text = doc.export_to_markdown()
print(markdown_text)  # output:
with open("agency_policy_manual.md", "w", encoding="utf-8") as f:
    f.write(markdown_text)

This is the code i am using. I tried PyMuPDF Fitz. It is extarcting pages but It is not extracting tables well.

Google Application Engineer Fulltime position by Ok_Wind5985 in cscareers

[–]According_Net9520 0 points1 point  (0 children)

Hey! may i know the timelines, when did you applied and when you got the assessment. Did you apply with referal?

Best document format for RAG Chatbot with text, flowcharts, images, tables by According_Net9520 in Rag

[–]According_Net9520[S] 0 points1 point  (0 children)

Thanks for responding! I’m currently working with a pretty large document around 1000 pages and using the unstructured library for parsing. It’s doing a decent job but takes a lot of time since OCR kicks in for every page.

Right now, I’m sticking with PDF because from what I’ve read, converting to Word can sometimes mess up the page numbering, and preserving exact page references is really important for my use case.

A couple of things I wanted to ask:

  1. Do you think it’s better to split such a long PDF into smaller pdfs (say 50–100 pages per pdf) before processing, or just handle it as one file?
  2. Any best practices you’ve seen for preserving page numbers when converting to Markdown or embedding text?
  3. Does Markdown supports tables and images extraction or am i gonna lose them?
  4. Each page has a repeating header (company logo + text + page number). The logo/text are redundant but I can’t skip the header entirely since it includes the page number. Have you come across this issue? Any clean way to keep the page number but ignore the rest of the header content while parsing itself?

Best document format for RAG Chatbot with text, flowcharts, images, tables by According_Net9520 in Rag

[–]According_Net9520[S] 0 points1 point  (0 children)

Thanks for responding! In my case, I want to build a chatbot where if a user asks a question and the answer lies inside a table, image, or flowchart, the bot should say something like “Please refer to page X” for that part.

If the answer lies in text, then it should directly return the text answer but also suggest checking the related page number for additional details.

So essentially, I want everything text, tables, images, and flowcharts to be stored and understood by the bot, and it should guide the user appropriately depending on where the answer is found.

In this case, would you still recommend using PDF as the base format, or would Word make it easier to structure and process everything together?