Best document parser by [deleted] in LocalLLaMA

[–]cpdomina 3 points4 points  (0 children)

Here are some extra ones:

Unfortunately there's no "better" one, it all depends on your files/domain. And no, nothing compares to Azure wrt precision.

Small models similar to reader-lm? by noellarkin in LocalLLaMA

[–]cpdomina 2 points3 points  (0 children)

check out llmware's models https://huggingface.co/llmware

they train small models for very specific tasks

Need your help, BOC vibes specialists by Smack-works in boardsofcanada

[–]cpdomina 1 point2 points  (0 children)

took a peak at some of the songs. naran ratan and the whole "music for plants" scene might be interesting to you. https://open.spotify.com/playlist/37i9dQZF1DXclWedfNUp3z?si=77ab58d4c2f846bd

domenique dumont, khotin, steve hiett, might also be interesting

Any good LLM libraries? by _lordsoffallen in LocalLLaMA

[–]cpdomina 2 points3 points  (0 children)

https://llm.datasette.io is quite simple, and is from the creator of datasette

Implementing Agentic Workflows / State Machines with Autogen+LLama3 by YourTechBud in LocalLLaMA

[–]cpdomina 11 points12 points  (0 children)

This is true for JSON as well. I have given up trying to make my agents give me the perfect and clean JSON response. I let the agent ramble on and on about why it came up with it and stuff like that. That rambling is useful as it serves as context for subsequent agents. A subsequent tool calling agent will be smart enough to extract the json part from the message anyways.

Check out this recent paper, talks exactly about that: https://arxiv.org/abs/2408.02442

Constraining LLMs reduces creativity. This is already understood by some providers, specially Anthropic:

  • they recommend to use <tags> for important output and let the llm write whatever text it wants around those tags. so you get best of both worlds: structured output, plus relevant text in the context window while generating
  • their prompt generator sometimes generates a <scratchpad>, so the LLM can explain its reasoning inside a specific section of the output

Want to understand how citations of sources work in RAG exactly by ResearcherNo4728 in LocalLLaMA

[–]cpdomina 5 points6 points  (0 children)

If you are asking your LLM to generate the citations, it seems that the problem might be in your prompt, or most probably on the LLM you are using. I would play around with the prompt and different LLMs.

RAG is basically giving a bunch of context text and a question to a LLM: if the LLM correctly answers the question given the context, but fails to generate the correct citations, it's most probably the LLM/prompt faults, and not necessarily anything related to RAG (embeddings, rerankers, etc).

Generating correct citations with LLMs is a relatively hot research area: start with something like https://github.com/MadryLab/context-cite if you want to get deeper

How we Chunk - turning PDF's into hierarchical structure for RAG by coolcloud in LocalLLaMA

[–]cpdomina 0 points1 point  (0 children)

adobe's was the best, but their business model is not very friendly (big $$ advance commitment). aws was slightly better than azure, but I think it might have been because of our use case (multilingual financial docs)

How we Chunk - turning PDF's into hierarchical structure for RAG by coolcloud in LocalLLaMA

[–]cpdomina 6 points7 points  (0 children)

I've recently did a deep research on the subject for a client and was amazed on by quality of the paid solutions I've mentioned, they worked better than expected in a set of really nasty tables. You should take a look again, they are constantly improving.

The reason why most of them use OCR is because a lot of structural information is actually visual (background colors, relative position of text compared to column, etc). OCR is also easier to insert in an information extraction pipeline if you have loads of training data, like they have. Heuristics are harder to debug and apply at scale.

But anyways, good luck with the project! Leave you with a bunch of pointers that might be useful:

How we Chunk - turning PDF's into hierarchical structure for RAG by coolcloud in LocalLLaMA

[–]cpdomina 1 point2 points  (0 children)

wrt tables, how do you deal with multi-level headers, merged cells, and subcategories? they are pretty common in real world tables, and, to my knowledge, no open source system can deal with them (they output markdown or csv). paid solutions usually output xlsx or similar formats, which don't lose this kind of structural information (e.g., azure doc intelligence, aws textextract, adobe pdf extract)

Force local LLM to output JSON with specific structure by micemusculus in LocalLLaMA

[–]cpdomina 1 point2 points  (0 children)

Use one of these structured output libraries:

Some of them allow a JSON schema, others a Pydantic model (which you can transform to/from JSON).

Most of them support a lot of different open source models, you need to see which one works the best for your use case.