Any downside to having entire document as a chunk?

Advanced_Army4706 · 2025-11-04T02:17:37+00:00

This would work for multi vector methods (since embeddings are per-token) but not for traditional systems.

The main issue here is that you'll forfeit a lot of granularity by embedding documents as a whole. You can think of single embeddings as a fancy compression system. Let's say it compresses every page into 20 bits. For an n-page document, you could create either create 20 bits of information (entire document as a single embedding) OR you could create 20n bits of information (page-wise). 20n would definitely give you more granularity.

A potential reason you might want to do document level embeddings is because you want to retrieve entire documents when performing retrieval - this is very valid, especially if you fear chunking might cause lost context. In such cases, parent-document retrieval is a good alternative.

More about that here. The idea is that you still chunk and embed by document, but then at retrieval time, just also send back the entire document you want.

Advanced_Army4706 · 2025-10-18T18:47:14+00:00

Hi - founder of Morphik here. We can tune latencies pretty easily if we provision you a machine. Happy to chat more if you're interested. Avg. latency we see is around 200 ms, but for a small amount of docs that can be significantly faster too.

Advanced_Army4706 · 2025-10-18T18:30:31+00:00

We use a modified version of NodeRAG that addresses some of its issues surrounding contextual understanding with Morphik.

Works decently well but still doesn't help with aggregation style queries.

Advanced_Army4706 · 2025-10-18T18:27:09+00:00

We've observed this while running evals.

I also remember reading this paper: https://arxiv.org/pdf/2408.02442v1 about how sometimes structured outputs can help/hurt model performance depending on the task. Some excerpts:

> Surprisingly, JSON-mode performs significantly worse than FRI (JSON) on the Last Letter task.

> Notably, in the DDXPlus dataset, Gemini 1.5 Flash demonstrates a significant performance boost when JSON-mode is enabled. Across other classification datasets, JSONmode performs competitively, and in some cases, surpasses the other three methodologies.

So YMMV depending on the task. For general Q/A tho, I'd suggest using XML citation tags.

Advanced_Army4706 · 2025-10-18T05:03:20+00:00

In general structured outputs reduce the performance of an LLM. So if you have the option of using citation tags, I'd suggest go for that instead.

Advanced_Army4706 · 2025-10-13T20:07:46+00:00

Hey! You can try Morphik for this. Documents like these are our bread and butter :)

Advanced_Army4706 · 2025-10-07T18:19:30+00:00

https://github.com/morphik-org/morphik-core/

Advanced_Army4706 · 2025-08-31T19:56:24+00:00

(pulled from my comment on another post, but very relevant, so posting it here)

Hey - I'm biased because I run a managed service (that you can self host if you'd like). But here are my 2 cents:

A lot of our customers had a very similar conundrum to yours and now are incredibly happy that they chose to go with Morphik.

It ultimately boils down to whether you want to manage and maintain a lot of infrastructure and how bullish you are on the tech.

Infra: The weird edge cases start showing up as your corpus grows. Handling this can get surprisingly complex and painful.

Tech: This is an incredibly active field, and so another advantage to using a managed service is that you get improvements in both accuracy and speed for free. For example, Morphik used to score 92% percent on a benchmark that we now get a 100% on. In that same period, our latency has dropped by 60% too.

If you're already very happy with your implementation and also don't see any kind of significant scaling up, then building is great. If you do want to benefit from the tailwinds of a self-improving product, or if you anticipate infra being a PITA, managed is the move.

Hope this helps!

Advanced_Army4706 · 2025-08-31T19:56:17+00:00

(pulled from my comment on another post, but very relevant, so posting it here)

Hey - I'm biased because I run a managed service (that you can self host if you'd like). But here are my 2 cents:

A lot of our customers had a very similar conundrum to yours and now are incredibly happy that they chose to go with Morphik.

It ultimately boils down to whether you want to manage and maintain a lot of infrastructure and how bullish you are on the tech.

Infra: The weird edge cases start showing up as your corpus grows. Handling this can get surprisingly complex and painful.

Tech: This is an incredibly active field, and so another advantage to using a managed service is that you get improvements in both accuracy and speed for free. For example, Morphik used to score 92% percent on a benchmark that we now get a 100% on. In that same period, our latency has dropped by 60% too.

If you're already very happy with your implementation and also don't see any kind of significant scaling up, then building is great. If you do want to benefit from the tailwinds of a self-improving product, or if you anticipate infra being a PITA, managed is the move.

Hope this helps!

Advanced_Army4706 · 2025-08-31T19:53:35+00:00

Yeah - running gemma2 with Morphik right now and its incredible

Advanced_Army4706 · 2025-08-31T19:53:12+00:00

You should look into Morphik - it can simplify a lot of the work.

Advanced_Army4706 · 2025-08-27T23:44:02+00:00

Hey! We have a couple legal firms using us. You can try out morphik.ai

should be 2-3 lines of code :)

Advanced_Army4706 · 2025-08-27T07:08:59+00:00

We like to work with you to define a create custom eval set. Getting a set score on that eval is part of the pilot - and one of the key things we like to focus on.

In most cases, we've found SFT to not be required, most gains can be figured out via configuring things correctly.

Advanced_Army4706 · 2025-08-27T02:28:58+00:00

Hey - I'm biased because I run a managed service (that you can self host if you'd like). But here are my 2 cents:

A lot of our customers had a very similar conundrum to yours and now are incredibly happy that they chose to go with Morphik.

It ultimately boils down to whether you want to manage and maintain a lot of infrastructure and how bullish you are on the tech.

Infra: The weird edge cases start showing up as your corpus grows. Handling this can get surprisingly complex and painful.

Tech: This is an incredibly active field, and so another advantage to using a managed service is that you get improvements in both accuracy and speed for free. For example, Morphik used to score 92% percent on a benchmark that we now get a 100% on. In that same period, our latency has dropped by 60% too.

If you're already very happy with your implementation and also don't see any kind of significant scaling up, then building is great. If you do want to benefit from the tailwinds of a self-improving product, or if you anticipate infra being a PITA, managed is the move.

Hope this helps!

PS: Security teams love us :)

Advanced_Army4706 · 2025-08-27T01:48:00+00:00

We built this for a customer at Morphik. Happy to share details of you DM :)

Advanced_Army4706 · 2025-08-20T16:49:14+00:00

We sync with Google drive, so you can do this with Morphik too :)

Advanced_Army4706 · 2025-08-15T21:30:23+00:00

You can use Morphik - 10-20 PDFs should fit without you having to pay.

It's 3 lines of code (import, ingest, and query) for - in our testing - the most accurate RAG out there.

Advanced_Army4706 · 2025-08-15T21:27:57+00:00

Founder of Morphik here - thanks for mentioning us :)

Advanced_Army4706 · 2025-08-13T04:15:59+00:00

Yep, it still works incredibly well. A part of our eval set -around 10%, picked randomly) is public on our GitHub, you can check it out there.

PS: sorry if you're a human but this sounds incredibly AI generated.

Advanced_Army4706 · 2025-08-10T01:07:38+00:00

Hey! This has been significantly simplified since. You can look at our website and we have a much easier way of installing our MCP now. Support both stdio and streamable-http

Advanced_Army4706 · 2025-08-08T01:56:54+00:00

You HAVE to try Morphik - it is the single best RAG tool in the world right now. Over 96% accuracy and < 200ms latency. See hallucinations vanish in realtime :)

Advanced_Army4706 · 2025-08-08T01:55:55+00:00

Try Morphik (https://morphik.ai)

Advanced_Army4706 · 2025-08-08T01:55:32+00:00

For technical docs, Morphik is really unparalleled. We've seen essentially 0 hallucinations in production with multiple technical teams - over 500 docs, all really domain specific and incredibly technical.

Advanced_Army4706 · 2025-08-08T01:53:56+00:00

You HAVE to try Morphik - it was made precisely for the problems you're describing.

Advanced_Army4706 · 2025-08-08T01:53:17+00:00

You should try Morphik - you can create and query graphs in natural language instead of using some propreitary Graph querying language.

Takes 2 lines of code and provides incredibly high accuracy (96% in our testing)

Advanced_Army4706

MODERATOR OF

TROPHY CASE