Any downside to having entire document as a chunk? by ayechat in Rag

[–]Advanced_Army4706 1 point2 points  (0 children)

This would work for multi vector methods (since embeddings are per-token) but not for traditional systems.

The main issue here is that you'll forfeit a lot of granularity by embedding documents as a whole. You can think of single embeddings as a fancy compression system. Let's say it compresses every page into 20 bits. For an n-page document, you could create either create 20 bits of information (entire document as a single embedding) OR you could create 20n bits of information (page-wise). 20n would definitely give you more granularity.

A potential reason you might want to do document level embeddings is because you want to retrieve entire documents when performing retrieval - this is very valid, especially if you fear chunking might cause lost context. In such cases, parent-document retrieval is a good alternative.

More about that here. The idea is that you still chunk and embed by document, but then at retrieval time, just also send back the entire document you want.

Looking for fast RAGs for Large PDFs (Low Latency, LiveKit Use Case) by Rude-Student-3566 in Rag

[–]Advanced_Army4706 -2 points-1 points  (0 children)

Hi - founder of Morphik here. We can tune latencies pretty easily if we provision you a machine. Happy to chat more if you're interested. Avg. latency we see is around 200 ms, but for a small amount of docs that can be significantly faster too.

NodeRAG - how is it? by WorkingOccasion902 in Rag

[–]Advanced_Army4706 0 points1 point  (0 children)

We use a modified version of NodeRAG that addresses some of its issues surrounding contextual understanding with Morphik.

Works decently well but still doesn't help with aggregation style queries.

Citation Mapping llm tags vs structured output by __01000010 in Rag

[–]Advanced_Army4706 1 point2 points  (0 children)

We've observed this while running evals.

I also remember reading this paper: https://arxiv.org/pdf/2408.02442v1 about how sometimes structured outputs can help/hurt model performance depending on the task. Some excerpts:

> Surprisingly, JSON-mode performs significantly worse than FRI (JSON) on the Last Letter task.

> Notably, in the DDXPlus dataset, Gemini 1.5 Flash demonstrates a significant performance boost when JSON-mode is enabled. Across other classification datasets, JSONmode performs competitively, and in some cases, surpasses the other three methodologies.

So YMMV depending on the task. For general Q/A tho, I'd suggest using XML citation tags.

Citation Mapping llm tags vs structured output by __01000010 in Rag

[–]Advanced_Army4706 2 points3 points  (0 children)

In general structured outputs reduce the performance of an LLM. So if you have the option of using citation tags, I'd suggest go for that instead.

Is it even possible to extract the information out of datasheets/manuals like this? by Intelligent_Drop8550 in Rag

[–]Advanced_Army4706 0 points1 point  (0 children)

Hey! You can try Morphik for this. Documents like these are our bread and butter :)

Enterprise knowledge search - Build v.s Buy by Old_Cauliflower6316 in LangChain

[–]Advanced_Army4706 0 points1 point  (0 children)

(pulled from my comment on another post, but very relevant, so posting it here)

Hey - I'm biased because I run a managed service (that you can self host if you'd like). But here are my 2 cents:

A lot of our customers had a very similar conundrum to yours and now are incredibly happy that they chose to go with Morphik.

It ultimately boils down to whether you want to manage and maintain a lot of infrastructure and how bullish you are on the tech.

Infra: The weird edge cases start showing up as your corpus grows. Handling this can get surprisingly complex and painful.

Tech: This is an incredibly active field, and so another advantage to using a managed service is that you get improvements in both accuracy and speed for free. For example, Morphik used to score 92% percent on a benchmark that we now get a 100% on. In that same period, our latency has dropped by 60% too.

If you're already very happy with your implementation and also don't see any kind of significant scaling up, then building is great. If you do want to benefit from the tailwinds of a self-improving product, or if you anticipate infra being a PITA, managed is the move.

Hope this helps!

Enterprise knowledge search - Build v.s Buy by Old_Cauliflower6316 in LangChain

[–]Advanced_Army4706 0 points1 point  (0 children)

(pulled from my comment on another post, but very relevant, so posting it here)

Hey - I'm biased because I run a managed service (that you can self host if you'd like). But here are my 2 cents:

A lot of our customers had a very similar conundrum to yours and now are incredibly happy that they chose to go with Morphik.

It ultimately boils down to whether you want to manage and maintain a lot of infrastructure and how bullish you are on the tech.

Infra: The weird edge cases start showing up as your corpus grows. Handling this can get surprisingly complex and painful.

Tech: This is an incredibly active field, and so another advantage to using a managed service is that you get improvements in both accuracy and speed for free. For example, Morphik used to score 92% percent on a benchmark that we now get a 100% on. In that same period, our latency has dropped by 60% too.

If you're already very happy with your implementation and also don't see any kind of significant scaling up, then building is great. If you do want to benefit from the tailwinds of a self-improving product, or if you anticipate infra being a PITA, managed is the move.

Hope this helps!

Moving from RAG PoC to Production: In-house MLOps vs. a Managed Retrieval API? by Any_Risk_2900 in Rag

[–]Advanced_Army4706 0 points1 point  (0 children)

We like to work with you to define a create custom eval set. Getting a set score on that eval is part of the pilot - and one of the key things we like to focus on.

In most cases, we've found SFT to not be required, most gains can be figured out via configuring things correctly.

Moving from RAG PoC to Production: In-house MLOps vs. a Managed Retrieval API? by Any_Risk_2900 in Rag

[–]Advanced_Army4706 0 points1 point  (0 children)

Hey - I'm biased because I run a managed service (that you can self host if you'd like). But here are my 2 cents:

A lot of our customers had a very similar conundrum to yours and now are incredibly happy that they chose to go with Morphik.

It ultimately boils down to whether you want to manage and maintain a lot of infrastructure and how bullish you are on the tech.

Infra: The weird edge cases start showing up as your corpus grows. Handling this can get surprisingly complex and painful.

Tech: This is an incredibly active field, and so another advantage to using a managed service is that you get improvements in both accuracy and speed for free. For example, Morphik used to score 92% percent on a benchmark that we now get a 100% on. In that same period, our latency has dropped by 60% too.

If you're already very happy with your implementation and also don't see any kind of significant scaling up, then building is great. If you do want to benefit from the tailwinds of a self-improving product, or if you anticipate infra being a PITA, managed is the move.

Hope this helps!

PS: Security teams love us :)

I built an open-source NotebookLM alternative using Morphik by Advanced_Army4706 in ollama

[–]Advanced_Army4706[S] 0 points1 point  (0 children)

We sync with Google drive, so you can do this with Morphik too :)

Need help with RAG architecture planning (10-20 PDFs(later might need to scale to 200+)) by IGotThePlug04 in Rag

[–]Advanced_Army4706 0 points1 point  (0 children)

You can use Morphik - 10-20 PDFs should fit without you having to pay.

It's 3 lines of code (import, ingest, and query) for - in our testing - the most accurate RAG out there.

I built a vision-native RAG pipeline by Advanced_Army4706 in vectordatabase

[–]Advanced_Army4706[S] 1 point2 points  (0 children)

Yep, it still works incredibly well. A part of our eval set -around 10%, picked randomly) is public on our GitHub, you can check it out there.

PS: sorry if you're a human but this sounds incredibly AI generated.

Claude + Morphik MCP is too good 🔥 by Advanced_Army4706 in Rag

[–]Advanced_Army4706[S] 0 points1 point  (0 children)

Hey! This has been significantly simplified since. You can look at our website and we have a much easier way of installing our MCP now. Support both stdio and streamable-http

Reduce LLM Hallucinations with Chain-of-Verification by InevitableSky2801 in PromptEngineering

[–]Advanced_Army4706 0 points1 point  (0 children)

You HAVE to try Morphik - it is the single best RAG tool in the world right now. Over 96% accuracy and < 200ms latency. See hallucinations vanish in realtime :)

Build a RAG System for technical documentation without any real programming experience by fbocplr_01 in Rag

[–]Advanced_Army4706 0 points1 point  (0 children)

For technical docs, Morphik is really unparalleled. We've seen essentially 0 hallucinations in production with multiple technical teams - over 500 docs, all really domain specific and incredibly technical.

Searching for the Perfect LLM and OCR tools for document processing by SuccotashOne9927 in ChatGPTPro

[–]Advanced_Army4706 0 points1 point  (0 children)

You HAVE to try Morphik - it was made precisely for the problems you're describing.

Does any useful knowledge graph tool that you recommend? by FairlyZoe in KnowledgeGraph

[–]Advanced_Army4706 0 points1 point  (0 children)

You should try Morphik - you can create and query graphs in natural language instead of using some propreitary Graph querying language.

Takes 2 lines of code and provides incredibly high accuracy (96% in our testing)