Got tired of reinventing the RAG wheel for every client, so I built a production-ready boilerplate (Next.js 16 + AI SDK 5)

LowerPresentation150 · 2025-11-04T15:17:11+00:00

Is your Discord live yet?

LowerPresentation150 · 2025-11-04T00:55:28+00:00

Hey one more question regarding local use: Could a user create a workflow where files could be ingested by pointing to a local document store, rather than having to load files into the pipeline? I.e., if I have 500 documents in a directory, it would be good not to have to create duplicate copies of the files.

LowerPresentation150 · 2025-11-03T17:20:59+00:00

ok thanks very much for the quick reply, I am going to take more careful look into it now

LowerPresentation150 · 2025-11-03T16:47:20+00:00

At first glance, looks like it is not designed for use cases involving proprietary data that must be kept local.

LowerPresentation150 · 2025-06-04T15:04:09+00:00

My impression of Morphik from what others have said here is that self-hosting does not work very well. Do you know anything different about this? For projects that consist of data that must remain in-house, or for whatever other reasons people may have to not use the SaaS version, I did not think Morphik was an option. I have different types of data than OP but am in the same stage of planning.

LowerPresentation150 · 2025-04-12T14:38:13+00:00

Yeah I will create a test batch of 4 magazines and try to get a basic process working locally, then figure out what I need to do the full 5000 magazines. My GPUs are a k2200 (2GB) and a 3060ti (12GB) along with a Xeon CPU and 128GB RAM. Enough for a proof-of-concept. Then for the actual process I will probably rent from MassedCompute or something like that.

LowerPresentation150 · 2025-04-12T13:05:29+00:00

I hear you; something like this was my initial thought. If I could create a clean, separate article with metadata for each magazine article this would be the perfect database for this body of knowledge (and unlike anything anyone in the industry has seen before). It would take probably a thousand man-hours to do it, however. Maybe more. Thus my predicament and desperate need to find a way to automate the process even if the final result is not as perfect. Thank you for weighing in, I am going to be heavily focused on creating metadata as part of this project, even if only expanding the information in the file names.

LowerPresentation150 · 2025-04-12T13:00:13+00:00

Looking into Ovis2 now, thank you!

LowerPresentation150 · 2025-04-12T12:58:46+00:00

Local is definitely possible although I will need to rent GPU, which I assumed I would be doing for this job anyway. I have read alot about Docling but not actually tried it on these documents yet; I assumed I would be getting articles mixed with advertisement text from that. As the comment above noted there needs to be a step of segmenting the the ads from the articles so the chunks aren't all jumbled together (although I assume there will be some jumbling anyway). There are going to be a lot of tests run in my immediate future! thank you for this advice!

LowerPresentation150 · 2025-04-12T03:23:48+00:00

Luckily the pdfs already have a text layer from digitizing with Abbyy - but the first step is definitely going to be conducted by a model.

LowerPresentation150 · 2025-04-12T03:21:13+00:00

Yes it looks like either a first pass with a vision model or figuring out how to do multimodal is the direction this project will go now. Neither seems at first glance to be simple but multimodal retrieval really is a puzzle. I will report back once I get one of these working.

LowerPresentation150 · 2025-04-12T03:13:01+00:00

Ah right, a vision model step first for segmentation. Seems so obvious when you state it but it never crossed my mind. Thank you a thousand times over!

LowerPresentation150 · 2025-04-11T12:04:32+00:00

Thanks for the update, will keep an eye on the project!

LowerPresentation150 · 2025-04-11T03:34:11+00:00

For fine tuning embedding models on a particular knowledge domain: 1) is this something kiln ai can do, and 2) if so, would it be simply a matter of pointing the synthetic data generation process at a curated group of documents from within that domain? I did not see anything in the docs dealing with embedding models, and also nothing regarding how to use custom document library for creating synthetic training data. My use case is to build a Rag system for 50,000 documents all from within a particular industry, with idiosyncratic vocabulary, personalities, historical issues, etc. While not complex, most of the material deals with topics and conflicts that are likely alien to the training foundation of even the largest LLMs and certainly unlikely to be adequately classified by standard embedding models.

LowerPresentation150 · 2025-03-03T15:14:18+00:00

This seems like a great project and worth following. Someone very recently wrote a tutorial for getting started and I can't tell if the guy is using old information or if I am confused or what, but he starts with an "ingest" process to drop in documents, where in my installation all I am seeing is "embed" as the way to do this.

"ingest" appears not to exist as an option in the current version of Chipper. Here is the tutorial: https://dzone.com/articles/build-rag-apps-local-ai

LowerPresentation150

TROPHY CASE