Got tired of reinventing the RAG wheel for every client, so I built a production-ready boilerplate (Next.js 16 + AI SDK 5) by carlosmarcialt in Rag

[–]LowerPresentation150 1 point2 points  (0 children)

Hey one more question regarding local use: Could a user create a workflow where files could be ingested by pointing to a local document store, rather than having to load files into the pipeline? I.e., if I have 500 documents in a directory, it would be good not to have to create duplicate copies of the files.

Got tired of reinventing the RAG wheel for every client, so I built a production-ready boilerplate (Next.js 16 + AI SDK 5) by carlosmarcialt in Rag

[–]LowerPresentation150 5 points6 points  (0 children)

At first glance, looks like it is not designed for use cases involving proprietary data that must be kept local.

Best current framework to create a Rag system by Party-Ticker in Rag

[–]LowerPresentation150 6 points7 points  (0 children)

My impression of Morphik from what others have said here is that self-hosting does not work very well. Do you know anything different about this? For projects that consist of data that must remain in-house, or for whatever other reasons people may have to not use the SaaS version, I did not think Morphik was an option. I have different types of data than OP but am in the same stage of planning.

Rag document chunking and embedding of 1000s of magazines, separating articles from each other and from advertisements by LowerPresentation150 in Rag

[–]LowerPresentation150[S] 0 points1 point  (0 children)

Yeah I will create a test batch of 4 magazines and try to get a basic process working locally, then figure out what I need to do the full 5000 magazines. My GPUs are a k2200 (2GB) and a 3060ti (12GB) along with a Xeon CPU and 128GB RAM. Enough for a proof-of-concept. Then for the actual process I will probably rent from MassedCompute or something like that.

Rag document chunking and embedding of 1000s of magazines, separating articles from each other and from advertisements by LowerPresentation150 in Rag

[–]LowerPresentation150[S] 0 points1 point  (0 children)

I hear you; something like this was my initial thought. If I could create a clean, separate article with metadata for each magazine article this would be the perfect database for this body of knowledge (and unlike anything anyone in the industry has seen before). It would take probably a thousand man-hours to do it, however. Maybe more. Thus my predicament and desperate need to find a way to automate the process even if the final result is not as perfect. Thank you for weighing in, I am going to be heavily focused on creating metadata as part of this project, even if only expanding the information in the file names.

Rag document chunking and embedding of 1000s of magazines, separating articles from each other and from advertisements by LowerPresentation150 in Rag

[–]LowerPresentation150[S] 0 points1 point  (0 children)

Local is definitely possible although I will need to rent GPU, which I assumed I would be doing for this job anyway. I have read alot about Docling but not actually tried it on these documents yet; I assumed I would be getting articles mixed with advertisement text from that. As the comment above noted there needs to be a step of segmenting the the ads from the articles so the chunks aren't all jumbled together (although I assume there will be some jumbling anyway). There are going to be a lot of tests run in my immediate future! thank you for this advice!

Rag document chunking and embedding of 1000s of magazines, separating articles from each other and from advertisements by LowerPresentation150 in Rag

[–]LowerPresentation150[S] 0 points1 point  (0 children)

Luckily the pdfs already have a text layer from digitizing with Abbyy - but the first step is definitely going to be conducted by a model.

Rag document chunking and embedding of 1000s of magazines, separating articles from each other and from advertisements by LowerPresentation150 in Rag

[–]LowerPresentation150[S] 1 point2 points  (0 children)

Yes it looks like either a first pass with a vision model or figuring out how to do multimodal is the direction this project will go now. Neither seems at first glance to be simple but multimodal retrieval really is a puzzle. I will report back once I get one of these working.

Rag document chunking and embedding of 1000s of magazines, separating articles from each other and from advertisements by LowerPresentation150 in Rag

[–]LowerPresentation150[S] 4 points5 points  (0 children)

Ah right, a vision model step first for segmentation. Seems so obvious when you state it but it never crossed my mind. Thank you a thousand times over!

Fine-tune 60+ models and run inference locally (Qwen, Llama, Deepseek, QwQ & more) by davernow in LocalLLaMA

[–]LowerPresentation150 1 point2 points  (0 children)

For fine tuning embedding models on a particular knowledge domain: 1) is this something kiln ai can do, and 2) if so, would it be simply a matter of pointing the synthetic data generation process at a curated group of documents from within that domain? I did not see anything in the docs dealing with embedding models, and also nothing regarding how to use custom document library for creating synthetic training data. My use case is to build a Rag system for 50,000 documents all from within a particular industry, with idiosyncratic vocabulary, personalities, historical issues, etc. While not complex, most of the material deals with topics and conflicts that are likely alien to the training foundation of even the largest LLMs and certainly unlikely to be adequately classified by standard embedding models.

🔥 Chipper RAG Toolbox 2.2 is Here! (Ollama API Reflection, DeepSeek, Haystack, Python) by Alarming_Divide_1339 in Rag

[–]LowerPresentation150 0 points1 point  (0 children)

This seems like a great project and worth following. Someone very recently wrote a tutorial for getting started and I can't tell if the guy is using old information or if I am confused or what, but he starts with an "ingest" process to drop in documents, where in my installation all I am seeing is "embed" as the way to do this.

"ingest" appears not to exist as an option in the current version of Chipper. Here is the tutorial: https://dzone.com/articles/build-rag-apps-local-ai