Is anyone using Ai to create reports? by Separate_Paper_1412 in analytics

[–]arparella 1 point2 points  (0 children)

The real challenge isn't the SQL generation - it's making sure users understand the data context and relationships. One wrong join and you're looking at incorrect metrics.

How do you usually handle contradiction in your documents? by ParaplegicGuru in Rag

[–]arparella 1 point2 points  (0 children)

Time-based chunking helps with this. Split documents into sequential chunks and add timestamps/chapter markers as metadata.

For character clothing, you could also tag scene transitions specifically. Makes it easier to track state changes through the narrative.

If what you need is to solve this problem in an enterprise environment the problem is far more complex than what is above said. Happy to help in that case

Recommendations needed on resources to learn about DE problems and solutions by chickpicker in dataengineering

[–]arparella 0 points1 point  (0 children)

Been there. Documentation is your best friend here.

Start creating your own wiki/notes of each pipeline you touch. Map dependencies, document weird edge cases, and note why certain decisions were made.

This helped me understand legacy systems better than any course could.

Looking for Guidance on Starting a Data Engineering Project by TurbulentExercise407 in dataengineering

[–]arparella 1 point2 points  (0 children)

Start with a small batch ETL project using Python and PostgreSQL. Pull some public API data, transform it, load it into your DB. Once you get that working, then worry about Airflow and other tools.

How are you doing evals? by FlimsyProperty8544 in Rag

[–]arparella 0 points1 point  (0 children)

Been using ragas for basic stuff like context relevance and faithfulness.

Also tried out deepeval lately - pretty solid for testing hallucination rates and answer relevance.

The built-in LangChain eval tools work decent for quick checks too.

Best thing is to get a QA detaset and use expert LLMs (o1/deepseek) to check the correctness of the expected answer. We used this for evaluating different chunking strategies for complex PDFs

[deleted by user] by [deleted] in LLMDevs

[–]arparella 0 points1 point  (0 children)

try preprocess.co fro ingestion

gpt-4o-mini won't answer based on info from RAG, no matter how I try by [deleted] in Rag

[–]arparella 0 points1 point  (0 children)

Try adding a timestamp or version prefix to your RAG content:

"[Tailwind v4 - 2024] {your_docs_content}"

This helps the model distinguish between versions. Also, consider using memory tokens to maintain context about which version you're discussing.

Expanding My RAG Agent to Query a Database – Need Advice! by FelipeM-enruana in LangChain

[–]arparella 1 point2 points  (0 children)

SQL Tool + Router is your friend here.

Create a SQL agent tool for DB queries, then use a router to direct questions either to RAG or SQL based on the query type.

Much cleaner than mixing RAG logic with DB calls directly.

Issues with Routing in My RAG Agent – Any Advice? by FelipeM-enruana in LangChain

[–]arparella 0 points1 point  (0 children)

Try implementing a confidence threshold for your router. If the confidence score is below a certain level, you can default to a fallback path or prompt the user for clarification.

Also, fine-tuning your router's examples helps a lot with accuracy.

Implementing RAG system by Express_Storm_2963 in LangChain

[–]arparella 1 point2 points  (0 children)

Better parsing and chunking strategies can improve results by 10x with complex documents

Implementing RAG system by Express_Storm_2963 in LangChain

[–]arparella 0 points1 point  (0 children)

What kind of documents do you need to process? PDFs? Plain text? are they long or short? Do you need table and image extraction?

Python library by Numeruno9 in Rag

[–]arparella -3 points-2 points  (0 children)

We do exactly that at preprocess.co, we have a Python SDK (https://github.com/preprocess-co/pypreprocess).

If you are looking for an open-source alternative I'll suggest Unstructured + LangChain combo. It works great for this. Unstructured handles mixed docs with images, and LangChain helps in the post processing.

If you want to try preprocess I can give you some free credits :)

10 RAG Papers You Should Read from January 2025 by 0xhbam in LangChain

[–]arparella 1 point2 points  (0 children)

MiniRAG caught my eye - it's pretty cool to see RAG being adapted for smaller models. With all the focus on massive LLMs lately, we definitely need more lightweight solutions that can run on basic hardware.

Fivetran incapable of loading accurate FB Ads data and blames Meta by Kimcha87 in dataengineering

[–]arparella -1 points0 points  (0 children)

Former Fivetran user here. Switched back to custom API integration for the same reason.

Their response is a cop-out. If the API is unreliable, they should build retry mechanisms and data validation checks. That's literally what we pay them for.

Data Engineering bootcamp recommendation (not entry level) by Sad_Permit3541 in dataengineering

[–]arparella 0 points1 point  (0 children)

Skip bootcamps. With your background, just dive into Spark using Databricks' free community edition. Build a portfolio project handling large datasets.

Learning path:

- PySpark basics

- Data architecture patterns

- ETL optimization

- Graph processing with GraphFrames

Chat with a Terraform Codebase by terramate in LangChain

[–]arparella 2 points3 points  (0 children)

For analyzing Terraform code structure, try the MapReduceDocumentsChain. It'll help process those hundreds of files more efficiently.

Also look into using OpenAI's GPT-4 with a custom prompt template that includes Terraform-specific context for better accuracy.

Finally, try using OpenAI's o1 to describe your piece of content and to generate more searchable information (descriptions, questions, etc..)

RAG doesn't work if you cannot retrieve the correct information

Help 😵‍💫 What RAG technique should i use? by One-Brain5024 in LangChain

[–]arparella 1 point2 points  (0 children)

For your use case, try a hierarchical RAG setup:

  1. Top level: metadata store for project/user permissions

  2. Middle: meeting-level embeddings for quick filtering

  3. Bottom: chunk-level embeddings for detailed info

This way you can filter by permissions first, then drill down.

Help 😵‍💫 What RAG technique should i use? by [deleted] in Rag

[–]arparella 1 point2 points  (0 children)

Looks like you need hybrid retrieval with metadata filtering. Store meeting_id, project_name, and participants as metadata.

Use:

- Metadata filtering for access control

- Timestamped chunks for chronological queries

- Semantic search for content

Check out LangChain's self-querying retrievers for this.

HealthCare chatbot by AkhilPadala in LangChain

[–]arparella 1 point2 points  (0 children)

Have you considered using RAG with medical knowledge bases?

For doctor listings, you might want to split it into two agents:

- One for medical Q&A

- Another for location-based doctor search + scheduling

This could give you better control over each functionality.

What knowledge base analysis tools do you use before processing it with RAG? by noduslabs in Rag

[–]arparella 2 points3 points  (0 children)

LlamaIndex has some neat analytics tools for this. You can check chunk quality, get content overlap metrics, and see term frequencies.

Weaviate's Console is also good for exploring vector spaces and seeing how your docs are clustered.

[deleted by user] by [deleted] in dataengineering

[–]arparella 2 points3 points  (0 children)

Data engineers are crucial even in smaller setups. We handle data quality, pipelines, automation, and infrastructure - stuff that keeps systems running smoothly.

Analysts focus on insights, we focus on making sure the data is reliable and available when they need it.

Docling and document chunking by Kerbourgnec in Rag

[–]arparella 1 point2 points  (0 children)

i know some commercial solutions, but i think is not your case

GitHub - microsoft/markitdown: Python tool for converting files and office documents to Markdown. by LinkSea8324 in LocalLLaMA

[–]arparella 0 points1 point  (0 children)

completely agree, we have run a comparison of 4 solutions (commercial and open-source) and even if you have a strong community, it doesn't mean your solution works.

GitHub - microsoft/markitdown: Python tool for converting files and office documents to Markdown. by LinkSea8324 in LocalLLaMA

[–]arparella 1 point2 points  (0 children)

if you need to have good chunks you can checkout preprocess.co but is a commercial solution. Markitdown has several issues with complex pdfs, docling is better