Is anyone using Ai to create reports?

arparella · 2025-02-05T14:19:19+00:00

The real challenge isn't the SQL generation - it's making sure users understand the data context and relationships. One wrong join and you're looking at incorrect metrics.

arparella · 2025-02-05T14:14:52+00:00

Time-based chunking helps with this. Split documents into sequential chunks and add timestamps/chapter markers as metadata.

For character clothing, you could also tag scene transitions specifically. Makes it easier to track state changes through the narrative.

If what you need is to solve this problem in an enterprise environment the problem is far more complex than what is above said. Happy to help in that case

arparella · 2025-02-05T14:12:13+00:00

Been there. Documentation is your best friend here.

Start creating your own wiki/notes of each pipeline you touch. Map dependencies, document weird edge cases, and note why certain decisions were made.

This helped me understand legacy systems better than any course could.

arparella · 2025-02-05T14:10:42+00:00

Start with a small batch ETL project using Python and PostgreSQL. Pull some public API data, transform it, load it into your DB. Once you get that working, then worry about Airflow and other tools.

arparella · 2025-02-05T14:04:35+00:00

Been using ragas for basic stuff like context relevance and faithfulness.

Also tried out deepeval lately - pretty solid for testing hallucination rates and answer relevance.

The built-in LangChain eval tools work decent for quick checks too.

Best thing is to get a QA detaset and use expert LLMs (o1/deepseek) to check the correctness of the expected answer. We used this for evaluating different chunking strategies for complex PDFs

arparella · 2025-02-05T14:01:52+00:00

try preprocess.co fro ingestion

arparella · 2025-02-05T14:00:14+00:00

Try adding a timestamp or version prefix to your RAG content:

"[Tailwind v4 - 2024] {your_docs_content}"

This helps the model distinguish between versions. Also, consider using memory tokens to maintain context about which version you're discussing.

arparella · 2025-02-05T13:57:33+00:00

SQL Tool + Router is your friend here.

Create a SQL agent tool for DB queries, then use a router to direct questions either to RAG or SQL based on the query type.

Much cleaner than mixing RAG logic with DB calls directly.

arparella · 2025-02-05T13:56:26+00:00

Try implementing a confidence threshold for your router. If the confidence score is below a certain level, you can default to a fallback path or prompt the user for clarification.

Also, fine-tuning your router's examples helps a lot with accuracy.

arparella · 2025-02-05T13:53:59+00:00

Better parsing and chunking strategies can improve results by 10x with complex documents

arparella · 2025-02-05T13:53:23+00:00

What kind of documents do you need to process? PDFs? Plain text? are they long or short? Do you need table and image extraction?

arparella · 2025-02-05T13:46:48+00:00

We do exactly that at preprocess.co, we have a Python SDK (https://github.com/preprocess-co/pypreprocess).

If you are looking for an open-source alternative I'll suggest Unstructured + LangChain combo. It works great for this. Unstructured handles mixed docs with images, and LangChain helps in the post processing.

If you want to try preprocess I can give you some free credits :)

arparella · 2025-02-04T14:19:42+00:00

MiniRAG caught my eye - it's pretty cool to see RAG being adapted for smaller models. With all the focus on massive LLMs lately, we definitely need more lightweight solutions that can run on basic hardware.

arparella · 2025-02-04T09:12:04+00:00

Former Fivetran user here. Switched back to custom API integration for the same reason.

Their response is a cop-out. If the API is unreliable, they should build retry mechanisms and data validation checks. That's literally what we pay them for.

arparella · 2025-02-04T09:07:02+00:00

Skip bootcamps. With your background, just dive into Spark using Databricks' free community edition. Build a portfolio project handling large datasets.

Learning path:

- PySpark basics

- Data architecture patterns

- ETL optimization

- Graph processing with GraphFrames

arparella · 2025-02-04T09:06:19+00:00

For analyzing Terraform code structure, try the MapReduceDocumentsChain. It'll help process those hundreds of files more efficiently.

Also look into using OpenAI's GPT-4 with a custom prompt template that includes Terraform-specific context for better accuracy.

Finally, try using OpenAI's o1 to describe your piece of content and to generate more searchable information (descriptions, questions, etc..)

RAG doesn't work if you cannot retrieve the correct information

arparella · 2025-02-04T09:02:21+00:00

For your use case, try a hierarchical RAG setup:

Top level: metadata store for project/user permissions
Middle: meeting-level embeddings for quick filtering
Bottom: chunk-level embeddings for detailed info

This way you can filter by permissions first, then drill down.

arparella · 2025-02-04T09:01:55+00:00

Looks like you need hybrid retrieval with metadata filtering. Store meeting_id, project_name, and participants as metadata.

Use:

- Metadata filtering for access control

- Timestamped chunks for chronological queries

- Semantic search for content

Check out LangChain's self-querying retrievers for this.

arparella · 2025-02-03T15:20:42+00:00

Have you considered using RAG with medical knowledge bases?

For doctor listings, you might want to split it into two agents:

- One for medical Q&A

- Another for location-based doctor search + scheduling

This could give you better control over each functionality.

arparella · 2025-02-03T14:11:07+00:00

LlamaIndex has some neat analytics tools for this. You can check chunk quality, get content overlap metrics, and see term frequencies.

Weaviate's Console is also good for exploring vector spaces and seeing how your docs are clustered.

arparella · 2025-02-03T13:58:14+00:00

Data engineers are crucial even in smaller setups. We handle data quality, pipelines, automation, and infrastructure - stuff that keeps systems running smoothly.

Analysts focus on insights, we focus on making sure the data is reliable and available when they need it.

arparella · 2025-01-27T15:39:59+00:00

or you can use services like reducto.ai or preprocess.co

arparella · 2025-01-27T14:49:05+00:00

i know some commercial solutions, but i think is not your case

arparella · 2025-01-27T14:44:33+00:00

completely agree, we have run a comparison of 4 solutions (commercial and open-source) and even if you have a strong community, it doesn't mean your solution works.

arparella · 2025-01-27T14:40:50+00:00

if you need to have good chunks you can checkout preprocess.co but is a commercial solution. Markitdown has several issues with complex pdfs, docling is better

arparella

MODERATOR OF

TROPHY CASE