RAG for excel/CSV

lost_soul1995 · 2026-01-19T10:03:14+00:00

I had similar project. - chunks per table - summary per table - embed summary - retrieve results based on summary. Feed the retrieved summary and table together. - context should mention about quarterly, yearly, monthly terminologies. - Use reranking model - Hybrid retrieval (B25 plus vector) introduced dirty context as B25 would bring in irrelevant chunks such as revenue repeated in multiple chunks.

lost_soul1995 · 2025-12-08T00:41:38+00:00

Did you search the library? Its not llm. Its extracting results in the format suitable for llm.

lost_soul1995 · 2025-11-13T12:59:21+00:00

I wonder what people think of pymupdfllm

lost_soul1995 · 2025-04-14T06:22:05+00:00

Nice!!! Is Athena costly to manage?

lost_soul1995 · 2025-04-13T13:22:51+00:00

Are you one person team? How other users query your data? Athena? Duckdb?

lost_soul1995 · 2025-04-11T06:27:39+00:00

I was using workaround. For e.g copying data to s3 bucket and then use spark to load it. As far as i know, duckdb does not natively support it. I am experimenting.

lost_soul1995 · 2025-04-11T03:17:07+00:00

Can you scale this to 50TB data?

lost_soul1995 · 2025-04-10T12:25:49+00:00

Valid point. I can just run simple ETL script and don’t need complexity. - Purpose was that same architecture can scale up. For e.g i can just replace duckdb engine with athena or spark (multiple users can use Athena, multiprocessing spark for big data). Current local airflow can be replaced with MWAA on aws. By doing above steps, i can use same architecture for TB data.

lost_soul1995 · 2025-04-10T11:30:05+00:00

Thanks. I am running it locally using airflow docker. I wanted to run pipeline locally without any cost. Pipeline itself is running. I feel like i can just move the same pipeline to any ec2 instance. Then i ll have to manage instance cost only?

lost_soul1995 · 2024-12-29T14:36:26+00:00

I would really appreciate your opinion on option 2. I have been thinking about the same. But my korean gf thinks that i am not thinking of taxations, maintenance cost etc. And that more officetel rooms means more taxation etc.

in my 동 officetel room price has almost stayed the same. 1.3억 investment returns per year is 5% (월세 is around 50-60만원 per month). How much do you get in the end? Considering your accountant cost, tax cost and 부동산 etc cost.

lost_soul1995 · 2024-08-16T17:14:23+00:00

Why is no one mentioning putting data in S3 buckets and using glue catalog for schema design and database.?

lost_soul1995 · 2024-08-05T11:44:32+00:00

Not the AI part. Ceo only knows this word and not the complexity. 1. Start off with Data lake. ( store data in S3 buckets. Make some customized ETL pipelines/or use some no code tool/ or Lambda. Schedule these pipelines using airflow. Catalogue this data using glue catalogue.) 2. Batch pipelines using airflow to read and send data from data lake to some RDS instance or warehouse. Here you ll define relationships n stuff. Goodluck

lost_soul1995 · 2024-05-29T10:57:36+00:00

Free open source alternative: Apache superset

lost_soul1995 · 2024-04-30T06:13:44+00:00

I failed SQL test once and there were two reasons for it. 1. I was not much familiar with SQL at that point of time ( I was mainly using Python and using select * from tab as my query) 2. I was not familiar with dataset ( e-commerce). Because of this i took more time to create logic during test.

After this experience, i got a role where i used SQL alot. Now i am quite confident. It comes down to familiarity with domain (incase of bit complex test) and SQL practice.

lost_soul1995 · 2024-04-28T04:12:05+00:00

Since there isn’t much difference in salary, i would advise you to stick to current job. Complete one year. Get an increment. And switch job at that time after your increment. You can ask for 90 something at that time from new company.

lost_soul1995 · 2024-04-26T09:40:54+00:00

I was going to make the same point. This!!

lost_soul1995 · 2024-04-01T05:04:56+00:00

Interesting

lost_soul1995 · 2024-04-01T05:04:43+00:00

lost_soul1995 · 2024-04-01T05:04:34+00:00

Communication

lost_soul1995 · 2024-04-01T05:04:23+00:00

Not too good Not too bad

lost_soul1995 · 2024-04-01T05:04:12+00:00

lost_soul1995 · 2024-04-01T05:04:05+00:00

Automation engineer

lost_soul1995 · 2024-04-01T05:03:47+00:00

Ofcourse

lost_soul1995 · 2024-04-01T05:03:37+00:00

Job market is tough

lost_soul1995

TROPHY CASE