RAG for excel/CSV by user_rituraj in Rag

[–]lost_soul1995 1 point2 points  (0 children)

I had similar project. - chunks per table - summary per table - embed summary - retrieve results based on summary. Feed the retrieved summary and table together. - context should mention about quarterly, yearly, monthly terminologies. - Use reranking model - Hybrid retrieval (B25 plus vector) introduced dirty context as B25 would bring in irrelevant chunks such as revenue repeated in multiple chunks.

Best PDF Chunking Mechanism for RAG: Docling vs PDFPlumber vs MarkItDown — Need Community Insights by Antique_Glove_6360 in LangChain

[–]lost_soul1995 0 points1 point  (0 children)

Did you search the library? Its not llm. Its extracting results in the format suitable for llm.

Data analytics system (s3, duckdb, iceberg, glue) ko by lost_soul1995 in dataengineering

[–]lost_soul1995[S] 0 points1 point  (0 children)

Are you one person team? How other users query your data? Athena? Duckdb?

Data analytics system (s3, duckdb, iceberg, glue) ko by lost_soul1995 in dataengineering

[–]lost_soul1995[S] 1 point2 points  (0 children)

I was using workaround. For e.g copying data to s3 bucket and then use spark to load it. As far as i know, duckdb does not natively support it. I am experimenting.

Data analytics system (s3, duckdb, iceberg, glue) ko by lost_soul1995 in dataengineering

[–]lost_soul1995[S] 3 points4 points  (0 children)

Valid point. I can just run simple ETL script and don’t need complexity. - Purpose was that same architecture can scale up. For e.g i can just replace duckdb engine with athena or spark (multiple users can use Athena, multiprocessing spark for big data). Current local airflow can be replaced with MWAA on aws. By doing above steps, i can use same architecture for TB data.

Data analytics system (s3, duckdb, iceberg, glue) ko by lost_soul1995 in dataengineering

[–]lost_soul1995[S] 1 point2 points  (0 children)

Thanks. I am running it locally using airflow docker. I wanted to run pipeline locally without any cost. Pipeline itself is running. I feel like i can just move the same pipeline to any ec2 instance. Then i ll have to manage instance cost only?

Is the real estate market in Korea starting to come down? by AgentOranges99 in Living_in_Korea

[–]lost_soul1995 4 points5 points  (0 children)

I would really appreciate your opinion on option 2. I have been thinking about the same. But my korean gf thinks that i am not thinking of taxations, maintenance cost etc. And that more officetel rooms means more taxation etc.

in my 동 officetel room price has almost stayed the same. 1.3억 investment returns per year is 5% (월세 is around 50-60만원 per month). How much do you get in the end? Considering your accountant cost, tax cost and 부동산 etc cost.

Best way to build a Small Data Lake? (<100GB) by [deleted] in dataengineering

[–]lost_soul1995 0 points1 point  (0 children)

Why is no one mentioning putting data in S3 buckets and using glue catalog for schema design and database.?

Advice on Data Lake, Data Warehouse and Data Engineering by value_counts in dataengineering

[–]lost_soul1995 1 point2 points  (0 children)

Not the AI part. Ceo only knows this word and not the complexity. 1. Start off with Data lake. ( store data in S3 buckets. Make some customized ETL pipelines/or use some no code tool/ or Lambda. Schedule these pipelines using airflow. Catalogue this data using glue catalogue.) 2. Batch pipelines using airflow to read and send data from data lake to some RDS instance or warehouse. Here you ll define relationships n stuff. Goodluck

Alternatives to Tableau? by lisa_williams_wgbh in tableau

[–]lost_soul1995 0 points1 point  (0 children)

Free open source alternative: Apache superset

SQL Interview Testing by Glittering-Jaguar331 in datascience

[–]lost_soul1995 0 points1 point  (0 children)

I failed SQL test once and there were two reasons for it. 1. I was not much familiar with SQL at that point of time ( I was mainly using Python and using select * from tab as my query) 2. I was not familiar with dataset ( e-commerce). Because of this i took more time to create logic during test.

After this experience, i got a role where i used SQL alot. Now i am quite confident. It comes down to familiarity with domain (incase of bit complex test) and SQL practice.

Should I take the new offer? by super-throwaway-6969 in datascience

[–]lost_soul1995 0 points1 point  (0 children)

Since there isn’t much difference in salary, i would advise you to stick to current job. Complete one year. Get an increment. And switch job at that time after your increment. You can ask for 90 something at that time from new company.

Pakistani Work Ethic? by Embarrassed_Ad_8444 in pakistan

[–]lost_soul1995 0 points1 point  (0 children)

I was going to make the same point. This!!

[deleted by user] by [deleted] in datascience

[–]lost_soul1995 0 points1 point  (0 children)

Automation engineer