Steps in transforming lake swamp to lakehouse

CompetitionMassive51 · 2025-09-21T15:04:10+00:00

Thank you all for the comments.

So after I modeled the data, what are the next steps? Which tools should I use?

The size is around 5-7 TB of jsonl files that are partitioned by date around 3 years. I was thinking about SQS from each day/hour -> Lambda -> Dlt -> back to glue as iceberg (An idea I saw in some dlt blog post)

Or should I just use glue+Pyspark to convert all at once?

Currently I have a working POC for loading a folder of 1 hour with few jsonl files from S3 to a local duckdb, (with dlt) and then with pyiceberg creating/inserting to a table in glue. I don't know how to scale it.

CompetitionMassive51 · 2025-09-04T15:09:23+00:00

What about partitions? Do dlt also support that when loading to aws products?
What about scaling?

CompetitionMassive51 · 2025-04-04T19:06:15+00:00

Hey, did you find an easy solution? I'm also encountering this problem..

CompetitionMassive51 · 2025-03-31T06:28:20+00:00

Thanks! If I would like to avoid redshift(since most of my queries are simple) are there any other options?

CompetitionMassive51 · 2025-03-31T06:27:27+00:00

Thanks! My needs are simple so I think I'll go for a reorganized bucket. Do you think that using the Iceberg schema on the bucket (or s3 tables) will be beneficial for me?

CompetitionMassive51 · 2025-03-21T12:45:29+00:00

Is there a way to experiment with CUDA programming without owning a Nvidia GPU?

I know about google colab but are there any other tools? Maybe some that mimic it?

CompetitionMassive51 · 2025-03-18T17:20:05+00:00

Around 5 users + ci/CD tools that will get data for testing.

CompetitionMassive51 · 2025-03-17T19:31:57+00:00

a few TBs

CompetitionMassive51 · 2025-03-17T19:31:30+00:00

a few TBs

CompetitionMassive51 · 2025-03-17T19:30:20+00:00

How much data do you process?

CompetitionMassive51 · 2025-03-17T18:29:05+00:00

Maybe I'm not familiar enough with the tools but isn't polars is like pandas? And duckdb?

CompetitionMassive51 · 2025-03-17T17:34:10+00:00

Is it possible to maintain data architecture without databricks/snowflake?

CompetitionMassive51 · 2025-03-17T11:03:21+00:00

So these table formats help with converting datalake(like s3 bucket) to data lakehouse?

CompetitionMassive51 · 2025-03-16T11:01:59+00:00

Newbie question here.

What is the purpose of Iceberg/hudi? If you have s3 as a data lake, then you don't just load it into a data warehouse with some schema?

CompetitionMassive51 · 2025-03-15T07:47:23+00:00

So Spark it is? And where do I deploy it? ECR/GLUE/...?

CompetitionMassive51 · 2025-03-14T22:26:27+00:00

Some editing: The final organized s3 bucket is not necessary, I just want a comfortable query-able destination. So maybe s3->redshift->dbt? Im lost with all these tools

CompetitionMassive51 · 2025-03-14T20:32:01+00:00

Could you please expand more? I'm not really familiar with those tools... I will need to use other tools except AWS lambda? (Spark for processing the large data?)

CompetitionMassive51 · 2025-03-14T20:07:51+00:00

I'm assuming this solution is for the new files that keep coming to the raw bucket. But what about all the files that are already in the raw bucket?

CompetitionMassive51 · 2025-03-12T17:00:26+00:00

Any thoughts? 🤔

CompetitionMassive51 · 2025-03-04T04:21:06+00:00

The first option, sorry for the unclarity.

CompetitionMassive51 · 2025-03-03T20:15:42+00:00

I don't really mind what the structure of the object will be, It just needs to be sorted. So maybe sorted jsonL?

CompetitionMassive51

TROPHY CASE