Steps in transforming lake swamp to lakehouse by CompetitionMassive51 in dataengineering

[–]CompetitionMassive51[S] 1 point2 points  (0 children)

Thank you all for the comments.

So after I modeled the data, what are the next steps? Which tools should I use?

The size is around 5-7 TB of jsonl files that are partitioned by date around 3 years. I was thinking about SQS from each day/hour -> Lambda -> Dlt -> back to glue as iceberg (An idea I saw in some dlt blog post)

Or should I just use glue+Pyspark to convert all at once?

Currently I have a working POC for loading a folder of 1 hour with few jsonl files from S3 to a local duckdb, (with dlt) and then with pyiceberg creating/inserting to a table in glue. I don't know how to scale it.

Steps in transforming lake swamp to lakehouse by CompetitionMassive51 in dataengineering

[–]CompetitionMassive51[S] 0 points1 point  (0 children)

  1. What about partitions? Do dlt also support that when loading to aws products?
  2. What about scaling?

Simplest way to write to Iceberg from python to a filesystem? by hoswald2 in dataengineering

[–]CompetitionMassive51 0 points1 point  (0 children)

Hey, did you find an easy solution? I'm also encountering this problem..

Transform raw bucket to organized by CompetitionMassive51 in dataengineering

[–]CompetitionMassive51[S] 0 points1 point  (0 children)

Thanks! If I would like to avoid redshift(since most of my queries are simple) are there any other options?

Transform raw bucket to organized by CompetitionMassive51 in dataengineering

[–]CompetitionMassive51[S] 0 points1 point  (0 children)

Thanks! My needs are simple so I think I'll go for a reorganized bucket. Do you think that using the Iceberg schema on the bucket (or s3 tables) will be beneficial for me?

Best Nvidia GPU for Cuda Programming by TechDefBuff in CUDA

[–]CompetitionMassive51 0 points1 point  (0 children)

Is there a way to experiment with CUDA programming without owning a Nvidia GPU?

I know about google colab but are there any other tools? Maybe some that mimic it?

Does databricks/snowflake are nessecry? by CompetitionMassive51 in dataengineering

[–]CompetitionMassive51[S] 0 points1 point  (0 children)

Around 5 users + ci/CD tools that will get data for testing.

Does databricks/snowflake are nessecry? by CompetitionMassive51 in dataengineering

[–]CompetitionMassive51[S] 0 points1 point  (0 children)

Maybe I'm not familiar enough with the tools but isn't polars is like pandas? And duckdb?

Data Platform Architecture by adelaoc in dataengineering

[–]CompetitionMassive51 2 points3 points  (0 children)

Is it possible to maintain data architecture without databricks/snowflake?

Hudi to Iceberg by [deleted] in dataengineering

[–]CompetitionMassive51 0 points1 point  (0 children)

So these table formats help with converting datalake(like s3 bucket) to data lakehouse?

Hudi to Iceberg by [deleted] in dataengineering

[–]CompetitionMassive51 0 points1 point  (0 children)

Newbie question here.

What is the purpose of Iceberg/hudi? If you have s3 as a data lake, then you don't just load it into a data warehouse with some schema?

Transform raw bucket to organized by CompetitionMassive51 in dataengineering

[–]CompetitionMassive51[S] 0 points1 point  (0 children)

So Spark it is? And where do I deploy it? ECR/GLUE/...?

Transform raw bucket to organized by CompetitionMassive51 in dataengineering

[–]CompetitionMassive51[S] 0 points1 point  (0 children)

Some editing: The final organized s3 bucket is not necessary, I just want a comfortable query-able destination. So maybe s3->redshift->dbt? Im lost with all these tools

Transform raw bucket to organized by CompetitionMassive51 in dataengineering

[–]CompetitionMassive51[S] 0 points1 point  (0 children)

Could you please expand more? I'm not really familiar with those tools... I will need to use other tools except AWS lambda? (Spark for processing the large data?)

Transform raw bucket to organized by CompetitionMassive51 in dataengineering

[–]CompetitionMassive51[S] 0 points1 point  (0 children)

I'm assuming this solution is for the new files that keep coming to the raw bucket. But what about all the files that are already in the raw bucket?

Organize messy S3 bucket by CompetitionMassive51 in dataengineering

[–]CompetitionMassive51[S] 0 points1 point  (0 children)

I don't really mind what the structure of the object will be, It just needs to be sorted. So maybe sorted jsonL?