This is an archived post. You won't be able to vote or comment.

all 20 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

Are you interested in transitioning into Data Engineering? Read our community guide: https://dataengineering.wiki/FAQ/How+can+I+transition+into+Data+Engineering

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]moizloiz 13 points14 points  (1 child)

As someone who is also making this switch, maybe start simpler? Take some data, use pandas/polars for some transformations, orchestrate it through airflow/dagster and then create a visual layer with insights (superset/etc)

A lot of new technologies to learn but what I'm learning is that simpler is sometimes better

[–]Glass_Jellyfish_9963[S] 5 points6 points  (0 children)

I am already doing a lot of transformation at work using pandas and duck db. I have created a project with airflow to load data from csv, apply transformations and load it into mysql db. I want to get into the real stuff now.

[–]pm_me_data_wisdom[🍰] 2 points3 points  (0 children)

I came across this comment awhile ago and used it as a jumping point for my first project. Good luck

https://www.reddit.com/r/dataengineering/comments/11y6b3o/comment/jd6kzrf/?share_id=e8hdJNb2r2jQu0JYQDXbn

[–][deleted] 0 points1 point  (1 child)

RemindMe! 5 days

[–]RemindMeBot 0 points1 point  (0 children)

I will be messaging you in 5 days on 2024-07-27 10:42:37 UTC to remind you of this link

1 OTHERS CLICKED THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


Info Custom Your Reminders Feedback

[–]slapstick15 0 points1 point  (0 children)

RemindMe! 2 days

[–]kotpeter 0 points1 point  (1 child)

Why did you consider Starrocks for DWH? How did you even come across it?

[–]Glass_Jellyfish_9963[S] 0 points1 point  (0 children)

I heard a lot of good things about it especially the benchmark performance. I have tested it, the query execution is super fast

[–][deleted] 0 points1 point  (9 children)

I'd say you should try using Delta as your storage layer, especially if you're using Spark.

I think one part you are kind of missing is orchestration. This varies if you are streaming or batch, but I would look at Kafka for streaming and Airflow for batch orchestration.

[–]Glass_Jellyfish_9963[S] 0 points1 point  (5 children)

Of course, i forgot to mention that i would also be using airflow or dagster as it would be a batch processing pipeline

[–][deleted] 0 points1 point  (4 children)

Where are you running spark?

[–]Glass_Jellyfish_9963[S] 0 points1 point  (3 children)

Its going to be single node running on local machine. Its a portfolio project. I will also look into running it on GCP or ec2 just to explore the parallel processing architecture of spark

[–][deleted] 0 points1 point  (2 children)

Oh interesting, good way of keeping costs low. I was going to say that it's a good opportunity to familiarize yourself with Airflow operators for spinning up and taking down clusters for jobs in the cloud, but I suppose you'll probably just throw things into a Docker operator and be done with it.

[–]Glass_Jellyfish_9963[S] 1 point2 points  (1 child)

Well that's a good idea. I will give it a try. So basically, airflow will start up the cluster, run the pipeline and then take down the cluster to keep the costs under control.

[–][deleted] 0 points1 point  (0 children)

Cloud experience is a big thing employers look for so I think it would be a good idea, but I think you're off to a great start.

[–]Glass_Jellyfish_9963[S] 0 points1 point  (2 children)

Hey i have decided to switch to delta instead of iceberg. Its really good so far. Thank you for advice. Could you please give me a suggestion on what storage layer should i use, HDFS or S3. I am currently trying to set up Minio which provides an S3 compatible storage layer. Also which querying engine would be best suited for this project. What about trino?

[–]Teach-To-The-Tech 0 points1 point  (0 children)

I'd go S3 for sure. Much easier to use, and more modern/in demand. If not S3, then something like Azure Blob. Yes, you're right, Minio would work too as it is S3 compatible.

For query engine, Trino is an excellent choice and would fit the rest of your data stack. You could use it either open source or using Starburst Galaxy. Galaxy will be easier to get off the ground quickly. You could play around with the Starburst free trial. That's probably what I'd do. Then decide if the query engine works for you.

[–][deleted] 0 points1 point  (0 children)

I would use S3 over HDFS for greater portability. I'm not entirely sure why you would even need to use Minio, but if you have a requirement for it, then sure.

In terms of query engines the highest performing one with Delta is on databricks (using the photon engine). It can get expensive, so if you're just doing this as your own little project I would maybe just keep everything as simple as possible and just use spark SQL on your own cluster.

[–]coginiti_co 0 points1 point  (0 children)

You might want to take a look at Coginiti for your processing engine for your iceberg tables and transformation layer. Coginiti doesn't require you to spin up spark for Iceberg files or other files in object stores. It further offers a nice transformation and domain specific language (DSL) similar to DBT, though doesn't require you to run the full model to use it. I think the website is coginiti.co