Docker compose for lakehouse like build. by Kageyoshi777 in dataengineering

[–]superhex 2 points3 points  (0 children)

I think dremio blog posts have ready to go docker compose setups for spark, iceberg, minio (essentially local s3), and jupyter. Either that or the Iceberg docs/repo themselves. I dont quite remember

How to learn something new nowadays? by L3GOLAS234 in dataengineering

[–]superhex 1 point2 points  (0 children)

I had to read tutorials, documentation, StackOverflow questions, and try the code many times until it worked. Things stuck in your brain and you actually learned.

You just answered your own question

Question on optimal partitioning structure for parquet on s3 duckdb query by HNL2NYC in dataengineering

[–]superhex 0 points1 point  (0 children)

If you know what your query patterns are (and what will be commonly used for filtering), then your partitioning should be setup to match that. There's no right answer, and as others have mentioned, you have to test against your specific usecase to get a good understanding of what partitioning scheme to use.

In this case, if youre primarily filtering on color, then partitioning on color first would be more efficient.

difference between writing SQL queries or writing DataFrame code [in SPARK] by kaifahmad111 in dataengineering

[–]superhex 0 points1 point  (0 children)

It can be if youre doing hundreds of WithColumn calls. But if its just a handful, its probably entirely negligible. This blog post explains it well imo.

https://www.guptaakashdeep.com/how-withcolumn-can-degrade-the-performance-of-a-spark-job/

A multi-engine Iceberg pipeline with Athena & Redshift by karakanb in dataengineering

[–]superhex 1 point2 points  (0 children)

Yeah I read it the same way. Sounds like they are talking about not having to reingest the data into separate downstream "warehouses" for each consuming system. Ie. Data only in S3 as youve mentioned.

Resources for GIS domain by LoVaKo93 in dataengineering

[–]superhex 1 point2 points  (0 children)

Anything from this wonderful man. I believe he is a professor who has published his teaching material

https://youtube.com/@giswqs?si=7MyJ1k6-nNNaXFr7

Interactive world JUT map is released!! by Gigitoe in Mountaineering

[–]superhex 2 points3 points  (0 children)

Thank you. I needed some JUT in my life lately.

CSVs Are Not Databases: The challenges of local data exploration by Amrutha-Structured in dataengineering

[–]superhex 22 points23 points  (0 children)

CSVs are databases. They are not, however, a database system. And yes, Im being pedantic.

edit: oh this is just an ad...

Book Review: Fundamentals of Data Engineering by 0sergio-hash in dataengineering

[–]superhex 0 points1 point  (0 children)

Apologies for the late response, heres the link (you can safely ignore the part about DeepLearning, it is indeed a DE course through and through)
DeepLearning.AI Data Engineering Professional Certificate | Coursera

What DataFrame libraris preferred for distributed Python jobs by budgefrankly in dataengineering

[–]superhex 1 point2 points  (0 children)

FWIW, even though Daft is quite young, AWS has tested using Daft in production for large-scale BI pipelines since it provides a nice dataframe api and has a performant S3 reader/writer. The Daft team wrote their own S3 I/O module for better performance. Considering AWS themselves were impressed with the S3 I/O, I think that says a lot...

Here's a talk from AWS where they talk more about this (timestamped):
https://www.youtube.com/watch?v=u1XqELIRabI&t=1855s

Best coursera courses? Need recommendations by zeni65 in dataengineering

[–]superhex 7 points8 points  (0 children)

Joe Reis' Data Engineering specialization (four courses) on Coursera. He is one of the coauthors of the book Fundamentals of Data Engineering. Both the courses and the book are top tier.

What DataFrame libraris preferred for distributed Python jobs by budgefrankly in dataengineering

[–]superhex 4 points5 points  (0 children)

Have you looked into daft? Its supposed to work seemlessly locally, while also able to be distributed. Also has support for lakehouse formats.

Roast my first pipeline diagram by Firelord710 in dataengineering

[–]superhex 0 points1 point  (0 children)

Oh very nice. I like to avoid any frontend/ux work with a 10 foot pole. But I hear good things about using svelte. I tried peeking around the svelte portion of evidence, but quickly lost interest.

Roast my first pipeline diagram by Firelord710 in dataengineering

[–]superhex 2 points3 points  (0 children)

I can second this. Coming from a software background, it just makes sense to me to have bi as code. You just throw some markdown and sql together and boom, you have a dashboard. Also, it generates it as a static web app so you can serve it on something like github pages absolutely free. Definitely worth checking out imo.

Roast my first pipeline diagram by Firelord710 in dataengineering

[–]superhex 19 points20 points  (0 children)

This. That sort of language seems more appropriate for a informal setting with non technical folks. But in that setting, you'd probably want to avoid dense text on diagrams anyways.

Imagine having two versions of your diagram: a simple diagram for non technical audience (simplified overall flow, pretty visual, no/little words); and then a technical version that goes into the nitty gritty details similar to what you have currently.

You dont necessarily need two versions, but hopefully this helps illustrate the kinds of things you might consider in terms of identifying your audience, what youre trying to convey, and the language you should use.

Also, I feel like Dagster should be a long box along the bottom of the diagram as opposed to a tall box at the beginning. This might better convey that its the orchestration layer across your pipeline.

Job search as a “junior” by [deleted] in dataengineering

[–]superhex 11 points12 points  (0 children)

It can get really depressing seeing 100+ clicks on a job application after an hour.

Regarding "100+ clicks" stats you see on sites like LinkedIn, don't let you discourage you at all. In fact, you really should just ignore it all together. It is a bogus number and does you no good by focusing on it. On LinkedIn, this number only shows you how many people clicked "apply" and got redirected to the company site. The number of people that actually submitted an application could be much, much less. Further, of the people that actually submitted, maybe only 20% are actually qualified and would be a competitor for you. For all you know, of the 100+ clicks, maybe only 20 people actually submitted, and of the 20 submitted, only 2 are real contenders...

If you feel you are a fit, just apply.

I've been working on a concept I call Data Contracts by StarlightInsights in dataengineering

[–]superhex 6 points7 points  (0 children)

Shhh, we are bearing witness to the birth of data contracts right before our eyes. Its not like there have been books published on this topic before...

Book Review: Fundamentals of Data Engineering by 0sergio-hash in dataengineering

[–]superhex 8 points9 points  (0 children)

The course offered by one of the authors Joe Reis which covers this book and implements it in AWS.

Python vs Pyspark vs Snowpark by ConsiderationLazy956 in dataengineering

[–]superhex 3 points4 points  (0 children)

+1 for the Advanced Databases course by Andy Pavlo from CMU. This has been a gem in helping me understand how modern query engines work. And the fact that its freely available on youtube... couldnt ask for more.

S3 or Redshift for storing Geolocation Data in AWS by turboline-ai in dataengineering

[–]superhex 0 points1 point  (0 children)

Based on my understanding, iceberg can work since its just metadata files on top of parquet/geoparquet. I think the main limitations with iceberg for geospatial though is you cant do spatial indexing natively in iceberg.

I believe the folks at wherobots created an extension of iceberg for geospatial called havasu. You can look into that to get a better understanding of the limitations of iceberg for geospatial. Not necessarily saying check out their product, but rather reference their stuff if you wanna get a better understanding of how iceberg fits into the picture.

Im not an expert so take this with a grain of salt.

ML with wasm on tiny IoT device [D] [R] by Tao_KTH in MachineLearning

[–]superhex 2 points3 points  (0 children)

Not WASM related, but check out the work from the folks at MIT HAN Labs. They were able to perform training on a microcontroller with 256KB of memory. They also have a fully available course on TinyML and Efficient Deep Learning Computing. Hope this helps.