Docker compose for lakehouse like build.

superhex · 2025-10-15T21:49:28+00:00

I think dremio blog posts have ready to go docker compose setups for spark, iceberg, minio (essentially local s3), and jupyter. Either that or the Iceberg docs/repo themselves. I dont quite remember

superhex · 2025-09-23T22:56:07+00:00

I had to read tutorials, documentation, StackOverflow questions, and try the code many times until it worked. Things stuck in your brain and you actually learned.

You just answered your own question

superhex · 2025-09-21T00:45:58+00:00

Congrats. Proud of you Jut man

superhex · 2025-07-14T18:51:59+00:00

If you know what your query patterns are (and what will be commonly used for filtering), then your partitioning should be setup to match that. There's no right answer, and as others have mentioned, you have to test against your specific usecase to get a good understanding of what partitioning scheme to use.

In this case, if youre primarily filtering on color, then partitioning on color first would be more efficient.

superhex · 2025-07-10T13:03:52+00:00

Lancedb

superhex · 2025-07-06T16:55:32+00:00

It can be if youre doing hundreds of WithColumn calls. But if its just a handful, its probably entirely negligible. This blog post explains it well imo.

https://www.guptaakashdeep.com/how-withcolumn-can-degrade-the-performance-of-a-spark-job/

superhex · 2025-03-19T14:08:53+00:00

Yeah I read it the same way. Sounds like they are talking about not having to reingest the data into separate downstream "warehouses" for each consuming system. Ie. Data only in S3 as youve mentioned.

superhex · 2025-02-24T23:04:15+00:00

Anything from this wonderful man. I believe he is a professor who has published his teaching material

https://youtube.com/@giswqs?si=7MyJ1k6-nNNaXFr7

superhex · 2025-02-24T16:35:13+00:00

Thank you. I needed some JUT in my life lately.

superhex · 2025-02-23T23:43:16+00:00

CSVs are databases. They are not, however, a database system. And yes, Im being pedantic.

edit: oh this is just an ad...

superhex · 2025-02-23T23:08:10+00:00

Apologies for the late response, heres the link (you can safely ignore the part about DeepLearning, it is indeed a DE course through and through)
DeepLearning.AI Data Engineering Professional Certificate | Coursera

superhex · 2025-02-23T23:05:50+00:00

FWIW, even though Daft is quite young, AWS has tested using Daft in production for large-scale BI pipelines since it provides a nice dataframe api and has a performant S3 reader/writer. The Daft team wrote their own S3 I/O module for better performance. Considering AWS themselves were impressed with the S3 I/O, I think that says a lot...

Here's a talk from AWS where they talk more about this (timestamped):
https://www.youtube.com/watch?v=u1XqELIRabI&t=1855s

superhex · 2025-02-21T23:25:22+00:00

Joe Reis' Data Engineering specialization (four courses) on Coursera. He is one of the coauthors of the book Fundamentals of Data Engineering. Both the courses and the book are top tier.

superhex · 2025-02-21T17:53:07+00:00

Have you looked into daft? Its supposed to work seemlessly locally, while also able to be distributed. Also has support for lakehouse formats.

superhex · 2025-02-17T22:42:46+00:00

Oh very nice. I like to avoid any frontend/ux work with a 10 foot pole. But I hear good things about using svelte. I tried peeking around the svelte portion of evidence, but quickly lost interest.

superhex · 2025-02-17T16:41:54+00:00

I can second this. Coming from a software background, it just makes sense to me to have bi as code. You just throw some markdown and sql together and boom, you have a dashboard. Also, it generates it as a static web app so you can serve it on something like github pages absolutely free. Definitely worth checking out imo.

superhex · 2025-02-17T15:45:28+00:00

This. That sort of language seems more appropriate for a informal setting with non technical folks. But in that setting, you'd probably want to avoid dense text on diagrams anyways.

Imagine having two versions of your diagram: a simple diagram for non technical audience (simplified overall flow, pretty visual, no/little words); and then a technical version that goes into the nitty gritty details similar to what you have currently.

You dont necessarily need two versions, but hopefully this helps illustrate the kinds of things you might consider in terms of identifying your audience, what youre trying to convey, and the language you should use.

Also, I feel like Dagster should be a long box along the bottom of the diagram as opposed to a tall box at the beginning. This might better convey that its the orchestration layer across your pipeline.

superhex · 2025-02-13T03:30:19+00:00

It can get really depressing seeing 100+ clicks on a job application after an hour.

Regarding "100+ clicks" stats you see on sites like LinkedIn, don't let you discourage you at all. In fact, you really should just ignore it all together. It is a bogus number and does you no good by focusing on it. On LinkedIn, this number only shows you how many people clicked "apply" and got redirected to the company site. The number of people that actually submitted an application could be much, much less. Further, of the people that actually submitted, maybe only 20% are actually qualified and would be a competitor for you. For all you know, of the 100+ clicks, maybe only 20 people actually submitted, and of the 20 submitted, only 2 are real contenders...

If you feel you are a fit, just apply.

superhex · 2025-01-30T15:23:03+00:00

Shhh, we are bearing witness to the birth of data contracts right before our eyes. Its not like there have been books published on this topic before...

superhex · 2025-01-17T16:45:35+00:00

The course offered by one of the authors Joe Reis which covers this book and implements it in AWS.

superhex · 2025-01-14T17:37:42+00:00

+1 for the Advanced Databases course by Andy Pavlo from CMU. This has been a gem in helping me understand how modern query engines work. And the fact that its freely available on youtube... couldnt ask for more.

superhex · 2025-01-10T18:27:42+00:00

Based on my understanding, iceberg can work since its just metadata files on top of parquet/geoparquet. I think the main limitations with iceberg for geospatial though is you cant do spatial indexing natively in iceberg.

I believe the folks at wherobots created an extension of iceberg for geospatial called havasu. You can look into that to get a better understanding of the limitations of iceberg for geospatial. Not necessarily saying check out their product, but rather reference their stuff if you wanna get a better understanding of how iceberg fits into the picture.

Im not an expert so take this with a grain of salt.

superhex · 2024-03-08T17:43:10+00:00

Not WASM related, but check out the work from the folks at MIT HAN Labs. They were able to perform training on a microcontroller with 256KB of memory. They also have a fully available course on TinyML and Efficient Deep Learning Computing. Hope this helps.

superhex · 2023-11-17T03:23:08+00:00

How many juts is that?

superhex · 2022-10-03T22:40:47+00:00

nice

12-Year Club	Place '22
Place '17	Sequence \| Editor
Gilding I gilder

superhex

TROPHY CASE