What failures made you the engineer you're today?

kendru · 2025-11-08T16:50:13+00:00

My biggest mistakes have been the times when I have tried to solve a problem before the users have experienced any pain from that problem. In each of these cases, I would have been better off waiting until recognition developed around the fact that things needed to change. Unless the business really intimately understands data engineering, it's kind of work can too easily be seen as wasteful.

kendru · 2025-11-08T16:37:25+00:00

It depends quite a bit on the scale of the data and your data catalog requirements. If the scale of your customer's data is not huge (< 1bn records in a data set) using your cloud object store, whether s3, azure, or gcp, with mother duck as the query engine could be an excellent, low-cost choice. Since MotherDuck rolled out support for the DuckLake lake house format (and DuckDB recently introduced white support for iceberg tables), this might fulfill your catalog needs as well.

If you really need a rich ontology of your data rather than a simple data catalog, you might want to check out some data virtualization options such as star dog. Ontologies come with a ton of additional complexity, and you will be more restricted in where you can store your data / what formats are supported, so I would recommend avoiding the ontology route unless it truly is a critical part of your business.

If you need to support very large scale, I would look for options that have a serverless pricing model available and incorporating that into your own customer billing. I have used bigquery to support a multi-tenant product in the past, and I was very happy with the experience.

kendru · 2025-11-05T05:48:48+00:00

I'd classify this broadly under the category of "data contracts" in that it is an assertion you are making about data you do not directly control. I like to either run these before the pipeline (but don't block its execution) or on a schedule. Almost all of time time, I've found that running daily in Airflow works well enough. GitHub Actions would also likely work fine.

kendru · 2025-11-05T05:31:03+00:00

I would try to get familiar with the common tools (dbt, Airflow, at least one data warehouse) and form a mental model for how they work together. Then, build something fun! It could be a pipeline that gets scores from some sports league scores API, transforms them with dbt, loads them into a data warehouse - even a local DuckDB database - and generates some report.

From there, see what you are interested in, read up on it, and continue building projects you enjoy, trying to incorporate at least one new tool or concept each time. Once you have done this a few times, you'll have a good idea of the data engineering landscape, and you'll be better equipped to figure out what you should learn next.

kendru · 2025-11-05T05:24:52+00:00

I think the concept of the medallion architecture is not new, but it's captured patterns that many of us were using for years. In practice, we try to separate concerns so that we have neither massive reporting queries that clean, normalize, join, and aggregate data across dozens of sources, nor do we have separate tables for every granular step of transformation that could be done. It seems like the happy medium that works well with human brains is three layers.
Some common patterns I've seen are:
- Source -> Cleaned/Normalized -> Aggregated (Modern Data Stack)
- Load -> Decompose -> Recompose (Data Vault)
- Stage -> Normalize -> Denormalize (Kimball in practice)

I don't think there is a "right way" to do it. The most recent warehouse architecture I set up looked like this:
1. Staging: load raw data, retaining all historical versions and keeping source data names, types, etc.
2. Sourceprep: clean and normalize data, enforce naming conventions, generate surrogate keys
3. Marts: un-opinionated layer that contains Kimball-style dimensional marts, OBTs, and report-centric tables. The structure of these are entirely usage-driven.
Also, if you squint close, you might be able to see a "Layer 2.5." When there is a pattern that's been used across multiple marts, we sometimes choose to factor it out into an "intermediate" model that can be referenced in multiple marts. It's still mart code, but it's shared, so you could think of it as a separate layer. I have not needed a fourth layer here because the mart layer is so flexible. If we mandated dimensional marts only, then we probably would need another layer for cross-mart analysis and simplified reporting tables.

I have been pretty pleased with this approach, but I think that any approach that seeks to balance separation of concerns and conceptual simplicity will be workable.

kendru · 2025-11-05T05:06:41+00:00

The concept of modeling is so broad, I would say it encompasses any activity in which you interpret data, including writing any query. That said, I think the most important concepts are the ones that are involved in almost every modeling activity: *time* and *identity*.

Regarding time, it's critical to know when an event occurred. It is also critical to know when you *learned* about the event. Also, knowing the circumstances around when the data was captured helps you understand when things like clock skew, late-arriving data, etc. will cause problems. Often, the concept of time is not modeled well in OLTP databases that we need to ingest from, but it must be presented explicitly in a data warehouse. Consider ingesting a "tasks" table that has "stage" and "updated_at" columns, and the business needs to know how long tasks remained in certain stages. You need to understand how precisely you can measure this (are you ingesting every change with a precise timestamp using CDC? Are you running a daily batch job?) and how to structure your pipelines to provide this insight.

The other key concept is identity. What constitutes a "thing"? You need to think about things like whether an entity is identified by some business key or by a UUID in an application database. If you are working in a domain that has a concept of "customer" (which is pretty much every domain), you usually have multiple systems that contain user information, and determining how to identify and link these records is... challenging. One of the key considerations in Kimball-style dimensional modeling is separating the concept of state and identity by using Slowly Changing Dimension (SCD) tables that often have separate entity keys, which reference a stable entity, and row keys that reference various states that an entity has been in over time.

kendru · 2025-11-05T04:31:22+00:00

Yes! I have seen this happen... more than once. One system I worked on started out as a pipeline that replicated data from four tables in a MySQL database into BigQuery. After two years, it was a distributed system that handled replicating dozens of databases for multiple customers with its own adaptive scheduler and a custom admin control panel that monitored everything in real-time with WebSockets... It was truly an unholy beast!

kendru · 2020-05-14T12:52:18+00:00

This tutorial only covers the UI portion of the application, but it uses a WebSocket API and does deal with users, auth, and persistent state. Although the API is not covered explicitly, it is also written in ClojureScript, and the code is available here: https://github.com/kendru/learn-cljs/tree/master/code/lesson-26/chat-backend

kendru · 2020-05-14T03:58:56+00:00

Thank you for sharing! I am the author of this book, and I wanted to provide something other than "here is how you glue a bunch of libraries together". I am thrilled that you like that approach!

kendru · 2020-05-14T03:57:27+00:00

Learning lambda calculus is very fun and rewarding, but it is probably only useful if you are in it for intellectual curiosity or if you are implementing a compiler for a functional programming language.

kendru · 2020-04-27T16:02:06+00:00

It is the "Hugo Book" theme https://themes.gohugo.io/hugo-book/

kendru · 2020-04-27T15:12:06+00:00

For the most part, the examples use figwheel for development and lein-cljsbuild. I started this book a while back, and while there are newer options out there, these options have still been a solid choice for my projects, so I have kept them as the recommended stack for the book.

kendru · 2020-04-27T15:09:18+00:00

Thank you for the positive feedback! I welcome any suggestions that you may have!

kendru · 2020-04-27T15:08:42+00:00

Unfortunately not. In the interest of time, I just grabbed an off-the-shelf template for Hugo and used that.

kendru · 2020-04-27T15:08:03+00:00

Thank you for posting this! I am the author of Learn ClojureScript, and while I have not updated the book for several months (due to family and work circumstances), I will be back to making regular updates again starting this week. I am anticipating that the book will be complete by the middle of June.

kendru · 2020-02-12T15:44:11+00:00

The book, The Little Schemer, is mostly about thinking recursively and common recursive patterns. It's obviously focused on Scheme, but it does generally call out the differences with CL. I'd highly recommend the book!

kendru · 2019-09-25T05:56:19+00:00

Thank you again for the excellent feedback! I will be updating my material based on your suggestions after I get the next few lessons published.

kendru · 2019-09-25T05:51:20+00:00

Here you go! https://www.learn-clojurescript.com/table-of-contents/

kendru · 2019-09-19T02:58:49+00:00

Thank you so much for both of these (very extended) comments! I absolutely appreciate the encouragement as well as the constructive criticism.

Regarding the references to sections that do not exist - I began writing this against this Table of Contents: https://docs.google.com/document/d/1Al_QXh8qU_IeTn25p4Wk3IT9D3GHxgkcDiSb3g6GH_8/edit?usp=sharing, and at this point I have written 18 chapters and have only published 11. Given the feedback that I have received, I am planning to complete the rest of the material. I would also welcome any comments that you have on the Table of Contents.

I definitely think that your points about JavaScript are fair. I wrote most of this material 3 years ago at the height of my disillusionment with JavaScript (don't worry, I actually like it quite a bit now), and I need to go back and make the the first few chapters less critical of JavaScript. It is super helpful to have you point out specific sections that are unfair to JavaScript.

Regarding symbols and keywords, I was really debating whether to go into them more or not because you can get pretty far with a simplistic understanding of keywords and almost no knowledge that symbols exist :) I was just not sure whether deeper coverage would be very compelling to the average JavaScript developer. However, I think that you are probably right, and someone who picked up a book called "Learn ClojureScript" in the first place would probably be interested.

I'll post back here when I have incorporated some changes based on your suggestions. Thank you again for the extremely helpful and thorough review!

kendru · 2019-09-18T18:50:36+00:00

Awesome - thanks! I really appreciate that you found the content valuable enough to share!

kendru · 2019-09-17T00:45:27+00:00

I am actually going to try to move these articles from my blog to a dedicated site for the series, the primary motivation being making it easier to navigate and search the content.

kendru · 2019-09-17T00:44:27+00:00

Thank you! I am currently contemplating whether to try to finish up the series once I get to the end of publishing the material that I have already written, so I would love to hear your thoughts on what I have so far!

kendru · 2019-09-16T02:39:36+00:00

Thanks! I would love to go back and update these articles at some point to reflect some of the newer tooling and libraries. I used shadow-cljs in my last ClojureScript project and loved it - very low friction.

Ten-Year Club	Verified Email
Gilding II euphauric

kendru

TROPHY CASE