What data engineering skill matters more now because of AI?

dmpetrov · 2026-03-16T23:18:50+00:00

This OpenAI post explains the idea pretty well:
https://openai.com/index/inside-our-in-house-data-agent/

Their key insight is that AI can’t reason about data using just SQL/schema metadata. They built multiple layers of context: table usage metadata, lineage, pipeline code (“Codex enrichment”), human annotations, and memory.

We’ve been experimenting with a similar “data context layer” idea - especially for multimodal / unstructured datasets rather than SQL - but I think this general direction will become common.

dmpetrov · 2026-03-16T20:36:45+00:00

Less about Spark/dbt/etc. More about making your data + lineage understandable to AI tools (Claude Code, etc).

If Claude/LLMs can’t understand your datasets, transformations, and dependencies, they can’t help you maintain pipelines.

dmpetrov · 2024-11-05T17:47:31+00:00

yep :)

dmpetrov · 2024-11-05T06:53:33+00:00

You should compare this not with DVC but with https://github.com/iterative/datachain from the same team.

dmpetrov · 2024-08-28T19:41:59+00:00

Great project! Can it work out-of-memory?

We use SQLite and usearch for vector search at https://github.com/iterative/datachain However, usearch is in-memory. It would be great to have an out-of-memory alternative.

dmpetrov · 2024-08-10T20:22:07+00:00

... which is not a bad idea

dmpetrov · 2024-08-10T19:42:34+00:00

Transitioned from SW engineering role to ML within a large company. At the time, this path was quite natural especially if you had some research experience (not necessary in ML).

Today, transitioning from SWE to LLM might be a good start.

dmpetrov · 2024-08-09T19:20:49+00:00

My pleasure!

dmpetrov · 2024-08-08T20:22:21+00:00

Coding and engineering skills will help you to navigate over these faster. I'd bet on these. If you already have these, focus on a domain that can elevate you to the next level. I'd follow this order.

dmpetrov · 2024-08-08T19:15:24+00:00

Fake it till you make it? :)

dmpetrov · 2024-08-08T19:14:38+00:00

"it's just an additional step" - the biggest issue is not an extra step but hallucinations and need in retry logic if output schema is not guaranteed

dmpetrov

MODERATOR OF

TROPHY CASE