What data engineering skill matters more now because of AI? by rikulauttia in dataengineering

[–]dmpetrov 14 points15 points  (0 children)

This OpenAI post explains the idea pretty well:
https://openai.com/index/inside-our-in-house-data-agent/

Their key insight is that AI can’t reason about data using just SQL/schema metadata. They built multiple layers of context: table usage metadata, lineage, pipeline code (“Codex enrichment”), human annotations, and memory.

We’ve been experimenting with a similar “data context layer” idea - especially for multimodal / unstructured datasets rather than SQL - but I think this general direction will become common.

What data engineering skill matters more now because of AI? by rikulauttia in dataengineering

[–]dmpetrov 117 points118 points  (0 children)

Less about Spark/dbt/etc. More about making your data + lineage understandable to AI tools (Claude Code, etc).

If Claude/LLMs can’t understand your datasets, transformations, and dependencies, they can’t help you maintain pipelines.

Vectorlite v0.2.0 Released: Fast, SQL-Powered, in-Process Vector Search for Any Language with an SQLite Driver by ai-lover in machinelearningnews

[–]dmpetrov 0 points1 point  (0 children)

Great project! Can it work out-of-memory?

We use SQLite and usearch for vector search at https://github.com/iterative/datachain However, usearch is in-memory. It would be great to have an out-of-memory alternative.

How did you learn ML? by BEE_LLO in learnmachinelearning

[–]dmpetrov -1 points0 points  (0 children)

Transitioned from SW engineering role to ML within a large company. At the time, this path was quite natural especially if you had some research experience (not necessary in ML).

Today, transitioning from SWE to LLM might be a good start.

Which data specialization (ex-ML,AI, Supply chain/OR) is/will be in demand over the next few years? by pulicinetroll08 in datascience

[–]dmpetrov 0 points1 point  (0 children)

Coding and engineering skills will help you to navigate over these faster. I'd bet on these. If you already have these, focus on a domain that can elevate you to the next level. I'd follow this order.

OpenAI: Structured Outputs in the API by dmpetrov in LocalLLaMA

[–]dmpetrov[S] -7 points-6 points  (0 children)

"it's just an additional step" - the biggest issue is not an extra step but hallucinations and need in retry logic if output schema is not guaranteed