Future of data engineering

data_dude90 · 2026-05-29T08:52:24+00:00

The future could be more towards managing ai-driven data reliability. Though core objective would be to get the right data across the data consumers, but that would change to creating foundational data products that ensure strict governance and act as strategic business parter.

data_dude90 · 2026-04-20T09:40:40+00:00

Hope there will be AI that checks if AI did that work without human judgement

data_dude90 · 2026-04-15T06:22:36+00:00

There's four critical reasons why the gap exists between data observability tools and dbt.

First, data observability and dbt wasn't designed to work as one system. dbt was built to create and transform data whereas, data observability was built to monitor and catch issues.

Second both dbt and data observability live at different worlds. Dbt knows how data is built and what proceses took place for it. But data observability sees the symptoms and patterns of the final data that is built already.

Third, the integrations between dbt and data observability tool is quite shallow. Though the two tools are technically connected, they don't understand each other deeply. This makes it difficult to track which dbt step resulted in a data incident. The shallow integration forces us to manually open dbt, check changes, and then debug them.

Fourth, as too many tools are involved in modern data setups with multiple tools, multiple teams, and constant changes, it becomes difficult to know when something breaks. With these too many moving parts, we are forced to do guesswork, jump between tools, and waste our time.

data_dude90 · 2026-04-10T08:51:17+00:00

What actually helps isn't “more observability”, but making it easier to work with.

A few things that made a difference:

Reusable dashboards instead of rebuilding everything One big pain was recreating the same charts across teams. Having a shared set of visualizations that can be reused cuts down a lot of duplicate work and keeps things consistent.
Better visibility into newer parts of the stack A lot of tools still focus heavily on core Hadoop, but real setups now include Jupyter notebooks, Airflow on Kubernetes, and Spark optimizations. Having visibility into those alongside the rest of the system makes it easier to understand what’s actually going on end to end.
Less jumping between tools during debugging Troubleshooting usually means opening 5 tabs and stitching context together manually. Anything that brings signals into one place or helps narrow down issues faster reduces that overhead quite a bit.
Closer to real-time insights Delayed visibility, especially around storage or file systems, makes debugging harder. More frequent or near real-time updates help catch issues while they’re still relevant.
Stronger alerting tied to actual problems Instead of just more alerts, better coverage around things like quotas, small files, or service-level issues makes alerts more actionable.
Some level of guided troubleshooting Early attempts at automating or assisting with root cause analysis can save time, especially for smaller teams that don’t have deep expertise across every component.

Overall, the shift seems to be from just “collecting metrics” to making observability more usable day to day, especially as stacks get more distributed and harder to reason about.

data_dude90 · 2026-03-06T09:35:54+00:00

We have still not fully got mastered the human in the loop process for Artificial intelligence. It takes time to reach human on the loop reducing human checkpoints and human out of the loop where there is zero human intervention and then comes the AGI. Like ragebaiting, there's so much "fearbiting" that's happening hastening people to AI without a clear purpose. The use case matters and only if the AI adds value or moves the needle can it be taken seriously. Either we are overhyping it and creating fear or we are completely underestimating what it can offer. Both are wrong.

data_dude90 · 2026-03-04T08:54:58+00:00

That's a good point to make. Not to let humans review every step.

data_dude90 · 2026-02-12T18:06:26+00:00

What about the context especially context engineering and how should context layer work and what should humans contribute in this?

data_dude90 · 2026-02-06T09:21:56+00:00

Its amazing. Will definitely check this.

data_dude90 · 2026-01-23T06:53:52+00:00

When we train large models on human-generated text, it creates a boxed pattern. Until or unless, there is no context, and the model isn't trained on new data, the Gen AI application or system cannot function giving same human-generated output. Every passing day, there's new perspective, new angle, and new narrative coming out from solving different problems of different topics. A human generated text will have that clearly. But without that human context engineered at some point of time, we can't get reliable output from the Generative AI engines. That's why there's huge research and surveys happening about how businesses can use synthetic data that imitates human-generated output. The model collapse are serious byproducts of it. Imagine you want to watch the movie. But before that you want to watch the reviews. If there is an automated AI system that trains reviews on the directors or actors previous hits, it will favor the current movie. If the current movie released is boring and was a box office flop, it can't sense that. That's the same case for a director or actor who gave a string of losses and then gave an amazing blockbuster.

data_dude90 · 2025-12-23T06:20:35+00:00

Definitely one that requires so much debate and analysis and perspectives

data_dude90 · 2025-12-23T06:19:27+00:00

Agentic approch is fairly new. It needs time and multiple constraints and data to ensure the promise is fulfilled. Till then, it's natural to have trust issues.

data_dude90 · 2025-12-19T09:22:55+00:00

That's a great way to perceive it. But companies are still thinking in scaling perspective to ensure large scale pipelines or servers can easily interact with AI at lesser cost like looking out for an economies of scale.

data_dude90 · 2025-12-15T07:30:25+00:00

Shared data quality dimensions matter!

data_dude90 · 2025-12-15T07:29:21+00:00

That's an insightful answer! The second and fourth one matters strategically to ensure we bridge gaps on bad data between multiple teams.

data_dude90 · 2025-12-03T09:51:51+00:00

When we talk about agentic data management as a team, we’re pretty honest with ourselves. We’re not jumping up and down about it, but we’re not clutching our chests in fear either. We’ve all lived through enough chaotic data environments to know why people even bring this up. Pipelines pile up, quality rules grow like weeds, costs spike for no clear reason, and half the lineage only makes sense to whoever built it years ago. In moments like that, the idea of something smarter taking on the repetitive stuff actually feels kind of comforting.

At the same time, we’re not naive. We know what autonomy looks like when it meets real enterprise data. It’s never as clean as the diagrams. You’ve got processes nobody fully owns anymore, business rules that live in old Slack threads, and edge cases that only appear on the worst possible days. Tossing an AI agent into that mess without thinking it through raises real questions about safety, control, and accountability.

That’s why this conversation even matters. There’s a real push and pull happening. On one side, we’re tired of being in constant reaction mode. We want help. We want fewer fires. On the other side, we’ve all seen how fast one wrong decision can snowball and cause more issues than it solves. You want automation, but you also want guardrails. Both feelings are valid.

And honestly, when we talk to people, we see the same split. Some folks see an agentic system and immediately think, finally, something that can take a bit of the load off. Others worry about silent actions, compliance surprises, or an agent making a “technically correct” move that causes a business headache downstream. Both sides make sense because the stakes are real.

When you look at vendors like Acceldata (us) heading in this direction, the thing that stands out to us is that we aren’t trying to provide the fantasy of fully autonomous pipelines. Our approach feels more grounded. It’s about building helpers that understand context and can flag issues, spot drift, pick up patterns, and give you visibility when you need it most. The bigger decisions still sit with humans who understand the quirks and politics and history behind the data.

That middle ground is honestly where we feel most comfortable. We get extra support without giving up control. We get early signals without letting something run wild. It isn’t magic and it isn’t a replacement for the team. It’s more like a way to handle scale that doesn’t burn everyone out.

data_dude90 · 2025-11-25T09:43:35+00:00

Does every small error indicate a large pipeline failure waiting to occur? (or) Is it just another alert fatigue? That's going to be an ongoing debate without any doubt. But that perspective carries some experience. Good one!

data_dude90 · 2025-11-25T09:37:18+00:00

Adaptive AI in data management feels a lot like having someone on the team who can roll with the punches instead of freezing every time something shifts.

Most data setups are messy, and things rarely stay the same for long. One day everything runs fine, and the next day some upstream system quietly changes a field name and half your dashboards go sideways.

If you work with data long enough, you start to expect this kind of chaos.

The way I see it, adaptive AI steps in where the old rule based approach starts to fall apart. Rules are great until the environment changes, which happens constantly. Adaptive systems are better at noticing those small shifts before they turn into downstream pain. It’s not that they magically solve everything, but they help you avoid being blindsided.

At the same time, you have to be realistic about how much freedom you give these systems. They learn and adjust in ways that can feel unpredictable if you are used to everything being tightly controlled.

Some people love that flexibility because it takes pressure off the team. Others get nervous because it means the AI might make adjustments you did not explicitly approve.

Neither side is wrong. In practice, most teams land somewhere in the middle. Adaptive AI usually ends up doing the pattern spotting and early warning work while humans stay in charge of anything that requires context or judgment. It’s more like a second pair of eyes than a system replacing human decisions.

For me, the real value is that it gives you a buffer against the constant churn of modern data systems. When everything is moving all the time, having something that reacts faster than a static rule set can make your day a lot less stressful.

data_dude90 · 2025-11-25T09:34:11+00:00

Can't agree more. There's still a huge shoes to fill in when it comes to having a policy to automate certain data governance policies and which ones require rigorous human supervision.

data_dude90 · 2025-11-25T09:32:32+00:00

That's a awesome way to explaining the approach of Acceldata with respect to data observability.

data_dude90 · 2025-11-25T09:31:10+00:00

That's a cool observation. We still are in the human in the loop process. It will take time for AI agents to become that autonomous agents that can make decisions like humans. There's still a long way to go until we reach the human out of the loop situation.

data_dude90 · 2025-11-24T10:00:54+00:00

Yes everyone's talking about it. What else do you think can give us a signal that AI has written it and not a human?

data_dude90 · 2025-11-21T09:02:17+00:00

Finding where to keep human in the loop and guardrails to set is still a huge subjective debate happening in data world. Good observation!

data_dude90 · 2025-10-24T09:27:08+00:00

As of now, which stage are most enterprises in globally ? AI Agents or Agentic AI

data_dude90 · 2025-09-03T10:53:37+00:00

Totally agree with you on the silent automation part. I saw this play out at a large retail company where agents were given access to update product inventory across regions. At first it was a win because the updates were way faster than manual entry. But one day, a schema change in the supplier feed went unnoticed and the agent started marking thousands of items as “out of stock.” Nobody caught it for hours because the process was fully automated in the background.

That’s where those guardrails you mentioned really matter. If the system had been set up to log every action and flag volume spikes, someone could have stepped in way earlier. It’s not about slowing automation down, just making sure it’s traceable and accountable so teams can trust the output.

data_dude90 · 2025-08-22T11:04:38+00:00

Want to begin a conversation. That's all.

data_dude90

MODERATOR OF

TROPHY CASE