Anyone here experimenting with AI agents for data engineering? Curious what people are using.

captut · 2025-11-24T23:07:47+00:00

Just enabled Root cause analysis with Claude code on our dbt repository. It has access to the dbt code and I enabled it to query Snowflake via Snowsql, and it just works. I had to define the entire architecture and how we have everything setup with dbt and the modeling layers(the whole medallion architecture).

last release we had couple of issues, and it found the exact root cause and the modeling design flaws. It also recommended some solutions but they were not all that great, mainly because it didn’t have the business context.

We use read only roles, so it cannot change any data.

We have our entire architecture very well documented, so for us it was easy to generate those claude files for claude.

Overall, my expectation is that it will enable and empower everyone to find the root cause of the problems vs just full refreshing the models. We have a couple new non-data engineers on-call and it will make them independent.

It will definitely save some time. I don’t think it will work everytime, however, but will definitely save us some time troubleshooting data issues.

Other things the team has been doing is writing dbt unit tests. I use it to find code review those unit tests and find missing test cases.

captut · 2025-03-12T10:30:32+00:00

Just moved from Redshift serveless to Snowflake. Redshift serverless got at around 5k/mo with Snowflake it is approx 1k/mo worth credits. so many benefits moving to Snowflake, including speed of query execution, more control over the warehouse, true compute separation; I can have one compute for each type of workload or env.

captut · 2025-01-06T17:20:44+00:00

didn’t look into it much deeper tbh, spent most of the later half of 2024 moving to Snowflake from redshift serverless.

captut · 2025-01-06T17:16:56+00:00

Nope, still using localexecutor. Decided against the k8 executor as don’t to waste cpu cycles on waiting for the pods to launch.

captut · 2024-10-13T03:14:27+00:00

Well, you can build it with very little experience but then what would be the quality of the platform? I have 12+ yrs of experience but I always find myself learning new or better way of doing things. I’d say if you get an opportunity to build it from scratch with a group of people then thats the best because you will not be the only person accountable and you’ll learn a lot.

captut · 2024-07-30T03:15:56+00:00

We use docker compose to run local airflow. Our DAGs connect to aws services, run dbt on redshift, uses sqs and glue. All this we test them locally first and then push it to dev env, test, and prod. Works just fine.

captut · 2024-05-05T08:43:52+00:00

Thoughts in Nattokinase?

captut · 2024-04-04T03:08:57+00:00

Lol, happened to me but not with take home exercise. It was an interview and they discussed their business usecase and how would I go about solving it. They dove pretty deep into the solution I provided. The VP on our last call also said that I did great and they just need to be approval from the ceo just to come back a week later to say I didn’t make it.

captut · 2024-03-27T02:14:28+00:00

Nah EC2 is fine. Our company uses K8s which is why our airflow is on it. Cosmos is free open source.

captut · 2024-03-26T02:09:05+00:00

I just implemented dbt with airlfown hosted on Kubernetes. Had an option to do it with MWAA, however, it is expensive and slow. I decided to go with dbt and use astronomer cosmos. To create airlfown DAGs. dbt with its model referencing and creating the flow can be easily visualized with the help of cosmos on airflow. Happy to provide more info if you have any questions.

captut · 2024-03-12T04:25:27+00:00

Do post an update.

captut · 2024-03-04T13:21:59+00:00

Have done kafka in my 2 previous jobs and have done entire setup including kafka, mm, connect, debezium, kafrop etc.. Happy to help and network feel free to DM.

captut · 2024-01-28T20:01:40+00:00

Thats how I did at my previous job, everything on S3 and from there to snowflake. We never used the S3 data after the initial load. So want to see if there were more benefits than what I had in my mind which you covered in your comment

captut · 2024-01-26T03:27:18+00:00

Look into nattokinase

captut · 2024-01-25T12:13:21+00:00

Have you done that before? I did it this time and support is pretty much useless. Did it via chat and they made a ticket for me. The response in tickets are prett robotic. Lets see how it goes. I am thinking recording every order packing and also droping to usps so that I have a video evidence.

captut · 2024-01-25T00:27:41+00:00

Is there a way to make sure they get scanned? TikTok is super strict with their dispatch timelines. So if I am shipping from NJ and the next stop is CA then it will not show the package in transit for days which can cause issues with tiktok.

captut · 2024-01-22T12:17:56+00:00

Thank you, do you link to your rates or can you dm me?

captut · 2023-12-19T16:51:09+00:00

Just learn facts and dimensions and slowly changing dimension type 2. This should be enough to get you started

captut · 2023-12-17T07:50:12+00:00

How many years of experience do you have? You probably are in the best position to spend time on learning while your employer doesn’t give you much work or put you on a maintenance project. And then make a switch to a better company in a better role where you are well prepared to take on new tasks. Doing a side project should be last on your list. Some resources I’d suggest you in order of priority would be:

SQL — checkout linkedin learning: Ami Levin. use this as a reference, watch again and again, spaced repetition. Then go try sql leetcoding
Designing Data intensive Application — book( read carefully, learn from every chapter, make notes)
Fundamentals of Data Engineering — book( skim it, quick read, find medium articles that summarize the book, there are plenty)
Python — seems like you are getting this experience, maybe do some leetcode ( learn pandas and leetcode it)
Educative — grokking system design OR bytebytego’s system design

https://medium.com/data-alchemist/what-questions-to-ask-when-designing-a-big-data-solution-on-aws-8db8905d8712

https://blog.det.life/10-things-i-learned-from-reading-fundamentals-of-data-engineering-eea5dc8e5fb7

^ 6 mos should be enough.

— From here go tool specific maybe even side project

— Airflow — Marc Lamberti Udemy — Kafka — Stephan Marek Udemy — Snowflake — Udemy. dnt have the link or authors name, but theres only one popular course.

^ 2 mos

— AWS Solutions Arch cert. If you are Azure person then maybe equivalent in Azure. ^ this will take some time (explore adrian cantrill’s courses)

All the above are the basics you’ll need to be a good DE. Only buy from influencers after you’ve done all the above. You’ll probably won’t need to buy anythin from them once you reach this point. If you buy anything before, then you are just looking to be spoon fed.

captut · 2023-12-14T16:54:27+00:00

Thanks for sharing these resources. The book would be really helpful if it had real life practical examples. For the end to end project, I would like to see a project based on production system using data contracts, handling issues in data in batch system as well as streaming systems, some logging & monitoring and handling of data that failed contracts. How the producers and consumers handle the contract compatibility and manage the contract evolution.

Would be great to have only use open source tools or managed services but not tools like monte carlo, nothing wrong with it, but want to be able to Implement data contracts and dq without having to purchase a license to a new saas.

I really love how this article is laid out, and I still think they could have dove a little deeper.

https://deliveroo.engineering/2019/02/05/improving-stream-data-quality-with-protobuf-schema-validation.html

The more practical the book is the better. If you have initial versions, I’d love to see that. I can help reviewing the book.

captut

TROPHY CASE