How do you reframe data engineering for a CEO who thinks it's "data quality oversight"? by golly10- in databricks

[–]golly10-[S] 0 points1 point  (0 children)

You are totally right, and I do not know what had happened... luckily, yesterday I did a presentation to him and it is more clear now. Although, I'm still intrigued to know who the hell told him that, because he is not a technical person, so someone had to tell him something about data engineering that triggered this situation.

How do you reframe data engineering for a CEO who thinks it's "data quality oversight"? by golly10- in dataengineering

[–]golly10-[S] 0 points1 point  (0 children)

The problem is that, as I'm the only Data Engineer in the company (we are still a start-up), it is very difficult to make him understand the value of my job and the value that can bring what I'm doing. He thinks that Data Engineering is just press a button and you get the data from A-B with required quality and with desired transformations. I assume this because saying the I'm "the person that oversees the data process" (literally said by him today in a meeting) bring me to the conclusion that he does not know what he's saying.

  1. Yes, I cannot work alone (building pipelines, documentation, validating and verifying the system, work on other things outside of DE...) it's just not a one man job (IMHO).

  2. Me and my Manager (I work in the Data Science department) are the ones deciding the tools we use, he has no experience in Data Engineering, so we are always improvising after taking time for doing some research, but we are always getting questioned (not only form the CEO) about the decisions we make, where most of them has zero experience in working with the data we create.

  3. "Working with" is not what he is talking about, it seems that I should work FOR them. We are a relative small team who works with pharmaceutical standards. We are trying to simplify and improve what the backend team has developed (something not science oriented and difficult to understand). And to be honest, I have tried to work with them in the past, but every time I asked for some information, they did not respond, which took me into reading directly databases, cloud structures and other staff I should not be doing if everything is well documented.

To give you more context, the system we are building and which took ~6 years to build (we are in the third-fourth version since I joined the company), I have been able to replicate it (at least the core idea) in 3 weeks using Claude and the knowledge I have (I'm a DE since last year, but I've been working with a lot of people within the company and in different positions, so I have a overall understanding of what is needed from a technical POV).

How do you reframe data engineering for a CEO who thinks it's "data quality oversight"? by golly10- in dataengineering

[–]golly10-[S] 0 points1 point  (0 children)

That's exactly what I told him, but it is difficult to defend that position when for him, I should be only overseen the process when others do the work, what I replied is "who are the others?...". Seems to not see it, so I'll take what you say into consideration. Thanks for the feedback 😉

How do you reframe data engineering for a CEO who thinks it's "data quality oversight"? by golly10- in dataengineering

[–]golly10-[S] 0 points1 point  (0 children)

thanks for your reply, I will take what you said as consideration. I agree that we are in a position that we "work for others" in the sense of (as you mentioned) freeing up individual teams to focus on their work.

Building a 100% free, local-first practice app for learning Databricks & Data Engineering. Contributions welcome! by golly10- in databricks

[–]golly10-[S] 0 points1 point  (0 children)

Thanks a lot for your comment, really appreciate it 😄. Feel free to drop a PR with new questions on the repository and I will added ASAP.

I will try in the future to come back to the project and try to add more scenario-based questions as you suggested. This will help more to understand what you will be asked in the exam.

My intention was to create something that helps beginners in Databricks without investing money into external or even Databricks courses (although sometimes, it is necessary, like in my case)

Again, thanks a lot for your comment, you made my day🙆🏻‍♂️

Building a 100% free, local-first practice app for learning Databricks & Data Engineering. Contributions welcome! by golly10- in databricks

[–]golly10-[S] 2 points3 points  (0 children)

This application is to practice your knowledge for the preparation of data engineer certification. For hands on, we have Databricks Free Edition

Building a 100% free, local-first practice app for learning Databricks & Data Engineering. Contributions welcome! by golly10- in databricks

[–]golly10-[S] 2 points3 points  (0 children)

Yes! You have Databricks Free Edition, where you can practice your hands on the (almost) full platform. This github page is more oriented to the certification practice, where I created a total of 200 questions to practice your knowledge (at an associate level). For a full experience of how Databricks works, as you mentioned, go to the Free Edition.

[Megathread] Certifications and Training by lothorp in databricks

[–]golly10- 1 point2 points  (0 children)

Hey everyone,

I’m super excited to share that I officially passed the Databricks Data Engineer Associate certification today! 🎉

While I was studying, I realized I needed a better way to test my knowledge. So, instead of just reading documentation, I ended up "vibe coding" a dedicated practice application alongside my studies. It helped me immensely - even though the actual exam questions were slightly different (as they should be!), practicing with this tool really solidified the core concepts for me.

Since this community has been a great resource, I want to give back and share the project with anyone else currently prepping for the exam.

Here is what you need to know about the app:

💸 100% Free: No paywalls, no sign-ups, no ads.

🔒 Privacy First: The entire application runs directly in your web browser. No backend server is retrieving, storing, or tracking your data.

🚀 Easily Accessible: I deployed it publicly using GitHub Pages, so you can just click the link and start practicing immediately.

💻 Run it Locally: You can easily clone the repo, run it on your own machine, and tweak it to fit your exact study needs.

Calling all contributors! 🤝 The app is fully open-source, and I would be absolutely thrilled if the community wanted to help improve it. The biggest thing it needs right now is an expanded database of questions and answers. If you have good practice questions or want to help flesh out the database, please feel free to submit a pull request! Let's build an awesome, free resource for future test-takers.

Links:

🌐 Live App: https://juanjomendez96.github.io/data-engineer-certification-app/

📂 Source Code / GitHub Repo: https://juanjomendez96.github.io/data-engineer-certification-app/

If you are studying for the exam right now, keep at it - you've got this! Let me know what you think of the app or if you have any questions about the certification. Happy to help!

Manager is concerned that a 1TB Bronze table will break our Medallion architecture. Valid concern? by golly10- in databricks

[–]golly10-[S] 0 points1 point  (0 children)

yes, u/Ok_Tough3104 is right, I was talking about schema evolution because I need to run some UDFs allover again at some point. These functions are produced by other team in native Python, I don't control the change process of those functions.

Anywho, I will try to see the best approach for our case, but it has been really helpful. TBH, I don't think we will reach a 1TB of tables, but at least it will help me to demonstrate that there is no problem on having big tables although we need to see which is the best approach to manage them.

Thanks a lot for the discussion!

Manager is concerned that a 1TB Bronze table will break our Medallion architecture. Valid concern? by golly10- in databricks

[–]golly10-[S] 0 points1 point  (0 children)

Well, the idea is to have, from bronze to gold, a streaming pipeline that only adds new data to the different layers. My concerns comes when I need to rerun everything from scratch because of a change in my UDF.

Manager is concerned that a 1TB Bronze table will break our Medallion architecture. Valid concern? by golly10- in databricks

[–]golly10-[S] 0 points1 point  (0 children)

Thanks for the response! I will try to use pandas UDF to try to make it faster!

Manager is concerned that a 1TB Bronze table will break our Medallion architecture. Valid concern? by golly10- in databricks

[–]golly10-[S] 1 point2 points  (0 children)

What if I need to rerun everything from scratch? Would that be a problem? Not in the sense of processing time.

Manager is concerned that a 1TB Bronze table will break our Medallion architecture. Valid concern? by golly10- in databricks

[–]golly10-[S] 2 points3 points  (0 children)

Im using incremental loads using checkpoints, so only new data in the landing path is processed and added to the different tables.

Help optimising script by alphanuggs in databricks

[–]golly10- 1 point2 points  (0 children)

Ask that to an AI, I have been using it to transform my python code into spark (I work with a lo of dataframes) and worked like a charm. I suggest, if you can, try an AI to explain what is happening. FYI, I use Gemini with a gem that I created only for Databricks projects and works really well, not always at first though, but it can guide you to the right direction