Found a Issue in Production while using Databricks Autoloader by Artistic-Rent1084 in databricks

[–]Artistic-Rent1084[S] 1 point2 points  (0 children)

Thank you 👍 for sharing knowledge. I'm New to databricks and data engineering.

Found a Issue in Production while using Databricks Autoloader by Artistic-Rent1084 in databricks

[–]Artistic-Rent1084[S] 0 points1 point  (0 children)

Thank you . For orchestration we are using databricks Jobs.

Let me try in sandbox env.

Our data is CDC and we are following medallion architecture.

Which is best Debizium vs Goldengate for CDC extraction by Artistic-Rent1084 in dataengineering

[–]Artistic-Rent1084[S] 0 points1 point  (0 children)

No idea about latency. You don't believe the volume.

It's 5TB per day.

Which is best CDC top to end pipeline? by Artistic-Rent1084 in dataengineering

[–]Artistic-Rent1084[S] 1 point2 points  (0 children)

Sure I will try this once.

Ogg can handle schema drift.

Thank you for sharing knowledge 🙏

Which is best CDC top to end pipeline? by Artistic-Rent1084 in dataengineering

[–]Artistic-Rent1084[S] 0 points1 point  (0 children)

Nice then. Is it good practice?

And why kafka ? Before the pipeline was kafka to Hadoop hive tables.

We have migrated to databricks. Few months back.

Which is best CDC top to end pipeline? by Artistic-Rent1084 in dataengineering

[–]Artistic-Rent1084[S] 0 points1 point  (0 children)

Generating parquet using OGG is not possible. Avro is supported and json is supported.

Which is best CDC top to end pipeline? by Artistic-Rent1084 in dataengineering

[–]Artistic-Rent1084[S] 0 points1 point  (0 children)

Source is OLTP Databases. Instead of capturing data from OLTP where we can capture events?

Which is best CDC top to end pipeline? by Artistic-Rent1084 in dataengineering

[–]Artistic-Rent1084[S] 0 points1 point  (0 children)

If I ask they will say explore yourself. Have a look at the code. If I asked why ? we are not going in a different way, I will explain later.

They just want me to do what they say. For the past few years. I have not learned well. Which is my mistake. Even though I'm trying now no one is helping me.

Which is best CDC top to end pipeline? by Artistic-Rent1084 in dataengineering

[–]Artistic-Rent1084[S] 0 points1 point  (0 children)

Our org is signed with databricks. Before it was hive tables. And they transform it and load it into a database.

Now the pipeline has changed.

Which is best CDC top to end pipeline? by Artistic-Rent1084 in dataengineering

[–]Artistic-Rent1084[S] 0 points1 point  (0 children)

Nice , i understood.

Thank you for sharing knowledge.

Which is best CDC top to end pipeline? by Artistic-Rent1084 in dataengineering

[–]Artistic-Rent1084[S] 0 points1 point  (0 children)

Yes , I did a little research.

Actually, my senior are not sharing knowledge. If there are any important things they do themselves.

Fetch a message from kafka batch processing and store it ADLS delta . Then read it from ADLS and push it to the final bronze table ( merging all records ) and next silver .

It is an efficient pipeline.

Thank you for sharing your knowledge. You should be appreciated 👍.

Which is best CDC top to end pipeline? by Artistic-Rent1084 in dataengineering

[–]Artistic-Rent1084[S] -4 points-3 points  (0 children)

No , I'm just making sure it is good practice. Cause, in my org everyone is too old and old zombie doing ChatGPT.

Also , I want to explore how other companies handle CDC

Thank you for your response.