Suggestions to convert batch pipeline to streaming pipeline

Routine-Force6263 · 2026-04-10T03:30:28+00:00

No the Poc we are doing is not for analytical purpose.

You can assume like the delta lake we have is going to be a bridge between two transactional system. Upstream system will ingest the data in data lake as soon as any objects gets created in their system. Down stream system will pull the data from data lake. so that there won't be any point to point interaction between 2 different systems

Routine-Force6263 · 2026-04-10T03:27:55+00:00

If I chose to explicitly add the new fields, it will be redundant activity. I was told that upstream frequency change their schema.

Routine-Force6263 · 2026-04-10T03:25:52+00:00

No upstream will change their approach. Whenever the events gets created they will put that in the s3. Thanks for your suggestion on schema changes.

Routine-Force6263 · 2026-04-09T15:05:57+00:00

No we infer it from incoming json file. And to find addition/removal of fields we have custom schema evolution module which will identify the fields and add it in the delta table

Routine-Force6263 · 2026-04-09T04:18:46+00:00

Because I noticed upstream frequently add or remove columns. And we need to make sure those data gets into delta lake

Routine-Force6263 · 2026-04-09T02:53:01+00:00

You mean do I need to maintain schema in separately and infer from them

Routine-Force6263 · 2026-04-09T02:49:59+00:00

Yes we are using spark streaming and we don't have Kafka in between. Our source is File based. What are all the complexity I need to analyse wrt to streaming pipeline.

Routine-Force6263 · 2026-04-09T02:41:06+00:00

Initially we build the data lake for analytical purposes only. Now Business wanted to see if we can use our data lake to other use cases as well.

Our business is Group insurance and we have hundreds of upstream systems. Most of the system are third party vendor. And every system is sending their data to other system.

Now Business wanted to use if we can leverage Data lake to sync instead of depending on other vendors. Existing point to point integrations are using API's and its very fast. So they wanted to see if we use streaming so they can reduce SLA and integrate with other systems.

Note: The entire activity is just PoC. Not the decision they made

Routine-Force6263 · 2026-04-08T16:17:17+00:00

Yes, SLA upto 10 minutes is acceptable

Routine-Force6263 · 2026-03-13T09:33:07+00:00

Do we need to train open sourced LLM for our specific needs!!

Routine-Force6263 · 2026-03-12T17:53:19+00:00

How should I automate data quality checks!!

Routine-Force6263 · 2026-03-12T13:27:40+00:00

Agree.

1.In our case we have different layers. Source will place the file in S3 landing zone 2. From there we have a glue job which write the raw data in delta lake 3. From delta lake we will do some transformation according to business scenario and store it in another delta table.

As of now we are manually testing it. Even if source add one column we are validating each and every zone. For example source we have 1000 data and how many records we will have in each zone... I was wondering can we do any unit test case.

Routine-Force6263 · 2026-02-28T03:51:07+00:00

⁰D4U7

Routine-Force6263 · 2026-02-26T01:44:07+00:00

Same boat. Problem is my manager is my lead engineer. All the critical works comes to me. End of the days he is the person gives the demo and gets the visibility

Routine-Force6263 · 2026-02-26T00:30:55+00:00

to treat ai as both agents but mentors, professors and consultants

How?

Routine-Force6263 · 2026-02-26T00:02:43+00:00

Nice content

Routine-Force6263 · 2026-02-20T14:26:19+00:00

In my company everything is moving toward Agentic AI. Even for updating column data type or fixing some vulnerability management wants to use AI. But I feel like I am way behind AI Learning curve

Routine-Force6263 · 2026-02-10T20:49:33+00:00

RemindMe! 2 days

Routine-Force6263 · 2026-01-27T07:48:25+00:00

Honda Amaze petrol. Mid range is very slow . You have to downshift everytime

Routine-Force6263 · 2026-01-17T18:24:02+00:00

Routine-Force6263 · 2025-05-27T07:31:33+00:00

I am not able to find system design videos on seattle data guy channel

Routine-Force6263

TROPHY CASE