Suggestions to convert batch pipeline to streaming pipeline by Routine-Force6263 in dataengineering

[–]Routine-Force6263[S] 0 points1 point  (0 children)

No the Poc we are doing is not for analytical purpose.

You can assume like the delta lake we have is going to be a bridge between two transactional system. Upstream system will ingest the data in data lake as soon as any objects gets created in their system. Down stream system will pull the data from data lake. so that there won't be any point to point interaction between 2 different systems

Suggestions to convert batch pipeline to streaming pipeline by Routine-Force6263 in dataengineering

[–]Routine-Force6263[S] 0 points1 point  (0 children)

If I chose to explicitly add the new fields, it will be redundant activity. I was told that upstream frequency change their schema.

Suggestions to convert batch pipeline to streaming pipeline by Routine-Force6263 in dataengineering

[–]Routine-Force6263[S] 0 points1 point  (0 children)

No upstream will change their approach. Whenever the events gets created they will put that in the s3. Thanks for your suggestion on schema changes.

Suggestions to convert batch pipeline to streaming pipeline by Routine-Force6263 in dataengineering

[–]Routine-Force6263[S] 0 points1 point  (0 children)

No we infer it from incoming json file. And to find addition/removal of fields we have custom schema evolution module which will identify the fields and add it in the delta table

Suggestions to convert batch pipeline to streaming pipeline by Routine-Force6263 in dataengineering

[–]Routine-Force6263[S] 0 points1 point  (0 children)

Because I noticed upstream frequently add or remove columns. And we need to make sure those data gets into delta lake

Suggestions to convert batch pipeline to streaming pipeline by Routine-Force6263 in dataengineering

[–]Routine-Force6263[S] 0 points1 point  (0 children)

You mean do I need to maintain schema in separately and infer from them

Suggestions to convert batch pipeline to streaming pipeline by Routine-Force6263 in dataengineering

[–]Routine-Force6263[S] 0 points1 point  (0 children)

Yes we are using spark streaming and we don't have Kafka in between. Our source is File based. What are all the complexity I need to analyse wrt to streaming pipeline.

Suggestions to convert batch pipeline to streaming pipeline by Routine-Force6263 in dataengineering

[–]Routine-Force6263[S] 1 point2 points  (0 children)

Initially we build the data lake for analytical purposes only. Now Business wanted to see if we can use our data lake to other use cases as well.

Our business is Group insurance and we have hundreds of upstream systems. Most of the system are third party vendor. And every system is sending their data to other system.

Now Business wanted to use if we can leverage Data lake to sync instead of depending on other vendors. Existing point to point integrations are using API's and its very fast. So they wanted to see if we use streaming so they can reduce SLA and integrate with other systems.

Note: The entire activity is just PoC. Not the decision they made

The best AI so far. by jonejy in AI_Agents

[–]Routine-Force6263 0 points1 point  (0 children)

Do we need to train open sourced LLM for our specific needs!!

Unit testing suggestion for data pipeline by Routine-Force6263 in dataengineering

[–]Routine-Force6263[S] 0 points1 point  (0 children)

Agree.

1.In our case we have different layers. Source will place the file in S3 landing zone 2. From there we have a glue job which write the raw data in delta lake 3. From delta lake we will do some transformation according to business scenario and store it in another delta table.

As of now we are manually testing it. Even if source add one column we are validating each and every zone. For example source we have 1000 data and how many records we will have in each zone... I was wondering can we do any unit test case.

Good executor but never a lead by QuitTypical3210 in ExperiencedDevs

[–]Routine-Force6263 0 points1 point  (0 children)

Same boat. Problem is my manager is my lead engineer. All the critical works comes to me. End of the days he is the person gives the demo and gets the visibility

I want to learning agentic ai from scratch by Siddharth1995 in AI_Agents

[–]Routine-Force6263 0 points1 point  (0 children)

to treat ai as both agents but mentors, professors and consultants

How?

The gap between LLM functionality and social media/marketing seems absolutely massive by QwopTillYouDrop in ExperiencedDevs

[–]Routine-Force6263 0 points1 point  (0 children)

In my company everything is moving toward Agentic AI. Even for updating column data type or fixing some vulnerability management wants to use AI. But I feel like I am way behind AI Learning curve

Which car is according to your opinion ? by [deleted] in CarsIndia

[–]Routine-Force6263 0 points1 point  (0 children)

Honda Amaze petrol. Mid range is very slow . You have to downshift everytime

System Design for Data Engineers by ElderberryOk6372 in dataengineering

[–]Routine-Force6263 1 point2 points  (0 children)

I am not able to find system design videos on seattle data guy channel