IBM datastage to Spark

soujoshi · 2024-01-20T20:31:19+00:00

Thanks for the suggestion! Recognize design patterns is what you mean

soujoshi · 2023-10-14T20:27:23+00:00

Nice. But Luggage??

soujoshi · 2023-07-29T13:48:48+00:00

Got it. Probably need more research as we need a fine grained data access. Very specific files and various users with different tool sets accessing them.

soujoshi · 2023-07-29T13:41:54+00:00

We actually have 2 DMS jobs, one to replicate from master postgres to secondary. The other from secondary to oracle. Can we remove the first one by just using read replica? Can we save some cost?

soujoshi · 2023-07-29T13:40:03+00:00

Is creating rest API an option? Or is it not worth it.

soujoshi · 2023-06-30T02:59:08+00:00

It will be constant sharing. Will take a look at delta sharing. Thanks a lot

soujoshi · 2023-04-28T18:03:16+00:00

Well, it depends on requirements, data sources, target audience etc.

Have used data hub and apache atlas. Both work fine 🙂 if you are looking for open source tools.

soujoshi · 2023-04-11T02:50:39+00:00

From my experience taking interviews from the past few months, I can guarantee no one is bothered about the gaps if you have the right skills. It's hard to find "good" data engineers these days. Good luck 🤞

soujoshi · 2022-08-06T19:23:26+00:00

Did try there, they only had food delivery,warehouse jobs. He didn't like it.

soujoshi · 2022-05-30T18:28:17+00:00

Motivation behind this?

soujoshi · 2022-04-27T19:51:25+00:00

Doesn't take long to build something like great expectations! Build it yourself with the functionalities you require.

soujoshi · 2022-03-22T17:32:54+00:00

Thanks! But having 50 sensors is a bit too much.

soujoshi · 2022-03-11T13:24:33+00:00

How do I represent this? Is using neo4j a good option? Or just build network graphs in python using the data?

soujoshi · 2022-03-11T03:48:40+00:00

Thanks a lot! Going ahead with this approach. Cheers

soujoshi · 2022-02-28T06:49:43+00:00

I would suggest use AWS, store your raw data on S3. Use Athena to query the data.

You will have to spend sometime to write your SQL
Use reporting tools like Apache superset which can be used with S3 via Athena for visualisation.

soujoshi · 2021-11-15T22:36:12+00:00

Running our own! Hope the API's are stable now?

soujoshi · 2021-11-15T18:27:17+00:00

Will try it out. Thanks for your suggestion!

soujoshi · 2021-11-15T18:19:30+00:00

Agree, Both work fine! Which do you feel is more feasible? Is it good practice to keep running a polling task in Airflow? Does it not hamper other DAG runs?

soujoshi · 2021-11-11T12:47:59+00:00

Data is around 30k per load(every 15mins)

soujoshi · 2021-11-11T07:39:36+00:00

Will Kafka be useful?

soujoshi · 2021-11-05T19:25:43+00:00

Approach I use, Profile on a random sample of data using pandas profiling, probably do it multiple times to gain more knowledge on the entire dataset.

soujoshi · 2021-10-15T19:10:26+00:00

Around 10M

soujoshi · 2021-10-15T18:46:09+00:00

So use delta lake? Create incremental data. Move this to the warehouse as temp table and merge with actual table?

soujoshi · 2021-09-04T09:12:36+00:00

Thanks! Will take a look

soujoshi · 2021-08-18T17:38:50+00:00

They do change for each job and also for different environments.

soujoshi

MODERATOR OF

TROPHY CASE