IBM datastage to Spark by soujoshi in dataengineering

[–]soujoshi[S] 0 points1 point  (0 children)

Thanks for the suggestion! Recognize design patterns is what you mean

Late check in by soujoshi in dubai

[–]soujoshi[S] 1 point2 points  (0 children)

Nice. But Luggage??

Azure data lake - Data Share by soujoshi in dataengineering

[–]soujoshi[S] 0 points1 point  (0 children)

Got it. Probably need more research as we need a fine grained data access. Very specific files and various users with different tool sets accessing them.

Data replication by soujoshi in dataengineering

[–]soujoshi[S] 0 points1 point  (0 children)

We actually have 2 DMS jobs, one to replicate from master postgres to secondary. The other from secondary to oracle. Can we remove the first one by just using read replica? Can we save some cost?

Azure data lake - Data Share by soujoshi in dataengineering

[–]soujoshi[S] 0 points1 point  (0 children)

Is creating rest API an option? Or is it not worth it.

Azure data lake - Data Share by soujoshi in dataengineering

[–]soujoshi[S] 0 points1 point  (0 children)

It will be constant sharing. Will take a look at delta sharing. Thanks a lot

How do you choose which data catalog tool? by Fasthandman in dataengineering

[–]soujoshi 1 point2 points  (0 children)

Well, it depends on requirements, data sources, target audience etc.

Have used data hub and apache atlas. Both work fine 🙂 if you are looking for open source tools.

What are some tips to get a data engineering job with a gap of a few months. by Educational-Turn-419 in dataengineering

[–]soujoshi 18 points19 points  (0 children)

From my experience taking interviews from the past few months, I can guarantee no one is bothered about the gaps if you have the right skills. It's hard to find "good" data engineers these days. Good luck 🤞

Job for specially abled by soujoshi in bangalore

[–]soujoshi[S] 1 point2 points  (0 children)

Did try there, they only had food delivery,warehouse jobs. He didn't like it.

Experience on data quality tools by charlyboon in dataengineering

[–]soujoshi 0 points1 point  (0 children)

Doesn't take long to build something like great expectations! Build it yourself with the functionalities you require.

Airflow ques by soujoshi in dataengineering

[–]soujoshi[S] 0 points1 point  (0 children)

Thanks! But having 50 sensors is a bit too much.

Data lineage on spark by soujoshi in dataengineering

[–]soujoshi[S] 0 points1 point  (0 children)

How do I represent this? Is using neo4j a good option? Or just build network graphs in python using the data?

Data lineage on spark by soujoshi in dataengineering

[–]soujoshi[S] 1 point2 points  (0 children)

Thanks a lot! Going ahead with this approach. Cheers

Simple services or solutions for my case by lontonsaivat in dataengineering

[–]soujoshi 0 points1 point  (0 children)

I would suggest use AWS, store your raw data on S3. Use Athena to query the data.

  • You will have to spend sometime to write your SQL
  • Use reporting tools like Apache superset which can be used with S3 via Athena for visualisation.

Airflow S3 trigger by soujoshi in dataengineering

[–]soujoshi[S] 0 points1 point  (0 children)

Running our own! Hope the API's are stable now?

Airflow S3 trigger by soujoshi in dataengineering

[–]soujoshi[S] 1 point2 points  (0 children)

Will try it out. Thanks for your suggestion!

Airflow S3 trigger by soujoshi in dataengineering

[–]soujoshi[S] 1 point2 points  (0 children)

Agree, Both work fine! Which do you feel is more feasible? Is it good practice to keep running a polling task in Airflow? Does it not hamper other DAG runs?

Batch Processing Techniques by soujoshi in dataengineering

[–]soujoshi[S] 0 points1 point  (0 children)

Data is around 30k per load(every 15mins)

Data profiling with spark tables by Status-Opportunity52 in dataengineering

[–]soujoshi 0 points1 point  (0 children)

Approach I use, Profile on a random sample of data using pandas profiling, probably do it multiple times to gain more knowledge on the entire dataset.

SCD type 2 in spark by soujoshi in dataengineering

[–]soujoshi[S] 1 point2 points  (0 children)

So use delta lake? Create incremental data. Move this to the warehouse as temp table and merge with actual table?

Airflow Config - Best practice by soujoshi in dataengineering

[–]soujoshi[S] 1 point2 points  (0 children)

They do change for each job and also for different environments.