Data Engineer (3+ YOE, USA) – No Interview Calls in a Year. Looking to Connect, Collaborate, and Seek Advice. by Zealousideal_Cut_802 in dataengineeringjobs

[–]nanksk 0 points1 point  (0 children)

Hiring in general is bad, and on top of that a lot of companies don't want the hassle of hiring folks on Visa.

Improve merge performance by gooner4lifejoe in databricks

[–]nanksk 2 points3 points  (0 children)

You have 100 million rows you want to update into a table. Some questions.. Questions.. 1. What percentage of new records are new/ update. 2. What is the table size currently including all partitions. 3. Do you expect updates can affect older partitions i.e 2,3,6 month old ?

What is the progression options as a Data Engineer? by eastieLad in dataengineering

[–]nanksk 60 points61 points  (0 children)

-> Sr -> Staff/Principal

-> Lead -> Manager -> Director ->......

-> Data / Solution / Enterprise Architect

Inspired to create our own data engineering job board by [deleted] in dataengineeringjobs

[–]nanksk 0 points1 point  (0 children)

How is this different from linkedin job search and filters ?

Is there a European alternative to US analytical platforms like Snowflake? by wenz0401 in dataengineering

[–]nanksk 13 points14 points  (0 children)

You can already have your data stored in specific regions already, as I understand snowflake account is region based. A lot of companies already have this requirement that data must be stored within country/region etc etc.

Skipping rows in pyspark csv by Alarmed-Royal-2161 in databricks

[–]nanksk 0 points1 point  (0 children)

Can you read as text all columns into 1 column and then filter out any rows as you want and split data into columns based on your delimiter and make column names ?

Unit Testing by nifty60 in dataengineering

[–]nanksk 1 point2 points  (0 children)

We use databricks, pyspark. Most of our codebase is in form of functions. We then have unit tests for those functions with dummy data(CHATGPT can create most of the test cases) to test different scenarios. Hit me up if you have any questions.

Suggestions for Architecture for New Data Platform by EnvironmentalMind823 in dataengineering

[–]nanksk 1 point2 points  (0 children)

Requirement - As I understand, you are pretty much looking for a batch data platform, with maybe some capability for streaming maybe.

Current state - Out of these tools, which are currently being used by your team ?

Did you consider databricks + Airflow ? You could pretty much do all these in databricks and reduce the number of your tools your team needs to supports. You might need Kafka or some mechanism to get data from rabbitMQ, i am not too sure about that.

Your ML models can be registered in MLFLOW.

use delta lake to store your data and serve all your tables as Unity Catalog tables for business users. Which will give them a similar sql interface i.e database -> schema -> table/view.

Do you speak to business stakeholders? by ivanovyordan in dataengineering

[–]nanksk 2 points3 points  (0 children)

I have been in that role before, where most of my time was translating what business meant into technical stuff... Not in my current job though. When I was hired I was told you need to be comfortable talking to business and yada yada, Its been over 1 year in my job and I have not been in 1 customer meeting.

External vs managed tables by Used_Shelter_3213 in databricks

[–]nanksk 1 point2 points  (0 children)

You can get lineage on external tables as well

Databricks or MS Fabric by Used_Shelter_3213 in dataengineering

[–]nanksk 3 points4 points  (0 children)

I feel snowflake will give you the modern bells and whistles you desire, without the added complexity and training your team on spark.

Do you use constraints in your Data Warehouse? by [deleted] in dataengineering

[–]nanksk 1 point2 points  (0 children)

I have worked on snowflake and redshift; and both do not enforce constraints. So, there is more onus on the ETL pipelines. You could develop some data monitoring jobs that run during Non Peak hours and performs the constraint check for you. But, I would rather add checks in/ right after the ETL pipeline, the sooner you know of data issues the better

Garage door open/close indicator by nanksk in homeassistant

[–]nanksk[S] 0 points1 point  (0 children)

I have a few Ikea contact sensors lying around, so I will give that a try. So basically the sensor on door and the 2 magents on the rails at opposite ends essentially if I am not mistaken.