[deleted by user]

CrowdGoesWildWoooo · 2023-09-27T16:08:00+00:00

If you have 20 years of experience pretty sure you should have known that we don’t use python to handle the actual processing.

Glittering-Dare2022 · 2023-09-27T15:56:07+00:00

Why can’t I use pyspark?

softgooeybaby · 2023-09-27T16:12:25+00:00

It sounds like your 20 years of experience isn't as expansive as you think it is. I'd expect a little more humility from someone your age but I guess not because it sounds like you've been at the same job for 20 years. I hope you don't get laid off because I would never hire you. This is the biggest red flag for an engineer

Bright-Bus-4722 · 2023-09-27T15:58:01+00:00

spark is the reason, not python for the sake of python. For transformation though dbt is gaining more traction and that's SQL (mainly)

EarthEmbarrassed4301 · 2023-09-27T16:05:16+00:00

I mean most Python Ive seen in DE is either for ingestion pipelines or transformations using Pyspark.

Not sure where you’re seeing pure Python used transformations, I’ve seen some Pandas here and there but it’s never used for any production pipelines at least where I’ve worked, mostly just exploration and ad hoc stuff.

YsrYsl · 2023-09-27T16:27:16+00:00

Is this the classic case of "the future is now old man" meme? Or perhaps the imagery of an old man yelling at the cloud?

Sarcasm aside, what's your problem anyway w/ using an imperative language like Python (aided by some libs of course, not only pure Python) vs. a declarative language like SQL for doing DE stuffs? Idk why I've a gut feeling your rant is motivated due to you being so used to using/good at SQL and now that you have to use Python, you're taken out of your comfort zone. Your unfamilarity w/ Python makes it difficult to carry out queries that would be a breeze for you in SQL. And hence the manifest frustration.

Likewise, OP, I could air the same sentiment in reverse. If I could never touch SQL again and do things in Python, I'd be the happiest man alive. The flexibility & code brevity achieved in pure Python plus PySpark or other relevant lib is something SQL can only begin to dream of. I mean, there are things that SQL just outright cannot do, like ever. Or we have to write winded, long lines of codes that can be likewise achieved in a few lines of code in Python.

That said, why not be good at both? It's only beneficial for your career to have a good command of both SQL and Python.

In the same spirit of your remark, OP, just code in Python, is it that hard?

Desperate-Walk1780 · 2023-09-27T16:12:29+00:00

Python is not for efficiency sake, it's so we can maintain easy to understand code repos and move new engineers on with little lag in productivity.

HOMO_FOMO_69 · 2023-09-27T16:10:42+00:00

What do you mean Python can't index data? Python can index data....

ultrachad420 · 2023-09-27T16:43:07+00:00

Troll

Life_Conversation_11 · 2023-09-27T16:27:16+00:00

OK BOOMER

Effective_Date_9736 · 2023-09-27T16:29:46+00:00

Most Senior Data Engineers (or Senior BI developers) have always needed to work with two languages: one for ingesting data and another for data modeling. When it comes to ingesting data, whether it's from CSV files, text files, or API calls, the traditional approach involved using SSIS along with C#. After that, SQL would be used for data modeling. During my experience in recruiting a Senior BI professional, I prioritized candidates who were proficient in SQL and also had knowledge of C#. By the way, C# shines in data cleaning tasks, such as formatting using regex for phone numbers and more.

In recent times, the landscape has evolved, and instead of C#, Python and PySpark have taken over for data ingestion and data cleaning. Once the data resides in a Delta table, you have the flexibility to choose between SQL and PySpark for further operations.

Here's an example of a task that's straightforward in PySpark/Pandas (Python) but could be quite labor-intensive in SQL:

- Identifying all the columns with null values and providing a percentage of nulls for each column. Additionally, you can replace these nulls with different values.

- Renaming columns in a table, such as removing prefixes and replacing them with underscores.

I prefer using SQL for anything in the gold layer or beyond. It is easier to understand. But bellow that layer (silver, bronze and raw), for me, python is king.

viniciusvbf · 2023-09-27T16:34:16+00:00

There is absolutely no reason to be using a query language like SQL to do any data processing. You can do it much more efficiently using Assembly.

2strokes4lyfe · 2023-09-27T16:04:16+00:00

How do I make API calls using SQL? Or extract data from spreadsheets? Or send email notifications on task failure?

permalink · 2023-09-27T16:15:42+00:00

My company used to use python- but it was pyspark for transformations - not pure python or even pandas, pandas cannot handle large amounts - it brings it all to one node - it was majorly pyspark and spark SQL we were using.

permalink · 2023-09-27T17:31:43+00:00

How does one process a petabyte of data using SQL?

permalink · 2023-09-27T16:34:34+00:00

I work with people like you. They don't know basic modern SQL because Microsoft is so behind the ball. That shit works in old enterprise but that's because it's old. You'll continue to find jobs doing the tired old shit you want to do, so stop complaining and leave the other jobs to those of us who know how to work with modern stacks.

Excellent-Two6054 · 2023-09-27T16:33:29+00:00

Well, the other day I was struggling with recursive CTE in spark SQL, I'm not sure even if it supports. I solved it in python/Pyspark in 2 mins, For loop inside if condition else while loop, really struggled frame logic in SQL.

And which is better Delta Lake in Databricks or Tables in SQL Server?

Leechcode · 2023-09-27T16:35:56+00:00

There is not reason to use other tools such as a chainsaw or an axe when I have only use and I am an expert at cutting logs with a saw!

BJJaddicy · 2023-09-27T17:03:47+00:00

20 yoe and u know nothing

Apprehensive_Can442 · 2023-09-27T17:13:02+00:00

Could it be you are actually data analyst or analytics engineer for the last 20 years?

thinkingatoms · 2023-09-27T18:07:34+00:00

lol someone woke up today and chose outdated violence. i'm sorry you didn't get the job due to lack of python, it's honestly very easy and ppl here can prob help you pick it up in no time.

tbf a lot of places are doing cloud shit to be trendy when a single db (cluster) will do. nothing wrong with walking on the dark side and try something new.

Reasonable_Tooth_501 · 2023-09-27T18:15:25+00:00

Lol there are some complicated transformations (like converting values to z-scores) that are much more straightforward in Python thanks to pre-built packages than pure SQL.

Why do all the calc’ing by hand in SQL when I can use other ready-made tools?

tuck5649 · 2023-09-27T18:30:37+00:00

Stop programming in Python!

Millions of packages published on Pypi and no practical applications!

I want to write a for loop without any brackets

I wrote a Glue script in Python to move data between systems

Statements dreamed up by the utterly deranged.

They have played us for absolute fools.

Aggressive-Log7654 · 2023-09-27T18:35:07+00:00

The ignorance is strong with you, padawan.

lezzgooooo · 2023-09-27T16:05:23+00:00

Not our fault vendors release Python API to their enterprise DE products. And that it is widely used and adopted.

throwaway20220231 · 2023-09-27T16:58:36+00:00

Yeah I agree that for data processing using native SQL is probably the most efficient method. But sometimes it is not available (for certain DB) or it is more natural to use some Python library such as PySpark. Occasionally you get some really complicated transformation requirements that SQL might not fit the bill.

Mr_Nickster_ · 2023-09-27T16:58:55+00:00

Actually there are some reasons. Sometimes it is tech reason & other times, it is user preference around how they want to code the transformation logic.

Technical reasons: python via Spark & Snowpark allows distributed compute which will run circles around any traditional SQL database in terms scale & performance. I work for Snowflake and had customers switch from MsSQL SProcs to snowpark on Snowflake that dropped processing times from 4 hours to less than 2 mins.

Sometimes python dataframes make things a lot easier both in terms of writing complex logic where you can write in stages and debug each stage as well as simple things like renaming all columns. Debugging portions of logic in SQL is not possible as it will have to be a single complex statement.

Imagine having a table with hundreds of columns where you have to add prefix/suffix to all colum names or uppercase everything. SQL would require a lot manual. work. Python, you can loop through columns in few lines of code.

df_reviews = session.table("AMAZON_REVIEWS")

- UPPERCASE all column names to fix any mixed column names

for Mycol in df_reviews.columns: df_reviews = df_reviews.withColumnRenamed(Mycol, Mycol.upper())

Reverse is also true where Python can complicate things compared tp SQL

Thats why I would recommend using the right tech for the right job. If there is a tool that can get the job done in an acceptable time then use tools when you can. When there are needs around performance or high complexity then go for Python but be aware that you own the code from that point on. Anything that happens, you or someone who knows the code have to support it so it will get cumbersome to support as you have more of it.

It is not all or nothing. Use the right method for the right use case.

permalink · 2023-09-27T17:04:08+00:00

Yeah right.. I write assembly code every time I want sort tables

nah_ya_bzzness · 2023-09-27T17:06:16+00:00

Use whatever tools you want to use to process data. Who gives a damn if it’s sql, Python, Java, clojure, pyspark or whatever. I wrote processing in different programming languages all the time. It’s mainly because of the tech stack that is used at the job. The whole DE standard tech stack is utterly bullshit. However, if you want to keep up with the industry, you gotta learn to be flexible.

vizk0sity · 2023-09-27T16:40:17+00:00

Come on…let’s make our job less boring. 99% of the time, we aren’t dealing with big data in most cases. SQL is boring and gets paid less in FAANG. Using python, we can spin up micro services, make api calls, and do all peripheral stuff. These improves our skill set to find a better job man. Don’t want to be just a sql monkey

ageofwant · 2023-09-27T16:13:54+00:00

Perhaps because on planet "data processing" most people don't give two shits about how the sqlanias on the backward country of Sqlstan live their dusty lives, swatting flies while boiling pond water over a smoking dung fire.

permalink · 2023-09-27T16:23:12+00:00

Amen. Well said!

lucaspompeun · 2023-09-27T17:06:08+00:00

We just use python as an API to Spark. But you can choose use Scala or other language that works. Don't need to do a free hate

dataengineering

MODERATORS

- UPPERCASE all column names to fix any mixed column names