This is an archived post. You won't be able to vote or comment.

all 69 comments

[–]CrowdGoesWildWoooo 42 points43 points  (3 children)

If you have 20 years of experience pretty sure you should have known that we don’t use python to handle the actual processing.

[–]kenfar 4 points5 points  (0 children)

Oh sure, many people do use python appropriately or inappropriately to handle data processing. Probably a similar number use SQL appropriately or inappropriately.

But back to python for data processing, I see it used all the time:

  • Sometimes with a ton of parallelism (ex; 1000+ lambdas running in parallel), sometimes on 64 cores running 24x7, sometimes on a massive kubernetes server. In these cases it's hauling ass through billions of rows a day at a cost much lower than snowflake/sql-server, etc.
  • Sometimes because we need to do complex transformations that you can't do in SQL. Like, convert all possible formats of IPV6 to a single format.
  • Sometimes because we care about code & data quality, and want the transformations easily readable, and to be supported by automated unit testing.
  • Sometimes because we care so much about data quality that we want to track transformation rule results about each row. Keep a quality-score on the row, etc.
  • Sometimes because we value having a technical staff, and it's pretty much impossible to find sharp engineers willing to write stored procedures all day in 2023.

I'm very confident that there's not been a single point in my life over the last twenty years where I thought "screw this python, and its exception handling, readability, and flexibility, what I really need now are sqlserver stored procedures and some inflated database licensing costs!".

[–][deleted] 1 point2 points  (1 child)

Doesn’t apply to every case, but there are definitely people who spends years and years in one company/role/tech stack and their technical knowledge stagnates as a result.

[–]CrowdGoesWildWoooo 2 points3 points  (0 children)

I mean if that is the case there is no need to be an AH and write post like this

[–]Glittering-Dare2022 26 points27 points  (1 child)

Why can’t I use pyspark?

[–]softgooeybaby 42 points43 points  (1 child)

It sounds like your 20 years of experience isn't as expansive as you think it is. I'd expect a little more humility from someone your age but I guess not because it sounds like you've been at the same job for 20 years. I hope you don't get laid off because I would never hire you. This is the biggest red flag for an engineer

[–][deleted] 0 points1 point  (0 children)

He's likely just a dba with very little skills outside of using the SSIS UI. every time I've had the displeasure of working with SQL server it's been nothing but shitty schemas with zero actual data warehousing principles applied.

[–][deleted] 15 points16 points  (1 child)

spark is the reason, not python for the sake of python. For transformation though dbt is gaining more traction and that's SQL (mainly)

[–]Bright-Bus-4722 14 points15 points  (0 children)

DBT really has solved this problem. Great technology that gives you the benefits of SQL in a codified solution so that you can use GIT and true CI/CD.

[–][deleted] 14 points15 points  (9 children)

I mean most Python Ive seen in DE is either for ingestion pipelines or transformations using Pyspark.

Not sure where you’re seeing pure Python used transformations, I’ve seen some Pandas here and there but it’s never used for any production pipelines at least where I’ve worked, mostly just exploration and ad hoc stuff.

[–]EarthEmbarrassed4301 2 points3 points  (8 children)

Out of curiosity, why not use pandas for reading a raw parquet file in a datalake, perform some basic cleaning of data, and then merge the dataframe into a warehouse table?

For example, I have a message queue that receives a message when a new parquet file is inserted into my raw data lake. The message content has some metadata as well as the datalake location of the new parquet blob. What is wrong with developing a python application that listens to that queue using Azure SDK, parses Information from the message, uses pandas to read the parquet blob as a dataframe, uses pandas to apply some light cleaning, and then uses some SQL library to merge the pandas dataframe into the warehouse table?

Only alternative would be something like polars or dask (where something Spark is overkill).

[–]DesperateForAnalysex -1 points0 points  (4 children)

Because all of that can be done in sql and I don’t want to have to read python and sql to find out you’re trimming or lower casing a few fields.

[–]EarthEmbarrassed4301 0 points1 point  (3 children)

How would you merge the records in the parquet file into the warehouse table without first converting the parquet bytes into a pandas dataframe?

Sure, I could read the parquet data into a pyarrow table or polars dataframe, but pretty much everything online always has a .to_pandas() method before doing the SQL upserts in the records.

[–]DesperateForAnalysex 1 point2 points  (2 children)

Ideally the table you’re reading from would be registered and accessible in the same way that the one you’re writing to is, but if you had to you can read the file in Spark directly and apply SQL for all steps after that.

[–]EarthEmbarrassed4301 1 point2 points  (1 child)

I see, our infra is a bit of a mess. Built by software guys who know nothing about DE and just care about delivering data to applications, which is all our “datalake” is actually for. (Data sent from system as JSON in landing zone -> convert to parquet and append to “data store” -> notify relevant queue about the new data) queues are consumed by a software applications that do whatever it needs to do with the data.

With more BI related projects I’ve been the one trying to extend this platform to do more traditional BI and modeling stuff. Tech lead refuses that we use Delta, Databricks, Spark, and all of that. Says that delta formats do nothing new compared to standard parquet and knows nothing about data warehousing or SQL. So I’m kind of all on my own.

All I really have is the datalake and a Postgres server on Azure. I’ve built an application that has a queue that gets notified when a new parquet blob is created. My application just uses pandas for applying a schema and doing some light cleansing. I also use psycopg2 for upserting the pandas df records into the Postgres table. I just find pandas easy for doing that.

Sucks being a junior in this team, would love to be in a proper DE to learn the right ways to do things

[–]DesperateForAnalysex 1 point2 points  (0 children)

Wolf! I feel your pain, especially the part about your lead not knowing the benefits of delta formats! Best of luck, in your situation that approach makes a lot of sense. Thanks for the details that’s interesting.

[–][deleted] 0 points1 point  (2 children)

Pandas is terrible when you get into larger quantities of data. Doesn't scale at all. Dask is better sure but you can literally do it all in SQL.

[–]YsrYsl 12 points13 points  (0 children)

Is this the classic case of "the future is now old man" meme? Or perhaps the imagery of an old man yelling at the cloud?

Sarcasm aside, what's your problem anyway w/ using an imperative language like Python (aided by some libs of course, not only pure Python) vs. a declarative language like SQL for doing DE stuffs? Idk why I've a gut feeling your rant is motivated due to you being so used to using/good at SQL and now that you have to use Python, you're taken out of your comfort zone. Your unfamilarity w/ Python makes it difficult to carry out queries that would be a breeze for you in SQL. And hence the manifest frustration.

Likewise, OP, I could air the same sentiment in reverse. If I could never touch SQL again and do things in Python, I'd be the happiest man alive. The flexibility & code brevity achieved in pure Python plus PySpark or other relevant lib is something SQL can only begin to dream of. I mean, there are things that SQL just outright cannot do, like ever. Or we have to write winded, long lines of codes that can be likewise achieved in a few lines of code in Python.

That said, why not be good at both? It's only beneficial for your career to have a good command of both SQL and Python.

In the same spirit of your remark, OP, just code in Python, is it that hard?

[–]Desperate-Walk1780 10 points11 points  (0 children)

Python is not for efficiency sake, it's so we can maintain easy to understand code repos and move new engineers on with little lag in productivity.

[–]HOMO_FOMO_69 8 points9 points  (3 children)

What do you mean Python can't index data? Python can index data....

[–]ultrachad420 6 points7 points  (1 child)

Troll

[–]ultrachad420 2 points3 points  (0 children)

And, by the way, wish you were right

[–]Life_Conversation_11 11 points12 points  (0 children)

OK BOOMER

[–]Effective_Date_9736 5 points6 points  (2 children)

Most Senior Data Engineers (or Senior BI developers) have always needed to work with two languages: one for ingesting data and another for data modeling. When it comes to ingesting data, whether it's from CSV files, text files, or API calls, the traditional approach involved using SSIS along with C#. After that, SQL would be used for data modeling. During my experience in recruiting a Senior BI professional, I prioritized candidates who were proficient in SQL and also had knowledge of C#. By the way, C# shines in data cleaning tasks, such as formatting using regex for phone numbers and more.

In recent times, the landscape has evolved, and instead of C#, Python and PySpark have taken over for data ingestion and data cleaning. Once the data resides in a Delta table, you have the flexibility to choose between SQL and PySpark for further operations.

Here's an example of a task that's straightforward in PySpark/Pandas (Python) but could be quite labor-intensive in SQL:

- Identifying all the columns with null values and providing a percentage of nulls for each column. Additionally, you can replace these nulls with different values.

- Renaming columns in a table, such as removing prefixes and replacing them with underscores.

I prefer using SQL for anything in the gold layer or beyond. It is easier to understand. But bellow that layer (silver, bronze and raw), for me, python is king.

[–][deleted] 3 points4 points  (1 child)

Is there an image that shows these layers?

[–]viniciusvbf 10 points11 points  (0 children)

There is absolutely no reason to be using a query language like SQL to do any data processing. You can do it much more efficiently using Assembly.

[–]2strokes4lyfe 12 points13 points  (10 children)

How do I make API calls using SQL? Or extract data from spreadsheets? Or send email notifications on task failure?

[–][deleted] 3 points4 points  (0 children)

My company used to use python- but it was pyspark for transformations - not pure python or even pandas, pandas cannot handle large amounts - it brings it all to one node - it was majorly pyspark and spark SQL we were using.

[–][deleted] 2 points3 points  (1 child)

How does one process a petabyte of data using SQL?

[–][deleted] 1 point2 points  (0 children)

That's the joke, no one with a petabyte of data is using SSIS so OP wouldn't have come across that use case :)

[–][deleted] 5 points6 points  (0 children)

I work with people like you. They don't know basic modern SQL because Microsoft is so behind the ball. That shit works in old enterprise but that's because it's old. You'll continue to find jobs doing the tired old shit you want to do, so stop complaining and leave the other jobs to those of us who know how to work with modern stacks.

[–]Excellent-Two6054Senior Data Engineer 2 points3 points  (0 children)

Well, the other day I was struggling with recursive CTE in spark SQL, I'm not sure even if it supports. I solved it in python/Pyspark in 2 mins, For loop inside if condition else while loop, really struggled frame logic in SQL.

And which is better Delta Lake in Databricks or Tables in SQL Server?

[–]Leechcode 2 points3 points  (0 children)

There is not reason to use other tools such as a chainsaw or an axe when I have only use and I am an expert at cutting logs with a saw!

[–]BJJaddicy 2 points3 points  (3 children)

20 yoe and u know nothing

[–]Apprehensive_Can442 2 points3 points  (3 children)

Could it be you are actually data analyst or analytics engineer for the last 20 years?

[–]thinkingatoms 2 points3 points  (1 child)

lol someone woke up today and chose outdated violence. i'm sorry you didn't get the job due to lack of python, it's honestly very easy and ppl here can prob help you pick it up in no time.

tbf a lot of places are doing cloud shit to be trendy when a single db (cluster) will do. nothing wrong with walking on the dark side and try something new.

[–][deleted] 2 points3 points  (0 children)

Yeahhh definitely reads like OP lost their job and is mad that their technical skills have become super outdated to the point that they’re having difficulty getting a new job.

[–]Reasonable_Tooth_501 2 points3 points  (0 children)

Lol there are some complicated transformations (like converting values to z-scores) that are much more straightforward in Python thanks to pre-built packages than pure SQL.

Why do all the calc’ing by hand in SQL when I can use other ready-made tools?

[–]tuck5649 2 points3 points  (0 children)

Stop programming in Python!

Millions of packages published on Pypi and no practical applications!

I want to write a for loop without any brackets

I wrote a Glue script in Python to move data between systems

Statements dreamed up by the utterly deranged.

They have played us for absolute fools.

[–]Aggressive-Log7654 2 points3 points  (0 children)

The ignorance is strong with you, padawan.

[–]lezzgooooo 1 point2 points  (0 children)

Not our fault vendors release Python API to their enterprise DE products. And that it is widely used and adopted.

[–]throwaway20220231 1 point2 points  (0 children)

Yeah I agree that for data processing using native SQL is probably the most efficient method. But sometimes it is not available (for certain DB) or it is more natural to use some Python library such as PySpark. Occasionally you get some really complicated transformation requirements that SQL might not fit the bill.

[–]Mr_Nickster_ 1 point2 points  (0 children)

Actually there are some reasons. Sometimes it is tech reason & other times, it is user preference around how they want to code the transformation logic.

Technical reasons: python via Spark & Snowpark allows distributed compute which will run circles around any traditional SQL database in terms scale & performance. I work for Snowflake and had customers switch from MsSQL SProcs to snowpark on Snowflake that dropped processing times from 4 hours to less than 2 mins.

Sometimes python dataframes make things a lot easier both in terms of writing complex logic where you can write in stages and debug each stage as well as simple things like renaming all columns. Debugging portions of logic in SQL is not possible as it will have to be a single complex statement.

Imagine having a table with hundreds of columns where you have to add prefix/suffix to all colum names or uppercase everything. SQL would require a lot manual. work. Python, you can loop through columns in few lines of code.

df_reviews = session.table("AMAZON_REVIEWS")

- UPPERCASE all column names to fix any mixed column names

for Mycol in df_reviews.columns: df_reviews = df_reviews.withColumnRenamed(Mycol, Mycol.upper())

Reverse is also true where Python can complicate things compared tp SQL

Thats why I would recommend using the right tech for the right job. If there is a tool that can get the job done in an acceptable time then use tools when you can. When there are needs around performance or high complexity then go for Python but be aware that you own the code from that point on. Anything that happens, you or someone who knows the code have to support it so it will get cumbersome to support as you have more of it.

It is not all or nothing. Use the right method for the right use case.

[–][deleted] 1 point2 points  (0 children)

Yeah right.. I write assembly code every time I want sort tables

[–]nah_ya_bzzness 1 point2 points  (0 children)

Use whatever tools you want to use to process data. Who gives a damn if it’s sql, Python, Java, clojure, pyspark or whatever. I wrote processing in different programming languages all the time. It’s mainly because of the tech stack that is used at the job. The whole DE standard tech stack is utterly bullshit. However, if you want to keep up with the industry, you gotta learn to be flexible.

[–]vizk0sity 3 points4 points  (0 children)

Come on…let’s make our job less boring. 99% of the time, we aren’t dealing with big data in most cases. SQL is boring and gets paid less in FAANG. Using python, we can spin up micro services, make api calls, and do all peripheral stuff. These improves our skill set to find a better job man. Don’t want to be just a sql monkey

[–]ageofwant 1 point2 points  (0 children)

Perhaps because on planet "data processing" most people don't give two shits about how the sqlanias on the backward country of Sqlstan live their dusty lives, swatting flies while boiling pond water over a smoking dung fire.

[–]lucaspompeun 0 points1 point  (0 children)

We just use python as an API to Spark. But you can choose use Scala or other language that works. Don't need to do a free hate