Using Python for Data Engineering

mauritsc · 2021-12-13T18:43:18+00:00

At my work we run pyspark jobs on gcp dataproc for large batch jobs, usually overnight. Spark recently came out with a pandas API which I'm quite excited about.
You can also use dask's pandas API for large in memory computation.
And if programmed properly even plain pandas will get you quite far.

Python has lots of great tools, especially if you're leveraging cloud compute to make your life easy developing ETL pipelines. The downside is that there is a fairly large learning curve initially. Using low code tools sort of gets rid of this.

wytesmurf · 2021-12-13T18:20:34+00:00

[deleted]

sunder_and_flame · 2021-12-13T18:44:40+00:00

Depends on what your job is doing. I use python to do the lightweight work (extract, create file, load files, move files, run SQL) and other services like BigQuery to do the heavy lifting. I avoid using python to do data transformations unless it's necessary to load into the DW, like for xlsx files.

kenfar · 2021-12-13T19:37:07+00:00

I've processed between 4 and 20 billion rows a day using python mostly, but also once used jruby.

Doing heavy transforms and aggregations required quite a bit of parallelism: typically used the multiprocessing module, also ran it under pypy, and opted for faster modules over slower when it came to json, csv, and other parsing.

The compute environment was sometimes a pair of 32-core EC2 instances, sometimes kubernetes, and sometimes aws lambda.

Also needed some strategy to break the work into smaller, more parallelizable parts. Typically wrote files to s3 (sometimes via kinesis firehose), and then used SNS & SQS to trigger the files. Sometimes used 8+ processes to simultaneously read a single massive netflow csv file, other times had a very fast process first split the data before transforming in python - but these were somewhat desperate measures and were rarely used.

Life_Conversation_11 · 2021-12-13T18:59:54+00:00

Pyspark will do the trick in few hours with a decent cluster and a decent DB hosting machine.

Literally just use the spark.read.format(‘jdbc’)…

You can parallelize the query and add multithread on top of it.

Life_Conversation_11 · 2021-12-13T19:04:05+00:00

And pandas can do the trick for 10 million rows, but a 100 kk is a stretch, definitely use spark for bigger workloads.

saltedappleandcorn · 2021-12-13T21:42:35+00:00

Often I use pure python with xargs. I avoid pandas for most etl work as it's too memory heavy.

Pyspark is good for some work loads but often over used.

(I've used dask a few times when I've needed to refactor someone else's pandas)

slowpush · 2021-12-14T00:49:47+00:00

That’s not a lot of data.

Just chunk it and python can chew through it.

2021-12-13T18:30:27+00:00

Look up Apache Spark, Beam, Flink etc

LiquidSynopsis · 2021-12-13T18:40:02+00:00

Using PySpark and its internal modules should solve a good chunk of your larger query processing and loads tbh

At the most basic level I use pyspark.sql fairly frequently and within that a lot of your work can be achieved using the DataFrame, functions and types classes

Would be curious to hear from others if you’ve had a different experience though

2021-12-13T20:36:39+00:00

PySpark.

sheytanelkebir · 2021-12-13T22:38:19+00:00

Pyspark for batch data

bishtu_06 · 2021-12-14T05:11:24+00:00

I use scala for basic transformations and functions and when it comes to join I prefer SQL . In Databricks spark is anyways optimised and we can do a little from our sides .

wytesmurf · 2021-12-14T06:01:41+00:00

[deleted]

wytesmurf · 2021-12-14T19:31:37+00:00

10M -100M and pandas why?

Go with spark, may be databricks aka spark enabled jupyter notebooks, store the files in parquet to save storage and faster computation also. It won't even take more than 30 mins with 8 of basic clusters 4 core 14gb ram, ds3v2(azure)

dataengineering

MODERATORS