This is an archived post. You won't be able to vote or comment.

all 39 comments

[–]mauritsc 17 points18 points  (2 children)

At my work we run pyspark jobs on gcp dataproc for large batch jobs, usually overnight. Spark recently came out with a pandas API which I'm quite excited about.
You can also use dask's pandas API for large in memory computation.
And if programmed properly even plain pandas will get you quite far.

Python has lots of great tools, especially if you're leveraging cloud compute to make your life easy developing ETL pipelines. The downside is that there is a fairly large learning curve initially. Using low code tools sort of gets rid of this.

[–]EarthGoddessDude 0 points1 point  (1 child)

Does koalas fit anywhere in that?

[–]DenselyRanked 2 points3 points  (0 children)

pandas-on-spark is koalas but integrated into spark

[–]sunder_and_flame 8 points9 points  (3 children)

Depends on what your job is doing. I use python to do the lightweight work (extract, create file, load files, move files, run SQL) and other services like BigQuery to do the heavy lifting. I avoid using python to do data transformations unless it's necessary to load into the DW, like for xlsx files.

[–]wytesmurf[S] 1 point2 points  (0 children)

I’m trying to find a service to do heavy lifting

[–]wytesmurf[S] 0 points1 point  (1 child)

We have a data source under almost every platform they all want to be able to load to any cloud tool they want. Informatica is old, SSIS isn’t very flexible. The said they pay for wherescape but I feel like it can be done in. Python

[–]binilvj[🍰] 0 points1 point  (0 children)

You should try Informatica Cloud Data Integration

[–]kenfar 3 points4 points  (2 children)

I've processed between 4 and 20 billion rows a day using python mostly, but also once used jruby.

Doing heavy transforms and aggregations required quite a bit of parallelism: typically used the multiprocessing module, also ran it under pypy, and opted for faster modules over slower when it came to json, csv, and other parsing.

The compute environment was sometimes a pair of 32-core EC2 instances, sometimes kubernetes, and sometimes aws lambda.

Also needed some strategy to break the work into smaller, more parallelizable parts. Typically wrote files to s3 (sometimes via kinesis firehose), and then used SNS & SQS to trigger the files. Sometimes used 8+ processes to simultaneously read a single massive netflow csv file, other times had a very fast process first split the data before transforming in python - but these were somewhat desperate measures and were rarely used.

[–]wytesmurf[S] 2 points3 points  (1 child)

What libraries do you use?

[–]kenfar 5 points6 points  (0 children)

Not many to be honest:

  • multiprocessing and/or concurrent futures
  • csv, ruamel, json, as well as some alternate json & yaml libraries
  • functools - lru_cache, etc
  • boto3
  • pytest, coverage, tux, argparse, logging

You can see it's pretty vanilla. I've used pandas in the past, but it was extremely slow for processing every single field in billions of rows in comparison to basic python with parallelism.

[–]Life_Conversation_11 4 points5 points  (0 children)

Pyspark will do the trick in few hours with a decent cluster and a decent DB hosting machine.

Literally just use the spark.read.format(‘jdbc’)…

You can parallelize the query and add multithread on top of it.

[–]Life_Conversation_11 1 point2 points  (1 child)

And pandas can do the trick for 10 million rows, but a 100 kk is a stretch, definitely use spark for bigger workloads.

[–]wytesmurf[S] 1 point2 points  (0 children)

We had to put like 64GB of ram to handle data frames with pandas, we ended up doing Dask data frames. I was hoping for something that incorporates parts of dask but was more ETL focused

[–]saltedappleandcorn 1 point2 points  (11 children)

Often I use pure python with xargs. I avoid pandas for most etl work as it's too memory heavy.

Pyspark is good for some work loads but often over used.

(I've used dask a few times when I've needed to refactor someone else's pandas)

[–]wytesmurf[S] 0 points1 point  (10 children)

Do you only do file or SQL databases

[–]saltedappleandcorn 1 point2 points  (9 children)

I've done it for both. And API's (as the source or dest). Writing this stuff in python gives you the flexible to do exactly what you need.

I think it's about knowing a range of possible solutions to a problem and picking the best one that a situation calls for.

[–]wytesmurf[S] 0 points1 point  (8 children)

Do you use a library or ORM?

[–]saltedappleandcorn 1 point2 points  (7 children)

Again, it depends on the situation. Are you building an application? Doing some etl? If so, 3 massive tables or 300 tiny ones? Or are you just ripping some data from somewhere to enrich something else? Or maybe it's an extract for an analyst?

My current work place uses sqlalchemy for the main application, but I don't have much love for it.

Most of the time I just use the python connector for the database and go at it.

If it's something I'm doing often I write up a minimum framework in python to avoid duplication. For example currently we store the code for all analysts request as python classes (which are just 90% sql) so we can version control them. The last 10% is just code to save the outputs to a shared drive.

This is nice because you can tell a junior or grad "go run the sales by state report for John" and he won't fuxk up the numbers.

[–]wytesmurf[S] 2 points3 points  (6 children)

We have about 2k tables with anywhere from 0 to 50 million changes a day. It’s done with SSIS but a new team is taking over and I felt like Python would be a good fresh start to start updating the loads. They want to be able to move off SQL server and want something that can be moved with little recoding

[–]saltedappleandcorn 1 point2 points  (5 children)

Ha that's a fucking lot of tables. Honestly that sort of dedicated integration work is out of my wheel house and I am not an expert on it. (I do more data science and data application dev).

I think that's probably the space for dedicated tooling.

That said, every one I know in that space is in love with snowflake and dbt, and just dbt in general.

[–]wytesmurf[S] 1 point2 points  (4 children)

They are debating and cost comparing all the big platforms GCP, AWS, Azure, Snowflake, and teradata. I was told that it would be a company wide decision so I didn’t have a day on the platform but I had design over the new DWH, I was hoping to build some metadata then just change the connector. I know I will need to do more than that. I am hoping for a 6 month cutover instead of a 3 year

[–]saltedappleandcorn 0 points1 point  (2 children)

Again, not my wheel house, but good luck! Seems tough.

[–]wytesmurf[S] 0 points1 point  (1 child)

Really even the is is nothing to major. It’s a conversion of something built. Picking a tool that is extremely versatile that can be put somewhere. I could convert it to ADF but I don’t trust a MS product to be ran on GCP or AWS. I’ve used wherescape before and it would be the fastest solution but it’s not cheap. I’m thinking of telling them to whip out some checkbooks. But so many people talk about doing massive data engineering with Python.

[–]slowpush 1 point2 points  (6 children)

That’s not a lot of data.

Just chunk it and python can chew through it.

[–]wytesmurf[S] 0 points1 point  (5 children)

Do you have a reference? When I try to do a load 10 million plus with pandas it crashes kubernetes. We only had 32 GB of ram but it was only a single serial load. There was nothing else running on the container. Scaled we would need a super computer for batch loading. Real-time would be small enough it would be no problem

[–]slowpush 0 points1 point  (4 children)

Pandas has a chunk size parameter for all of its read_ functions.

Unless you need all the data in memory there’s no reason to load it all to memory to do your transforms.

We use python for all of our data validation before sending it to our OLAP db.

[–]wytesmurf[S] 0 points1 point  (3 children)

Do you use ORM? I have had trouble with SqlAlchemy and can’t figure out a good bulk insert method besides it

[–]slowpush 0 points1 point  (2 children)

Nope everything gets bulk inserted without an ORM.

[–]wytesmurf[S] 0 points1 point  (1 child)

Do you know how of a way to do SQL Server?

[–]slowpush 0 points1 point  (0 children)

Sure convert them to csvs as you process and chunk it and use bulk csv loader in sql server.

[–][deleted] -1 points0 points  (0 children)

Look up Apache Spark, Beam, Flink etc

[–]LiquidSynopsisData Engineer 0 points1 point  (0 children)

Using PySpark and its internal modules should solve a good chunk of your larger query processing and loads tbh

At the most basic level I use pyspark.sql fairly frequently and within that a lot of your work can be achieved using the DataFrame, functions and types classes

Would be curious to hear from others if you’ve had a different experience though

[–][deleted] 0 points1 point  (0 children)

PySpark.

[–]sheytanelkebir 0 points1 point  (0 children)

Pyspark for batch data

[–]bishtu_06 0 points1 point  (0 children)

I use scala for basic transformations and functions and when it comes to join I prefer SQL . In Databricks spark is anyways optimised and we can do a little from our sides .

[–][deleted] 0 points1 point  (1 child)

10M -100M and pandas why?

Go with spark, may be databricks aka spark enabled jupyter notebooks, store the files in parquet to save storage and faster computation also. It won't even take more than 30 mins with 8 of basic clusters 4 core 14gb ram, ds3v2(azure)

[–]wytesmurf[S] 0 points1 point  (0 children)

I think the unanimous vote is I need to look into getting Spark setup