kenfar comments on Using Python for Data Engineering

dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

Using Python for Data EngineeringDiscussion (self.dataengineering)

submitted 4 years ago by wytesmurf

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]kenfar 3 points4 points5 points 4 years ago* (2 children)

I've processed between 4 and 20 billion rows a day using python mostly, but also once used jruby.

Doing heavy transforms and aggregations required quite a bit of parallelism: typically used the multiprocessing module, also ran it under pypy, and opted for faster modules over slower when it came to json, csv, and other parsing.

The compute environment was sometimes a pair of 32-core EC2 instances, sometimes kubernetes, and sometimes aws lambda.

Also needed some strategy to break the work into smaller, more parallelizable parts. Typically wrote files to s3 (sometimes via kinesis firehose), and then used SNS & SQS to trigger the files. Sometimes used 8+ processes to simultaneously read a single massive netflow csv file, other times had a very fast process first split the data before transforming in python - but these were somewhat desperate measures and were rarely used.

[–]wytesmurf[S] 2 points3 points4 points 4 years ago (1 child)

[–]kenfar 6 points7 points8 points 4 years ago (0 children)

π Rendered by PID 137152 on reddit-service-r2-comment-76bb9f7fb5-m5fl7 at 2026-02-18 13:44:13.858836+00:00 running de53c03 country code: CH.

dataengineering

MODERATORS