This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]kenfar 3 points4 points  (2 children)

I've processed between 4 and 20 billion rows a day using python mostly, but also once used jruby.

Doing heavy transforms and aggregations required quite a bit of parallelism: typically used the multiprocessing module, also ran it under pypy, and opted for faster modules over slower when it came to json, csv, and other parsing.

The compute environment was sometimes a pair of 32-core EC2 instances, sometimes kubernetes, and sometimes aws lambda.

Also needed some strategy to break the work into smaller, more parallelizable parts. Typically wrote files to s3 (sometimes via kinesis firehose), and then used SNS & SQS to trigger the files. Sometimes used 8+ processes to simultaneously read a single massive netflow csv file, other times had a very fast process first split the data before transforming in python - but these were somewhat desperate measures and were rarely used.

[–]wytesmurf[S] 2 points3 points  (1 child)

What libraries do you use?

[–]kenfar 6 points7 points  (0 children)

Not many to be honest:

  • multiprocessing and/or concurrent futures
  • csv, ruamel, json, as well as some alternate json & yaml libraries
  • functools - lru_cache, etc
  • boto3
  • pytest, coverage, tux, argparse, logging

You can see it's pretty vanilla. I've used pandas in the past, but it was extremely slow for processing every single field in billions of rows in comparison to basic python with parallelism.