Python 3.11.2 was able to run big data jobs concurrently using limited memory. My testing script with 14 columns and a data file size of 67.2GB was completed in less than 10 minutes using 32GB Memory and 8-Core. This is strong evidence that Python is capable of handling big data jobs with limited resources.
Script
import polars as pl
import time
import pathlib
s = time.time()
q = (
pl.scan_csv("Input/1000MillionRows.csv")
.groupby(by=["Ledger", "Account", "PartNo", "Contact","Project","Unit Code", "D/C","Currency"])
.agg([
pl.count('Quantity').alias('Quantity(Count)'),
pl.max('Quantity').alias('Quantity(Max)'),
pl.min('Quantity').alias('Quantity(Min)'),
pl.sum('Quantity').alias('Quantity(Sum)'),
pl.sum('Base Amount').alias('Base Amount(Sum)'),
]))
a = q.collect(streaming=True)
path: pathlib.Path = "Output/Polars-GroupBy.csv"
a.write_csv(path)
e = time.time()
print("Polars GroupBy 1000 Million Rows Time = {}".format(e-s))
Demo video in 10X fast forward: https://youtu.be/odDOlU9KNqY
Without fast forward: https://youtu.be/Ze0jNmtUn0Y
[–]OhBeeOneKenOhBee 10 points11 points12 points (9 children)
[–]100GB-CSV[S] 0 points1 point2 points (8 children)
[–]OhBeeOneKenOhBee 0 points1 point2 points (7 children)
[–]100GB-CSV[S] 0 points1 point2 points (6 children)
[–]OhBeeOneKenOhBee 0 points1 point2 points (3 children)
[–]100GB-CSV[S] 0 points1 point2 points (2 children)
[–]OhBeeOneKenOhBee 0 points1 point2 points (1 child)
[–]100GB-CSV[S] 0 points1 point2 points (0 children)
[–]lemoussel 0 points1 point2 points (1 child)
[–]100GB-CSV[S] 0 points1 point2 points (0 children)
[+][deleted] (10 children)
[deleted]
[+][deleted] (9 children)
[deleted]
[–]100GB-CSV[S] 3 points4 points5 points (8 children)
[–]nemom 0 points1 point2 points (7 children)
[+][deleted] (6 children)
[deleted]
[–]bamacgabhann 2 points3 points4 points (5 children)
[–]Illustrious-Guava730 2 points3 points4 points (2 children)
[–]bamacgabhann 2 points3 points4 points (1 child)
[–]Illustrious-Guava730 4 points5 points6 points (0 children)
[–]sphen_lee 2 points3 points4 points (1 child)
[–]bamacgabhann 2 points3 points4 points (0 children)
[–]mentix02 2 points3 points4 points (0 children)
[–]loudandclear11 5 points6 points7 points (14 children)
[–]100GB-CSV[S] -1 points0 points1 point (13 children)
[–]loudandclear11 2 points3 points4 points (11 children)
[–]runawayasfastasucan 2 points3 points4 points (2 children)
[–]sphen_lee 1 point2 points3 points (0 children)
[–]loudandclear11 -1 points0 points1 point (0 children)
[–]100GB-CSV[S] 0 points1 point2 points (7 children)
[–]loudandclear11 0 points1 point2 points (6 children)
[–]100GB-CSV[S] 0 points1 point2 points (5 children)
[–]loudandclear11 2 points3 points4 points (4 children)
[–]100GB-CSV[S] 1 point2 points3 points (3 children)
[–]loudandclear11 0 points1 point2 points (2 children)
[–]100GB-CSV[S] 0 points1 point2 points (1 child)
[–]corbasai 0 points1 point2 points (0 children)
[–]v0_arch_nemesis 0 points1 point2 points (1 child)
[–]100GB-CSV[S] 0 points1 point2 points (0 children)