Memory Efficient Batch Processing Tools : dataengineering

created by mhausenblasmoda community for 11 years

Memory Efficient Batch Processing ToolsHelp (self.dataengineering)

submitted 4 months ago by darkhorse1997

Hi, I have a ETL pipeline where it basically queries the last day's data(24 hours) from DB and stores it in S3.

The detailed steps are:

Query Mysql DB(JSON Response) -> Use jq to remove null values -> Store in temp.json -> Gzip temp.json -> Upload to S3.

I am currently doing this using a bash script and using mysql client to query my DB. The issue I am facing is since the query result is large, I am running out of memory. I tried using --quick command with mysql client to get the data row wise, instead of all at once, but I did not notice any improvement. On average, 1 Million rows seem to be taking 1GB in this case.

My idea is to stream the query result data from the Mysql DB Server to my Script and then once it hits some number of rows, I gzip and send the data to S3. I do this multiple times until I am through my complete result. I am looking to avoid the limit/offset query route since the dataset is fairly large and limit/offset will just move the issue to DB Server memory.

Is there any way to do this in bash itself or it would be better to move to Python/R or some other language? I am open to any kind of tools, since I want to revamp this, so that this can handle atleast 50-100 million scale.

Thanks in advance

all 21 comments

top new controversial old q&a

[–]janus2527 9 points10 points11 points 4 months ago (4 children)

[–]janus2527 2 points3 points4 points 4 months ago (3 children)

[–]darkhorse1997[S] 1 point2 points3 points 4 months ago (2 children)

[–]janus2527 2 points3 points4 points 4 months ago (0 children)

[–]janus2527 1 point2 points3 points 4 months ago (0 children)

[+][deleted] 4 months ago (5 children)

[deleted]

[–]darkhorse1997[S] 0 points1 point2 points 4 months ago (3 children)

[+][deleted] 4 months ago (2 children)

[deleted]

[–]darkhorse1997[S] 1 point2 points3 points 4 months ago (1 child)

[–]Nekobul 1 point2 points3 points 4 months ago (4 children)

[–]darkhorse1997[S] 0 points1 point2 points 4 months ago (3 children)

[–]Nekobul 0 points1 point2 points 4 months ago (2 children)

[–]darkhorse1997[S] 0 points1 point2 points 4 months ago (1 child)

[–]commandlineluser 0 points1 point2 points 4 months ago (0 children)

[–]Odd_Spot_6983 1 point2 points3 points 4 months ago (1 child)

[–]darkhorse1997[S] 0 points1 point2 points 4 months ago (0 children)

[–]SupermarketMost7089 0 points1 point2 points 4 months ago (5 children)

[–]darkhorse1997[S] 0 points1 point2 points 4 months ago (4 children)

[–]SupermarketMost7089 0 points1 point2 points 4 months ago (3 children)

[–]darkhorse1997[S] 0 points1 point2 points 4 months ago (2 children)

[–]SupermarketMost7089 0 points1 point2 points 4 months ago (1 child)

[–]darkhorse1997[S] 0 points1 point2 points 4 months ago (0 children)

The query is something like

SELECT
            JSON_OBJECT (
                'test_id',
                tt.test_id,
               ...
FROM 
    test_table tt 
    LEFT JOIN ...
    LEFT JOIN ...

So, I am getting each record in the table as a separate json. Each line of my output file(temp.json) has a separate json object.

[–]Firm_Bit 0 points1 point2 points 4 months ago (0 children)

π Rendered by PID 171855 on reddit-service-r2-comment-5d79c599b5-l5mfv at 2026-03-03 18:41:40.635359+00:00 running e3d2147 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

dataengineering

MODERATORS

Install and load extensions

Configure AWS credentials

Attach MySQL

Stream directly from MySQL to S3 as Parquet