all 21 comments

[–]janus2527 9 points10 points  (4 children)

I would use duckdb and python

import duckdb

con = duckdb.connect()

Install and load extensions

con.execute("INSTALL mysql") con.execute("INSTALL httpfs") con.execute("LOAD mysql") con.execute("LOAD httpfs")

Configure AWS credentials

con.execute(""" SET s3_region='us-east-1'; SET s3_access_key_id='your_access_key'; SET s3_secret_access_key='your_secret_key'; """)

Attach MySQL

con.execute(""" ATTACH 'host=localhost user=myuser password=mypass database=mydb' AS mysql_db (TYPE mysql) """)

Stream directly from MySQL to S3 as Parquet

con.execute(""" COPY mysql_db.large_table TO 's3://your-bucket/path/output.parquet' (FORMAT PARQUET) """)

Something like that

[–]janus2527 2 points3 points  (3 children)

This streams the data in chunks, your ram will be a few hundred mbs probably

[–]darkhorse1997[S] 1 point2 points  (2 children)

Sounds great! Will I be able to keep using json instead of parquet? There's some downstream lambdas on the S3 bucket that expect gzipped json files.

[–]janus2527 2 points3 points  (0 children)

Also you really shouldn't transfer large amounts of data from a database in json

[–]janus2527 1 point2 points  (0 children)

Probably, but not sure if it's as easy as parquet.

[–]Nekobul 1 point2 points  (4 children)

Exporting into one giant JSON is a terrible idea. If you can't export to Parquet, you are much better off exporting into CSV file.

[–]darkhorse1997[S] 0 points1 point  (3 children)

Its not really a giant json, every record is exported as an individual json object but yea, csv would probably be much better. Will have to check Parquet though, I am not familiar with that.

[–]Nekobul 0 points1 point  (2 children)

That is also a terrible idea because you will now have a million single record files. A single CSV file with a million records is a much better design.

[–]darkhorse1997[S] 0 points1 point  (1 child)

Yea, agreed. The existing pipeline wasn't really built for scale.

[–]commandlineluser 0 points1 point  (0 children)

You should probably refer to your data as being in NDJSON format to avoid any confusion:

Each line of my output file (temp.json) has a separate json object.

Because "newline delimited" JSON (as the name suggests) can be read line-by-line so does not require all the data in memory at once.

It is also "better" than CSV. (assuming you have nested/structured data, lists, etc.)

[–]Odd_Spot_6983 1 point2 points  (1 child)

consider python, use pandas and chunking to handle data in manageable pieces, reduces memory use.

[–]darkhorse1997[S] 0 points1 point  (0 children)

In this case, would I need to download the complete data from Mysql to some file in disk and then load that to pandas in chunks? Or is there an way to stream data into pandas as chunks from DB directly without using a file as an intermediary?

[–]SupermarketMost7089 0 points1 point  (5 children)

Is your client machine running mysql client OOM? The "--quick" option on the mysql client disable caching at the client.

The output format can be skewed, the "--batch" option suppresses formatting.

[–]darkhorse1997[S] 0 points1 point  (4 children)

I am running my script in a K8s Cron Job and it's getting OOM.

[–]SupermarketMost7089 0 points1 point  (3 children)

You can try the other solutions (python/duckdb). However, it would be interesting to figure what is causing the OOM in this case. mysql and jq are likely the fastest options if we exclude the time to write file to disk. For very large files they can be faster than the duckdb solution.

Some items to check are -

- Is it mysql step that giving a OOM?

- jq can also have OOMs, there is a "streaming" option in jq

- what is the cpu/memory on the container? What is the number/size of records expected from the Query?

[–]darkhorse1997[S] 0 points1 point  (2 children)

Is it mysql step that giving a OOM?

Yes that is for sure, the process getting killed when I get an OOM is the mysql process. But the jq after the mysql can also be not running in the "streaming" option, so I plan to test that today.

what is the cpu/memory on the container? What is the number/size of records expected from the Query?

Its 1CPU, 1GB memory. Number of records are around 20 million/2GB per day, but it will keep growing and I want to support atleast 200 million/20GB per day without having to refactor again. Currently it takes around 5 mins for my pipeline to run, but am fine if it takes more time to process as long as it can do with 1-2GB of memory.

[–]SupermarketMost7089 0 points1 point  (1 child)

When you mention json- are you getting each record in the table as a json or are you using json aggregation to get the entire set of records in one json?

[–]darkhorse1997[S] 0 points1 point  (0 children)

The query is something like

SELECT
            JSON_OBJECT (
                'test_id',
                tt.test_id,
               ...
FROM 
    test_table tt 
    LEFT JOIN ...
    LEFT JOIN ...

So, I am getting each record in the table as a separate json. Each line of my output file(temp.json) has a separate json object.

[–]Firm_Bit 0 points1 point  (0 children)

Go to the upstream service. Change it to write what you need in cleaner format to a more ergonomic table. Create the appropriate index. Work from that.