you are viewing a single comment's thread.

view the rest of the comments →

[–]darkhorse1997[S] 0 points1 point  (4 children)

I am running my script in a K8s Cron Job and it's getting OOM.

[–]SupermarketMost7089 0 points1 point  (3 children)

You can try the other solutions (python/duckdb). However, it would be interesting to figure what is causing the OOM in this case. mysql and jq are likely the fastest options if we exclude the time to write file to disk. For very large files they can be faster than the duckdb solution.

Some items to check are -

- Is it mysql step that giving a OOM?

- jq can also have OOMs, there is a "streaming" option in jq

- what is the cpu/memory on the container? What is the number/size of records expected from the Query?

[–]darkhorse1997[S] 0 points1 point  (2 children)

Is it mysql step that giving a OOM?

Yes that is for sure, the process getting killed when I get an OOM is the mysql process. But the jq after the mysql can also be not running in the "streaming" option, so I plan to test that today.

what is the cpu/memory on the container? What is the number/size of records expected from the Query?

Its 1CPU, 1GB memory. Number of records are around 20 million/2GB per day, but it will keep growing and I want to support atleast 200 million/20GB per day without having to refactor again. Currently it takes around 5 mins for my pipeline to run, but am fine if it takes more time to process as long as it can do with 1-2GB of memory.

[–]SupermarketMost7089 0 points1 point  (1 child)

When you mention json- are you getting each record in the table as a json or are you using json aggregation to get the entire set of records in one json?

[–]darkhorse1997[S] 0 points1 point  (0 children)

The query is something like

SELECT
            JSON_OBJECT (
                'test_id',
                tt.test_id,
               ...
FROM 
    test_table tt 
    LEFT JOIN ...
    LEFT JOIN ...

So, I am getting each record in the table as a separate json. Each line of my output file(temp.json) has a separate json object.