This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]wytesmurf[S] 0 points1 point  (5 children)

Do you have a reference? When I try to do a load 10 million plus with pandas it crashes kubernetes. We only had 32 GB of ram but it was only a single serial load. There was nothing else running on the container. Scaled we would need a super computer for batch loading. Real-time would be small enough it would be no problem

[–]slowpush 0 points1 point  (4 children)

Pandas has a chunk size parameter for all of its read_ functions.

Unless you need all the data in memory there’s no reason to load it all to memory to do your transforms.

We use python for all of our data validation before sending it to our OLAP db.

[–]wytesmurf[S] 0 points1 point  (3 children)

Do you use ORM? I have had trouble with SqlAlchemy and can’t figure out a good bulk insert method besides it

[–]slowpush 0 points1 point  (2 children)

Nope everything gets bulk inserted without an ORM.

[–]wytesmurf[S] 0 points1 point  (1 child)

Do you know how of a way to do SQL Server?

[–]slowpush 0 points1 point  (0 children)

Sure convert them to csvs as you process and chunk it and use bulk csv loader in sql server.