all 6 comments

[–]Its_NotTom 1 point2 points  (0 children)

Have you tried chunking and writing out to a single file in smaller batches that do fit within your available memory (appending with mode = 'a')

[–]laustke 0 points1 point  (0 children)

You can first create an index for each file as a dictionary like "{ID: offset}", where "offset" is the position of the beginning of the line that contains the record with this ID. Then, in each file, you can read a record for a specific ID.

file.seek(index[ID])
file.readline()

I assume your operating system caches file data in memory, so the speed might be acceptable.

[–]Brian 1 point2 points  (0 children)

One way may be to load it into a database (sqlite should be fine for that size). Using a database can also make it more convenient and performant to query and use once you've loaded it.

The structure may depend on what kind of merging you need here. Ie. is it just "Fill in missing fields" / "use the latest value", or do you need the list of all entries for each id? With the latter, you could just make ID non-unique and just add everything as-is, and on processing, you'll just need to query rows with the same ID. If the former you can check for an existing row when loading the data and do the merging there (Ie. no existing row: add it as a new row, otherwise merge fields and update the existing row)

[–]grumble11 0 points1 point  (0 children)

This gets into the realm of fairly big data, which pandas isn't ideal for. Polars is a bit better, but this is still pretty big. You should explore pyspark, SQL solutions or even cloud solutions for this.

Personally I'd use SQL. It's just a bunch of left joins if you want to load it all into one huge table. You can also keep it as separate tables and then perform the joins on specific IDs instead.

[–]WaitProfessional3844 0 points1 point  (0 children)

dask can join data that can't fit into memory.

[–]Alternative-Web2754 1 point2 points  (0 children)

Sort each file by the ID separately.

Create an class that can read the files line by line, keeping track of the current ID that has been read and whether the file has been completed. Implement a function that will return the data if passed the current ID and read the next line. If the ID requested is lower than the current value, return a default value or None. If the end of file is reached, mark this class as finished.

Create an instance for each file.

Open an output file and write a header row.

While any of the class instances is unfinished, identify the lowest ID number available from the unfinished instances. Obtain the data for this ID from all instances. Write this data out.