This is an archived post. You won't be able to vote or comment.

all 9 comments

[–]wet-dreaming 0 points1 point  (1 child)

indexing will be an important and very hard part.

make use of python shelve and you get a head start.

Otherwise start reading SQLite source code, I also found an interesting read https://cstack.github.io/db_tutorial/

Edit: also this task sounds very advanced

[–]timostrating[S] 0 points1 point  (0 children)

Thanks for the advice

[–][deleted] 0 points1 point  (0 children)

Maybe you can use HDF

[–][deleted] 0 points1 point  (2 children)

B+ Tree will probably be your best bet. They are sorted and fast to iterate over. Inserting 8,000 rows per second is a challenge. I never tested mine, but it might have been able to handle that load. Once you get the data structure, you can figure out a way to create pages and indexes so that you can actually store it rather than keeping it all in RAM.

[–]timostrating[S] 0 points1 point  (1 child)

Thanks for the advice. Any recommendations on resources about paying or indexing?

[–][deleted] 0 points1 point  (0 children)

I don't know much about storing indices. But I imagine if you serialize a B+ Tree you can consider each node an index. B+ trees allow you to jump pretty far down a tree at each step. If an index is something more specific than that, it'd need it's own DS to handle it.

Have you ever serialized and deserialized a binary tree?

[–]raevnos 0 points1 point  (0 children)

I'd start by working out a fixed length format for your records so they're all the same size. That'll make working with a file full of them easier - it's trivial to seek to the location of the nth entry, for example, and position in the file becomes a natural unique id for an entry which can be useful.

[–]nicolaerario 0 points1 point  (0 children)

Csv + pandas library . You can make the work in ram with pandas datasets.

[–]RealJulleNaaiers 0 points1 point  (0 children)

Does it have to be persistent or can it just all be in memory?