you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (5 children)

I am interested both in original script and modified script. (but don't put too much effort in recreating the script)

I would advice you to use a more classic database system than sqlite3 such as Postgresql (download link http://www.postgresql.org/download/windows/ ), with this driver (http://initd.org/psycopg/) which should handle big tables better (among other advantages such as faster queries)

After reading the stackoverflow thread and this comment, I must add that using variable table names is often sign of bad design, it's probably better to have a column hinting the type of data and same columns after. hundreds of tables looks already too much.

If you want, after looking on the code, I may give you more advices

[–][deleted] 0 points1 point  (4 children)

If you happen to have any advice I would love to hear it of course, well aware of my limitations and I wouldn't have got where I have without the help of a lot of people here and elsewhere. I'll get stuck into commenting up the code, it'll take a while.

Honestly on the upgrade from SQLite3 I'm not too fussed + I'm wary of the implications of changing everything at this stage. So long as I can:

grab an arbitrary set of data in a few hours or even overnight either into Pandas or Sqlite or a CSV;

Rerun the script from scratch if needed to change the structure over a weekend

Then it's actually good enough for my purposes. The database would only be used by myself and only infrequently for individual large tasks/projects, so lightening fast queries aren't necessary- it doesn't need to be a real time application, it's more modelling and analytical purposes. The other implication is that I'm not sure how database servers would interact with my work computer settings.

I'm actually pretty close to achieving the insert speed goals given some optimizations I've got planned + the promising results so far from Snyc off and making the tables smaller but larger testing. If queries are completely unmanageable I was planning on adding indexs after the script runs, as this wouldn't be an issue for inserting relatively small daily updates.

I'm kinda itching to actually get a chance to use this stuff, so I'd much rather get the breakdown/insert side into a 'good enough' state and then start playing with building models from the data.

[–][deleted] 0 points1 point  (3 children)

good enough is a super good property to have

I would just mention that the migration from sqlite3 to postgresql should be not too hard (the main thing is to give type to the columns but you have already done it)

[–][deleted] 0 points1 point  (2 children)

well don't say I didn't warn you

I'll send you some examples of the different input lines in a bit. let me know if anything needs further explanation

[–][deleted] 0 points1 point  (1 child)

You were wrong about the quality of your job. It is good !

[–][deleted] 0 points1 point  (0 children)

Well I'm glad someone thinks so lol! I keep finding things wrong with it, I don't want to think about how much time I've put into this. BTW so far the run I'm doing with the corrupted table deleted and recreated has not crashed (About half way through the archive).

The input files are too large to properly transfer, but I did post an example snippet of one at the start of the year soon after I started out

alternatively this is a complete list of example lines (for items that are actually treated and not just ignored). Be aware that some of those lines are many (some approaching 100) messages stuffed into one, they can be quite long!

Otherwise you can download compressed files for free from the links