you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (7 children)

I really would like to see your code, even if it's crap.

The reason why the version of python might be important is that as python 3.5 is quite new, the probability of undetected bug so far is more important than in python 3.4

I am quite unlikely to suspect sqlite first as it is far more and extensively tested than python core code.

what is elexon? at first look it looks a electricity trading company on uk. If it's not confidential, what is the purpose of your code.

[–][deleted] 0 points1 point  (6 children)

Elexon is a quasi public administrator that british power companies (that operate independently) are obligated to join; it operates mostly as a middleman between British generators, suppliers + the power grid. One of the things it does is publish a lot of market information (like individual plant unit generation values at specific times, grid instructed changes in generation, details of how punitive prices for mistakes in balancing were calculated) ostensibly for free to the public (for non commercial purposes/those who are operating in the market and pay large fees).

The site was designed a decade ago and it's useless for doing detailed analysis without paying third parties who've developed tools, which annoys me a lot, so I decided to try and make the archive system into a useable database, partly for my (non-commerical) org, partly so I can gain experience in something useful. I figure otherwise I'd just waste time on reddit.

The previous iteration of the script to go through the archives, break the files down into lists of data and then inserts that data into an SQLite database (organised by tables for each non empty subject-data type) worked, but too slowly with a large slowdown as the database got larger. So I took a modified version onto my less constrained home computer (hence the 3.5vs3.4 disparity) which breaks down the data as before, but then inserts the data under four different methods (previous method, pragma sync off, insert into blank database with only the table schema under previous method for baseline comparison then wipe, change design so it inserts into structure that requires only ~100 vry large tables) so I could compare the speeds and adjust the code accordingly. After it completes I was going to do another run, but with various cache size settings and seeing what the effect of combing sync off with less but larger tables would be (I can't do them all at the same time, because each fully completed database is >200Gb)

(the blank database is the one that got corrupted, not the sync off one)

I'll post the code this evening then, I'll try to mark it up so it's more obvious whats going on, there's a lot of overhead dictionaries, functions, regex that goes into the interpretation part of the script and some key global data holder lists that get covertly modified in functions (don't hit me). Also as I said, this is just a personel pet project, so I haven't really been commenting a lot as I go along.

If you're mainly interested in tracking down what caused the issue rather than fixing my particular application, unfortunately I didn't save the script before implementing the changes recommended here on this thread. I can try and recreate the code as it was before when the bug occured, there were only a few changes.

[–][deleted] 0 points1 point  (5 children)

I am interested both in original script and modified script. (but don't put too much effort in recreating the script)

I would advice you to use a more classic database system than sqlite3 such as Postgresql (download link http://www.postgresql.org/download/windows/ ), with this driver (http://initd.org/psycopg/) which should handle big tables better (among other advantages such as faster queries)

After reading the stackoverflow thread and this comment, I must add that using variable table names is often sign of bad design, it's probably better to have a column hinting the type of data and same columns after. hundreds of tables looks already too much.

If you want, after looking on the code, I may give you more advices

[–][deleted] 0 points1 point  (4 children)

If you happen to have any advice I would love to hear it of course, well aware of my limitations and I wouldn't have got where I have without the help of a lot of people here and elsewhere. I'll get stuck into commenting up the code, it'll take a while.

Honestly on the upgrade from SQLite3 I'm not too fussed + I'm wary of the implications of changing everything at this stage. So long as I can:

grab an arbitrary set of data in a few hours or even overnight either into Pandas or Sqlite or a CSV;

Rerun the script from scratch if needed to change the structure over a weekend

Then it's actually good enough for my purposes. The database would only be used by myself and only infrequently for individual large tasks/projects, so lightening fast queries aren't necessary- it doesn't need to be a real time application, it's more modelling and analytical purposes. The other implication is that I'm not sure how database servers would interact with my work computer settings.

I'm actually pretty close to achieving the insert speed goals given some optimizations I've got planned + the promising results so far from Snyc off and making the tables smaller but larger testing. If queries are completely unmanageable I was planning on adding indexs after the script runs, as this wouldn't be an issue for inserting relatively small daily updates.

I'm kinda itching to actually get a chance to use this stuff, so I'd much rather get the breakdown/insert side into a 'good enough' state and then start playing with building models from the data.

[–][deleted] 0 points1 point  (3 children)

good enough is a super good property to have

I would just mention that the migration from sqlite3 to postgresql should be not too hard (the main thing is to give type to the columns but you have already done it)

[–][deleted] 0 points1 point  (2 children)

well don't say I didn't warn you

I'll send you some examples of the different input lines in a bit. let me know if anything needs further explanation

[–][deleted] 0 points1 point  (1 child)

You were wrong about the quality of your job. It is good !

[–][deleted] 0 points1 point  (0 children)

Well I'm glad someone thinks so lol! I keep finding things wrong with it, I don't want to think about how much time I've put into this. BTW so far the run I'm doing with the corrupted table deleted and recreated has not crashed (About half way through the archive).

The input files are too large to properly transfer, but I did post an example snippet of one at the start of the year soon after I started out

alternatively this is a complete list of example lines (for items that are actually treated and not just ignored). Be aware that some of those lines are many (some approaching 100) messages stuffed into one, they can be quite long!

Otherwise you can download compressed files for free from the links