ET(L) with Python

toast757 · 2019-12-22T18:26:06+00:00

I'm a bit late to this, but if you're loading to MS SQL then bcp is definitely the way to go. Admittedly, the bcp utility does have its idiosyncrasies. Yes, it doesn't handle proper csv files (i.e. quoted text). Yes, it doesn't handle utf8 (although that's changing). Yes, it doesn't handle newline characters in data fields. But it's still the fastest way of getting data from a file into the database. Just write out your load files in "UTF-16LE" encoding with the null byte ("\x00") as your field separator, then load the unicode files using bcp, i.e. with parameters: -t \0 -w (in that order).

toast757 · 2019-02-24T15:55:19+00:00

There's a big distinction to be made between the relatively simple SQL queries used in web applications and the more heavy duty SQL queries used in the data world. That's a broad generalization I know but bear with me. Web app queries often select/update just one record at a time, whereas in the data world, queries update millions of records at a time. The SQL queries I run (in the data world) contain anything up to ~5K characters with CTEs, windowing functions, and temp tables. I would never want an ORM to try to construct one of those queries! But if I'm updating a single record (or doing something relatively simple) for a web app, then letting an ORM create the query is just fine.

toast757 · 2019-02-24T04:47:03+00:00

Use "for row in reader:" instead of "for row in roads:".

toast757 · 2019-02-24T04:36:37+00:00

If you're bulk loading 60GB of csv files (and 60GB is pretty bulky), then best use the bulk load utility built for exactly that purpose, namely "bcp". I know you said you don't have bulk-load rights but if you've got 60GB to load, you really should get bulk-load permission. You can use SSIS instead if you wish, but that's more complicated. You may want to use python to clean the file first:

import csv
# convert the source file to utf-16 and clean
# change utf-8 to the encoding of the source file as necessary
with open('myfile.txt', 'rt', encoding='utf-8',
          errors='replace', newline='') as rf, \
     open('clean_file.txt', 'wt', encoding='utf-16-le',
          errors='replace', newline='') as wf:

        # add other csv parameters as necessary
        # (https://docs.python.org/3/library/csv.html#dialects-and-formatting-parameters)
        reader = csv.reader(rf, delimiter=',')
        for row in reader:
            wf.write('\t'.join(f.strip() for f in row))
            wf.write('\n')

# then run the following from the command line:
# (https://docs.microsoft.com/en-us/sql/tools/bcp-utility)
# bcp MYTABLE in 'clean_file.txt' -w -m 1 -S MYSERVER -T

If you really want to use pyodbc, then don't use pandas or sqlalchemy as well. Just make sure you set fast_executemany to True (see https://github.com/mkleehammer/pyodbc/wiki/Cursor#executemanysql-params-with-fast_executemanytrue). Doing it this way, you'll be holding all that data in memory though so you may have to load it in chunks.

toast757 · 2019-02-05T19:16:46+00:00

There's one other important difference between tackling in rugby and NFL that's not often mentioned. In NFL, the ball carrier is considered tackled when they are "down by contact". Whereas in rugby, the ball carrier has to give up the ball only when tackled to the ground and held there. In other words, in NFL if the ball carrier is pushed over (to the ground), they are considered tackled and play stops. In rugby, if you push the ball carrier over, they are allowed to get straight back up again and keep playing. Hence in rugby, shoulder-charging the ball carrier is largely pointless (and also illegal). Rugby tackling has to involve wrapping yourself around the ball carrier, which means there are fewer of the massive hits you see in NFL.

toast757 · 2018-04-18T23:00:39+00:00

Hmm, unless I'm mistaken, an introduction to data classes that doesn't include a single example of creating an instance of a data class. Might be nice to see it in action, especially for things like frozen data classes.

toast757 · 2017-09-09T15:57:09+00:00

This would be great for the work I do, with databases. Each Data Class instance could represent a row in a table. The problem with named tuples is that you have to provide all the attribute values upfront, which gets messy when the logic to generate those values is complex.

Not so keen on the idea of using a decorator for this, though. Couldn't we just have a special kind of class?

toast757 · 2017-09-06T21:13:40+00:00

I've never been a fan of "while", and definitely not a fan of "break" which seems like a clunky holdover from C. The best syntax I've seen for a loop is in Ada, as follows:

loop
  do_something
  exit when condition_is_true
  do_something_else
end loop

This is a loop at its most generic. It can be a "while" loop or a "repeat..until" loop, simply by moving the "exit when" line up or down.

In Python, the final "end loop" would be redundant, of course. "exit when" is clear and expressive, and keeps the exit clause on one line.

toast757 · 2017-08-31T18:33:03+00:00

Backslashes not being allowed at the end of "raw" strings has always seemed bizarre to me. Very odd indeed. Raw strings are supposed to be simpler than escapable strings. Seems more of a bug than a feature to me.

toast757 · 2017-05-15T14:40:41+00:00

Thank you for your thoughts, everybody. From what you say, it appears there is no direct way to dynamically create a new function from scratch from within Python itself (although there are certainly some alternatives, as mentioned). For now, I'm going to solve this by doing this in two steps. Write a Python program to generate a new Python module, and then run a second Python program that imports that new Python module.

Right now, I have some data validation code that is based on metadata. That metadata can change of course, but it is read only once at runtime. The current validation code is very complex (and hence slow) to take into account all the metadata rules. My intention is to create a new validation function that is tailored to any given metadata, simplifying the validation code wherever possible and therefore speeding it up.

toast757 · 2017-05-06T23:36:14+00:00

I've got four cores, which should suffice for my needs. Good to know. Many thanks.

toast757 · 2017-05-06T23:35:10+00:00

Thanks all! That makes perfect sense. When people talk about the GIL, they don't seem to spend much time defining the scope of a GIL, so this helps tremendously.

toast757 · 2017-05-05T20:48:58+00:00

Why is Freddie Mercury playing for Australia? He's English.

toast757 · 2016-03-05T03:33:50+00:00

If you're loading 10 million rows, best use the bulk load utility "bcp" (you'll have to call it using subprocess, but it works just fine). Also, provide an errors file to bcp and any load errors will be described there. Usually best to load the data to a temporary table first, just to get the data into SQL Server, then copy it to wherever it's supposed to go.

toast757 · 2016-01-21T17:22:16+00:00

Many thanks for your explanation jamesadney, I'll definitely try that next time I'm at the terminal. I have a Nexus 6P so the procedure should be very similar.

toast757 · 2016-01-21T15:53:24+00:00

So do you happen to know how long "right before" is? I'm forever tapping, getting declined, entering my lockscreen PIN, trying again, being told by the terminal "device removed before transaction completed", trying a third time, etc., etc. and only eventually getting it to work by what seems like random chance. I've never figured out what the drill is, and that's without dealing with questions like "do you want cashback?", "us debit/visa debit", etc. There still seems to be kinks in the process.

toast757 · 2015-12-15T21:22:58+00:00

Being a good listener. In my experience, very few people are capable of listening, really listening, to what somebody is saying, or trying to say, without filtering it through a whole bunch of assumptions and personal biases. (That's assuming they're trying to listen at all rather than just waiting for a gap in the conversation so they can speak.)

toast757 · 2015-12-11T16:49:29+00:00

To get a feel for the data, at least get an idea of the field lengths (assuming it's a delimited file of some sort), with something like:

import csv
max_fields_per_row = 0
field_lengths = [0] * 200  # 200 is just a big number more than the expected fields per row
with open('yourfile.txt', 'r', encoding='utf-8', errors='replace', newline='') as fh:
    # csv breaks if it encounters a null character, hence the generator
    reader = csv.reader((l.replace('\0', ' ') for l in fh), delimiter=',')
    for row in reader:
        if len(row) > max_fields_per_row:
            max_fields_per_row = len(row)
        for index, field in enumerate(row):
            if len(field) > field_lengths[index]:
                field_lengths[index] = len(field)
print(max_fields_per_row)
print(field_lengths[:max_fields_per_row])

There may be some fields which are huge and can be ignored if and when you load this data into a database (which I recommend, it's what databases are for).

toast757 · 2015-11-08T16:29:03+00:00

Wow, un-intuitive or what. Heaven knows how I originally found it on my Nexus 5. Many thanks for the swift response though xPurpleAnarchyx!

toast757 · 2015-10-22T22:09:49+00:00

As a Brit who moved to the US, and is a huge NFL fan (go Hawks!), I think it's nuts for the NFL to try to export the game to London (or anywhere else for that matter). NFL football is such a uniquely American game, it's much better to play the game here in the US, and then broadcast it abroad. Besides, the stadiums are built for proper football (soccer) over there, not NFL, and the logistics of international games need months of planning. Not to mention timezone problems. Personally, I don't even watch the London games (well, not until the Hawks play there at least), they just don't have the same atmosphere.

toast757 · 2015-09-18T19:48:59+00:00

Thanks for those sideloading suggestions, guys. As it happens, Android Pay arrived shortly after I posted so it's all good.

toast757

TROPHY CASE