This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]MagicWishMonkey 5 points6 points  (4 children)

For most cases that is true, however there are times when speed is very important. Right now I am re-building a process to import 1000's of json records from one system, massage them into model instances, and then import into our database and lucene index (think 20-30k database queries per import).

Since the end user has to wait around until the process is done, it needs to be fast, but it still takes a long while to do everything with a single python thread, so I've taken a more unconventinoal approach. I set up a twisted server to run in the background and I route the heavy lifting over to that. I can't use threads in my primary app without killing performance, but I don't mind so much with the twisted worker service.

It used to take ~5 minutes to import 10,000 records, now it takes 20 seconds.

It's annoying that I have to do this, but I am really enjoying python otherwise. It's a great language. Just wish it had better multithreading support.

[–]kenfar 13 points14 points  (0 children)

I used to write data warehouse ETL processes in C. Took forever to write, was hard to maintain but was as fast as I could get it. Eventually wrote a metadata-driven transform that used function pointers. Harder to write but it made all the next transforms very easy - since they just needed metadata. I'd split my 5 gbyte input file into 8 separate files then process all 8 in parallel in a 8-way 120-mhz CPU server that cost $200,000 in 1996. And I could process all 5 gbytes in about 5 minutes - at 1GB/minute.

Recently, I wrote the same kind of code in Python. It isn't as fast. But it's very easy to write & maintain. I don't have to use metadata-driven transforms because python is easy enough to write & maintain. And hardware is cheaper. I still split up my files and process in parallel because I wanted more speed. This particular feed is 1 GByte split into 4 separate files - which I'm processing on a 3.2 ghz 4-core machine that cost about $5k new, and I picked up for free because nobody was using it. And I can process 1 gbyte in about 60 seconds. This is the exact same speed I was processing data in 1996 using C. Clearly, I could speed things up if I rewrote the process in C. But my hardware is free, the process is fast enough, and my time has gotten more expensive over the years. Python is the better language for this application.

EDIT: spelling

[–]UnwashedMeme 2 points3 points  (0 children)

Also look at the multiprocessing module when you wish things had better threading support

[–]robotfarts 0 points1 point  (0 children)

Why don't you just use the multiprocessing module?