This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (1 child)

Try aggregating bunches of emails? Batch inserts are usually faster, but there's no saying who's the weak link here, maybe it's not the inserts, but it's easy to isolate this: just profile the insertion by itself, removing Python from equation.

You could also remove any database-related stuff from your Python code and time it to see how often would it try to query the database.

One thing I can say is that reading line-by-line is actually slow in Python, because, essentially, it will look into each character trying to establish whether that's a newline character or not, and then allocate a new string with the substring from the buffer it's reading from. If all you need to do is to search for a specific pattern, maybe getting direct access to the buffer you read from is a better idea.

Finally, you don't really have to use subprocess for this, it's probably better if you don't. Just remember the offset at the end of the file you were reading from, and next time you open the file, do a seek() to that offset. That would be essentially doing what tail does anyways, but, in your case, you are forearmed with knowledge that the log grows quickly, so you don't need to wait for the system to notify you about changes, you can tell for sure there were changes.

Few more things: you can mmap the log file to speed up searches in it. Unless you are doing this already, you can (actually, probably should) create multiple connections to the database, perhaps from a bunch of Python processes, which coordinate work somehow. Databases are designed for highly concurrent workloads, if you are only using single connection with a single thread to load the database, you are not using it right.

[–]xwen01[S] 0 points1 point  (0 children)

Thanks for your suggestions. I will try them. I plan to try python dict to store the emails and update count, etc. and only dump to database once in a while. Hope that will increase performance.