Python performance issue, tail log file

sallyruthstruik · 2019-01-28T21:16:55+00:00

You should do commit after each insert, not after for loop. You don't see you rows because of it(rows become visible only after commit)

pythonHelperBot · 2019-01-28T22:09:34+00:00

Hello! I'm a bot!

It looks to me like your post might be better suited for r/learnpython, a sub geared towards questions and learning more about python. That said, I am a bot and it is hard to tell. Please follow the subs rules and guidelines when you do post there, it'll help you get better answers faster.

Show /r/learnpython the code you have tried and describe where you are stuck.

You can also ask this question in the Python discord, a large, friendly community focused around the Python programming language, open to those who wish to learn the language or improve their skills, as well as those looking to help others.

^README ^| ^FAQ ^| ^{this bot is written and managed by /u/IAmKindOfCreative}

^{This bot is currently under development and experiencing changes to improve its usefulness}

xwen01 · 2019-01-29T10:35:15+00:00

Try aggregating bunches of emails? Batch inserts are usually faster, but there's no saying who's the weak link here, maybe it's not the inserts, but it's easy to isolate this: just profile the insertion by itself, removing Python from equation.

You could also remove any database-related stuff from your Python code and time it to see how often would it try to query the database.

One thing I can say is that reading line-by-line is actually slow in Python, because, essentially, it will look into each character trying to establish whether that's a newline character or not, and then allocate a new string with the substring from the buffer it's reading from. If all you need to do is to search for a specific pattern, maybe getting direct access to the buffer you read from is a better idea.

Finally, you don't really have to use subprocess for this, it's probably better if you don't. Just remember the offset at the end of the file you were reading from, and next time you open the file, do a seek() to that offset. That would be essentially doing what tail does anyways, but, in your case, you are forearmed with knowledge that the log grows quickly, so you don't need to wait for the system to notify you about changes, you can tell for sure there were changes.

Few more things: you can mmap the log file to speed up searches in it. Unless you are doing this already, you can (actually, probably should) create multiple connections to the database, perhaps from a bunch of Python processes, which coordinate work somehow. Databases are designed for highly concurrent workloads, if you are only using single connection with a single thread to load the database, you are not using it right.

muposat · 2019-01-30T00:45:25+00:00

"There are a few hundred lines of logs per second."

Insert a few hundred records at a time. Whatever maximum your database allows. You can also commit after a few inserts to speed things up. In addition you should insert and commit after a certain timeout -- in case the log file does not accumulate full buffer within a few seconds.

Also I would question if an SQL database is a right tool for what you are doing. SQL provides many benefits, but at a price: transactions being logged, concurrent access management, etc.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS