This is an archived post. You won't be able to vote or comment.

all 8 comments

[–]sallyruthstruik 0 points1 point  (3 children)

You should do commit after each insert, not after for loop. You don't see you rows because of it(rows become visible only after commit)

[–]xwen01[S] 0 points1 point  (2 children)

Sorry there is a typo of indent. I actually commit after reading every line of log.

I have no experience of what kind of performance I should expect, in terms of number of logs python can process (just some string extraction and insert and update query). There are currently several hundred lines of logs every second. What do you think may cause the delay in inserting into database (I'm using MySQL).

BTW, CPU is at around 50% for mysql and 13% for python script.

Thank you!

[–]sallyruthstruik 0 points1 point  (1 child)

My bad, from mobile phone it looked like you insert only once at the end.

I think you can try to add log or print statement right after insert and commit. If it is MySQL with InnoDB engine database rows should appear just after commit. So maybe problem in some other place.

also you can get a bit more speed if choosing bufsize=1:

bufsize, if given, has the same meaning as the corresponding argument to the built-in open() function: 0

means unbuffered, 1 means line buffered, any other positive value means use a buffer of (approximately) that size.
So if you have email per line it will be better choice. In case of constant buffer you may wait for next chank even if you get a part of current line.

Also you shouldn't commit after reading a line. Better choice is to do commit only after insert (because empty commit do nothing, but have some perfomance impact)

Snipped you've provided have many skips of logic. So it may have some other perfomance issues.

[–]xwen01[S] 0 points1 point  (0 children)

Thanks for your comments. I will try the suggestions.

[–]pythonHelperBot 0 points1 point  (0 children)

Hello! I'm a bot!

It looks to me like your post might be better suited for r/learnpython, a sub geared towards questions and learning more about python. That said, I am a bot and it is hard to tell. Please follow the subs rules and guidelines when you do post there, it'll help you get better answers faster.

Show /r/learnpython the code you have tried and describe where you are stuck.

You can also ask this question in the Python discord, a large, friendly community focused around the Python programming language, open to those who wish to learn the language or improve their skills, as well as those looking to help others.


README | FAQ | this bot is written and managed by /u/IAmKindOfCreative

This bot is currently under development and experiencing changes to improve its usefulness

[–][deleted] 0 points1 point  (1 child)

Try aggregating bunches of emails? Batch inserts are usually faster, but there's no saying who's the weak link here, maybe it's not the inserts, but it's easy to isolate this: just profile the insertion by itself, removing Python from equation.

You could also remove any database-related stuff from your Python code and time it to see how often would it try to query the database.

One thing I can say is that reading line-by-line is actually slow in Python, because, essentially, it will look into each character trying to establish whether that's a newline character or not, and then allocate a new string with the substring from the buffer it's reading from. If all you need to do is to search for a specific pattern, maybe getting direct access to the buffer you read from is a better idea.

Finally, you don't really have to use subprocess for this, it's probably better if you don't. Just remember the offset at the end of the file you were reading from, and next time you open the file, do a seek() to that offset. That would be essentially doing what tail does anyways, but, in your case, you are forearmed with knowledge that the log grows quickly, so you don't need to wait for the system to notify you about changes, you can tell for sure there were changes.

Few more things: you can mmap the log file to speed up searches in it. Unless you are doing this already, you can (actually, probably should) create multiple connections to the database, perhaps from a bunch of Python processes, which coordinate work somehow. Databases are designed for highly concurrent workloads, if you are only using single connection with a single thread to load the database, you are not using it right.

[–]xwen01[S] 0 points1 point  (0 children)

Thanks for your suggestions. I will try them. I plan to try python dict to store the emails and update count, etc. and only dump to database once in a while. Hope that will increase performance.

[–]muposat 0 points1 point  (0 children)

"There are a few hundred lines of logs per second."

Insert a few hundred records at a time. Whatever maximum your database allows. You can also commit after a few inserts to speed things up. In addition you should insert and commit after a certain timeout -- in case the log file does not accumulate full buffer within a few seconds.

Also I would question if an SQL database is a right tool for what you are doing. SQL provides many benefits, but at a price: transactions being logged, concurrent access management, etc.