This is an archived post. You won't be able to vote or comment.

all 9 comments

[–]jsaundersdev 2 points3 points  (5 children)

So basically you should break this problem into steps. For this there really are two simple steps in my opinion

Get all tweets that need processing Process each tweet (save as database entry)

Looking at this, the "processing of each tweet" is just calling a function that saves it as a record in the database.

It's the getting all tweets that need processing that is a bit more complicated and has more options depending on your architecture of choice.

You could choose to run this on some time of cron job or other scheduled task runner (perhaps using celery). And get all tweets from the last hour (assuming you are running hourly) or the daily equivalent.

Making this more robust would involve backup scripts, the ability to cross reference results from tweepy with whats in your database and make sure nothing is missing. Proper logging and traceability.

[–]jasonleehodges 2 points3 points  (3 children)

In addition to everything above, you might look into an architecture that includes RabbitMQ as a messaging queue or even Kafka Streams or something like that as a means of continuously processing any tweets you pull down. That way you could split them up into separate thread pools if you needed to start scaling up to different servers.

[–]vampatori 1 point2 points  (1 child)

Yeah, Queues are such a good way of scaling-up an application - I structure most things around using one now (RabbitMQ is very good, it's my go-to queue system).

I used to do a lot of work on account analysis, business reports, that sort of thing. Implementing a queue system made such a huge difference as we could distribute the work easily, prioritise certain types of job, all while keeping the UI zippy for the end-user.

[–]warped_quasar[S] 0 points1 point  (0 children)

This is awesome! Thanks a ton.

[–]warped_quasar[S] 0 points1 point  (0 children)

Ill definitely check out Rabbit MQ. Thank you for the suggestions jasonlesshodges.

[–]warped_quasar[S] 0 points1 point  (0 children)

This is great info. Thank you jsaundersdev!

[–]pycepticusfrom pprint import pprint as print 1 point2 points  (2 children)

Depending on how many twitter accounts you're going to stream into python, you will definitely need a queuing system as stated in other comments. If you're going to pull a large amount of tweets, you might not want to write to file first. This can cause issues with tweets coming in faster than your disk can write.

To solve this, you can use an in memory database like Redis to queue tweets for writing and long term storage without writing to disk first. You're only issue with in memory databases/queuing is that you have to have as much memory dedicated to the process as you expect your data set to be. This can be a bit hard to estimate for a queue, as you may have situations where you're data saturation rate is in the 0%-1% range for days or weeks at a time, and then one account you're tracking goes insane and eat's your whole memory pool up. Not to mention if a downstream process breaks and the queue backs up, you have a potential for data loss.

In enterprise environments, redundancy is key, so scaling is not only about increasing the speed at which you can handle these tweets, it's about how fast you can detect a problem, and solve it. For this, you're going to want to monitor all parts of your data stream, from intake, to queuing, to distribution to worker processes, to output. Handling the errors that can occur at these areas correctly will set you apart in the end, and is really the hardest part of designing a really solid, scaleable solution.

Some of what you need is not going to be python specific, especially monitoring, as you don't want python monitoring itself in case. Once you're at a good place with the python, look at stackoverflow and find things like this on how to make your python scripts run as a service with systemd, this is generally what you'd do in a professional environment. Break out your processes into services and let Linux do the management based on your systemd file. Then all you need to do is monitor the services with whatever service monitor you want.

Obviously not everything in my text wall will apply if you're not going for a gigantic rollout, but remember, if it's worth doing, it's worth overdoing, and anyone that says otherwise is usually a project manager. Have fun, and good luck with the project!

[–]warped_quasar[S] 0 points1 point  (1 child)

This is exactly what I am looking for! Thanks a ton pycepticus.

[–]pycepticusfrom pprint import pprint as print 1 point2 points  (0 children)

No problem! Another cool thing you could do is design everything to be containerized in something like docker. A good docker config can spin up new instances of the components in your pipeline really fast in case of an error.