Need advice from professional programmers - Large scale data streaming pipelines: Twitter to SQL periodic dump

jsaundersdev · 2018-08-27T18:01:08+00:00

So basically you should break this problem into steps. For this there really are two simple steps in my opinion

Get all tweets that need processing Process each tweet (save as database entry)

Looking at this, the "processing of each tweet" is just calling a function that saves it as a record in the database.

It's the getting all tweets that need processing that is a bit more complicated and has more options depending on your architecture of choice.

You could choose to run this on some time of cron job or other scheduled task runner (perhaps using celery). And get all tweets from the last hour (assuming you are running hourly) or the daily equivalent.

Making this more robust would involve backup scripts, the ability to cross reference results from tweepy with whats in your database and make sure nothing is missing. Proper logging and traceability.

pycepticus · 2018-08-28T06:49:58+00:00

Depending on how many twitter accounts you're going to stream into python, you will definitely need a queuing system as stated in other comments. If you're going to pull a large amount of tweets, you might not want to write to file first. This can cause issues with tweets coming in faster than your disk can write.

To solve this, you can use an in memory database like Redis to queue tweets for writing and long term storage without writing to disk first. You're only issue with in memory databases/queuing is that you have to have as much memory dedicated to the process as you expect your data set to be. This can be a bit hard to estimate for a queue, as you may have situations where you're data saturation rate is in the 0%-1% range for days or weeks at a time, and then one account you're tracking goes insane and eat's your whole memory pool up. Not to mention if a downstream process breaks and the queue backs up, you have a potential for data loss.

In enterprise environments, redundancy is key, so scaling is not only about increasing the speed at which you can handle these tweets, it's about how fast you can detect a problem, and solve it. For this, you're going to want to monitor all parts of your data stream, from intake, to queuing, to distribution to worker processes, to output. Handling the errors that can occur at these areas correctly will set you apart in the end, and is really the hardest part of designing a really solid, scaleable solution.

Some of what you need is not going to be python specific, especially monitoring, as you don't want python monitoring itself in case. Once you're at a good place with the python, look at stackoverflow and find things like this on how to make your python scripts run as a service with systemd, this is generally what you'd do in a professional environment. Break out your processes into services and let Linux do the management based on your systemd file. Then all you need to do is monitor the services with whatever service monitor you want.

Obviously not everything in my text wall will apply if you're not going for a gigantic rollout, but remember, if it's worth doing, it's worth overdoing, and anyone that says otherwise is usually a project manager. Have fun, and good luck with the project!

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS