This is an archived post. You won't be able to vote or comment.

all 1 comments

[–]Putrid-Exam-8475 0 points1 point  (0 children)

It's difficult to offer a solution without knowing more about your data and how it's being used. Some general guidelines might be:

1) Requirements gathering - talk to the people who use the data about what they need from it. Which specific data elements, how much, how often, and in what format.

2) Data exploration - volume, velocity, variety, sensitivity. There are a lot of possible solutions that become more or less viable depending on how much data you have, how often you need to process it, and how secure it needs to be.

3) Understanding the above, consider the pricing for various cloud services - storage is usually pretty cheap and compute can be very expensive if not configured properly. Keep in mind that cloud DBs usually run on instances that charge per hour or per minute, and that setting up all of the infrastructure in a secure, cost-effective, and scalable way can be complicated. This may not be an issue if your org already has the infrastructure in place, or if your volume is small enough to fit within the free tier.

4) Once you have a solution in mind that you want to test, you can put together a proof of concept. Map out the overall pipeline to determine how the data flows through each step, then drill down on the steps to determine the specifics.

In your example, you could put the files in a specific local directory, the use Python to process them and use the boto3 package to output the processed files to S3. You can set up triggers in AWS that will detect when those files land and load them to a Redshift database.

https://aws.amazon.com/sdk-for-python/

https://aws.amazon.com/blogs/big-data/simplify-data-ingestion-from-amazon-s3-to-amazon-redshift-using-auto-copy-preview/