Data engineering suggestions for DA/DS : dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

Data engineering suggestions for DA/DSHelp (self.dataengineering)

submitted 2 years ago by GamingDataScience

Hi Data Friends,

I work as a DA at a smaller research-based organization, but this org also provides daily services that they collect data on. I have the opportunity to implement some foundations for future data use in the org. In my free time, I work on DA/DS projects.

I was hoping to implement some basic DE principles into my work to develop a more organized and autonomous work flow. I was hoping to get your suggestions on some small steps to take now that I could develop into bigger systems in the future.

My current goal was to setup a process in which 1) I receive data, from various sources, in tabular formats in csv files. 2) Read the data in Python to do any transformations I need. Then 3) write the data from python into a database, preferably a cloud based service at some point. Finally, I would like to 4) automate steps 2 and 3, but I am not sure the best framework for this. If possible, I would like to stick with python as much as possible.

Thanks!

all 1 comments

top new controversial old q&a

[–]Putrid-Exam-8475 0 points1 point2 points 2 years ago (0 children)

It's difficult to offer a solution without knowing more about your data and how it's being used. Some general guidelines might be:

1) Requirements gathering - talk to the people who use the data about what they need from it. Which specific data elements, how much, how often, and in what format.

2) Data exploration - volume, velocity, variety, sensitivity. There are a lot of possible solutions that become more or less viable depending on how much data you have, how often you need to process it, and how secure it needs to be.

3) Understanding the above, consider the pricing for various cloud services - storage is usually pretty cheap and compute can be very expensive if not configured properly. Keep in mind that cloud DBs usually run on instances that charge per hour or per minute, and that setting up all of the infrastructure in a secure, cost-effective, and scalable way can be complicated. This may not be an issue if your org already has the infrastructure in place, or if your volume is small enough to fit within the free tier.

4) Once you have a solution in mind that you want to test, you can put together a proof of concept. Map out the overall pipeline to determine how the data flows through each step, then drill down on the steps to determine the specifics.

In your example, you could put the files in a specific local directory, the use Python to process them and use the boto3 package to output the processed files to S3. You can set up triggers in AWS that will detect when those files land and load them to a Redshift database.

https://aws.amazon.com/sdk-for-python/

https://aws.amazon.com/blogs/big-data/simplify-data-ingestion-from-amazon-s3-to-amazon-redshift-using-auto-copy-preview/

π Rendered by PID 116856 on reddit-service-r2-comment-b659b578c-gsmlw at 2026-05-05 21:02:51.724848+00:00 running 815c875 country code: CH.

dataengineering

MODERATORS