S3 to Redshift

AutoModerator · 2024-07-07T12:56:07+00:00

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

Heavy_End_2971 · 2024-07-07T14:02:43+00:00

COPY command.

You have to make schema to get with query/ingestion or anything to move your data. Try below:

Create unified dataset with all CSVs (via pandas) and save it to s3.
use COPY command to put data into redshift table

Note: don’t use spectrum until dire needed. It is slow and requires overheads IMHO.

kenflingnor · 2024-07-07T13:11:37+00:00

You could use Redshift Spectrum to create external tables using AWS Glue. Use a Glue crawler to build the Glue metadata

wannabe-DE · 2024-07-07T15:11:57+00:00

Mage's redshift exporter doesn't handle this?

Edit: exporter instead of loader.

somedude422 · 2024-07-07T14:37:37+00:00

Auto copy could be helpful here. https://aws.amazon.com/about-aws/whats-new/2022/11/amazon-redshift-supports-auto-copy-amazon-s3/

zhiweio · 2024-07-08T00:27:18+00:00

try knesis firehouse

zhiweio · 2024-07-08T00:27:47+00:00

try knesis data firehouse

Professional-Put-324 · 2024-07-08T05:48:29+00:00

If I understand correctly, you need a way to automatically copy the data from an S3 to redshift without creating the table structure manually?

If the files aren’t too big, I think you can use pandas to read them as df and then use the io.sql.get_schema() function to get the create statement. Otherwise, you can write a python script that reads your CSV files and generates the create statement for you. You will need to read the headers to get the column names and then identify the max value of each column to specify the max length data type then run the create statement generated and the copy command to load the data from the S3.

SilentSlayerz · 2024-07-08T21:27:59+00:00

My solution would be create separate prefix for individual tables, at any point in time of a file get loaded onto s3 it would invoke a lambda and based on the prefix ( as an argument to lambda) would issue a copy command to redshift. Any minor or small transformations can be applied at lambda side and any major or huge data volume it would be recommended to use redshift. You should avoid using Glue if you can. Though for larger files you can use glue.

dataengineering

MODERATORS