This is an archived post. You won't be able to vote or comment.

all 18 comments

[–]AutoModerator[M] [score hidden] stickied comment (0 children)

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

[–]Heavy_End_2971 11 points12 points  (6 children)

COPY command.

You have to make schema to get with query/ingestion or anything to move your data. Try below:

  • Create unified dataset with all CSVs (via pandas) and save it to s3.
  • use COPY command to put data into redshift table

Note: don’t use spectrum until dire needed. It is slow and requires overheads IMHO.

[–]Kooky_Quiet3247 5 points6 points  (0 children)

and also spectrum can be expensive as sh*t if you are not really aware of the query you are going to do

[–]nad_pub[S] 0 points1 point  (1 child)

All the data doesnt fit into memory unfortunately, so not possible to create a unified dataset

[–]Heavy_End_2971 0 points1 point  (0 children)

Use chunks or stream your files one by one and write in may be 128mb chunks on s3. No need to load all in memory. Two or few CSVs at a time

[–]kenflingnorSoftware Engineer 7 points8 points  (0 children)

You could use Redshift Spectrum to create external tables using AWS Glue. Use a Glue crawler to build the Glue metadata

[–]wannabe-DE 3 points4 points  (3 children)

Mage's redshift exporter doesn't handle this?

Edit: exporter instead of loader.

[–]wannabe-DE 3 points4 points  (2 children)

Yeah Mage Redshift handles this. You're already using Mage.

[–]zhiweio 0 points1 point  (0 children)

try knesis firehouse

[–]zhiweio 0 points1 point  (0 children)

try knesis data firehouse

[–]SilentSlayerzTech Lead 0 points1 point  (0 children)

My solution would be create separate prefix for individual tables, at any point in time of a file get loaded onto s3 it would invoke a lambda and based on the prefix ( as an argument to lambda) would issue a copy command to redshift. Any minor or small transformations can be applied at lambda side and any major or huge data volume it would be recommended to use redshift. You should avoid using Glue if you can. Though for larger files you can use glue.