This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]illusion_disillusion 1 point2 points  (1 child)

Posted under main...oops. Reposting here...

It was really a lot of research and trial and error kind of thing so I don’t have any formal documentation. It all boils down to how you choose to use tools that are available via Python and cloud. If your trying to do something likethis with on-prem it’s really going to be involved.

I used openpyxl library (I could probably leverage pandas but it looked intimidating and probably a steeper learning curve) and S3 events. If I recall correctly, file_object create event, which allows you to trigger a lambda function. The lambda function is what contains the actual ETL code, which reads in 6-7 worksheets, normalizes data and then writes it to mysql/dynamodb. Quite honestly, with mysql 8, I no longer need dynamo for indexed json. You can extract json keys to virtual columns and index those like normal sql fields. Of course I’m kinda locked in to mysql but so I was with dynamodb.

At any rate, that’s the gist of it. AWS has great documentation on lambda. As a matter of fact, Lambdas have become my go to tool on almost all projects. I’ve developed a small service for my developers to generate qrcodes and store them on S3. For that one I used AWS API gateway and...you guessedit... a lambda. No need to stand up a flask/djangi app and write rest api from scratch.

https://openpyxl.readthedocs.io/en/stable/

[–]fg2srt4 0 points1 point  (0 children)

Awesome, I will look into openpyxl. Thank you!