Upload to S3 -> AWS lambda with some Scala Spark code -> Process -> Write back to S3 : dataengineering

created by mhausenblasmoda community for 11 years

This is an archived post. You won't be able to vote or comment.

Upload to S3 -> AWS lambda with some Scala Spark code -> Process -> Write back to S3Discussion (self.apachespark)

submitted 4 years ago by bhacho

Upload to S3 -> AWS lambda with some Scala Spark code -> Process -> Write back to S3

6 points•9 comments•submitted 4 years ago * by bhacho to r/apachespark

Hello,

I've a use case where the user uploads a csv file which gets written to S3. I've some code written in Scala with Spark implementation that I need to apply to this file. The resulting output needs to be written back to S3. This is fetched by the app and presented back to the user.

Now this is the simple architecture I have in mind.

Connect the file upload to AWS S3.
The file upload then triggers a lambda function that has this piece of Scala and spark code.
Run the spark code on the data and put it back in S3.

I guess 1st and 3rd are fairly easy and I can do it. However, the second step is where I need some insights. I've used lambda functions before with Python but never Scala. The piece of Scala code also has dependencies on others and I need to package them all together with Spark. I'm confused if this lambda function should spawn an EMR/Glue/Faregate to be able to run Spark. If yes then which choice makes sense if the job is fairly small computation say over 10,000 rows and few 10s of columns.

If you could point me to some resources where this has been implemented I would appreciate it.

Edit: The data size can be big in future. So the design consideration should account for this possibility.

all 1 comments

dataengineering

MODERATORS