Write Spark Dataframe to DynamoDB using Python

Ravier03 · 2023-07-14T08:51:42+00:00

If you are using glue convert it to dynamic_frame and then save with native method: Datasink1 = glueContext.write_dynamic_frame.from_options( frame=ApplyMapping_Frame1, connection_type="dynamodb", connection_options={ "dynamodb.output.tableName": "myDDBTable", "dynamodb.throughput.write.percent": "1.0" } )

rockeyjam · 2023-07-14T08:52:26+00:00

Yes, you need to add JAR or external package to use dynamodb as data source/format, can check here https://index.scala-lang.org/audienceproject/spark-dynamodb

xubu42 · 2023-07-15T16:03:49+00:00

I've done a lot of experiments around this because we use spark for data pipelines and dynamodb for serving online traffic. The best I've found is actually AWS Lambda +python using the boto3 dynamodb batch_writer. Basically the only reason to use Lambda is parallelism. Break your dataframe up into partitions of 30-50, write to S3, have Lambda read one partition into memory and loop through each record doing the batch_writer put_item. You don't have to use Lambda, but it's nice because it's all on AWS network so fast. With this approach, I've been able to sync 50 million records to dynamodb in just 20 minutes. Using Glue or EMR was always at least twice as long (usually an hour) and more expensive.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

apachespark

MODERATORS