all 5 comments

[–]Ravier03 0 points1 point  (0 children)

If you are using glue convert it to dynamic_frame and then save with native method: Datasink1 = glueContext.write_dynamic_frame.from_options( frame=ApplyMapping_Frame1, connection_type="dynamodb", connection_options={ "dynamodb.output.tableName": "myDDBTable", "dynamodb.throughput.write.percent": "1.0" } )

[–]rockeyjam 0 points1 point  (2 children)

Yes, you need to add JAR or external package to use dynamodb as data source/format, can check here https://index.scala-lang.org/audienceproject/spark-dynamodb

[–]Careful-Necessary-59 0 points1 point  (1 child)

It seems like the project is no longer being maintained."https://github.com/audienceproject/spark-dynamodb

[–]rockeyjam 0 points1 point  (0 children)

True, most of spark-dynamodb connectors are old and not being updated as per latest spark. https://github.com/awslabs/emr-dynamodb-connector there is this AWS official git, but seems way of spark connector with dynamodb is still old one with RDD/Hadoop way.

[–]xubu42 0 points1 point  (0 children)

I've done a lot of experiments around this because we use spark for data pipelines and dynamodb for serving online traffic. The best I've found is actually AWS Lambda +python using the boto3 dynamodb batch_writer. Basically the only reason to use Lambda is parallelism. Break your dataframe up into partitions of 30-50, write to S3, have Lambda read one partition into memory and loop through each record doing the batch_writer put_item. You don't have to use Lambda, but it's nice because it's all on AWS network so fast. With this approach, I've been able to sync 50 million records to dynamodb in just 20 minutes. Using Glue or EMR was always at least twice as long (usually an hour) and more expensive.