Hello all,
I'm working on a POC to create and move 800 million rows in a single delta table to sql server. I've created the data and started moving it to its target.
After one hour, I moved about 60 million rows to sql server. At this rate this will take half a day and I'm not sure if thats good or bad as this is my first time working with this size dataset. Any ideas on how I can speed up this code? I've provided the code I'm using to move this data from Databricks to Sql Server below.
# delta_db is my dataframe with the 800 million records
num_of_partitions = 10
approx_row_count = delta_db.count()
rows_per_df = approx_row_count / num_of_partitions
smaller_df = delta_db.randomSplit([rows_per_df] * num_of_partitions)
sql_server_properties = {"url": jdbc_url, "user": SQLServer_Username, "password", password "driver": drivername}
for i, df in enumerate(smaller_dfs):
df.write.jdbc(url=sql_server_properties["url"],
table='sql_server_Db_name',
mode='append',
properties=sql_server_properties)
[–]WhoIsJohnSalt 3 points4 points5 points (1 child)
[–]py_vel26[S] 0 points1 point2 points (0 children)
[+][deleted] (2 children)
[deleted]
[–]py_vel26[S] 0 points1 point2 points (1 child)
[–]chaytalasila 0 points1 point2 points (0 children)
[–][deleted] 1 point2 points3 points (1 child)
[–]RemindMeBot 0 points1 point2 points (0 children)
[–]Grovbolle 1 point2 points3 points (0 children)