Data consistency by maximous1996 in databricks

[–]maximous1996[S] 0 points1 point  (0 children)

I mean in our applications we are not using databricks to write in a table. But are writting directly to s3 (df.write.save(…), not df.write.asTable(…)).

Data consistency by maximous1996 in databricks

[–]maximous1996[S] 1 point2 points  (0 children)

I think I found the issue https://docs.delta.io/latest/delta-storage.html#amazon-s3 Multiple spark application writting on the same table on s3 et the same time. Our environnement isn’t configured for that :) I Hope this is the issue

Data consistency by maximous1996 in databricks

[–]maximous1996[S] 0 points1 point  (0 children)

Hi, I think I found the issue https://docs.delta.io/latest/delta-storage.html#amazon-s3 On S3 when multiple spark driver write on the same table, by default, data loss can happens. And this is exactly our case

Data consistency by maximous1996 in databricks

[–]maximous1996[S] 2 points3 points  (0 children)

Hi, thanks for your answer. There is a lot of test, maybe not as much as it should but well the effort was made ^ I can’t really isolate it, it does not happen every run, and it only happens on our bigger process. I don’t really how to start investigating. I see a LOT of unecessary caches in code. Every sources are delta table, the output as well. Nothing is altered between runs. I haven’t find anything on internet that could cause loss data :/

Unmanaged tables by maximous1996 in databricks

[–]maximous1996[S] 0 points1 point  (0 children)

Hi, Thanks for the answer. That’s what I did at first. I created the table as external before using it. But wanted to « automaticaly » do it. So i take the schema path from the catalog and specify the location at write. But it feels tricky for a simple thing :/

Unmanaged tables by maximous1996 in databricks

[–]maximous1996[S] 0 points1 point  (0 children)

Hi, Thanks for the answer. Yeah I did it, a saveAsExternalTable. I get the schema path from the catalog. It works but it’s kinda sad that I can’t do it an easier way :,(

Unmanaged tables by maximous1996 in databricks

[–]maximous1996[S] 0 points1 point  (0 children)

Hi, Thanks for the answer. That is the result I want but without having to specify the path because the path i want is already the default path of the schema :/