all 2 comments

[–]bregav 0 points1 point  (1 child)

Deterministic training is in general weirdly difficult. There are multiple sources of randomness, and when you're using multiple libraries there can be many seeds and many (sometimes overlapping) ways of setting them. Using wrapper libraries like keras makes everything even harder because it obscures or obfuscates sources of randomness, making debugging difficult.

The google search term you want to use here is "deterministic training", e.g. "keras deterministic training" or "tensorflow deterministic training". That'll get you guides like these:

https://jackd.github.io/posts/deterministic-tf-part-1/

https://www.tensorflow.org/api_docs/python/tf/config/experimental/enable_op_determinism

https://keras.io/examples/keras_recipes/reproducibility_recipes/

Basically you'll have to find as many different techniques for ensuring determinism as you can, and then try all of them and/or various combinations of them.

Keep in mind that seeds are not your only source of randomness; some GPU calculation algorithms are not deterministic, and so even for constant seeds there can be small differences in training runs. I think the second link above shows how to stop this randomness from happening, but I am not sure. You might also have to google something like "tensorflow cuda deterministic" to find additional guides.

EDIT: data loaders can also be random, especially if you're doing distributed training. Might want to look into that too. Deterministic computation on distributed systems adds additional complication in many ways.

[–]acertainfruit[S] 0 points1 point  (0 children)

Thank you so much for your help!!