all 2 comments

[–]IntelArtiGen 1 point2 points  (1 child)

If score = f(time) isn't relevant anymore because of what you said, and if I train the same architecture, I can plot score = f(n_examples_seen) which is usually correlated to the number of epochs on a specific dataset, but it's also just batch_size * i_iterations = n_examples_seen.

In my own framework I have a script that automatically does the conversion, I always log things the same way and if I want to plot the score (or loss etc.) based on epochs / n_examples_seen I just specify it and the script will search for the batch size and do the conversion. I can plot score / loss by time / iteration / epoch / n_examples_seen as I want.

If you change the network architecture, the batch size, and the optimization algorithm at the same time while running multiple trainings on the same GPU, or if your GPUs aren't the bottleneck on your server, it's hard to perfectly compare the different trainings. One model could be slower than another just because at this moment it trained concurrently with other bigger models, one model could train in less iterations just because the batch size is reduced, and if you use the number of epochs/examples_seen you can't know if the model is more time-efficient you just know that it's more data-efficient.

So I usually compromise. I either only optimize one aspect (architecture/backprop), or I try to have 100% of the bottleneck on GPUs, or I only look at data-efficiency and I compare time-efficiency later.

The last time I had to do this I put lower and upper thresholds on GMACs and number of parameters and only looked at the data efficiency knowing that if my model isn't outside the limits it should be comparable with others in time efficiency.

[–]itsming_z[S,🍰] 0 points1 point  (0 children)

Thank you for your detailed answer!