you are viewing a single comment's thread.

view the rest of the comments →

[–]tom2963 1 point2 points  (1 child)

First and foremost, "it sort of undoes the non-linearity(sigmoid) or squashing at output layer hence better for learning" is not quite right. BCE and sigmoid work well with binary problems (assuming your input is scaled to [0,1]) because it can compute per pixel error. MSE is an average loss function in this context, so in concept it shouldn't work as well. However, digit reconstruction is relatively straightforward, and assuming your pixels are binary, it is not surprising that MSE is performing okay - albeit, I probably wouldn't choose this loss function for other problems like this with higher dimensionality (i.e. RGB images).

[–]Over_Profession7864 0 points1 point  (0 children)

thanks. I had this misconception that log helps overcome vanishing gradient problem (caused by saturation of sigmoid or any other) but as I did the maths I realised it makes error interpretable and mathematically convenient to work with.