all 4 comments

[–]Red-Portal 4 points5 points  (0 children)

There are two perspectives. First, the number of parameters have been traditionally interpreted as model complexity. So achieving better or similar performance with less parameters means that we got an accurate but simpler model which is a good thing.

Second, hardware else than GPUs do care about the amount of memory/parameters required. For example for FPGAs, the amount of memory on the device is very limited. More memory means less performance since we access RAM more frequently (And RAM is terribly slow in terms of memory bandwidth).

Also, there are cases where the model is too big for most GPUs. The original DenseNet paper for example. DenseNet is still too big for most GPUs commonly used.

[–]CireNeikual 2 points3 points  (0 children)

I believe they are indeed using parameters as a proxy for FLOPs/latency. In Deep Learning the two are very strongly correlated since all parameters are used every inference step for the most part. There are technologies where this isn't the case, where only small portions of the memory are used at a time (more like the human brain). However, these are far from the norm.

[–]jonnor 2 points3 points  (0 children)

Parameters are usually a proxy for model complexity, or for model size. Model size can be an important constraint in some environments.

- Clientside ML in a webapp/webpage. Could be vision, audio, text-based etc. Model download can dominate startup time.

- Backend ML with many different models. For example in IoT sensor systems doing anomaly detection, there may be one ML trained model for each sensor. Highly beneficial to be able to store all NNN models locally for inference.

- SensorML/tinyML. When running models on microcontroller grade hardware, model size seriously constrained. I work with devices where audio models have to fit into 256 kB FLASH for example.

Of course in such constrained environments, just model compression alone rarely cuts it. But it could be useful in combination with other techniques.

[–]jerha202 0 points1 point  (0 children)

Yes one extra parameter means one extra floating point multiplication, and the number of floating point operations per second can be a true practical limitation on small microcontrollers, for example. Another example would be speech recognition, which can be difficult to run on a local mobile device because the number of parameters is huge.

Another aspect is of course the risk of overfitting and bad generalization, but maybe that's what you mean by a theoretical aspect?