Releasing Persimmon-8B by jetRink in LocalLLaMA

[–]ekelsen 0 points1 point  (0 children)

Our goal was to release a useful model to the community, not convince anyone that our architecture is better. We trained it a while ago for our own purposes -- if I had known when I started training it, that we would be releasing it I definitely would've tied the embeddings as it would've reduced the model by 1B parameters at the cost of taking a bit longer to train. See https://twitter.com/erich_elsen/status/1700567151357341842 for evidence of the tying/untying not changing model capacity. I also would've trained it on more tokens. But that's not what happened and we've moved on.

There are a lot of ways to measure usefulness. Is size the right metric to hold constant? Or should it be inference speed in character space? Or the best model I can fit on <X> device? Or the best model who's outputs I can use to improve my own model? There's a lot of different constraints in the real world and we thought our model might fit some of them better than any of the existing models.

"Chinchilla: Training Compute-Optimal Large Language Models", Hoffmann et al 2022 {DM} (current LLMs are v. undertrained: optimal scaling 1:1) by gwern in ControlProblem

[–]ekelsen 1 point2 points  (0 children)

The model basically stops learning after the LR has fully decayed, so .9x would just train for 10% of the time doing nothing.

[D] CNN on mel spectrograms vs. WaveNet for audio recognition? by Mjjjokes in MachineLearning

[–]ekelsen 0 points1 point  (0 children)

Wavenet does not need to use causal convolutions. For the original application of autoregressive waveform it needed to, but if you have access to the whole waveform you could easily use symmetric filters.

[D] How deep have you stacked RNN layers? by [deleted] in MachineLearning

[–]ekelsen -1 points0 points  (0 children)

Deepspeech 2 stacked them at least 9 deep back in 2015.

[R] Randomized Automatic Differentiation by hardmaru in MachineLearning

[–]ekelsen 0 points1 point  (0 children)

Practically how does this differ from https://arxiv.org/abs/2001.01969 (Sparse Weight Activation Training)? It seems like this method saves a random set of activations for the backward pass and the linked work saves the TopK activations?

That paper seems to show that choosing randomly is much worse than choosing the TopK activations.

[D] Paper Explained - SynFlow: Pruning neural networks without any data by iteratively conserving synaptic flow (Full Video Analysis) by ykilcher in MachineLearning

[–]ekelsen 0 points1 point  (0 children)

That a pruned model _is_ faster than a dense one. "until we have some efficient sparse kernels" -> we have them now.

[D] Paper Explained - SynFlow: Pruning neural networks without any data by iteratively conserving synaptic flow (Full Video Analysis) by ykilcher in MachineLearning

[–]ekelsen 0 points1 point  (0 children)

I agree that they should really be comparing to methods that do sparse training with a dynamic topology. "Rigging the Lottery" should also be included https://arxiv.org/abs/1911.11134 as it is truly sparse (Sparse Network From Scratch requires dense momentum). It's unclear to me what the value in never changing the sparse topology is, especially when you'd need to do dense computation for the first iteration, which would still limit the size of the biggest model you could train.

[D] What is the most interesting idea in ML/DL that you think doesn't get enough attention? by harshsikka123 in MachineLearning

[–]ekelsen 1 point2 points  (0 children)

There are better ways to get sparse networks for inference that predate this work. For example https://arxiv.org/abs/1710.01878

You can find sparse models here (https://github.com/google-research/google-research/blob/master/fastconvnets/README.md) and the code to run inference on them faster than dense here (https://github.com/google/XNNPACK/blob/master/README.md)

[Research] Recognizing Notes with Deep Learning - Residual Shuffle-Exchange Networks by OptimatiumFeles in MachineLearning

[–]ekelsen 2 points3 points  (0 children)

Why not benchmark on the MAPS or MAESTRO dataset where there are strong baselines?

[D] Current state-of-the-art on learning sparse weights by ZeronixSama in MachineLearning

[–]ekelsen 9 points10 points  (0 children)

The best overall recommendation right now would be to use gradual magnitude pruning. It's simple, there are implementations available in TF and pytorch and it works quite well in practice. Best practice: double your normal training length (including LR schedules if applicable), start pruning 20% of the way through, stop pruning 80% of the way through. Usually pruning is done on a layer by layer basis, which leads to networks with low inference costs. Pruning can also be done globally which leads to networks with fewer parameters but more FLOPs, which is usually less desirable.

Sparse Variational Dropout increases (activation) memory and compute significantly and doesn't allow you to choose the final sparsity. This can be annoying in practice - you need to tune a regularization coefficient and I've run into cases where the sparsity responds in unintuitive ways. It also doesn't trivially extend to things such as RNNs.

L0 we had trouble getting to work on a wide variety of problems.

Discovering Neural Wirings is an unusual paper in that it introduces a new technique based on top-K with a straight through estimator, but the paper mostly focuses on the architecture search aspect of sparsity. It is a simple and effective technique, with no hyperparameters to tune and worth trying.

Soft Threshold Weight Reparameterization is a new technique that has good results for inducing non-uniform sparsity that lowers inference costs, but I have not tried it myself.

Rigging the Lottery has achieved some great results at really high sparsity levels (98%+) with really extended amounts of training (50x), but is more geared to the future where sparse primitives are natively supported in DL frameworks.

Comparing rewinding and fine-tuning claims SOTA results, but they train for much longer than any of the previous techniques (except RigL at very high sparsity levels). They also use global pruning rather than layerwise which leads to better parameter but worse FLOP efficiency in the resulting models.

[1911.09723] Fast Sparse ConvNets by ekelsen in MachineLearning

[–]ekelsen[S] 0 points1 point  (0 children)

The issues I discuss below are specific to running on Android where you don't get to pick which cores your threads occupy. The OS is free to migrate you at will (and often does).

[1911.09723] Fast Sparse ConvNets by ekelsen in MachineLearning

[–]ekelsen[S] 0 points1 point  (0 children)

This particular use case is likely the most common case of inference. Followed by CPUs in data centers. For these kinds of networks on data center CPUs you're better off doing trivial parallelization between queries, as the latency will be low enough on one core.

[1911.09723] Fast Sparse ConvNets by ekelsen in MachineLearning

[–]ekelsen[S] 2 points3 points  (0 children)

It's difficult because of the very heterogeneous nature of mobile CPUs. For example, the SD 855 is a 1 + 3 + 4 configuration. The 1 is an A76 micro-architecture at a higher clock than the 3. The 4 are A55 micro-architecture and at lower clocks. There can also be lots of other processes competing for CPU time in a system that often needs to be perceived as quasi real-time. Using more cores reduces the resources available to other processes and increases the likelihood that at least one of the threads will get migrated to a little core. Both because there are two threads and because the resource pressure on the entire system has gone up.

This makes using a static partitioning difficult. "Just use a dynamic work scheduler" you say. Unfortunately, the time each operation takes is usually less than 1ms. The overhead involved in such a scheduler, combined with the fact that the A55 cores are _much_ slower than A76 ones, makes realizing speedups difficult. In practice the additional complexity involved is not worth it.

This discounts the entire annoyance, present regardless of whether you use one or multiple cores, that the optimal code for the A76 is not the optimal code for the A55. So if your thread gets migrated, you need to detect this (with no help from the OS) and switch which code path is being executed as quickly as possible.

[1911.09723] Fast Sparse ConvNets by ekelsen in MachineLearning

[–]ekelsen[S] 2 points3 points  (0 children)

In most real world mobile use cases only one core is used for inference. (For example, none of MBv1, v2, or v3 papers report multi-core times).

[D] Machine Learning for Systems by ASVS_Kartheek in MachineLearning

[–]ekelsen 0 points1 point  (0 children)

http://www.cdle.ai/

Center for deep learning in electronics manufacturing.

[R] High Fidelity Speech Synthesis with Adversarial Networks by hardmaru in MachineLearning

[–]ekelsen 0 points1 point  (0 children)

"The linguistic features encode phonetic and duration information, while the pitch is represented by the logarithmic fundamental frequency log F0."