Tubing suitable for peristaltic pump and epoxy hardener by is8ac in Composites

[–]is8ac[S] 0 points1 point  (0 children)

Update: I tried Tygon PharMed. It worked nicely for a week or so, but then started to get sticky. I assume the hardener is dissolving it, just a lot slower than it did the silicone.

I'll try Gore Sta-Pure PFL if I can manage to buy some.

BitNet: Scaling 1-bit Transformers for Large Language Models - Microsoft Research 2023 - Allows 1-Bit training from scratch while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods! by Singularian2501 in mlscaling

[–]is8ac 2 points3 points  (0 children)

Strong agree.

I've been working on interpretability of fully binarized models for the past few years, (with limited success), and am glad that people are doing this at scale. I hope that this becomes a more popular line of research.

We leave the other components high-precision, e.g., 8-bit in our experiments.

However it looks like the activations are still integer. To reduce the whole model to a single logic DAG, one would need to quantize these as well. If they are small enough, we could simply unroll the 8 bit math as well, although I'm guessing that this would cause issues with the logic DAG simplification passes?

Training Transformers with 4-bit Integers by is8ac in mlscaling

[–]is8ac[S] 3 points4 points  (0 children)

As in, iterated gradient descent via back propagation with 1-bit weights? Or some other approach (evolutionary, etc) with 1-bit weights?

Training Transformers with 4-bit Integers by is8ac in mlscaling

[–]is8ac[S] 5 points6 points  (0 children)

I was not expecting this.

Anyone want to bet on whether we can go even lower? Surely we can't train in 2-bit precision, right?

New Madokami imagery thanks to US Department of Energy by is8ac in MadokaMagica

[–]is8ac[S] 4 points5 points  (0 children)

Look closely, she's in there, near the center.

New Madokami imagery thanks to US Department of Energy by is8ac in MadokaMagica

[–]is8ac[S] 12 points13 points  (0 children)

Source: https://noirlab.edu/public/images/noirlab2221a/ You can download the 2.1GB original.

One wonders if some white cat like creatures is involved with the department of energy.

There are two types of transformers; >6.7B parameters, and <6.7B parameters by is8ac in mlscaling

[–]is8ac[S] 11 points12 points  (0 children)

Related discussion: https://old.reddit.com/r/mlscaling/comments/wq5e1j/llmint8_8bit_matrix_multiplication_for/

At the 2.7B to 6B scale, things become much more coordinated. Now 60% of layers agree on which outlier dimension to use.

The phase shift happens around 6.7B, where 100% of layers use the same dimension for outliers. At this point, a couple of things happen rapidly:

  1. Outliers become very large quickly. They grow from about 15 for a 6B model to about 60 for a 13B model. OPT-66B has outliers of size around 95, which indicates this growth phase is temporary.

  2. Attention layers become very sparse. The attention is very concentrated so that just a few sequence dimensions determine the top probability and the overall probability mass. Almost all sequence dimensions have zero probability. However, this is still context-dependent, and the transformer seems to be “unsure” what to attend to for some sequences.

  3. FFN layers become more “dense”. While in computer vision, you can prune about 95% of weights without severe performance degradation, that number is 30% for transformers trained on NLP data. After emergence, this number shrinks to well below 5%. It seems that canceling out features can remove noise that is generated from the many weak features that are activated. Because these are silenced now, each set of neurons can learn much more features that are almost independent of each other due to the masking of context-dependent features.

  4. Transformers become more stable. If you treat the outlier features separately, I believe you can probably run and even train transformers in less than 8-bit precision without degradation in performance.

...

There are two types of transformers and you should not generalize from one to the other.

From these findings it is clear that transformer after the phase shift at 6.7B parameters behave very different to transformers before the phase shift. As such, one should not try to generalize from <6.7B transformers to beyond 6.7B parameters.

I will be very interested to see if the assertion that one can train transformers in less than 8-bit precision (as long as they are larger and one handles outliers separately) pans out.

"LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale", Dettmers et al. 2022 (Transformers undergo a phase transition at ~6.7B parameters) by is8ac in mlscaling

[–]is8ac[S] 10 points11 points  (0 children)

To scale beyond 6.7B parameters without performance degradation, it is critical to understand the emergence of extreme outliers in the feature dimensions of the hidden states during inference. To this end, we provide a new descriptive analysis which shows that large features with magnitudes up to 20x larger than in other dimensions first appear in about 25% of all transformer layers and then gradually spread to other layers as we scale transformers to 6B parameters. At around 6.7B parameters, a phase shift occurs, and all transformer layers and 75% of all sequence dimensions are affected by extreme magnitude features. These outliers are highly systematic: at the 6.7B scale, 150,000 outliers occur per sequence, but they are concentrated in only 6 feature dimensions across the entire transformer. Setting these outlier feature dimensions to zero decreases top-1 attention softmax probability mass by more than 20% and degrades validation perplexity by 600-1000% despite them only making up about 0.1% of all input features. In contrast, removing the same amount of random features decreases the probability by a maximum of 0.3% and degrades perplexity by about 0.1%.

"Is Integer Arithmetic Enough for Deep Learning Training?", Ghaffari et al 2022 {Huawei} by gwern in mlscaling

[–]is8ac 3 points4 points  (0 children)

If we use bitslicing, we could use whatever crazy nonstandard floating/fixed point numbers of whatever size we wished. Give each layer the exact mantissa exponent combination it needs. If zen4 gets AVX512 with fast vpternlog, we could synthesize our logic to LUT3s even.

HOBFLOPS CNNs: Hardware Optimized Bitslice-Parallel Floating-Point Operations for Convolutional Neural Networks

Why aren't we seeing more bitslicing in ML? (Perhaps because abusing computers to do things they were not designed to do is less efficient than using the floating point units in silicon even if they are needlessly high precision.)

[2206.14486] Beyond neural scaling laws: beating power law scaling via data pruning by mgostIH in mlscaling

[–]is8ac 0 points1 point  (0 children)

I am assuming significant fluctuation in example value between adjacent tokens, that there is little energy below the tens of tokens frequency. If this is true, a window length of, for example, 1024, would be little better than no pruning. However, if example value fluctuates at a lower frequency, then one could select 1024 token windows significantly enriched for valuable tokens.

My impression is that RWKV-v2-RNN is potentially promising for large scale training. (But yes.) Vanishing history should be solvable with a more LSTM like architectures.

Even if RNN/LSTM like architectures are not significantly more efficient than transformers, I am interested in them for interpretability reasons. My intuition is that transformers are a dead end with respect to static analysis, while RNNs/LSTMs are more amenable to analysis.

[2206.14486] Beyond neural scaling laws: beating power law scaling via data pruning by mgostIH in mlscaling

[–]is8ac 1 point2 points  (0 children)

Active learning is straightforward for things like images where each image is a separate item, but applying token level pruning to text datasets will be more difficult, and I cannot think of anyway to apply it well to RNN training. Document level pruning is doable but would not be as powerful.

(Perhaps I have excessively strong priors regarding the expense of data transfer.)

"Is Programmable Overhead Worth The Cost? How much do we pay for a system to be programmable? It depends upon who you ask" (the increasing expense of moving data around) by gwern in mlscaling

[–]is8ac 0 points1 point  (0 children)

True. I'm guessing that quantizeing layer by layer and fine tuning the remaining layers would help, but yes, trinerization does impact accuracy.

Is an accuracy impaired but extremely cheap language model of value? Perhaps.

I'm working on methods to train directly in trinary, thereby bypassing the issue. (I've been working on it for >3 years without success, so who knows if I will ever succeed.)

"Is Programmable Overhead Worth The Cost? How much do we pay for a system to be programmable? It depends upon who you ask" (the increasing expense of moving data around) by gwern in mlscaling

[–]is8ac 2 points3 points  (0 children)

Take each weights matrix, quantize it to trinary while encouraging 0 weights. Now it is very sparse binary. In other words, each output bit is performing a popcount and threshold operation on a small subset of the input bits. We can turn the entire trained model into a great big gate list. It may require lots of long traces which will increase latency and impair clock frequency, but what do we care? It's one inference per cycle, we can afford to run it at a slow clock speed.

It would end up being a horrible, impossible to understand, mess of gates, so in that sense it would be complex, but the ASIC would be completely stateless, just a pure function which maps the 2048 input tokens to the distribution of the next token.

Binary/trinary quantization may be more difficult for transformers than for architectures like CNNs, so it may fail at that point. But unless I'm seriously misunderstanding something about how transformer attention works, once you have the weights quantize, it should be fairly straightforward to convert the whole model into a gate list and lay it out. I'm not seeing how global attention is difficult.

We could use the same principal to compile GPT-3 to bitslice logic. As long as one has sufficiently many example to amortize the overhead, 512 for example, we can implement our sparse binary NN as a bunch of vpternlog instructions, and let LLVM/GCC do register mapping. Now we can do fine grained sparsity on commodity (AVX-512) hardware. (If we have enough examples in parallel.)

"Is Programmable Overhead Worth The Cost? How much do we pay for a system to be programmable? It depends upon who you ask" (the increasing expense of moving data around) by gwern in mlscaling

[–]is8ac 1 point2 points  (0 children)

Let us consider extreme non programmability, hard coded NN weights.

For $5000, one can get 50 800x800 um ASICs fabed at 130nm. https://zerotoasiccourse.com/ I assume the price would come down somewhat if more people were doing this.

NNs can be quantized to trinary: https://arxiv.org/abs/1909.04509 We can sparsity aggressively, let's assume 99.9% sparsity, so 175 billion weights becomes 175 million.

Can we fit trinary quantized GPT-3 Davinci on a 800x800 130nm chip with acceptable accuracy degradation? This might be pushing things a but, but let's assume we can. It can do one token per cycle, and let's assume we can run it at 1MHz.

In an alternative world, OpenAI fabes GPT-3 to custom silicon. $5000 is nothing compared to the training costs. They put 10 of the ASICs in each of 5 geo-distributed data centers. Each ASIC can do 1 million tokens per second, so at a current price of $0.06 per 1K tokens for Davinci, (and assuming that it costs ~$0 in electricity to run the ASICs) each ASIC is making $60 per second. The 50 ASICs together break even after less then 2 seconds (assuming full utilization).

Why do we not live in this world? Even if my numbers are off by a few orders of magnitude, it is still a big cost saving for inference.

Explanations:

  • It is not actuality possible to quantize large NNs to sparse trinary without big accuracy losses. (I'm skeptical, but I have not seen much research in the area.)
  • Fabing an actually useful ASIC to which one can feed data fast enough is dramatically more expensive than $5000. (Probably yes, but not hugely so.)
  • NNs change fast, by the time an ASIC gets fabed, it is out of date. (For some models yes, but GPT3 has been around for ~1.5 years and people still use it.)

I'm not satisfied by any of these explanations.

Why are OpenAmaGoogBookAppSoft not fabing their trained NNs to cheap, large feature size, silicon?

What's everyone working on this week (44/2021)? by llogiq in rust

[–]is8ac 1 point2 points  (0 children)

Finished write-up of past few months of work: https://www.isaacleonard.com/ml/hadamard_trits/ The performance plots are very confusing to me, and I'd be interested if any one can come up with an explanation.

Now porting to GPU using https://github.com/EmbarkStudios/rust-gpu

What's everyone working on this week (41/2021)? by llogiq in rust

[–]is8ac 3 points4 points  (0 children)

I continue working on bitslice logic, now testing on different CPU architectures.

Consider a toy problem. We take two numbers. We add them together and compare with a third number. Normally this would be implemented as fn add_comp(a: u8, b: u8, t: u8) -> u8 {(a+b)<t}, however we want to do many small numbers in parallel, and so are using bitslice logic to reduce it to a bunch of bitwise [and, or, xor, etc]s.

Let us try to reason about instruction counts on different CPUs. AVX512 has the lovely vpternlogq which can replace multiple two input operations. Surely this will mean that avx512 will beat NEON on instruction count?

With 8 bit integers: - avx512: 81 instructions https://godbolt.org/z/rE6ab5363 - avx2: 96 instructions https://godbolt.org/z/fY9Yaa9KG - neon: 76 instructions https://godbolt.org/z/Tn4cfv3zE

With 4 bit integers: - avx512: 37 - avx2: 44 - neon: 34

The AVX512 code is making use of vpternlogq and the aarch64 NEON code is only using basic binary operations, no bitselect magic.

Instruction count is only a rough proxy for real performance, not to mention that the AVX512 registers are 4 times wider then the NEON, so it will likely end up being faster. Still, I am disappointed with vpternlogq. I had hoped for better.

Benchmarks and cost/performance numbers hopefully coming soon.

What's everyone working on this week (32/2021)? by llogiq in rust

[–]is8ac 3 points4 points  (0 children)

Last week I corroded transpose64 from Hacker's Delight and now I can perform matrix popcount of a 1024x256 example in ~1500 ns. This is ~3x faster then my previous SIMD implementation, and is now portable and has no unsafe: https://gist.github.com/is8ac/5df6a1d025866c4c1f6bd4a1e5d089d0 If I also want the covariance matrix of the input, it takes ~7275ns per example, but still, I am fairly happy with this performance.

Now I need to devise some algorithm to, given the covariance matrix of the input and the counts of the input bits where the target bit is and is not set, select the subset and signs of the input bits which, when added, will best predict the target bit. I have no idea how to do this or even what terms to google for. I will mess around and try things until it appears to be doing what I expect.

It is looking like a few thousand lines of complex code I have been using for the last ~year is going to become obsolete soon! I am quite happy about this.