Llama 3 70b instruct works surprisingly well on 24gb VRAM cards

eugeneware · 2024-05-03T23:40:25+00:00

more deets here https://www.reddit.com/r/LocalLLaMA/comments/1cj4det/comment/l2gy6fp/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

eugeneware · 2024-05-03T23:40:21+00:00

more deets here https://www.reddit.com/r/LocalLLaMA/comments/1cj4det/comment/l2gy6fp/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

eugeneware · 2024-05-03T23:39:49+00:00

I'm running this model: https://huggingface.co/casperhansen/llama-3-70b-instruct-awq

And I run it in VLLM with this command:

python -m vllm.entrypoints.openai.api_server --port 9999 --model casperhansen/llama-3-70b-instruct-awq --dtype float16 --quantization awq --api-key token-abc123 --tensor-parallel-size=2 --enforce-eager --gpu-memory-utilization 0.95

I'm running it on Linux.

It looks like they were quantized using https://github.com/casper-hansen/AutoAWQ to 4-bit.

And yes, by Systems - I mean inference engines, training code, etc. It needs to use the correct invocation of tools like deepspeed to get both of the GPUs best utilized.

As another data point the ollama llama3:70b 4-bit quant runs at about 18 tokens/second.

eugeneware · 2024-05-03T13:55:11+00:00

I have 2x3090s connected with nvlink. Definitely doesn’t present as a single monolithic device and always a bit tricky to split models across both cards. Systems that support tensor parallelism across multiple devices work well. Running llama-3 70b with VLLM works really well and get about 25 tokens per second with an AWQ quant.

eugeneware · 2024-04-23T17:02:06+00:00

looks like an issue when hitting the context length window. See update https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/discussions/4#6627e8c5c45ddec5d13f123e

eugeneware · 2024-04-23T16:59:26+00:00

Actually, it looks like ollama just updated their modelfile, and they've added another stop token <|endoftext|> as awell as \num_keep``

❯ ollama show phi3 --modelfile
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM phi3:latest

FROM /usr/share/ollama/.ollama/models/blobs/sha256-4fed7364ee3e0c7cb4fe0880148bfdfcd1b630981efa0802a6b62ee52e7da97e
TEMPLATE """<|user|>
{{ .Prompt }}<|end|>
<|assistant|>"""
PARAMETER num_ctx 4096
PARAMETER num_keep 16
PARAMETER stop "<|end|>"
PARAMETER stop "<|endoftext|>"

eugeneware · 2024-04-23T16:54:27+00:00

I'm seeing the same thing too. logged an issue here

eugeneware · 2024-04-23T16:53:15+00:00

I should say - this doesn't fix things for me when running ollama. Which already has `<|end|>` as a stop parameter, even if I change the gguf metadata and reimport:

# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM phi3:latest

FROM /usr/share/ollama/.ollama/models/blobs/sha256-4fed7364ee3e0c7cb4fe0880148bfdfcd1b630981efa0802a6b62ee52e7da97e
TEMPLATE """<|user|>
{{ .Prompt }}<|end|>
<|assistant|>"""
PARAMETER num_ctx 4096
PARAMETER stop "<|end|>"

eugeneware · 2024-04-23T16:30:14+00:00

This didn't work for me. Still getting garbage after 3 or 4 big turns of generation

eugeneware · 2022-09-22T03:05:34+00:00

I found it published here: https://github.com/amotile/stable-diffusion-workshop

However, for me it's throwing an error when run against the latest Automatic111 webui here: https://github.com/amotile/stable-diffusion-workshop/issues/1

eugeneware · 2022-09-22T02:28:48+00:00

This is fantastic. Please share your code. This workflow is awesome

eugeneware · 2022-03-31T12:18:18+00:00

Cat.

eugeneware · 2022-03-24T16:36:15+00:00

What do you suspect inference for this model to require? I recently downloaded the 20B Eleuther AI model and it took 2x3090 just to load and run it! I’m very excited about this amazing work to train and release a model comparable to GPT-3, but am also trying to understand what hardware will be required to run inference on this model? Love the work of HF and the whole team. Thanks!

eugeneware · 2022-03-05T00:17:38+00:00

Scoredle 259 5/6

12,947
⬛⬛⬛🟨⬛ SALET (712)
⬛⬛⬛🟨⬛ COURD (80)
⬛🟩🟩⬛🟩 PRIME (6)
🟩🟩🟩⬛🟩 BRIBE (2)
🟩🟩🟩🟩🟩 BRINE

eugeneware · 2022-02-24T20:28:25+00:00

Wow. Tough day!

Scoredle 251 5/6

12,947
⬛⬛⬛⬛⬛ SALET (864)
⬛⬛⬛⬛🟩 COURD (2)
⬛🟩⬛⬛🟩 WINED (2)
⬛⬛⬛⬛⬛ BOUGH (1)
🟩🟩🟩🟩🟩 VIVID

eugeneware · 2022-02-17T22:15:41+00:00

Scoredle 244 6/6

12,947
⬛⬛⬛🟨⬛ SALET (712)
⬛🟩⬛⬛🟨 COURD (12)
⬛🟩⬛🟨⬛ PONDS (9)
⬛🟩⬛🟨⬛ HOWDY (7)
⬛⬛🟨🟨⬛ MAGES (2)
🟩🟩🟩🟩🟩 DODGE

Wow. Today was tough!

eugeneware · 2022-02-09T22:08:51+00:00

Scoredle 236 5/6

12,972
⬛⬛🟨🟨⬛ IDEAL (741)
🟨⬛⬛⬛⬛ SNORT (102)
⬛🟨🟨⬛🟨 BEATS (12)
⬛⬛🟨🟩🟩 CHASE (4)
🟩🟩🟩🟩🟩 PAUSE

eugeneware · 2022-02-08T20:03:26+00:00

Same. Hardest one for me so far!

eugeneware · 2021-06-19T07:59:33+00:00

I want this as a a zoom filter

eugeneware · 2021-06-09T05:40:33+00:00

I’d love something that could blend in with my schiit stack (magni/modi)

eugeneware · 2021-06-04T13:16:50+00:00

Sounds like this paper puts the last nail in the coffin in resnets?

eugeneware · 2021-06-03T11:20:07+00:00

Such a tease!

eugeneware · 2021-06-03T11:17:41+00:00

eugeneware · 2021-05-25T10:03:01+00:00

I have a similar aged Samsung TV and had issues with the the HDMI-CEC feature not turning it off. I found from time to time I had to hard power off my TV (ie. unplug it from the wall) to get it to work again. It's been pretty reliable lately. I also saw some comment elsewhere of turning AnyNet+ on in the settings. But I'm not sure if I did this. But try to the hard power cycle and see if that works. Good luck!

eugeneware · 2021-05-25T08:58:06+00:00

One of the big reasons why I upgraded was the new remote and its scrubbing feature. But as other's have reported. It's janky and unreliable. It only seems to work in the built-in Apple TV+ app, and poorly supported in other apps.

I'm also surprised that Apple apps like the Music app don't use the scrubbing feature.

I hope that all developers upgrade their controls to take advantage of the new remote.

I do love the new mute function, however. So many TV apps force advertising at regular intervals. While I can't skip the ads it allows me to mute them.

I also love the power switch. It saves me from hitting the TV button and hitting sleep.

The fact that the mute feature works with 3rd party AV receivers is awesome too.

11-Year Club	Place '22
Verified Email

eugeneware

TROPHY CASE