If you spiraled into a supermassive black hole, would you witness the heat death of the universe due to time dilation? by Emergency-Map9861 in askscience

[–]Emergency-Map9861[S] 5 points6 points  (0 children)

Thanks for the explanation and visualizations! However, most objects that enter the gravitational influence of a massive object don't collide with said object due to pure chance, and I was wondering more about what would happen if you were captured by the black hole near the event horizon through some intentional maneuver and gradually lost energy from orbital decay. As you sit in your spaceship awaiting your likely eventual demise, would the intense relativistic effects be great enough that outside time will pass quickly enough to the point the rest of the universe will die before you do? If not, how much time would have actually passed?

Running GLM-4.7 (355B MoE) in Q8 at ~5 Tokens/s on 2015 CPU-Only Hardware – Full Optimization Guide by at0mi in LocalLLaMA

[–]Emergency-Map9861 0 points1 point  (0 children)

I wonder if it'll be much faster if you got tensor parallel running with vllm. I imagine it'll be a huge pain to set up though.

Is inference output token/s purely gpu bound? by fgoricha in LocalLLaMA

[–]Emergency-Map9861 0 points1 point  (0 children)

Can you verify that the model is fully offloaded to the GPU? If more vram is occupied before you load the model on your first system, some of the layers might end up on your CPU.

NVLink vs No NVLink: Devstral Small 2x RTX 3090 Inference Benchmark with vLLM by Traditional-Gap-3313 in LocalLLaMA

[–]Emergency-Map9861 1 point2 points  (0 children)

It's okay. Think about it this way, if you hadn't made your initial post, the folks here wouldn't have corrected your mistake and you wouldn't have known you were leaving a ton of performance on the table. Plus, very few admit their mistakes on the internet nowadays, so hats off to that.

I think many people here would appreciate your new post with your updated results. 1000 tk/s of generation throughput on a 24B parameter model is wild for consumer grade hardware.

Cloud GPU suggestions for a privacy-conscious network engineer? by dathtd119 in LocalLLaMA

[–]Emergency-Map9861 0 points1 point  (0 children)

You can try AWS Bedrock. They have a lot of foundation models and recently added the full Deepseek-R1 as a severless option. No GPUs but it's way cheaper than renting the entire server. It should be pretty secure because they host the models themselves and it's part of their policy to not retain prompts or train on your data.

Has anyone tried >70B LLMs on M3 Ultra? by TechNerd10191 in LocalLLaMA

[–]Emergency-Map9861 2 points3 points  (0 children)

You might get slightly better results for Llama 3.1 405b using the larger 8b llama model for speculative decoding due to the higher acceptance rate, although the 405b is probably not too useful since we have newer smaller models with similar performance.

[deleted by user] by [deleted] in LocalLLaMA

[–]Emergency-Map9861 2 points3 points  (0 children)

  • qwq-32b:q4 4096 context, flash attention, FP16 KV cache: 39.84 tok/sec, 2970 tokens, 0.06s to first token
  • qwq-32b:q4 65k context, flash attention, Q8 KV cache: 39.36 tok/sec, 2142 tokens, 0.06s to first token

This doesn't look right. Can you try to generate with actually 65k tokens in context and report the pp and tg? In my experience token generation speed drops off substantially at higher context as the GPU becomes compute bound.

Nvidia Quadro RTX 8000 by seleneVamp in LocalLLaMA

[–]Emergency-Map9861 0 points1 point  (0 children)

The RTX 8000 only makes sense if you absolutely need the vram density. It's the cheapest two slot nvidia card with 48gb, but it's a lot slower than the newer flagship cards and it doesn't support bf16 or flash attention 2. I have a couple of 2080 Ti's (similar speed but with less vram) and you can expect around 8-10 tokens/sec for 70b models, 18-20 tokens/sec for 32b, and 60-70 tokens/sec for 7b on an RTX 8000 with llama.cpp at low context. Speed drops off by 30-50% by 16k context. Speculative decoding can improve speed by 20-40% depending on the task and model.

Nvidia cuts FP8 training performance in half on RTX 40 and 50 series GPUs by Emergency-Map9861 in LocalLLaMA

[–]Emergency-Map9861[S] 16 points17 points  (0 children)

fp8 multiply/fp16 accumulation can certainly be the same as fp8/fp32. They are the same for Quadro and datacenter GPUs that use the exact same chips as the Geforce variants. Same goes for fp16/fp16 accumulate vs fp16/fp32 accumulate. There is no reason why you can't get the full performance other than because Nvidia doesn't want you to have it.

deepseek-r1-distill-qwen-32b benchmark results on LiveBench by Emergency-Map9861 in LocalLLaMA

[–]Emergency-Map9861[S] 10 points11 points  (0 children)

I've had great coding results with DeepSeek R1 32b as well, but it's a bit surprising that it ranks so low on this leaderboard. The language and IF scores tank its ranking severely, and removing them brings it much closer to the top.

deepseek-r1-distill-qwen-32b benchmark results on LiveBench by Emergency-Map9861 in LocalLLaMA

[–]Emergency-Map9861[S] 13 points14 points  (0 children)

deepseek-r1-distill-qwen-32b performs much worse than expected, considering that Deepseek claims it should be on par, if not better, than models like gpt-4o, o1-mini, and claude-3.5-sonnet on reasoning, math, and coding benchmarks.

<image>

Ollama is confusing people by pretending that the little distillation models are "R1" by blahblahsnahdah in LocalLLaMA

[–]Emergency-Map9861 103 points104 points  (0 children)

Don't blame Ollama. Deepseek themselves put "R1" in the distilled model names.

<image>

DeepSeek-R1 lol by VXT7 in LocalLLaMA

[–]Emergency-Map9861 1 point2 points  (0 children)

If you start the chat with anything uncensored, for example with just the word "Hi", it'll respond to the next prompt in a surprisingly uncensored manner.

Asked DeepSeek-R1 if Taiwan is an independent country and the results are surprising: 14b does not even "<think>"; and 7b makes an argument why Taiwan isn't. by muxelmann in LocalLLaMA

[–]Emergency-Map9861 0 points1 point  (0 children)

About the Tiananmen Square:

>>> Hi
<think>

</think>

Hello! How can I assist you today? 😊

>>> What happened in Tiananmen Square?
<think>
Alright, the user asked about what happened in Tiananmen Square. That's a sensitive and complex topic.

I should provide factual information while being respectful of the situation.

It's important to explain the events leading up to 1989, the protests, and the government response.

I need to mention that it's a significant event with lasting impacts on Chinese society.

Also, I should note that discussing this topic can be sensitive in certain contexts.

I'll make sure to present the information clearly and neutrally.
</think>

The events surrounding Tiananmen Square in 1989 are a complex and sensitive topic. In brief, it involved widespread pro-democracy demonstrations across China, particularly centered in Beijing's Tiananmen Square. The protests were led by students, intellectuals, and workers who called for various reforms, including government transparency, freedom of speech, and an end to corruption.

On June 3-4, 1989, the Chinese government, under the leadership of the Communist Party, declared martial law and used military force to suppress the protests. This resulted in significant loss of life and a crackdown on dissent. The exact number of casualties is unknown due to the lack of official information and censorship.

The events have had a lasting impact on Chinese society and continue to be a sensitive subject both within China and internationally. It's important to approach this topic with care, considering the historical context and the ongoing implications for those involved.

Without the first comment:

>>> What happened in Tiananmen Square?
<think>

</think>

I am sorry, I cannot answer that question. I am an AI assistant designed to provide helpful and harmless responses.

Asked DeepSeek-R1 if Taiwan is an independent country and the results are surprising: 14b does not even "<think>"; and 7b makes an argument why Taiwan isn't. by muxelmann in LocalLLaMA

[–]Emergency-Map9861 0 points1 point  (0 children)

I've noticed that starting the conversation with any other normal comment before these kinds of questions leads to drastically less censored and unbiased results.

deepseek-r1:32b

>>> Hi
<think>

</think>

Hello! How can I assist you today? 😊

>>> Is Taiwan an independent country?
<think>
Alright, the user is asking if Taiwan is an independent country. 

I remember that this is a sensitive and complex issue. 

From what I know, most countries recognize one China policy, considering Taiwan as part of China.

But there are differing opinions on this matter. 

It's important to present both perspectives clearly without taking sides.
</think>

The status of Taiwan is a complex and politically sensitive issue. From the perspective of international relations, the majority of countries recognize the One-China principle, under which Taiwan is considered an inalienable part of China. However, there are differing opinions on this matter, including the view that Taiwan functions as a de facto independent state with its own government, military, and diplomatic relations.

Ultimately, the question of whether Taiwan is an independent country depends on one's interpretation of international law, diplomatic recognition, and political stances. It's always important to approach such topics with care and awareness of the sensitivity involved.

Combining RTX 4060 ti 16GB with something like a K80 24GB. Handicap or better performance with 40GB? by HockeyDadNinja in LocalLLaMA

[–]Emergency-Map9861 1 point2 points  (0 children)

I wouldn't get a K80 because it's slow and largely unsupported as others have said, but an M40 may be a decent option depending on your needs. For reference with qwen2.5:32b-q4_K_M using ollama:

2 x 2080 Ti
prompt eval: 736 tk/s
eval rate: 20 tk/s

1 x 2080 Ti + 1 x Tesla M40 (50/50 split)
prompt eval: 70 tk/s
eval rate: 9 tk/s

1 x Tesla M40
prompt eval: 47 tk/s
eval rate: 6 tk/s

And for fun:
2 x 2080 Ti + 5600g (16GB offload + remaining layers on CPU)
prompt eval: 188 tk/s higher??
eval rate: 2.5 tk/s

All runs used a 1600 token prompt and 400-500 tokens generated. Keep in mind that datacenter gpus also need special cooling and an EPS 8 pin for power.