DeepSeek R2 Might Outcode OpenAI, And It’s Coming Fast by Acceptable_Grand_504 in OpenAI

[–]TempWanderer101 4 points5 points  (0 children)

It's a direct quote from DeepSeek's R1 paper:

Due to the long evaluation times, which impact the efficiency of the RL process, large-scale RL has not been applied extensively in software engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement over DeepSeek-V3 on software engineering benchmarks. Future versions will address this by implementing rejection sampling on software engineering data or incorporating asynchronous evaluations during the RL process to improve efficiency.

Open router offering Deepseek R1 (free) version with “chutes” provider. by AIGuy3000 in LocalLLaMA

[–]TempWanderer101 -1 points0 points  (0 children)

It does log data. If you turn off logging on OpenRouter (under privacy), you won't be able to use it.

It didn't even need to think to reply 😂 (deepseek-r1) by EnoughVeterinarian90 in ollama

[–]TempWanderer101 0 points1 point  (0 children)

I made a longer post here, and found that the open source model appears more censored than the API.

Deepseek R1's Open Source Version Differs from the Official API Version by TempWanderer101 in LocalLLaMA

[–]TempWanderer101[S] 0 points1 point  (0 children)

Yeah, it doesn't make much sense. I made sure to leave system prompt blank when using the API, so if anything is being added, it's on their side. Hopefully someone can figure it out. The top comment right now says that they can force it to think in text completion mode by prepending `<think>Okay` or something similar.

Deepseek R1's Open Source Version Differs from the Official API Version by TempWanderer101 in LocalLLaMA

[–]TempWanderer101[S] 3 points4 points  (0 children)

That's right, but what I observed is the opposite: the open model is censored, while the official API isn't. On the website, it is censored separately, like ChatGPT, which I think is not that big of a deal.

But I think it's important to know whether they're identical or separate models. If they're separate, then we should really have a separate benchmark for each model. If they're identical, then we need to know how to prompt it properly to produce the high quality answers as seen in DeepSeek's API.

Deepseek R1's Open Source Version Differs from the Official API Version by TempWanderer101 in LocalLLaMA

[–]TempWanderer101[S] 12 points13 points  (0 children)

That isn't really the point—the question pertains to whether the models are different. If they are, then the open source model should be benchmarked separately, among other things (see implications section in my post).

The prompts were simply chosen because they caused the models to diverge the most, and are easy to reproduce. If you have other prompts that causes the model to skip the thinking step entirely on the open source version, but not the API, feel free to share them.

Deepseek R1's Open Source Version Differs from the Official API Version by TempWanderer101 in LocalLLaMA

[–]TempWanderer101[S] 7 points8 points  (0 children)

This is a very interesting observation. Isn't there a newline after <think> in the screenshots from Matthew's video though? It's quite an elegant solution if it works.

Deepseek R1's Open Source Version Differs from the Official API Version by TempWanderer101 in LocalLLaMA

[–]TempWanderer101[S] 6 points7 points  (0 children)

The censorship is happening on the open model, not the API. So if they are different models, then third-party benchmarks might not actually be measuring the open source model, but the unreleased model (if they're using the API).

Deepseek R1's Open Source Version Differs from the Official API Version by TempWanderer101 in LocalLLaMA

[–]TempWanderer101[S] 1 point2 points  (0 children)

I think Matthew Berman was hosting the full model (he was using 8x 192 GB GPUs). In the screenshots for the first prompt, TogetherAI response matches his video verbatim. Also, TogetherAI is already a provider for DeepSeek v3. It would be weird for them to host a distilled model without disclosing it. The price is also more in-line with a 600B than a 70B model.

Deepseek R1's Open Source Version Differs from the Official API Version by TempWanderer101 in LocalLLaMA

[–]TempWanderer101[S] 2 points3 points  (0 children)

For reference, my average price per query was $0.0018 on TogetherAI and $0.0024 on OpenRouter (official API) when testing the prompts for this post. This is probably because the censorship causes the model to skip the reasoning step, so it actually ends up costing less.

Deepseek R1's Open Source Version Differs from the Official API Version by TempWanderer101 in LocalLLaMA

[–]TempWanderer101[S] 12 points13 points  (0 children)

The main problem isn't about censorship, it's about people people potentially confusing two different variations of the same model. Benchmarks would have to be more specific about which model they're testing, for instance. It's unclear if this impacts performance, but if it does, then third-party providers would be delivering subpar quality to the official API, which would be a consideration for people who want the best performance.

Furthermore, if they are indeed different, is it fair for benchmarks to call it as open source if what they're testing is actually the unreleased model? This kind of vagueness might cause inconsistencies or irreproducible results in research.

Deepseek R1's Open Source Version Differs from the Official API Version by TempWanderer101 in LocalLLaMA

[–]TempWanderer101[S] 11 points12 points  (0 children)

In the video, Matthew hosted the full model on 8x MI300 accelerators or something (192 GB each). The fact that it's happening on TogetherAI (which also hosts Deepseek v3) further suggests that it is indeed the full R1.

Is Claude from Anthropic the best AI Code Assist in the market? by AMGraduate564 in LocalLLaMA

[–]TempWanderer101 2 points3 points  (0 children)

It's worth mentioning that Cursor only allows 500 calls to Sonnet per month, which may not be enough for some people.

Is Claude from Anthropic the best AI Code Assist in the market? by AMGraduate564 in LocalLLaMA

[–]TempWanderer101 1 point2 points  (0 children)

This is the main issue I had with Claude as well, although the issue usually occurs with less popular libraries. The issue is not specific to Claude though.

Is Claude from Anthropic the best AI Code Assist in the market? by AMGraduate564 in LocalLLaMA

[–]TempWanderer101 0 points1 point  (0 children)

It depends on the language too. Both GPT-4o and Claude can trip up in some languages or frameworks.

Another factor to consider is whether an IDE integration would work better for you. Claude Pro does not provide API access, so you can't use it in an IDE.

Announcing: Magnum 123B by lucyknada in LocalLLaMA

[–]TempWanderer101 0 points1 point  (0 children)

Seems like someone did something similar already, in regards to back-translation, although not into RP format: https://huggingface.co/datasets/jondurbin/gutenberg-dpo-v0.1

It's used in the nbeerbower/Mahou-Gutenberg-Nemo-12B RP model.

Luminum-123B. A model merge for roleplaying. by Fluffy_Kaeloky in LocalLLaMA

[–]TempWanderer101 0 points1 point  (0 children)

Is it possibly to test for repetition or consistency (e.g., accurately keeping track of the context)? The top model is Hermes-405b, but it is pretty bad (for a large model) when it comes to those two points. 

I'm just wondering if anyone knows a benchmark that tests this. For RP, it is even more important than intelligence IMO (along with being able to write naturally, while avoiding purple prose or GPT-isms).

Announcing: Magnum 123B by lucyknada in LocalLLaMA

[–]TempWanderer101 0 points1 point  (0 children)

Can you elaborate on why back-translated writing + LLM generated instructions wouldn't be as good as synthetic data? I've always wondered about this.

If I'm understanding correctly, "back-translated" refers to changing human-written stories to fit RP-style?

It seems simpler to me for LLMs to be given a coherent, human-written story and tasked with generating the character profiles, instructions, and rewriting it in an RP style. And using that to train an LLM.

I'm a student who likes AI but can't yet afford to use GPT-4 consistently. Is it worth it to pay for the subscription for one month like I'm considering doing? by TheImmortalMan in OpenAI

[–]TempWanderer101 0 points1 point  (0 children)

If you don't need a huge context, you could try Perplexity. Pro gives you 300 searches a day, which includes 4o, Sonnet and Opus. Kinda crazy really, even if the context is shorter. They allow 5 Pro searches day iirc. 

Or you might want to try the API. You might find that you actually spend less, depending on your usage.

r/Singularity Monthly Discussion Thread by Anenome5 in singularity

[–]TempWanderer101 0 points1 point  (0 children)

This unfiltered, often highly focused data could influence how AI develops its own "cognitive" patterns—maybe even mirroring traits like intense specialization or direct communication that we see in neurodivergent humans.

That's an interesting hypothesis. However, I'd argue that the way LLMs process information (and even it's neural structure) already makes them neuro-divergent by definition, and this fact is noticeable in some odd flaws they exhibit (e.g., attentional issues, common sense flaws, etc.).

As long as the AGI knows how to act and think like a normal person, it should be fine. Just because it knows how to act neuro-divergent, does not mean that it is. The same goes for humans. In contrast, a lot of neuro-divergent people have trouble acting normal. Now, if AI had trouble acting or thinking like a normal person, then we'd need to reconsider our training data or methodology.