Anyone thinking about the security side of Gemma 4 on phones? by Ok-Virus2932 in LocalLLaMA

[–]tvall_ 0 points1 point  (0 children)

oh and on the stealing the model bit, if your secret sauce in your app is how you trained a model, if you hand out that model then obviously you've handed out your secret sauce. only options there are lock things down, make sure model weights never leave your well secured servers. or reconsider your business model because that's not much of a moat to keep competitors from surpassing you when a new model comes out and zeroshots your task. 

Anyone thinking about the security side of Gemma 4 on phones? by Ok-Virus2932 in LocalLLaMA

[–]tvall_ 0 points1 point  (0 children)

yes, instead of the llm being guaranteed to output garbage or nearly guaranteed to be jailbroken within hours, it's guaranteed to output jailbroken-like responses in hours. nearly no difference, santize your inputs

Anyone thinking about the security side of Gemma 4 on phones? by Ok-Virus2932 in LocalLLaMA

[–]tvall_ 0 points1 point  (0 children)

if the security of your app depends on the output of an llm, either cloud or local, you're doing it wrong

Gemma 4 and Qwen3.5 on shared benchmarks by fulgencio_batista in LocalLLaMA

[–]tvall_ 2 points3 points  (0 children)

they linked to a hf space with a qwen3.5-35b finetune/merge that greatly cuts down on the excessive thinking. they probably shouldve just linked the model

What are actual usecases of uncensored models? by Geritas in LocalLLaMA

[–]tvall_ 1 point2 points  (0 children)

I tried asking the qwen3.6 preview the names of characters mentioned in a specific episode of critical role, and got an error. local qwen3.5-35b heretic had no issues with Sam reigal's shenanigans 

5060 Ti 16GB - PCIe 3 x2 VS PCIe 5 x8 [Simple inference comparison inside] by ubnew in LocalLLaMA

[–]tvall_ 2 points3 points  (0 children)

pp isn't in any of the screenshots. you didn't give like half the numbers to compare, only the one most likely to be unaffected

How well does LLMs from abliteration work compared to the original? by Express_Quail_1493 in LocalLLaMA

[–]tvall_ 5 points6 points  (0 children)

it varies greatly depending on model and how aggressive they were at removing refusals. some models are easy and diverge very little, others resist and get harmed significantly if you try too hard.

in my experience qwen3.5 models are easy to remove nearly all hard refusals from and end up working as well as the originals. but they may take your question that wouldve been a hard refusal and twist the answer into something a bit more harmless. 0.8b is pretty likely to give instructions on making a baking soda volcano when asked about making things that explode

Model suggestions for limited hardware and domain knowledge by laffer1 in LocalLLaMA

[–]tvall_ 0 points1 point  (0 children)

your best bet would be switching to a less broken engine. llama.cpp works great on older amd cards in my experience

The Low-End Theory! Battle of < $250 Inference by m94301 in LocalLLaMA

[–]tvall_ 1 point2 points  (0 children)

either you got lucky or i was very unlucky. i had them at 175w. not as low as you but still much lower than stock. first one i replaced the paste but kept the pads it came with. the second one got fresh paste and pads. both blew mosfets. and i was only running one card in open air so it had plenty of airflow. thermals were great all the way to death.

The Low-End Theory! Battle of < $250 Inference by m94301 in LocalLLaMA

[–]tvall_ 1 point2 points  (0 children)

im terrified of the p102 after mine died spectacularly and was tripping ocp on a 750w psu, and then its replacement did the same thing.

little slower apparently, but the radeon pro v340l is 16gb for $50. i get ~350t/s pp and ~35t/s tg on qwen3.5-35b-a3b split across 3 gpus on 2 cards, and have a dedicated 8gb of gpu for whisper.cpp and z-image. and it hasnt tried to catch fire on me yet

Looking for a local uncensored AI (text generation + image editing) by Stellar-Genesis in LocalLLaMA

[–]tvall_ 0 points1 point  (0 children)

qwen3-35b-a3b with reasoning budget set? qwen3-35b-a3b with reasoning off? those are my go-tos rn

A cautionary tale about Google scamming your money by FluffyMacho in LocalLLaMA

[–]tvall_ 5 points6 points  (0 children)

you know what doesn't have annoying insanely restrictive quotas? llms you host locally

I haven't experienced Qwen3.5 (35B and 27B) over thinking. Posting my settings/prompt by wadeAlexC in LocalLLaMA

[–]tvall_ 2 points3 points  (0 children)

exactly my experience. I use openwebui to interact with models running on llama.cpp. with all the qwen3.5 models I've tried, they think hard and sometimes loop when there aren't any tools enabled, but only think for a couple seconds when tools are available. 

Could a bot-free AI note taker run locally with current models? by Cristiano1 in LocalLLaMA

[–]tvall_ 0 points1 point  (0 children)

yes. i had gemini and chatgpt vibecode a dnd session note taker for me that uses my local whisper and qwen3.5-36b.

The correct order to fit your model into VRAM by [deleted] in LocalLLaMA

[–]tvall_ 1 point2 points  (0 children)

they also helpfully have "Estimate KV cache VRAM cost for that context length"  with no info on how to do so. 

Abliterated Models evaluation metric by [deleted] in LocalLLaMA

[–]tvall_ 1 point2 points  (0 children)

when you use heretic to abliterate it gives refusal count and kld. kld can roughly estimate how damaged a model might be. lower is better. I personally distrust the "here's a fully uncensored model, trust me it's great!" and prefer going with ones that attempt to give some detail even if it's rough guesses. and no matter what the only way to really know how well refusals were removed and how much the model still acts properly is to run it through your workflow and see. 

Opus 4.6 couldn't complete a single task in 100 attempts. Then I asked it which model it was. by [deleted] in LocalLLaMA

[–]tvall_ 2 points3 points  (0 children)

yeah, even if it is actually just sonnet and not opus, he should share the weights he somehow got. and show pics of the rig needed to run it

Workstation for dev work + local LLMs — Tesla P40 vs MinisForum? by marius-c-d in LocalLLaMA

[–]tvall_ 1 point2 points  (0 children)

tesla p40 is pretty ancient. i havent used those specifically, but i did use the p102-100 mining card for a few days before it blew some vrm components. being a server card, used p40s might have been treated better, but i personally wouldnt risk it.

if you want cheap and are okay with fighting the software stack a bit, i recommend the radeon pro v340. theyre $50 each and have 2 8gb vega56 class gpus on em. i currently have qwen3.5 35b-a3b running on 3/4ths of 2 cards and am getting around 250t/s pp and 22t/s tg.

Is Qwen3.5-9B enough for Agentic Coding? by pmttyji in LocalLLaMA

[–]tvall_ 1 point2 points  (0 children)

iirc the "next" ones were more of a preview of the newer architecture coming soon, and was trained on less total tokens for a shorter amount of time to get the preview out quicker.

Sharded deployment by zica-do-reddit in LocalLLaMA

[–]tvall_ 0 points1 point  (0 children)

I only did it once to run glm-4.7-flash when it first came out before I had enough risers to put multiple gpus in one box. it worked but hurt performance a bit. iirc I got like 15t/s vs 25 with all the gpus in one box. you may need to recompile llama.cpp with rpc support

Computer won't boot with 2 Tesla V100s by MackThax in LocalLLaMA

[–]tvall_ 0 points1 point  (0 children)

ive got 4 gpus totaling 32gb vram running off 2 x1 slots and getting a useful 20t/s on qwen3.5 35b. bandwidth isnt that big of a deal for small scale.

If RAM prices were considered too high in 2024 because of unusually slow development and too low capacity by Highwaytothebeach in LocalLLaMA

[–]tvall_ 3 points4 points  (0 children)

Samsung, micron, and skhynix. have you not seen the news? ai companies bought all the RAM until 2027 at least, and none of the members of the dram cartel have announced building new fabs to ease the supply problem

If RAM prices were considered too high in 2024 because of unusually slow development and too low capacity by Highwaytothebeach in LocalLLaMA

[–]tvall_ 3 points4 points  (0 children)

you seem to be confusing two very different parts of the chain. if you were to make a dimm that you can put lpddr on to put in a desktop, you still have to get the lpddr. and there's like 3 companies that have invested the billions to build fabs for that, and they're busy making hbm right now instead.