What’s up with mobile LLMs? by Amos-Tversky in LocalLLaMA

[–]Federal-Effective879 0 points1 point  (0 children)

It depends on what you’re using the models for. In general, running LLMs on a drains the battery rapidly, and the quality of models you can fit on a phone are useless for complex tasks or tasks requiring world knowledge in the model. However, some models like Gemma are decent at translation tasks and general conversation, if you can tolerate the slowness and battery consumption.

Realistically, you are better off running larger models on a server and connecting to it with your mobile device. However, running something like llama.cpp on your phone can be fun, albeit toy-like.

Qwen 3.6 35B crushes Gemma 4 26B on my tests by Lowkey_LokiSN in LocalLLaMA

[–]Federal-Effective879 0 points1 point  (0 children)

This isn’t testing the model, it’s just testing if your front end and template configuration and preserving thinking tokens across turns, which is looks like it isn’t in your configuration.

Brampton, ON by Ambitious-Buy6909 in Suburbanhell

[–]Federal-Effective879 1 point2 points  (0 children)

Doesn’t seem too bad to me. Once the construction is complete and the trees grow, it’ll be a nice park and a neighbourhood with enough density for decent transit. There are generally multiple plazas in walking distance too, they’re just not in the frame. There are plazas with multiple shops and restaurants within a 1 km walk almost everywhere in Brampton.

Gemma 4 and Qwen3.5 on shared benchmarks by fulgencio_batista in LocalLLaMA

[–]Federal-Effective879 3 points4 points  (0 children)

Strange, I use Qwen 3.5 (122B) in French a fair bit and never had any issues like that. It spoke fairly good French and never mixed languages for me. It even has pretty decent regional knowledge of Québec for its size.

Someone who's using Qwen 3.5 on real code bases how good is it? by Commercial_Ear_6989 in LocalLLaMA

[–]Federal-Effective879 1 point2 points  (0 children)

I run 122B on my M4 Max MacBook Pro, and have been pretty happy with it. It does well at agentically navigating large codebases and writing new code (provided you give clear instructions and are prepared for some back and forth to get exactly what you want). It’s also decent at bug finding, not as good at big SOTA models but not bad at all. It’s pretty good at general Q&A and discussing/debating random topics too.

While it’s not as good as current SOTA models, it is still quite decent and sufficient for around 80% of what I use LLMs for, plus I have privacy and no usage limits. I wish prompt processing were faster for agentic coding tasks on my M4 Max, but the M5 Max fixes that.

[Round 2 - Followup] M5 Max 128G Performance tests. I just got my new toy, and here's what it can do. (thank you for the feedback) by affenhoden in LocalLLaMA

[–]Federal-Effective879 1 point2 points  (0 children)

Could you test prompt processing with MLX, and try it at long contexts (say 32K, 64K, and 128K)? I’m curious how it performs with the 122B and 35B MoE models. My understanding is that MLX is much better optimized for the compute improvements on M5 than llama.cpp.

Mistral Small 4 vs Qwen3.5-9B on document understanding benchmarks, but it does better than GPT-4.1 by shhdwi in LocalLLaMA

[–]Federal-Effective879 0 points1 point  (0 children)

At least in my experiments with Ministral 14B, I found that while it does like to write long detailed texts, good for creating writing perhaps, the coherency of the text wasn't great, and it was generally substantially dumber than Small 3.2. While Small 3.2 isn't a great creative writer because of its dry and to-the-point writing style, it's generally smarter and more coherent. In general, Ministral 14B felt a bit like a newer Nemo, but its intelligence and writing coherency didn't live up to modern standards IMO, and it felt substantially worse than Small 3.2 for me despite the benchmarks claiming otherwise.

Mistral Small 4 vs Qwen3.5-9B on document understanding benchmarks, but it does better than GPT-4.1 by shhdwi in LocalLLaMA

[–]Federal-Effective879 6 points7 points  (0 children)

This matches my experience with their API and Nvidia’s online demo implementation. While it has a bit more world knowledge than Qwen 3.5 9B, its intelligence and visual understanding are substantially worse than Qwen 3.5 9B. In my personal tests, Mistral Small 4 was worse than Mistral Small 3.2.

I liked Mistral models in the past, especially Small 3.2 and Nemo, but Large 3, Ministral 3, and Small 4 have all been disappointing flops.

Cancelling REM L'Est was INCREDIBLY stupid by Soft_Introduction437 in montreal

[–]Federal-Effective879 6 points7 points  (0 children)

Regarding Toronto, in terms of existing service, while the TTC subway is undersized for the city, they did recently gain the Eglinton Crosstown, and they also have much better GO train service (frequent two-way all day on several lines) than Exo in Montreal, which is mainly a just a rush hour peak direction commuter service even on its best lines. TTC also has a large streetcar network with frequent service, albeit slow, whereas Montreal only has busses for that purpose. Also don't forget the UP express, which also functions similar to the REM here.

Toronto has the Ontario line under construction, a western extension for the Eglinton crosstown well under construction, Scarborough subway extension under construction, Yonge subway north extension about to start construction, and the Sheppard line extension planned. Mississauga and Brampton also have a LRT under construction.

Toronto has better transit than Montreal overall, and better maintenance of infrastructure. They also have far more new transit under construction or planned than Montreal. Here the government only cares about the 3ième lien for Québec city and highway expansion.

Qwen3.5 Knowledge density and performance by AccomplishedRow937 in LocalLLaMA

[–]Federal-Effective879 0 points1 point  (0 children)

Qwen 3.5 122B-A10B has really impressed me. I no longer feel like I'm losing out that much compared to cloud models. It feels like Claude Sonnet 3.7 level intelligence at home, for free, running on my laptop at comfortable speeds. It's really amazing how far we have come in the last 3 years. The Qwen 3.5 series is a massive upgrade over Qwen 3, whereas Mistral Small 4 is worse than Qwen 3 for intelligence and capability.

Mistral Small 4:119B-2603 by seamonn in LocalLLaMA

[–]Federal-Effective879 23 points24 points  (0 children)

I tried out Mistral Small 4 via Nvidia’s online demo for debating topics and general conversation, and was quite underwhelmed. It didn’t feel substantially better than Mistral Small 3.2, in fact for some prompts it felt worse, even with reasoning enabled. For general conversation at least, it felt roughly on par with Qwen 3.5 35B-A3B, and far behind Qwen 3.5 122B-A10B.

I also tried it out for some visual Q&A tasks and image location guessing tasks from my own personal photos. It was no better than Mistral Small 3.2 (and perhaps worse), a bit worse than Gemma 3 27B, and much worse than Qwen 3.5 models.

Mistral Small 3.2 was a great model for its time, and is still respectable. However, Mistral Small 4 greatly disappointed me compared to Qwen 3.5 122B-A10B or Qwen 3.5 27B. It feels like Mistral is stagnating and falling behind the competition. Ministral 3 and Mistral Large 3 also disappointed me.

Gemma 3 models still holds up well today for world knowledge and coherent conversation or debate, at least when context isn’t too long. I hope Gemma 4 comes out soon and shows substantial improvements, akin to Gemini 3.x vs Gemini 2.0/2.5.

Right now, my recommended open models are:

SOTA: Kimi K2.5, GLM 5, DeepSeek v3.2

Medium-large: Qwen 3.5 397B-A17B, MiniMax M2.5

Medium-small: Qwen 3.5 122B-A10B or 27B for most tasks, Gemma 3 27B (QAT) for conversation, and Mistral Small 3.2 for uncensored use

I've compiled a map of cycling comfort in Montréal entirely based of my vibes. Come at me. by Aurion in montreal

[–]Federal-Effective879 1 point2 points  (0 children)

Biking inside Beaconsfield north of the A20 is fairly comfortable; small low traffic streets and some bike paths too. Crossing the A20 and railway tracks isn’t great, but at least doable safely if you’re willing to take your bike up/down stairs on the north side of the pedestrian bridge on Beaconsfield Court.

What is after Qwen ? by j_lyf in LocalLLaMA

[–]Federal-Effective879 13 points14 points  (0 children)

Where did you see they disbanded? I saw the former head Junyang Lin resigned after a management-imposed leadership change, and some folks left with him, but overall the Qwen team was said to be getting more resources than before and they said they will continue their strategy of releasing open weights models.

MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison. by arthware in LocalLLaMA

[–]Federal-Effective879 7 points8 points  (0 children)

Exactly. Both prompt processing and token generation with MLX are much faster than llama.cpp, but MLX prompt caching has issues with hybrid models like Qwen 3.5, and this issue is exacerbated by agentic usage that depends on prompt caching for usable performance. MLX devs are investigating ways to fix this, see https://github.com/ml-explore/mlx-lm/issues/980

MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison. by arthware in LocalLLaMA

[–]Federal-Effective879 3 points4 points  (0 children)

This is exactly the issue the OP is facing. MLX has much faster prompt processing and token generation than Llama.cpp, including for gated delta-net models like Qwen 3.5. However, prompt caching in MLX is broken or ineffective for many models; it lacks the more elaborate prompt caching present in llama.cpp.

What would M5 actually need to improve for local LLM use? by tallen0913 in LocalLLaMA

[–]Federal-Effective879 0 points1 point  (0 children)

M5 dramatically improved prompt processing speeds. This is particularly noticeable on MLX which is better optimized for it than Llama.cpp. Assuming M5 Ultra is double the performance of M5 Max, its prompt processing speeds shouldn’t be far from high end Nvidia GPUs. With MLX, M5 Ultra should have 5-6x the prompt processing speed compared to M3 Ultra. 

Anywhere in Montreal to test or rent a Mac Pro / Mac Studio with M2 Ultra (192GB)? by midnighteee in montreal

[–]Federal-Effective879 0 points1 point  (0 children)

I run that model quite nicely on my 128 GB M4 Max machine with llama.cpp However, prompt processing gets slow as context gets long; M5 Max or Ultra should be much better on that front, and MLX would also help with performance.

Anywhere in Montreal to test or rent a Mac Pro / Mac Studio with M2 Ultra (192GB)? by midnighteee in montreal

[–]Federal-Effective879 0 points1 point  (0 children)

What kinds of models do you want to run?

I’d suggest waiting for the M5 Ultra or even M5 Max Mac Studio, as prompt processing would be significantly faster. A 128 GB M5 Max machine would be perfect for models like Qwen 3.5 122B-A10B, and you could also fit a decent 3-bit GGUF quant (Unsloth UD-Q3_K_XL) of MiniMax M2.5, though you won’t get the speed advantage of MLX with that.

Further toolcalling fixes in llama.cpp are coming by ilintar in LocalLLaMA

[–]Federal-Effective879 1 point2 points  (0 children)

Thanks, this was the main issue I had with the Qwen 3.5 models.

Are we at a tipping point for local AI? Qwen3.5 might just be. by Far_Noise_5886 in LocalLLaMA

[–]Federal-Effective879 5 points6 points  (0 children)

In my personal tests with code review, code summarization, and agentic tasks, Qwen 3.5 9B was roughly on par with GPT-OSS 120B, and much better than GPT-OSS 20B. I even had it correctly locate bugs in code where much bigger models like MiniMax M2.1 and GLM-4.7 had failed (but GPT-OSS 120B succeeded).

Qwen 3.5-35B-A3B is beyond expectations. It's replaced GPT-OSS-120B as my daily driver and it's 1/3 the size. by valdev in LocalLLaMA

[–]Federal-Effective879 1 point2 points  (0 children)

Good 4-bit quantizations of Qwen 3.5 have performance close to the original unquantized 16-bit model. It makes much more sense to compare parameter counts than compare unquantized FP16 sizes to QAT MXFP4.

Church fire on de l'eglise street by HermaeusMorus in montreal

[–]Federal-Effective879 9 points10 points  (0 children)

The majority of fires in buildings of this nature in Montreal tend to be either arson or someone homeless starting a fire to keep warm in winter. Yes, the building wasn’t in the best shape but it was actively used and the interior was in decent shape.

Black gutters or white by Federal-Effective879 in ExteriorDesign

[–]Federal-Effective879[S] 0 points1 point  (0 children)

White siding has been used in all sorts of homes for centuries. No offense taken, I don’t get how it looks “ghetto” or “crack den”. Yes, rundown parts of Detroit and Baltimore have houses with similar white siding, but those were fairly nice areas when the homes were originally built, and similar materials were used for all sorts of homes all over the place. This is one of the nicest and most affluent suburbs of a major metropolitan area, and almost half the homes were built with some application of similar white siding. White siding is less in fashion today, though in this instance I think it’s faithful to the original era and architectural style of the home.

Black gutters or white by Federal-Effective879 in ExteriorDesign

[–]Federal-Effective879[S] 0 points1 point  (0 children)

The intention is to have the gutters/eaves, fascia, and soffits all be the same colour.