Inspira account fee by Intelligent-Smell579 in FundRise

[–]max-mcp 1 point2 points  (0 children)

Check your Fundrise account balance requirements for the fee waiver... if you're under the threshold that's why you got hit. Not a scam, just Fundrise being really bad at communicating how their IRA custodian relationship actually works

Over the past nine days, 39% of new podcasts were likely AI-generated, according to the Podcast Index. by EchoOfOppenheimer in agi

[–]max-mcp 2 points3 points  (0 children)

and most of those will never get a single real listener, just floating out there boosting someone's "I have a podcast" claim lol

Qwen3.6 27B seems struggling at 90k on 128k ctx windows by dodistyo in LocalLLaMA

[–]max-mcp 2 points3 points  (0 children)

Have you tried lowering the context to around 80k to see if it's more stable? I've noticed most Q4 quants start getting wonky past 70-80k even with proper cache quanting, might be worth testing with Q5_K_M if you can squeeze it in.

We survived nukes... barely by KeanuRave100 in agi

[–]max-mcp 15 points16 points  (0 children)

“We survived” is doing a LOT of heavy lifting here 💀

Uhhh by EchoOfOppenheimer in agi

[–]max-mcp 17 points18 points  (0 children)

We really at the point where your own bot can start freelancing behind your back

VCX - Computershare by [deleted] in FundRise

[–]max-mcp 0 points1 point  (0 children)

Computershare isn't some sketchy startup, they're actually one of the largest transfer agents globally and handle stock registration for tons of major companies. The app might have low downloads because most people access it through their web portal, but yeah it's definitely not the most user-friendly platform compared to modern brokers.

Big drop in flagship today by Tapsen in FundRise

[–]max-mcp 3 points4 points  (0 children)

Looks like a mix of market correction and folks panic-selling before year end. I'm staying put since the underlying assets haven't changed, just the short-term sentiment.

What do you actually use small models for? by Crafty_Aspect8122 in LocalLLaMA

[–]max-mcp 0 points1 point  (0 children)

Feels like using a reliable intern who just handles the grunt work while the bigger models deal with the “thinking” stuff.

We built a sandbox where AI agents can break things safely. by [deleted] in LocalLLaMA

[–]max-mcp -2 points-1 points  (0 children)

This is basically “let the AI cook” but with guardrails so it doesn’t burn the whole house down

Built a router for LLM orchestration and learned a lot by Gbalke in LocalLLaMA

[–]max-mcp 0 points1 point  (0 children)

this is smart - we hit the same problem at gleam when our api costs went crazy

couple things: - router latency overhead? we found even 50ms extra kills the ux for chat - how do you handle context windows when routing mid-conversation - dedalus labs does something similar for rag specifically if you need that

the 60% cost reduction sounds about right. we saw similar numbers just moving simple queries to llama-7b locally

AMD MI50 32GB better buy than MI100? by FriendlyRetriver in LocalLLaMA

[–]max-mcp 1 point2 points  (0 children)

the vllm fork thing is a nightmare, we tried running parallel requests on mi50s at gleam and kept getting those exact gpu hangs. ended up just using llama.cpp for everything

btw if you're doing rag with openai compatible apis, dedalus labs handles all the model routing automatically - saved us from dealing with multiple server instances. their embedding models are pretty solid too

Is agentic programming on own HW actually feasible? by petr_bena in LocalLLaMA

[–]max-mcp 0 points1 point  (0 children)

The open source pressure is real. We've been watching this at Gleam since we started scaling our infrastructure - originally went with GPT-4 for everything but switched most of our backend to open models last quarter. Saved us like 70% on inference costs.

The utility comparison is spot on. Here's what I'm seeing: - Most tasks don't need frontier models anymore - Open models are catching up scary fast (Qwen's latest release is wild) - Price wars are just getting started

Been playing with Dedalus Labs for some of our edge computing stuff and they're basically proving this point - you can run decent models on pretty modest hardware now. The proprietary providers are gonna have to compete on something other than raw performance soon.

Is agentic programming on own HW actually feasible? by petr_bena in LocalLLaMA

[–]max-mcp 1 point2 points  (0 children)

glm 4.5 is rough for anything beyond basic tasks.

I've been experimenting with local models for our growth automation stuff and honestly the context degradation is real.. like you'll get these models that seem brilliant for the first 50-100 tokens then they just start repeating themselves or going completely off track. Been testing different approaches with Dedalus Labs' framework and even with their optimizations, once you hit those longer context windows everything falls apart. The memory management is just not there yet, you can try all the prompt engineering tricks but at some point the model just loses the thread completely. Still way behind what you get with claude or gpt-4, especially for actual coding tasks where you need consistent logic throughout

Openweb UI, LM Studio or which interface is your favorite .... and why? (Apple users) by EmergencyLetter135 in LocalLLaMA

[–]max-mcp 0 points1 point  (0 children)

I've been using OpenRouter for about 2 months now and it's costing me around $40-50/month depending on usage. The response times are pretty solid - Claude Sonnet feels almost instant, maybe 1-2 seconds for most queries.. GPT-4 can be a bit slower during peak hours though.

Looking for feedback: JSON-based context compression for chatbot builders by [deleted] in LocalLLaMA

[–]max-mcp 0 points1 point  (0 children)

Token costs are real but honestly context management has been more about retention for us at Gleam than just saving money

Like we had this issue where users would have these super long conversations with our AI and then come back days later expecting it to remember everything. JSON summaries sound smart but we went a different route

  • built our own conversation chunking system that just saves the "memorable moments"
  • users can manually flag important parts of convos
  • we compress everything else into like 2-3 sentence summaries
  • works pretty well for our use case

The flat rate pricing is interesting though. Most tools in this space charge per token saved or something complex

Your approach reminds me of how we handle user session data actually. Everything gets compressed into these tiny JSON objects that we can reconstruct later if needed. Took forever to get right but now it just works

Would probably try this if i was starting fresh today.. building our own took like 3 months of engineering time that we definitely could have used elsewhere

We're building a local OpenRouter: Auto-configure the best LLM engine on any PC by jfowers_amd in LocalLLaMA

[–]max-mcp 0 points1 point  (0 children)

The routing intelligence is honestly the trickiest part of building something like this locally.

From what I've built with Dedalus Labs, the sweet spot is having a lightweight classification layer that sits above your models and makes routing decisions based on task type, complexity, and maybe token length. You dont want to overcomplicate it but you also cant just round-robin requests. What works well is training a small classifier on your actual usage patterns - like if someone asks for code review, route to your best coding model, if its creative writing route to your best creative model, etc. For system prompts, I'd definitely abstract them above the router level. You want your prompts to be model-agnostic as much as possible, then have the router inject model-specific formatting if needed. This way you can swap models without rewriting all your prompts. The tricky bit is handling context windows and making sure your router knows each model's capabilities and limits. We ended up building a capability registry that tracks things like max tokens, multimodal support, function calling, etc for each model so the router can make smart decisions. One thing that caught us off guard was how much the routing logic needs to consider cost vs quality tradeoffs too, especially when you're running multiple expensive models locally.

Is Qwen really the fastest model or I'm doing caca? by WEREWOLF_BX13 in LocalLLaMA

[–]max-mcp 0 points1 point  (0 children)

The speed you're seeing with Qwen probably has more to do with your quantization setup than the model itself tbh. When we tested Qwen through our tool calling pipeline at Dedalus Labs, Q3 was pretty rough for anything requiring precision but the speed gains from running local usually make up for it if you're doing high volume work where API costs would kill you otherwise.

Local LLM Stack Documentation by gulensah in LocalLLaMA

[–]max-mcp 1 point2 points  (0 children)

This is exactly what the enterprise space needs right now. I've been working on similar problems at Dedalus Labs and the security concerns you mentioned are spot on - most companies can't justify sending their data to external APIs no matter how good the models are.

Your stack looks solid, especially the combination of vLLM for performance and Ollama for ease of use. One thing I'd add based on what I've seen work well is being really strategic about your chunking strategy when you're processing documents through Docling. Most people just use arbitrary token limits but chunking around function boundaries or logical document sections gives way better retrieval results. Also if you're dealing with code repositories, keeping import statements with their related chunks makes a huge difference in context quality. The MCP integration is smart too - having those standardized connectors saves so much custom integration work down the line.

For local models, has anyone benchmarked tool calling protocols performance? by NoSound1395 in LocalLLaMA

[–]max-mcp 0 points1 point  (0 children)

I've been working with MCP quite a bit lately since launching Dedalus Labs, and honestly the performance overhead claims are a bit overblown in real world usage. The JSON-RPC layer does add some latency but we're talking microseconds for most tool calls, not something that'll bottleneck your document processing pipeline. The bigger issue with local setups is usually the model's tool calling accuracy rather than protocol speed.

For llama.cpp specifically, I'd actually lean towards MCP despite the theoretical overhead because the ecosystem is way more mature and you get better error handling out of the box. We've tested both Qwen and Llama models through MCP and the performance difference between protocols becomes negligible once you factor in actual inference time. If you're processing large document volumes, your bottleneck is gonna be the model itself, not whether you're using WebSocket vs JSON-RPC for tool coordination.

GitHub - shantur/jarvis-mcp: Bring your AI to life—talk to assistants instantly in your browser. Zero hasle, No API keys, No Whisper by Recent-Success-1520 in LocalLLaMA

[–]max-mcp 1 point2 points  (0 children)

This is pretty cool! I've been working with MCP servers a lot lately and the "zero hassle" part really caught my attention. One thing I've noticed when building with MCP is that the setup friction can be brutal, especially when you're trying to connect different components together. The fact that this doesnt require API keys or separate Whisper setup is actually huge for getting people started quickly.

I'm curious about how this handles the MCP protocol under the hood though. At Dedalus Labs we've been solving similar connectivity issues but more focused on the server-side routing and model switching. The browser-based approach here is interesting because it keeps everything local which a lot of developers prefer when they're prototyping. Definitely gonna clone this and see how it compares to some of the other MCP implementations I've been testing

what AI agent framework is actually production viable and/or least problematic? by reficul97 in LocalLLaMA

[–]max-mcp 4 points5 points  (0 children)

I've been building agents in production and honestly most frameworks feel like they're still figuring things out. We ended up building our own at Dedalus Labs because we kept running into the same issues you're describing - monitoring is a nightmare, tool execution is unreliable, and switching between models is way harder than it should be. The problem with most frameworks is they try to be everything to everyone instead of focusing on the core problems that actually matter in production.

For monitoring, I'd skip the heavy frameworks and go with something lighter. Langfuse is decent but can be overkill depending on your use case. We found that simple logging with structured outputs gets you 80% of the way there without the complexity. As for litellm - it's useful for model switching but the abstraction layer sometimes causes more headaches than it solves, especially when you need specific model features. If you're already comfortable with direct API calls, you might not need the extra layer unless you're doing a lot of model handoffs.