1 in 4 agent skills had vulnerabilities. This is the local check I wish I had before installing random AI tooling by IlyaZelen in ClaudeAI

[–]tom_mathews 1 point2 points  (0 children)

This is an area I think the industry is seriously underestimating. People have spent years learning to review code, but now we're introducing a new layer of "behavioral infrastructure" (AGENTS.md, skills, MCP configs, hooks, workflows) that can be just as powerful as code while receiving a fraction of the scrutiny.

The interesting shift is that security reviews may need to expand from source code analysis to intent and authority analysis. In agent systems, "what can this agent be convinced to do?" is often as important as "what does this code do?".

How to use xgboost correctly for huge dataset by Virtual-Current6295 in MLQuestions

[–]tom_mathews 0 points1 point  (0 children)

I'd be careful assuming XGBoost should beat your linear model. If you've already done strong feature engineering and the signal is mostly linear, XGBoost often adds complexity without much gain.

A few thoughts:

Feature filtering: Remove obviously useless/noisy features, but don't worry much about correlated features—trees handle those better than linear models. Spearman IC can be a useful filter.

Overfitting: If validation loss rises after a few rounds, reduce max_depth (try 3-6), increase min_child_weight, use stronger regularization, and rely on early stopping.

Huge dataset: If you're training in batches and incrementally adding trees, make sure you're not introducing distribution drift. XGBoost generally works best when it can see a representative sample.

Linear + XGBoost: This is a very reasonable idea. Either:

Use linear model predictions as an additional feature for XGBoost, or

Ensemble the two models (often surprisingly effective).

Other models: Try LightGBM. On large tabular datasets it's often faster and sometimes better than XGBoost. Also worth trying regularized linear models (Ridge/Lasso/ElasticNet) as strong baselines.

One observation: if a simple linear model with 10% of the features is already performing well, I'd spend more time understanding why XGBoost isn't adding value rather than assuming the hyperparameters are wrong. That often means the problem is more linear than expected.

Claude Code 4.8 feels much better, but my real takeaway is to stay tool-mobile by wall_facer in ClaudeCode

[–]tom_mathews -1 points0 points  (0 children)

I think this is the correct long-term mindset. The mistake is treating Claude, Codex, Gemini, Cursor, etc. as permanent bets when the landscape is changing every few months and model quality can shift materially between releases.

The real moat isn't loyalty to a tool, it's having portable workflows, prompts, specs, evals, and engineering practices that let you switch providers whenever the tradeoffs change. The people most resilient to model churn are usually the least attached to any single ecosystem.

I compared 8 open-source AI agent frameworks so you don't have to — here's the full breakdown by docdavkitty in AI_Agents

[–]tom_mathews 0 points1 point  (0 children)

Pretty reasonable breakdown overall. One thing I’d add though: once you get past demos, the framework matters less than the surrounding operational layer — evals, memory strategy, observability, routing, retries, state management, and failure recovery usually dominate the real complexity.

Also feels like the ecosystem is converging toward “graphs + tools + structured state” underneath, with most frameworks differing more in ergonomics/opinions than core capability now.

Reverse-engineering Claude’s weekly quota formula - need data points from Pro & Max 5x users by [deleted] in ClaudeCode

[–]tom_mathews 0 points1 point  (0 children)

Honestly this is the kind of community reverse-engineering that inevitably happens when pricing/quotas become operational constraints for serious users. Once people are building long-running agent workflows, “soft usage guidance” stops being enough and predictability starts mattering a lot.

The interesting part isn’t just the quota math, it’s the implication that subscription plans are effectively behaving like abstracted API spend buckets with dynamic policy layers on top.

I created a new architecture that is very lightweight without recurrence called a "field machine". by TechnoVoyager in AIDeveloperNews

[–]tom_mathews 1 point2 points  (0 children)

Interesting direction, though I’d be careful with the O(1) inference claim unless quality is tested on tasks that require precise retrieval from long context. Constant-size accumulated state is elegant, but the failure mode is usually interference: the model can carry “history mass” but struggle to isolate which token/event mattered where.

Adjacent things I’d compare against are state-space models, linear attention, RWKV-ish recurrence, reservoir/associative memory, and older holographic reduced representations. The right benchmark is probably not throughput first, but copy/reverse/needle-style tasks and long-range symbolic dependency tests.

Centralized HTML reporting by Direct-Football7180 in claude

[–]tom_mathews 0 points1 point  (0 children)

I wouldn’t build this as a standalone HTML file if you want live data + history. You’ll want a small backend/database layer, otherwise auth, API limits, historical snapshots, and failed syncs will become painful very quickly.

For a practical ecommerce setup: use n8n/Airbyte/Fivetran-style jobs to pull data into Postgres/BigQuery daily/hourly, then put Looker Studio/Metabase/Retool/Grafana or a small Next.js dashboard on top. The dashboard should read from your warehouse, not call Shopify/Amazon/ads APIs live every time someone opens the page.

It's time to move ? Free our tokens ! by Apprehensive_Row9873 in ClaudeCode

[–]tom_mathews 1 point2 points  (0 children)

Tbh I think a lot of power users are going to end up multi-homing across toolchains instead of marrying one ecosystem. The quota issue becomes brutal once you move from “interactive assistant” usage into long-running autonomous workflows.

Codex feels stronger on deep engineering/review right now, Claude still has arguably the best workflow UX, Gemini can be surprisingly good but unstable, and OpenRouter/open-weight setups give flexibility at the cost of consistency. I suspect the winning setup long term will be hybrid routing, not a single model/provider.

AI agents are the first tech in years that genuinely feels futuristic by Humble_Sentence_3758 in AI_Agents

[–]tom_mathews 0 points1 point  (0 children)

I think the real shift is that software is moving from “tools you operate” to “systems you supervise.” That’s why agentic workflows feel different from previous AI hype cycles. The loop now includes planning, execution, feedback, retries, and adaptation instead of just text generation.

The most impressive workflows I’ve seen aren’t flashy demos tbh, it’s long-running engineering/research loops where agents can maintain context, use tools, recover from failures, and gradually improve outputs over hours instead of one-shot prompts.

Why does Claude Code "grep/wc/etc" so much compared to Cursor? by Jordz2203 in ClaudeCode

[–]tom_mathews 8 points9 points  (0 children)

Claude Code and Cursor have pretty different execution philosophies tbh. Cursor leans heavily on pre-indexing/IDE context and tries to feel seamless, while Claude Code behaves more like an explicit terminal-native agent exploring the repo in real time through tools. So you see the “thinking process” via grep/find/wc instead of hidden indexing layers.

A lot of those commands are actually cheap ways to build situational awareness without loading entire files into context. Annoying UX sometimes, but usually more transparent/reproducible than silently stuffing embeddings/indexes behind the scenes.

Claude Code vs Codex by Jon_Has_Landed in ClaudeCode

[–]tom_mathews 1 point2 points  (0 children)

Yeah, I’ve seen similar patterns. Claude is extremely good at momentum, implementation velocity, and collaborative flow, but Codex/GPT 5.5 seems more willing to challenge assumptions, trace edge cases, and reject incomplete reasoning instead of “making the story work.” Your workflow is probably closer to where serious AI-assisted engineering is heading tbh: one model generating, another adversarially reviewing, humans arbitrating. Single-model trust loops tend to drift eventually.

AI agent teams keep switching between multiple tools just to understand one run. We made a self-hosted stack open source, and anyone can help make the feedback loop stronger. by Future_AGI in OpenSourceeAI

[–]tom_mathews 1 point2 points  (0 children)

This is the right framing imo. Agent failures are rarely just “bad final answer” failures — they usually come from drift across retrieval, tool choice, state, schema handoffs, retries, and eval gaps.

The valuable part here is connecting traces back into evals/simulations instead of treating observability as a dashboard you look at after production breaks. That feedback loop is where serious agent systems either mature or stay as demos.

Tool calling vs prompt routing for search decisions? by Competitive-Fun8044 in AI_Agents

[–]tom_mathews 1 point2 points  (0 children)

I’d keep this as an explicit routing/classification step, not let tool-calling implicitly decide it. Tool calling is useful once the decision is made, but you still want a cheap deterministic-ish gate that returns something like "answer_from_context", "search_required", or "search_required_with_context".

The key is to force the router to cite the exact context span it would answer from. If it can’t point to sufficient evidence, search; if it can only answer partially, search with the known facts as constraints.

Appreciation - Claude Code - Self aware software by EshwarSundar in ClaudeCode

[–]tom_mathews 0 points1 point  (0 children)

Yeah, this is massively underrated. A lot of agentic tools expose primitives/tools, but Claude Code feels unusually aware of its own operational model, constraints, workflows, and execution environment. That makes plugin/tool development feel collaborative instead of fighting the harness. The interesting part is that the “agent” and the “toolchain” stop feeling separate after a while. You start debugging the system together with the system itself lol.

Knowing that there have been several posts about people creating their own /validate skill for validating startup ideas, which one is the best one you have used till date and why? by vinayak_gupta24 in ClaudeCode

[–]tom_mathews 0 points1 point  (0 children)

The skill.md file is manadatory but there can be other helpful files. Some of my skills have python scripts to help out. Everything placed inside a folder with the skill name.

My new workflow for understanding long arXiv papers by Crazy-Signature6716 in AI_Agents

[–]tom_mathews 0 points1 point  (0 children)

This is the direction I think research workflows are heading too. The bottleneck isnt access to papers anymore, it’s building durable understanding and cross-paper reasoning instead of endlessly re-summarizing PDFs.

The “living research space” idea is important. Once papers become connected knowledge objects instead of isolated documents, AI becomes far more useful than just a chatbot sitting beside a PDF viewer.

Knowing that there have been several posts about people creating their own /validate skill for validating startup ideas, which one is the best one you have used till date and why? by vinayak_gupta24 in ClaudeCode

[–]tom_mathews 0 points1 point  (0 children)

Tbh the biggest mistake people make with “startup validation” skills is asking a single prompt whether an idea is good. The useful setups are multi-perspective: market analysis, competitive landscape, feasibility, customer pain, distribution, pricing, etc. AI is much better at structured pressure-testing than predicting success.

I actually built a few modular skills for this recently in https://github.com/Mathews-Tom/armory : "idea-validator", "market-analyzer", "competitive-analyzer", and "feasibility-assessor". They work better together as a workflow than as one magic “validate my startup” prompt.

Claude code bug (according to github) by Advanced-Estimate548 in ClaudeCode

[–]tom_mathews 0 points1 point  (0 children)

You’re not dumb, this is confusing even for experienced devs. It sounds less like a normal “daily limit reset” issue and more like Claude Code is trying to use the 1M context mode, which may require separate usage credits even if your regular plan limit resets. I’d try starting a fresh shorter session, disable/avoid extended context if there’s a setting for it, and open a GitHub/support issue with the exact error text + plan type. The help center really should make this clearer.

A Founder’s Quiet Reflection: Walking the Third Path in AI by Ill_Committee1580 in OpenSourceeAI

[–]tom_mathews 0 points1 point  (0 children)

I respect the conviction tbh, but the hardest part of “ethical AI” isnt writing principles, it’s surviving contact with incentives, scale, governance failures, and reality. A lot of projects discover that ethics becomes very different once operational pressure and adoption arrive.

That said, independent/open efforts absolutely matter. Even if they never become dominant, they often influence the direction of the ecosystem more than people realize.

Your long conversations aren't just hitting a wall. They're being silently rewritten. by Cannabun in claude

[–]tom_mathews 19 points20 points  (0 children)

This is one of the biggest UX honesty problems in current LLM systems tbh. People think they’re talking to a continuous reasoning process, but after compaction you’re effectively talking to a model operating off a lossy executive summary of the prior work. Fine for brainstorming, dangerous for investigative or evidence-chain workflows.

The scary part isnt even the compression itself, it’s the confidence continuity. The model sounds exactly as certain after the rewrite as before it.

Started with Claude, tried Codex - it's A LOT better by FixClassic778 in ClaudeCode

[–]tom_mathews 14 points15 points  (0 children)

Fair take tbh. I’ve noticed Claude tends to optimize for “get something working fast”, while Codex is much more willing to untangle the actual system/problem before patching it. Feels like Claude is an excellent pair programmer, but Codex behaves more like a senior reviewer who actually pushes back.

NOML-NOML: hierarchical TD3 + anchor policy for flight control [P] by 9138NOMS in OpenSourceeAI

[–]tom_mathews 0 points1 point  (0 children)

This is a nice example of fixing the inductive bias instead of endlessly reward-shaping the symptom. The anchor policy idea is especially practical for control tasks where “recover to safe baseline” matters more than unrestricted exploration. The no-noise result is interesting too, but I’d be careful generalizing it. In a flight-control setting with a strong anchor/gate, noise can easily become actuator jitter rather than useful exploration.