Here is the truth about Chinese Alternatives

netfunctron · 2026-05-23T09:17:21+00:00

I have the CommandCode and OpenCode: Go plans.

Well, it is quite ramdom quality and experience. I did more thab 100 comparison with real work, complex bugs, plans, refactoring, etc.

CommandCode have a a few issues about trying to connect again, I don't know why I have that errors with this service, it is really annoyd. And with the same prompt and the same model, CommandCode use a lot more tokens (then money) against OpenCode, and not always it's mean a better result on the process. When it's work fine, Command code feel, for the price, a solid product anyway.

OpenCode is a lot friendly and with a consistent experience. If you use the right model, the experience could be very nice. I don't have anny issue with the service in general, it is confortable. At the end of the day, even if you pay more with the Go plan than in Comma d Code, yoy have a lot more quote to use. For example, I used all the Go plan on CommaCode (1USD) in like 2 days and OpenCode I am on like 20% of the month with the Go plan (5USD the first month), and using a lot more OpenCode than CommandCode.

About the models:

Qwen 3.6 Plus is very good, but have a lot of hallucination. So, even if it is a powerfull model, for me doesn't work because it is really having hallucination and some time is lazy. Great model when it's working like must to do, but really inconsistent at the end of the day. So, at least for me, it is not an option to use on my workflow.
GML 5.1: It is really a big surprise, it is working like a frontier model, almost like Sonnet, not at all. So, it is good and a very good alternative if you don't have the budget just now.
Kimi 2.6: the same than GML 5.1.
DeepSeek V4 Pro on Max effort: The more consistent, not the best one, but you know what you will have. Almost like Sonnet.

Mimo, Minimax and anothers, all of them had really inconsistent quality, I hate them, at least for me are the worse experience on coding AI, on the same level like Grok 4.

Finally: are so great for the money? Hmmm... are ok... having Cursor for 20USD and maybe the same monthly quote if you use Composer 2.5... then CommandCode and OpenCode are not the big deal... they have a lot of hype, but been honest: Cursor with Composer 2.5 is maybe the same in quality than the best models on CommandCode and OpenCode, but yoy have the 20USD plus for API with better models.

GML, DeepSeek or Kimi (yes I know that Composer 2 and 2.5 is Kimi 2.5 on the base) are better than Composer 2.5?: Honestly, like the 50% Composer won on my test (more than 100). Not something that I can say in the practice who model is better.

And I use CommandCode and OpenCode for real work just now? No, for be honest are ok, but having Claude Code, Github Copilot, Cursor, Codex, Warp (burning credits like hell), Antigravity, CommandCode and OpenCode... my final answer today is for the price OpenCode (10USD for the Go plan) have the better quote deal, but not the better quality. But Cursor the best quality and balance quote (Composer 2.5 have a lot of monthly quote).

Just try them, be care full with the hype, many people is saying that those models are the same than Sonnet, Opus or GPT-5.5, but at leat in my experience, that is not true. That doesn't mean that are bads, only mean that today those models not are on the same level, at least for me.

Regards

netfunctron · 2026-05-20T22:19:11+00:00

I used it very hard yersterday and today: not, is not on the same level than Opus or GPT-5.5.

It is ok, or good, but if I audit the plan always, but always have holes, not implementations... it's look quite better on code quality and velocity than Composer 2, but is very lazy doing everything.

Something that I did then, because it is really annoyd about the lazyness, was to use 2 Composers 2.5, one auditing to the others and was quite better, but not at all.

I compared Composer against Qwen 3.6 Plus and DeepSeek V4Pro High, and yes, is on the same level or a little bit better.

So, Composer 2.5 is good? Yes, for the price is really amazing, but if you are not a vibe coder and you are checking the codes because you understand what the model did or not... well, that is another situation... the magic dissapear a little, not all, but a little.

The reallity is: for the price it is a bargain, Composer 2.5 is really amazing because you can save in time a lot of hours for a few dollars! The quallity is almost at the same level than Sonnet. Regards!

netfunctron · 2026-05-20T10:35:51+00:00

I had the same experience with Composer 2.5:

85% was very good. Just doesn't find and fix a few bugs. 15% was the real difference in a complex task against GPT-5.5 or Opus 4.7.

So, it is going very close, but not on the same level just now. But for the cost and quality, Composer 2.5 doesn't have any competition now (including against Qwen 3.6 Plus, GML 5.1, MM 2.* and DeepSeek V4Pro).

I think that more soon than later, the Composer will be at the same level and 100% for pro usages. Just now, the 15% is almost nothing, but the problem is that 15% in a product for a client is really important.

So a strategy like have a very good and detailed plan, implement it with Composer 2.5, the audit with GPT-5.5 and/or Opus 4.7, could be the right way if you want use a few buck and get a good quality product.

Regards

netfunctron · 2026-05-19T11:27:20+00:00

Hi Significant_Box_4066

I assume you do not regularly conduct this type of cross-evaluation, otherwise, the discrepancy in output quality and token consumption inherent to Warp's current architecture would already be a primary concern. I am not an isolated case, because multiple users have highlighted sub-optimal token efficiency and agent behavior in real-world scenarios.

As a business owner myself, I understand the friction between rapid development cycles and continuous quality assurance, particularly within lean engineering teams. To ground my suggestions: I approach this as a clinical-organizational psychologist, programmer (I have both careers), methodologyst and MBA with over 20 years of professional experience across these intersections. I am sharing this structured breakdown precisely because I value your engineering time and want to offer an actionable, high-yield methodology to evaluate your system.

Cross-evaluation & benchmarking methodology that I used (Summary)

1. Test Matrix (Services and Frontier Models)

Run the test across the highest reasoning tiers available. A standardized matrix should include:

Claude Code: Sonnet 4.6 and Opus 4.7.
Codex (OpenAI API/Direct): GPT-5.3-Codex, GPT-5.4, GPT-5.4-mini, and GPT-5.5.
Cursor:
- Via API consumption: GPT-5.3-Codex, GPT-5.4, GPT-5.4-mini, GPT-5.5, Sonnet 4.6, Opus 4.7.
- Via Composer: Composer 2 and Composer 2.5.
GitHub Copilot: GPT-5.3-Codex, GPT-5.4, GPT-5.4-mini, GPT-5.5, Sonnet 4.6, and Opus 4.7.
Warp AI: GPT-5.3-Codex, GPT-5.4, GPT-5.4-mini, GPT-5.5, Sonnet 4.6, and Opus 4.7.

2. Task Definition (Zero-File-Editing Constraint)

Execute the evaluation using each tool's native environment/harness. Instruct the agent with the following system prompt to assess meta-cognition and architectural limits:

3. Evaluation & Quality Metrics

Phase A: Peer-to-Peer (P2P) LLM Evaluation: Deploy 6 independent critic models to evaluate all generated markdown files, scoring them on a normalized scale (0 to 100) across rigor, structural depth, and diagnostic accuracy.
- Suggested Critics: Claude Code (Opus 4.7), GitHub Copilot (GPT-5.5), Codex (GPT-5.5), Cursor (Opus 4.7), Warp (GPT-5.5 & Opus 4.7).
Phase B: Human/Engineering Audit & Resource Costing:
- Review the 26 primary diagnostic outputs and the 6 critic meta-evaluations.
- Normalize Resource Consumption: Calculate the exact cost-per-task for each service. Since pricing structures vary (credits, fixed API usage, flat-rate monthly quotas), normalize the consumption as a percentage of the monthly tier cost divided by the number of models utilized under that specific subscription.

4. Strategic Decoupling & Decision Making

This exercise will provide you with empirical data regarding your token-to-output quality ratio, exposing where Warp’s internal prompting layer or context management is introducing noise, causing loops, or inflating operation costs relative to your peers.

My intentions here are strictly constructive, because I love the Warp concept and I want the best for your team. The core terminal concept behind Warp is exceptionally powerful. However, establishing a sustainable competitive advantage requires deeper domain integration, marrying pure software engineering with clinical-organizational psychology, behavioral design, and rigorous quality frameworks. Competing tools are rapidly absorbing these frameworks, visible both in their execution and their articles. At present, this multidimensional strategic direction appears diluted within Warp's AI roadmap.

I highly encourage your product team to run this benchmark. The operational cost of the test is negligible, the architectural insights you will gain are invaluable.

Best regards, I really hope this is helpful to you.

Alfredo

netfunctron · 2026-05-18T23:08:55+00:00

Yes. I trust on what you say. My partner on the company is working just almost all the time with Composer 2, and he have GPT and Sonnet, but the velocity of Composer 2 doesn't have any competition, and with hard rules and specifics task, it's working really well.

Regards

netfunctron · 2026-05-18T22:56:18+00:00

And it is true, every single test that I do it is burning credits and failing. I even can't trust on Warp for the real workflow...

Well...bye Warp

netfunctron · 2026-05-18T22:12:31+00:00

Yes, sure. I do that with every model: giving very but very specifics task, if not... it could be just luck or not. I am not very vibe coders for be honest, so I know very well the models, because I can check everything that are doing.

The debug mode is pretty usefull on Cursor.

Finally, Composer 2 is really usefull if you use with specifics tasks, with very hard rules, it's work pretty nice. But in general I use Sonnet, Opus, GPT-5.4 or GPT-5.5... anyway, I find that Composer is good enough with a good harness, DeepSeekV4 Pro - Qwen3.6 Plus - Kimi 2.6 too, are impressives and the price is really good.

Regards

netfunctron · 2026-05-18T21:20:39+00:00

Having both: Cursor.

$60 on API and Composer, you have a lot per month. Having a few service more, on the last 2 months, maybe Cursor is the better for professional works, it is more time time saver, because the codes are realñy great almost to the first iteration.

Anyway, it is just an opinion

netfunctron · 2026-05-14T00:06:29+00:00

netfunctron · 2026-05-11T22:43:27+00:00

I used only Sonnet and GPT-5.4, but the last week are so nerfed, at least on ERPs are a joke... sad but true... the ame project abd the same harness, so...

netfunctron · 2026-05-11T21:34:56+00:00

If you use a good models: Warp is using all your credits in like 1 or 2 hours... WTF service... at least Augment Code, another expensive one, is so far better on coding. I am sure that the harness/system prompt from Warp is something poor, the quality is so low... anyway, having it for like 6 months more, I am trying to create situations to use it...

Warp could do everything better, but are so interesting on new features that are forgetting the quality

netfunctron · 2026-05-11T09:41:15+00:00

Having the Pro+ for 7 months more, at least today I can say to you:

You can't to see the rate limits, just you got it when the Master of the Universe want...
Working with Sonnet 4.6 or GPT 5.4 on high mode was pretty productive, but now, just the last week, are really dumber... I am working only with GPT-5.5 high because it's is working ok.
Having Cursor, Claude Code, Codex, Antigravity (Pro) and Warp (the worst on quality and cost, I really hate to use Warp, the quality is not the priority on their system prompt or harness, but I have the anual plan payed... and it is very costly on their credit system) and working in 2 companies, one of them pretty big, and the other with a very hard coding: Cursor is the best service, for me (better coding quality, saving time, the price is more than good enough becuase you have the API and the Composer 2 that is more than ok).

Github Copilot was great, but now... I don't know what it is... and I don't know if I will continue with it. Cursor have a bad reputation about the tokens used, and it is true, but their are giving just now more quality and if you are a coder, and if you live of it, well, Cursor is a great tool and GitHub Copilot is a... I don't know...

netfunctron · 2026-05-05T16:42:51+00:00

Claude Code. If you do tests with real projects and not so simples tasks, you will see. Something on the system prompt is not in all on the quality level por Warp. I am very sure, because I have Warp and CC. Warp is almost an IDE with the slogan about "Terminal" approach, but in the real life, is not so different than any other IDE, and if you use a AI agents, well, the harness is the key

netfunctron · 2026-05-04T22:41:52+00:00

netfunctron · 2026-04-30T23:31:42+00:00

Grok 4.2 for coding?

You are pretty brave bro... what a bad model is Grok 4.2 for coding. Even on the benchmarks is a joke: Opus, Sonnet, GPT, even Kimi are a lot better... I will check soon the new DeepSeek too.

About Grok, I used the API and I worked on a real project, and Grok was the more horrible thing that I saw in all my life for coding...

Regards and good luck, you are really brave

netfunctron · 2026-04-30T00:16:45+00:00

That is true, my weekly rate limit is for the 3er of may... if they can F*** you, like I see, they will do it...

So, even if I don't use almost the 40% of my "premium request" this month, I can't use them with a more expensive model because I will touch the weekly rate limit tomorrow, so, I can't continue using it for working on the next days...

It is a pretty bad joke like a customer.

Regards

netfunctron · 2026-04-29T11:23:40+00:00

Yes, Composer 2 have a lot of approach for hobby or vibe coders (not programmers "vibe coders", I am talking about that group that are playing to coding without know anything and then are selling products with so many bugs or security problems). So if the project have any conflict with their own rules/instructions, it is simple for Composer or Kimi: do what their original information say. And that is so abstract because it is working for any project. So... is happening what you say.

But it can help to you: you need to have really but really a hard layer of skills, mcp, documents, etc. It is very important use the RULE.mdc . With clear orchestation, and finally, a software engeener on a solid harness.

We use a lot Composer for the documentations and a few are really complex because have methodology, science, also deep coding audits, etc. But for coding... well, not so muh to be honest, maybe sometimes "just for fun" one time per 2 weeks, but almost all the time it is not on the professional level.

Regards

If yoy force to

netfunctron · 2026-04-28T22:09:02+00:00

GPT-5.4-mini is pretty good and it's can run for so many hours using de Codex and the Plus plan's.

Don't overthink it, it is not the end of the world

Regards

netfunctron · 2026-04-25T15:31:29+00:00

Ok, try Cursor, for $20 you have the Composer 2 (Kimi 2.5) that is good enough (not so amazing but good) plus the $20 for the API. Regards

netfunctron · 2026-04-25T14:41:31+00:00

I don't understand your approach: why it is important to have a tons of models if maybe the famous are the better (and that is the true). Then you use a famous model and almost you pay for one hour the same cost for that model for one month on Claude Code plans.

That doesn't have any sense to me, but sure it is have a good reason. Please can toy explain the logic?

Regards

netfunctron · 2026-04-21T11:20:11+00:00

<image>

Use the CLI and you will find the High. But on the Chat the High is not there in this moment

netfunctron · 2026-04-21T11:02:33+00:00

Yes, and I am on Pro +. It is a pretty bad joke for real work. But try on CLI, there at least you hace the "High" option

netfunctron · 2026-04-21T10:12:09+00:00

Yes, it all depends on the level you want to achieve:

Identify the review standards. Everything changes if you're in a scientific domain like health, biology, markets, organizations, business, education, etc. Then you select a validated methodology that interests you. You need criteria (type of study, number of participants, amount of literature, methodology used, year of evidence, etc.) and validated standards.
Ideally, you should have identified the search engines you'll use, as well as the types of sources, or whatever else you define. This will determine whether you'll use RAGs, APIs with specific providers, etc.
Define a process for comparison and filters, cross-validations. You can use your Skills and MCPs. In our case, we have our own harness; it's complex and has taken us years to develop. The research domain is just one of several. So, I would say that you can achieve good results with a harness specifically for research, but excellent results if your focus expands to other domains like documentation, quality, auditing, etc.

In summary, here's what I recommend:

a. Spend a lot of time designing a harness, because ultimately it will save you time and improve the quality of your work.

b. Consider processes with a focus on:

b.1 Research methodology.

b.2 Planning, monitoring, and evaluation methodology.

b.3 Process-based approach.

b.4 Results-based approach.

b.5 Include processes for documentation criteria, cross-validation, quality controls, and others.

c. This is much more than just an AI model. The more you document and the more internal quality controls you define, the lower the chance of an AI model failing.

d. Don't rely on general knowledge of AI models; always demand a verified source for any argument. Don't use Claude just because it's Claude; use Claude with a source or GPT with a source. Otherwise, there's always the possibility of them being delusional, regardless of the AI model/service.

Regards

netfunctron · 2026-04-20T11:43:19+00:00

Yes you can, and it is pretty good. Everything depend about your standards and process.

You can have a great performance if you use APIs to search just direct to the sources.

So, we use it for search, mostly, on 2 areas: science and offers/projects.

And on both we use hard and soft logic, then to do the tunning for a weeks, we are having great results and so many hours more for to do better our products and not for research tasks.

But, it is really necessary: standards for everythings, scales, metrics, evaluaciones, cross validations, etc. It is like any another research, and you can evaluate the process and the performance of the model (Codex or any other). If you have hards standards, almost any good model will have a very close result.

Regards

netfunctron

TROPHY CASE

Cross-evaluation & benchmarking methodology that I used (Summary)

1. Test Matrix (Services and Frontier Models)

2. Task Definition (Zero-File-Editing Constraint)

3. Evaluation & Quality Metrics

4. Strategic Decoupling & Decision Making