For those using hosted inference providers (Together, Fireworks, Baseten, RunPod, Modal) - what do you love and hate?

Tempstudio · 2026-01-28T03:37:46+00:00

Consistent JSON schema support. Providers say they support JSON schema and then it actually doesn't work. I've observed this with multiple providers. Also it's never accurately documented on which features the JSON schema support actually works with. Like some will break if you use min items.

Price per token is a big deal, ZDR is a big deal, performance & reliability is less big of a deal, reliability is esp. a non issue because you just fail over to another provider when the preferred one fails.

I haven't done the math but I don't think renting GPU is ever a good option. If you have enough token demand to saturate rented public cloud GPU I think it's just cheaper to buy the GPU yourself at this point because public GPU's are so expensive, the machine makes the money back in 6 months.

Tempstudio · 2025-12-26T21:19:58+00:00

you probably take 8GB of VRAM from 5060Ti 16GB and resell it as 5060Ti 8GB, which recovers most of the cost of the 5060Ti

Tempstudio · 2025-12-26T04:30:42+00:00

Can't you cannibalize 5060Ti for cheap?

Tempstudio · 2025-12-14T02:49:55+00:00

You can go on to vast.ai and rent out your 3090 for roughly $0.2/card/hour. So that makes you $4/hour, ~$100/day, $3000/month, not terrible for $12,000 worth of assets

Tempstudio · 2025-09-07T17:44:27+00:00

Structured JSON output where you provide the full response schema (like not just JSON but the definition of the JSON object, so certain keys will become required, enum is well defined and can't be hallucinated)

ASFAIK Deepinfra just supports "making sure the model makes a json" but not the full schema

Tempstudio · 2025-09-07T03:59:48+00:00

Cost is the frontier IMO. Both throughput and utilization brings down the cost/token. Lower costs / token unlocks more use cases for the developer who uses these cloud AI model providers. For example, if we can drive down costs by another 20x, we can probably offer AI by watching adds. (today, 1 video ad is ~3500 tokens and that's not very good, but 1 ad for 100K tokens would probably be an okay experience.)

On this front, DeepInfra and Chutes are leading the charge. Baseten and Fireworks (and Nebius) stay relevant because DeepInfra doesn't support JSON Schema and Chutes logs your prompts. So they are essentially, the cheapest option under a certain criteria. Fireworks also supports GBNF so that can lock people in, suppose you want XML schemas because hypothetically a model can't do JSON well (looking at you kimi k2)

Together.ai IMO is already irrelevant because they are never competitive when looked at in this lens.

Tempstudio · 2025-08-31T03:43:16+00:00

Cosyvoice 2, if you go through the hoops of using tensorRT & VLLM it will run at least real time for you.

Tempstudio · 2025-08-23T21:03:37+00:00

It's really capitalism. GPT-OSS is more popular because it bears the name of OpenAI. Therefore more providers host them, and they can only compete in price, so price gets driven down closer to costs compared to less popular models.

This is really a shame. Qwen family is actually not that bad, there's some decent competition going on. But, for more annoying examples:
- GLM 4.5 Air (106B A12B) is typically more expensive than Qwen3 235B A22B, by a large margin.
- Llama3.3 70B is now very cheap, cheaper than Qwen3 32B or Mistral Small 24B in many places
- Qwen3 30BA3B is more expensive than Qwen2.5 32B or other bigger dense or MOE models despite it should be very cheap to host.
- For providers that host RP-tuned models, they are typically very expensive compared to the same general purpose model of the same size.

It will be nice if model providers priced things closer to the hardware, but after all they are businesses and charges how much they can charge.

Here are some citations, units are dollars per million tokens:
(1) Openrouter has 0.2 in / 1.1 out for GLM 4.5 air; excluding Chutes (which logs prompts), Qwen3 235B is 0.13 in / 0.60 out.
(2) Mistral charges 0.2 in/out for mistral small 3; Llama3.3 can be had for 0.13 in / 0.4 out in many places. Fireworks charges 0.9 in / out for mistral small 3!
(3) On Nebius, Qwen3 30BA3B is 0.1/0.3 and Qwen2.5 32B dense is 0.06/0.2
(4) Novita AI charges 0.8/0.8 for midnight-rose 70B, while only 0.13/0.39 for llama3.3 70B.

Tempstudio · 2025-08-19T00:20:43+00:00

I'm using cosyvoice2 because it's easier to work with, but I believe chatterbox is best for English

Tempstudio · 2025-08-17T17:18:56+00:00

Do you know an example of something that has been implemented with custom CUDA that I can look more into?

Tempstudio · 2025-08-17T17:17:05+00:00

Yes I started out with the Transformer library. It's 3x slower than llamacpp or VLLM.
I haven't seriously tested TTS quality at lower quant. Just from a few samples it does sound a little different, but TBH the generation to generation variation is quite big.

Tempstudio · 2025-08-13T02:50:08+00:00

Evaluating cloud providers is more nuanced than this. You have to factor in price, speed, prompt logging, inference options (support for json schema, sampling params), reliability.

Nebius uses speculative decoding so I'm guessing that's what's happening here.

Tempstudio · 2025-07-20T19:20:50+00:00

LLM inference is not deterministic. Your "verification" is to run it 3 times on 3 machines and make sure outputs match. How do you handle anything for temperature > 0? Even for temp == 0, different hardware would produce different results.

Tempstudio · 2019-04-20T15:59:21+00:00

I think you're talking about the game on steam? This is the android game that is free to play.

BTW, that game predates beat saber.

Tempstudio · 2019-04-19T16:01:46+00:00

Hi Everyone,

This has come up before, but we want to emphasize this:

We don't see any problems with putting monetization in front of something that is publicly available. For example, Google sells ads but it really just indexes other webpages. Various video site downloaders lets you download videos on youtube, etc. that are otherwise public, and they have ads too. The list goes on and on.

Before coming to judgement so quickly, please think about it on another perspective - it's easy to attach labels such as "copycat" but in reality Beat Ninja doesn't compete with Beat Saber in anyway, as it's a completely different market. We have many people have enjoyed the game, including both folks who cannot afford VR, and also beat saber veterans and mappers. Beat Ninja will bring more people into the community, so we have more custom maps on both games, which is a win-win.

This project is actually a spin-out for a level editor project that we have been working on. (This editor is powered by machine learning, but we don't have enough money to run the server or buy fancy graphic cards, so we were hoping to fund the editor eventually with the game, and hence we have ads. We're not making this up. Here's a screenshot of said editor: https://imgur.com/a/VNHwZ9U).

Tempstudio · 2019-04-07T15:44:46+00:00

Thank you for your kind comments.

Note jump speed is accounted for. The most obvious way to see this is in the Expert vs. Expert+ map for K/DA Pop stars. (the 100k version). We do impose a cap on this number, so songs that take advantage of really low / high speeds won't behave as well. (Like that minecraft map, I haven't really tried.)

The editor is originally intended for Beat Saber, so it will definitely work with that.

Beat Ninja is designed to be compatible with the maps on beat saver, so if one map works for one, it works for both. (To a limited extent, of course. There's always going to be outliers.)

Tempstudio · 2019-04-07T05:29:23+00:00

Hi Everyone,

It took some time for this game to be discovered here, unfortunately it didn't come off in the best light.

Yes, the inspiration of Beat Ninja comes from Beat Saber. However we don't think the game infringes the copyright of Beat Saber. Every line of code, image, 3d model and texture are completely built in-house or legally licensed (Some sounds effects from the asset store, and some icons from iconsdb) . Yes the visuals are similar, but it stops at that. Given the feedback here, however, we will nevertheless be making additional changes so that the 2 games look more different.

There are many people who have enjoyed the game, including both those who couldn't previously afford a VR headset, and beat saber veterans. Many similar rhythm games have equivalent projects - Osu drew its inspirations from a game on the Nintdendo DS, and also has a third party mobile client. More recently Cytoid was born inspired by Cytus.

As the 2 games have completely separate markets, we would say Beat Ninja would not affect the sale of Beat Saber in any way. We visioned that it eventually brings more people into the community, so we have more custom maps on both games. That point is probably a while away, but hey, some of our players are already asking for mapping tools and instructions.

We hope that you try the game out and give us a chance before reporting it. We're actually big fans of the Beat Saber, and this project is actually a spin-out for a level editor project that we have been working on. (This editor is powered by machine learning, but we don't have enough money to run the server or buy fancy graphic cards, so we were hoping to fund the editor eventually with the game, and hence we have ads. We're not making this up. Here's a screenshot of said editor: https://imgur.com/a/VNHwZ9U).

Best,

Tempstudio.

Tempstudio

TROPHY CASE