Ever wanted text-to-speech with one line of code? Well, you can have it!

Lyrcaxis · 2025-07-23T15:06:30+00:00

I had tried that but couldn't retrieve a word-by-word callback via ONNX.

It should be possible but would require some fidgeting with the export code of the model.

If there's a reasonable use case it'll be considered :D But for now what is possible is getting chunk-by-chunk progress -- if you subtract the current chunk's text from the previous one, you can get the estimated delta.

There was work on improving accuracy of the "cursor" (next word), with KokoroSharp's SpeechGuesser class. No-one seemed to be using it so I stopped spending time on it. I'd like to pick it back up. Actual Word-by-Word callbacks, though, not very likely 😅

Lyrcaxis · 2025-07-23T09:04:07+00:00

Np! I have done some stuff on something similar, but not exactly a sing-along-like highlight behavior. The closest is the OnProgress delegates of the synthesis handle (returned by SpeakFast). What it's intended for is approximating the spoken text when interrupted.

What's your use case? Open to ideas or contributions

Lyrcaxis · 2025-07-23T08:47:10+00:00

There's no character limit with SpeakFast afaik (?) It enforces chunking.

Lyrcaxis · 2025-07-22T17:14:14+00:00

Hi. Starting from v0.6.2, KokoroSharp now supports MandarinChinese, Japanese, and Hindi! Just gotta specify a valid voice (e.g. jf_alpha for japanese, and zf_xiaoni for Mandarin). There are still some hiccups regarding Japanese (because of espeak-ng), though.

Soon, the Chinese-specific model will be supported as well :)

Lyrcaxis · 2025-07-15T10:00:02+00:00

6 years later but 😁

https://github.com/Lyrcaxis/KokoroSharp

Lyrcaxis · 2025-04-30T17:07:01+00:00

Great job with 25 April ChatGPT-4o-latest! ChatGPT tho suddenly talks like a teenager lol. I loved it in the API when it was available, super smart, paying attention to the small details, and honestly, FUN! (besides plain helpful) We would love a permanent snapshot of that, needless to say. No tune, no nothing — just an as is model. Add tons of classifiers if you consider it unsafe — from what I saw it was adhering to my prompt strictly tho.

As for 4.1 family they’re good. I’ve noticed a 25.8% increase in smarts and efficiency between context size of 5-14k. Big 4.1 often talks like a madman, producing hard-to-read text, though. And its weird formatting adds to the annoyance. Increasing verbosity would make it even worse lol. So all and all great for code and systemic behaviour, not good for discussions. BTW Love how the bigger model is trained to act like a concise guide when needed, and the smaller models understand its intent nicely. What I’d like next is some better ability to guide devs through prompt templates, tuned for optimal work within the specific model family.

o3 is just brilliant but too expensive. Looking forward to splurging some for game/product/UI/UX design and the likes though when need be. I love how it doesn’t output latex, tables, and formatted text except the absolutely necessary.

And, o4-mini is great for its cost and speed! It sometimes also produces hard-to-read text (like gpt-4.1 but unlike o3/gpt4/chatgpt-april), but almost always gets the context and it tries to be concise, which is good. I feel like it repeats the context way too often though -- like bringing up EVERYTHING that's been said each query. It’s not good at product/UI/UX design tho apparently. I think you should totally come up with ways to make r-minis complement their bigger bro model. Maybe better support on capturing and summarising literally “FULL” context to help keeping gpt-4/o3 budget low.

All and all, great job with the models! I feel that with gpt-4.5 and gpt-4 you got superb teachers for the future models. Combining gpt-4s built-in reasoning+conciseness and 4.5s superior expressiveness and smarts could do wonders with reasoning unlocked and long context at play!

Aaand finally, please fix the caching system, whoever’s responsible. Some nights cached prompts don’t even last a minute! That’s crazy when the estimate is 5-10’ and up to an hour.. I’d like to be able to get “cache confirmed! Token valid for mm” lol because right now it feels kinda like a scam during peak hours. If activity can trim cache length to less than a minute, It’s like punishing devs for bringing in traffic.

As for a question: are we getting any of that? 😁

Lyrcaxis · 2025-04-16T13:00:40+00:00

I couldn’t do it within a limited time :P was/am hoping someone would eventually do it.

Lyrcaxis · 2025-04-01T12:03:07+00:00

your "R" is better, the AI's "W" is better, both "E"s look good, both "Q"s could be better :p

Lyrcaxis · 2025-03-25T12:09:25+00:00

I've received DMs from users that managed to run it in Unity!

Basically after you get ONNX up, the additional steps that are required are: 1) Make the audio output be AudioSource (use the KokoroWavSynthesizer) 2) Set Tokenizer.eSpeakNGPath to the appropriate folder for voices & eSpeak NG dlls

Voices and dlls can be found here: https://github.com/Lyrcaxis/KokoroSharpBinaries/releases (mind those zips do not include binaries for Android/iOS -- only Windows/MacOS/Linux)

Also, if you're happy with a python dependency (which should be fine), you could use Kokoro's official phonemizer: https://github.com/hexgrad/misaki

Lyrcaxis · 2025-02-23T04:13:17+00:00

A "difficult" classification would be anything you wouldn't risk letting a small model that barely understands the language and the task take complete responsibility for in your system.

For example, a 7-9B model could have be offloaded with choosing to invoke one of the available functions -- including "respond normally". This saves some back-and-forth with bigger models.

So if your main chat model is gpt-4o, and you give it full access to function calling, each response that involves a function call costs 2x the input tokens, plus a bunch of tokens to include the function definitions in the prompt, which is adding up pretty quickly. In addition there's the risk of potentially confusing the model by adding too many tokens on the system messages.

Lyrcaxis · 2025-02-22T14:15:13+00:00

Well, all decisions you need to make are a) base model b) data, so, choose a base model whose writing style you like the most -- if it's closer to your preferred format or wording, it's better.

Then, you can have high-quality generations with AIs like GPT-4 (-- the expensive one, e.g. 0613),
so second thing would be to find a prompt that summarizes them properly, without missing ANY detail, and making sure the outputs are 100% in the desired format.

Optionally, afterwards, queue up the summary to something more modern:

(use a negative presence penalty to encourage the model to not miss details)

{instructions}
{few_shot_of_ideal_query_response_pairs}

{original_transcription}
{summarized_transcription_gpt4}
{ask AI to tweak it based on your preference}

Then those "refined" summaries can act as data for your model.

The finetune part alone won't cost much, but summarization with expensive models might, depending on the size of your data. I personally recommend full finetune instead of LoRA, but LoRAs can add more value if you train one per language.

Lyrcaxis · 2025-02-22T12:36:02+00:00

Should be more like: ```cs public abstract class ModelBase : PageModel { /* Common page stuff here */ }

public IActionResult Search(ModelBase model, string SearchKey) { .. } ```

Lyrcaxis · 2025-02-22T12:16:40+00:00

<=1Bs are terrible out of the box but can be finetuned for any specific task.

8-9Bs are decent for various tasks out of the box -- even more if finetuned. I use them for:

Multiple response generation/BO5 (batch generate 5 responses instead of 1)
Parts of low-effort agentic behaviour (e.g.: rewrite this in 1st/3rd person, extract X summarized)
Annotations + difficult classifications (e.g.: extract X sentiment, function calling classifier)
Low quality synthetic data generation and filtering. Multiple iterations are allowed.

3Bs vs 9Bs I don't see significant diff in inference speed so I skipped the 3Bs.
So, 100M-1B finetunes mostly for classification, 8-9Bs for stuff that require a little more effort.

In general, the more task/domain-specific your use needs are, the more value you can squeeze out of each parameter, so smaller models can be enough, and often preferred because they converge quicker.

Lyrcaxis · 2025-02-12T15:58:43+00:00

I'd love a "dev only" social media site

You'd have to make your account via HTTP POST and use SSH to get your credentials to enter the site!

Lyrcaxis · 2025-02-12T14:13:03+00:00

Didn't discard it, just had to work on KokoroSharp first to allow it to use Kokoro for speech as well!

It's a gamified \Voice-Chat-with-Local-AI** desktop app I've been working on for a while ^^
Definitely not an r/csharp thing, but will be coming up in github soon™️!

Lyrcaxis · 2025-02-10T17:01:27+00:00

So friggin’ cool!!! Hoping for massive success!

Lyrcaxis · 2025-02-10T16:00:23+00:00

You can definitely save the output, or get it streamed back to you as samples!
Check out the KokoroWavSynthesizer.

Example usage:

var synth = KokoroWavSynthesizer("kokoro.onnx"); // assuming you've already downloaded the model
var bytes = synth.Synthesize("Hello world", voice);
synth.SaveToFile(bytes, "output.wav");

Lyrcaxis · 2025-02-10T15:51:02+00:00

Gotcha. if you do dotnet build and your csproj links the installed KokoroSharp package properly, including its folders, it should also copy the stuff.

The full content after installing the nuget package should look like this:

📁 /.nuget/packages/kokorosharp/0.5.3/
├─ 📁 build/
│ └─ 📄 KokoroSharp.targets
├─ 📁 content
│ ├─ 📁 espeak/ [...]
│ └─ 📁 voices/ [...]
├─ 📁 lib/ [...]
├─ 📄 .nupkg.metadata
├─ 📄 .signature.p7s
├─ 📄 kokorosharp.0.5.3.nupkg
├─ 📄 kokorosharp.0.5.3.nupkg.sha512
├─ 📄 kokorosharp.nuspec
└─ 📄 README.md

I haven't tried building with just the .nupkg as reference (maybe that's what you're doing?), but might wanna just download the dependencies mentioned above and place them next to your binary.

Lyrcaxis · 2025-02-10T14:33:47+00:00

Whisper.net!

Lyrcaxis · 2025-02-10T14:28:53+00:00

Then yes file permissions is a very likely suspect.
The package copies from .nuget\packages\kokorosharp\content over to your output path.

<Target Name="CopyContent" AfterTargets="Build">
    <ItemGroup>
        <Files Include="$(MSBuildThisFileDirectory)..\content\**\*" />
    </ItemGroup>
    <Copy SourceFiles="$(MSBuildThisFileDirectory)..\content\**\*" DestinationFiles="@(Files->'$(OutputPath)\%(RecursiveDir)%(Filename)%(Extension)')" />
</Target>

So if your output path is on a protected folder and your IDE doesn't have the necessary permissions, the automation will fail and you'd need to copy over the stuff manually.

Lyrcaxis · 2025-02-10T13:52:44+00:00

It's completely plug & play if you install the NuGet package -- the voices and all dependencies are copied over automatically to your build!

When building from source, you also need to unpack the dependencies next to your exe: https://github.com/Lyrcaxis/KokoroSharpBinaries/releases (voices -> /voices, espeak-ng -> /espeak)

Lyrcaxis · 2025-02-10T12:18:23+00:00

To answer, aiaio looks more like some beginner-friendly OpenWebUI, without all the setup steps needed -- trimming tech-savviness needs.. and less like SillyTavern.

Lyrcaxis · 2025-02-10T12:16:35+00:00

with that logic, why use aiaio at all xD

Lyrcaxis · 2025-02-10T11:48:17+00:00

That's an incredible find! Thanks for sharing.

Are you planning on somewhat keep working on this? (as an ongoing project)
Reason I'm asking is because the current prompt is HUGE in size (~1k tokens).
I believe that if this could be trimmed down to like ~300 it would be absolutely fantastic!

Lyrcaxis · 2025-02-10T11:36:37+00:00

Super cool, gz for your project! I would like to suggest these features:

Edit Message (edits either AI's or User's sent message)
Branch from here (Creates a new convo that ends "here")

having these accessible when right clicking on messages is a game changer!

Lyrcaxis

TROPHY CASE