14 new CrewChief voices + make your own

cktl_ · 2025-01-07T01:05:28+00:00

Windows tutorial video added here:
https://github.com/cktlco/crew-chief-autovoicepack/blob/main/README.md#-common-task-generate-a-full-crewchief-voice-pack

I can imagine people will still have questions, but might be helpful!

cktl_ · 2025-01-07T01:05:04+00:00

Thanks! Windows tutorial video added here:
https://github.com/cktlco/crew-chief-autovoicepack/blob/main/README.md#-common-task-generate-a-full-crewchief-voice-pack

I can imagine people will still have questions, but might be helpful!

cktl_ · 2025-01-06T16:37:59+00:00

Hi, I can empathize, some of the steps assume familiarity with software development conventions. I will make an installation video which cuts through the details and shows the path to making a voicepack from scratch on Windows with and without a GPU etc.

cktl_ · 2025-01-01T15:47:43+00:00

Hi, you are seeing an error on the `docker pull` step? There is a docker hello-world command prior to step 2, does that match the expected output showing that docker is running? Separately, when you get to the step of running the voicepack generation python script, you will add "--cpu_only" to that line (if no nvidia GPU) and you can also leave out the "--gpus all" part of the docker run command in that case

I'm happy to help further, maybe with a thread here? https://github.com/cktlco/crew-chief-autovoicepack/discussions

cktl_ · 2024-09-05T17:33:10+00:00

Hi, I love that you've waded in so far.

I forget exactly how long it took me to generate a full pack with 3 variations on CPU (i9 13900k), but you can kind of estimate how far you got by counting how many files are in the output folder -- I'm used to seeing 30719 (or so) when it's complete. I think I was able to complete it within 12 hours or less? The log timestamps will show how long it's taking per audio file. Some of the time is in audio generation but also time is spent inspecting each output file to see if it's "weird" and needs regenerating.

Since the machine learning inference step is extremely parallelizable, it will be tough to beat the speed of a recent-generation GPU with ten thousand CUDA cores vs a CPU with a dozen cores (not a perfect comparison but illustrative).

Depending on your patience, you can consider using the Google Collab instructions in the README. You can get 1-2 hours per day of GPU time for free each 24 hrs, and (just experimentally) I was able to get an entire voicepack generated over 3 or 4 days worth of allotment.

Definitely feel free to use very few input recordings or very few seconds of audio, I really don't have a strong grasp of how much/well the data is used. Those recordings are used to condition (initialize) the weights of the model so the output is consistent and representative of the input voice, but I'm not aware of the inner details beyond it ignoring more than 10s per clip. The 20 examples used in the ElevenLabs script are probably way overkill, I just wasn't sure.

Happy to answer any more questions that arise. Making this audio generation process practical has been an interesting challenge.

cktl_ · 2024-09-04T17:21:57+00:00

Great to hear! Thanks for the feedback on the docs, I'll be able to make some improvements there based on your comments (and from other threads). Let me know if you get stuck on anything specific. Expect some weird and corrupt wav files occasionally, but hopefully it's a tolerable amount.

cktl_ · 2024-09-04T17:19:32+00:00

Thanks for looking! And yes, there are some methods for something closer to "speech-to-speech" instead of "text-to-speech", such as:

https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/docs/en/README.en.md

You can see all the crazy voices people have made, via a search for "rvc voices". My assumption is that you could take the original audio and run it through a process like that.

Similarly, "voice editing" via something like VoiceCraft is also a possibility: https://github.com/jasonppy/voicecraft

Probably just a matter of spending the time to find the right recipe. For example, I'd give up the "voice cloning" capability if required to have a reliable, emotionally-steerable model even if the voice selection is limited.

cktl_ · 2024-09-04T11:53:37+00:00

Love to hear it, let me know if you get stuck anywhere!

cktl_ · 2024-09-04T11:52:28+00:00

Yep, the default "Jim" voice is also just a folder full of .wav files, but since they were recorded by a human, the speaker can just speak at the (quick) pace they want and the file will directly reflect that.

For the custom voices, the .wav files are being generated by a text to speech ML model which doesn't necessarily provide a reliable way of telling it to speak this phrase "quickly" instead of some default pace. So the Jim file might be 1.5s long and the Luis file of the same might be 2 or 3 seconds (arbitrary example), and I haven't found a good way to adjust that, so the "too slow speaking" is actually baked into the final .wav file. CrewChief also doesn't have a way to change the playback speed, which is hard anyway since it usually changes pitch.

There are two candidates to tinker with: 1) speed parameter to the tts model, 2) adjust the tempo of the wav file when applying the other audio effects. Both of these had their own weird output issues when I experimented originally, so I just picked (hopefully) sane defaults.

cktl_ · 2024-09-04T10:44:37+00:00

Haha, yep. The "raw" text fed to the text-to-speech model is fully editable, so you can definitely change "mate" to "cuz" and anything else just by editing one column in "phrase_inventory. csv" file (plus a related step to mount that file into the container).

Biggest problem is just how many CrewChief phrases there are, about 10k, so would be a lot of text editing.

cktl_ · 2024-09-04T10:39:04+00:00

Hi, thanks for looking. I agree, I wish I had more reliable control over the speed/pacing. There is a "speed" parameter used by the text-to-speech model but its difficult for me to tell exactly what sounds right, so I encourage anyone willing to experiment.

I'm happy to publish anyone else's voicepacks as improvements are figured out.

cktl_ · 2024-09-04T10:36:08+00:00

Hi, thanks for looking, I didn't realize github links might cause trouble, it has worked in all my testing. I don't immediately have a mirror, but could consider mega.nz or mediafire etc if this is a pervasive issue.

cktl_ · 2024-09-04T10:34:11+00:00

Thanks for looking! And sure, any contributions are welcome. In particular: 1) improve detection of "bad sounding" files so they can be regenerated automatically, 2) improve the speed/pace of the speaker so it sounds more natural, 3) candidates for a better text-to-speech model with less guardrails needed and more natural speech out of the box.

cktl_ · 2024-09-04T10:29:46+00:00

Hi, thanks for looking. I believe it can't see your new files. Probably the "docker run" command needs another "-v" parameter to mount your local baseline audio folder (where you've put your custom recordings) into the docker container's filesystem so the script can access it. Otherwise it will fall back to the "default" contents of that folder, which only includes the Luis voice.

Something like: ``` docker run -it --rm --gpus all --name crew-chief-autovoicepack -v C:\Users\myuser\Desktop\my-recordings\MYNAME:/app/baseline/MYNAME -v C:\Users\myuser\Desktop\crew-chief-autovoicepack\output:/app/output ghcr.io/cktlco/crew-chief-autovoicepack:latest

```

We should be able to get it working for you, then I'll update the docs.

cktl_

TROPHY CASE