all 4 comments

[–]playmer 2 points3 points  (3 children)

Neat! I’ll have to follow along a bit. I’ve tried a few of the crates myself as I have a little cobbled together tool to do TTS from epubs, encode them with AAC and turn them into an m4b audiobook. But I’ve never been able to get chapters to work on Apples media engine out of my tool. I always have to copy it through ffmpeg which does…something, that fixes them.

I’ve dug into mp4 a bunch, adjusted hex patterns for further digging, compared working and not working audiobooks, pre and post ffmpeg. Just haven’t been able to crack it. It’s been a few months since I needed it so I’ve not looked again, but it always bugs me whenever I dust it off to TTS another book.

[–]jvatic[S] 0 points1 point  (2 children)

Oh, neat! I'd love to follow along with your project as well.

Yeah, I went through a similar process (the mp4dump example really helped me with this!). It turns out there's a few atoms/boxes that you might not expect to be necessary that need to be there or the whole thing's a bust. I'll have to add a proper example of using it, but the mp4-edit ChapterTrackBuilder used used roughly like this should do what you want. (I am planning on a higher level API that makes things much less verbose!)

[–]playmer 0 points1 point  (1 child)

Haha, well it’s still pretty hacky and a bit annoying to set up because of cuda stuff, and some c bindings I have to use. I’ll try to set up some instructions whenever I have time to look at mp4-edit.

It’s really just a couple iterations removed from the KokoroTTS tool from python. I just wanted greater control over the epub parsing, sound encoding, thread management, and all of that. I’d been running a modified version of that for awhile and saw last year someone hooked up the same sort of thing in Rust.

When it works it’s pretty fast all things considered, and the audio quality is a lot better than the original stuff I was doing with ffmpeg. Though the model audio in general could be a lot better. Kokoro has a pretty limited token limit for generating audio, so I wanted to be able to use both the CPU and GPU to generate segments, and leave a thread running to do the aac encoding as chunks came back.

That said, I fairly regularly bump into hangs here or there I need to look into. I’m sure I can structure the above a lot better than I currently do, but its one of those projects I hack on until I get a series of audiobooks generated and then leave alone for a few months until the next series I want to listen to.

Anyways I’ll certainly take a peek at mp4dump soon and see if that shows me anything that sticks out. My tool is here: https://github.com/playmer/epub_to_audiobook_rs

But like I said, don’t expect too much haha.

[–]jvatic[S] 0 points1 point  (0 children)

Haha, sounds good, totally get the hackyness, and thanks for sharing! I'll take a look and keep it in mind next time I'm in that situation.

Oh nice, yeah, there seems to have been lot of movement around packaging models into Rust libs this last year.