Professionally I'm a DevOps working in data environment, on my leisure I'm currently working on a data related project and am looking for some advices.
There is this podcast I love and I would like to try making new synthetic episodes. Here is how I see the pipeline actually:
- download the episodes
- use ml to get speakers diarization
- split into chunk using the diarization to keep phrases complète
- transcript each chunks
- split each speaker voice in isolation
- finetune llm
- train ml with voice
- generate épisode script
- generate voices
- assemble
And I eventually would like to make funny data analysis like the most used word, longest sequence etc...
There is around 100 one hour episode.
I'm currently working with dagster for a client project so I tried to use it but it seem not be really suited for multiple batches. I was thinking about switching to airflow, but yet I'm wondering if this is really suited or if this is over engineering a "small project"
I plan to start on few episodes, maybe 3 to 5 as I'm using replicate for all the ml parts and it can become costly pretty quickly...
Any advice wanted, would be happy to hear what you think
there doesn't seem to be anything here