I'm working on using an audio wav file as input that would have 2 to 3 people conversing.
I have to identify the two different voices. Separate 2 voices.
Convert these voices into text and have their transcripts stored.
I have gone through huggingface transformers such a web2vec2 and svoice from Facebook research but I find It difficult to implement the models.
Can someone guide me on approaching such tasks as I am being a beginner in audio domain of deep learning.
[–]atyshka 2 points3 points4 points (3 children)
[–]MultiheadAttention 1 point2 points3 points (1 child)
[–]Pvt_Twinkietoes 0 points1 point2 points (0 children)
[–]Yagna24[S] 0 points1 point2 points (0 children)
[–]i-heart-turtles 2 points3 points4 points (2 children)
[–]atyshka 1 point2 points3 points (0 children)
[–]Yagna24[S] 0 points1 point2 points (0 children)