Skip to main content

It appears as though RingCentral records calls in Mono, and when you run the mp3 through a transcription service - the speaker is always changing. Is there a way to make that NOT happen? So the first person speaking is speaker 1, or speaker 1 is associated with the agent on inbound, etc.

Most of speech-to-text services provide only diarization and they can only label different speakers as you can see as speaker 0, speaker 1 ,speaker 2 etc. I am not aware of any ML service out there that supports speaker identification, which normally requires pre-training with a speaker's voice sample.

If this is critical for your app/service, you can implement your own app which records call in multiple channel and thus, you will know e.g. which channel is an agent and which channel is a customer. Let me know if this is what you want to implement so I can help further.


I'm not sure if Mono audio is the cause of this problem, perhaps try a few different transcription services and compare results?


Reply