Elevenlabs scribe: we have tried this voice transcription model, and it's stunning
Elevenlabs and its scribe model © Elevenlabs
The Elevenlabs startup has just revealed Scribeits new model of AI dedicated to the transcription of audio in text (Speech to Text). According to the company, this model would be the most precise in the world with an error rate of only 3 % in English as in French. We tried it and pushed it in its last entrenchments. One thing is certain, the precision of scribe is literally stunning.
Scribe test for the transcription of a French piece of rap
The scribe model is officially deployed in the Elevenlabs interface, as well as available in API format for application developers. To verify its performance, we submitted it to two tests: we have gathered in the Elevenlabs dashboard two rap acapella, respectively in French and English. For the first test, it was chosen to use a title released in the year 2000 “Disiz la plague – I take the lead”.
Once the vocal version has been rapped only and detached from the instrumental band recovered, we sent it to the transcription tool of Elevenlabs.
Damn, I'm tackling. Worse, I'm in a trance. Transport blocks, it hobs. It's been two hours since I’m only advancing. Little by little, I fart the lead. The Beauf from behind insults me and treats me with con. It's too much. I sit from my box, take my bag in the trunk and break. Leave my body on the periphery. Nothing to fuck, I trace. Too much stress, I'm too hungry. I need an MCDO. I find one, come back. There's a bastard tail, but hey, I do it. Thirty minutes later, I ask very politely: hello, a McMorning, please. When she answers me: too late. Sorry sir, it is noon and afternoon, well the McMorning, it's over. I said to him: when I asked you, it was eleven fifty-nine. Make an effort. I want a salty morning. You know, the one with the egg. She said to me: it's noon. I told you, it's over.
We then compared the extract of this transcription with the original lyrics
Damn I am tulying, worse I am in a trance the transport blocks it Klaxon It has been for 2 hours that I am going to go that shortly by little I fart the beauf the beauf from behind insult me and treat me with con (asshole, go!) It is too much I go out of my box take my bag in the trunk and get my crate on the periphery, I donate a mc do I find one, come back, there is a bastard tail but hey I do it about 30 minutes later I ask very politely: “Hello a mc morning please” when it answers me “too late, sorry sir it is noon and afternoon, well the mc morning it is finished” I say it: “when I asked you it was 11:59 am I said: “It's noon, I told you it's over
If the transcriptions seem identical, it is immediately noted that Elevenlabs properly ensures the detection of most of the oral elements pronounced by the artist, despite the speech speed, specific to the style of rap.
Scribe transcribes the words of a very fast title of Eminem
Impressed by the capacities of this transcription model, we carried out a second test, supposed to be more difficult. We have chosen to keep the concept of rap acapella. On the other hand, our choice fell on Eminem – Rap God, a piece from 2013 where the American rapper unpacks his words at a frantic pace. We focused on one of the fastest passages, precisely that starting at 4.25 min.
According to the same protocol as for our first test, we recovered an extract transcribed from Elevenlabs.
Uh, Suma-Luma-Duma-Luma, You Assuming I'm A Human. What I Gotta do to get it through to you i'm superhuman? Innovative and i'm made of Rubber so that anything you say is ricochetting. OFF OF ME AND IT'LL GLUE TO YOU AND DEVASTABING. More Than Ever Demonstracting How To Give A Motherfucking Audience. At Feeling Like It's Levitating, Never Fading. And I Know the Haters Are Forever Waiting for the Day that they can say i Fell Off, they'll be celebracting. 'Cause I Know the Way to get' Em Motivated. I Make Elevating Music, You make Elevator Music. “Oh, He's Too mainstream.” Well, that's what they do when they get Jealous, they confuse it. It's not hip-hop, it's pop. 'Cause I Found A Killer Way to Fuse it. With rock, shock rap with doc. Throw on lose yourself and make 'em lose it.
Once the transcription has been recovered, we compared with the original lyrics.
Uh, summa-lumma, dooma-lumma, you assumin 'i'm a human What i gotta do to get it through to you i'm superhuman? Innovative and i'm made of Rubber so that anything you say is ricochetin 'OFF OF ME, AND IT'LL GLUE TO YOU AND I'M DEVASTABING, MORE THAN EVER DEMONTRATING How to GIVE A MARCHERFUCKIN' Audience A Feeling LIKE IT's Levitating Never Fading, and I Know the Haters Are Forever Wait. The day that they can say i fell off, they'll be celebrating 'cause i know the way to get' em motivated i make elevating music, you make an elevator music “oh, he's tooo mainstiem” Well, that's What they do when they are Jealous, they confuse it ” Hip-hop, it's pop, “'cause i found a hella way to fuse it with rock, shock rap with doc throw on” lose yourself “and make' em lose it
This time again, the precision of the model designed by Elevenlabs is remarkable, if not stunning. The company indicates that a low latency version of its IA model will soon be published, which should allow developers to create new conversation applications in real time.