- The tech, dubbed VoCo (voice conversion), presents the user with a text box. Initially the text box shows the spoken content of the audio clip. You can then move the words around, delete fragments, or type in entirely new words. When you type in a new word, there's a small pause while the word is constructed—then you can press play and listen to the new clip.
VoCo works by ingesting a large amount of voice data (about 20 minutes right now, but that'll be improved), breaking it down into phonemes (each of the distinct sounds that make up a spoken language), and then attempting to create a voice model of the speaker—presumably stuff like cadence, stresses, quirks, etc., but Adobe hasn't provided much detail yet.
Not quite sure why these corporate events always have to be so painful to watch, but the technology is impressive. Makes me wonder: if Adobe's close to turning this into a product you can run on your home computer, then which other organizations already have similar technology deployed? What could the CIA do with a few choice edits to a leaked recording, for example? You have to suspect that similar technologies may be in use outside of the public view.
Up until about 2 years ago it only ran on Windows, which kept most of the music-making world from fucking with it. Up until last year it only spoke Japanese with no real English port for any of its instructions. Means that if you wanted to do something not in Japanese you had to use Japanese phonemes and katakana instruction and it was kinda tortuous. Also, there are few things as fuckin' Otaku Japanese as Vocaloid. now with bonus Archer clip
We've been doing this the hard way for years. It doesn't take as long as you'd think - it's not that the process is difficult, it's that it's offensive - "if you need that line, go record that guy saying that line, ass." "But we don't have tyyyyyyymmmmmeeeeeeUHHHHHH!" "I will literally take an iPhone recording." "Why are you being so difffffffficulllllllllltUHHHHHHHH!*" I think it's indicative of Adobe that they're simplifying a process that nobody needs rather than, oh, coming out with an audio editor that doesn't suck. Because the people that need this are Youtubers that didn't get that tedious c-grade celebrity saying that thing they needed for that tedious Youtube video and they don't have the skills to edit around their failure. Real toys with a real function that accomplishes real task has been available for a decade.
I don't doubt that there's tools with a lot more utility than Adobe's software provides when it comes to editing vocal takes and such, like the one you linked. And like you said, it's not really that bad of a process to do manually. At least not from the limited experience I've had of it. The main interest I had in what Adobe showed was the ability to type in new words and phrases that a person hasn't actually said and the software's ability to produce a somewhat realistic sounding take. That's the cool part for me, vs. the basic vocal adjustment capability.
That's because Adobe has had this latent voice recognition thing that they've been trying to do for a while so they can tag flash videos. It's why they came out with Story so they'd increase their corpus. My point is that the underlying technology Adobe is leveraging has been available for years from far better vendors using far better tools, it's just that nobody but Adobe would come up with a "let's fake news clippings" use for it.
Us audio engineers are going to be out of a job soon! ;-P