Language is essential to human interaction — but so too are its associated emotions.
Expressing our joy, sadness, anger or frustration helps communicate our messages and create bonds between us.
Generic AI has shown itself adept in many other areas; however, it still struggles with understanding subtleties associated with human emotions and their complexities.
Typecast, an AI startup which utilizes artificial intelligence to produce synthetic voices and videos, claims it is disrupting this sector with its innovative Cross-Speaker Emotion Transfer feature.
Typecast’s My Voice Maker feature now makes this technology available, enabling users to apply emotions recorded from another voice into their own style while creating faster, more efficient content creation.
“AI actors still lack the emotional range necessary for human performances, making this their biggest limitation,” noted Taesu Kim, CEO and cofounder of Neosapience and Typecast in Seoul, South Korea.
With Typecast Cross-Speaker Emotion Transfer, anyone can now easily utilize AI actors with emotional depth based on only a small sample of their voice.
Kim noted that traditional categories for emotions — happiness, sadness, anger, fear, surprise and disgust — don’t adequately express all possible ranges of emotional expression in speech generated speech.
Speaking is more complex than simply mapping text onto output speech, as he pointed out.
People can speak the same sentence in thousands of different ways,” according to VentureBeat’s exclusive interview with him. We can also express different emotions within one sentence (or word).
Example: Recording “How can you do this to me?” using an emotion prompt such as “In a sad voice, as if disappointed” would yield dramatically different results from recording using “Angry, like scolding.”
Emotions such as those described in the prompt, “So Sad Because Her Father Passed Away But Showing Smiles On Her Face”, are complex and hard to define in one neat category.
Kim and other researchers write in their paper about their new technology that humans are capable of speaking with various emotions, leading to rich and varied conversations.
Text-to-speech limitations due to emotions
Text-to-speech technology has made remarkable advances in just a few short years, led by models such as ChatGPT, LaMDA, LLama, Bard and Claude from existing providers and new entrants alike.
Emotional text-to-speech has made significant advances, yet its implementation remains challenging due to requiring large volumes of labeled data that is hard for most people to access, explained Kim. Capturing all the subtleties associated with emotions through voice recordings has proven time consuming and challenging.
Additionally, “it can be extremely challenging to record multiple sentences for extended periods while maintaining emotion”, Kim and his colleagues state in their article.
Traditional emotional speech synthesis requires all training data to be identified with an emotion label, while alternative approaches often necessitate additional emotion encoding or reference audio files.
But this represents a considerable challenge, since every emotion and speaker requires available data. Additionally, existing approaches often suffer from mislabeling issues due to difficulty extracting intensity information.
Cross-speaker emotion transfer becomes more challenging when an unknown emotion is assigned to one speaker. Technology has thus far had limited success as it seems unnatural for emotional speech to be produced by a neutral speaker instead of its original source; plus there may not be an option to control emotion intensity levels effectively.
“Even when it is possible to acquire an emotional speech dataset,” Kim and his co-researchers write, “there remains some difficulty controlling emotion intensity.
Leveraging Deep Neural Networks and Unsupervised Learning Its To address this challenge, researchers first fed emotion labels into a generative deep neural network – something Kim referred to as a world first. While successful in terms of providing emotional accuracy and speaking style expression, however, this method wasn’t sufficient in conveying sophisticated emotions and speaking styles.
Researchers created an unsupervised learning algorithm capable of distinguishing speaking styles and emotions from a large database, training it without using emotion labels as Kim explained.
These calculations provided representative numbers from given speeches. Although unintelligible to humans, these representations could be used by text-to-speech algorithms to convey emotions from databases.
Researchers trained a perception neural network to translate natural language emotion descriptions into physical representations.
“Using this technology, the user doesn’t have to record hundreds or thousands of different speaking styles/emotions as it learns from a large database of emotional voices,” Kim stated.
Researchers managed to replicate voice characteristics using only short samples by exploiting latent representation, writing that this approach enabled “transferable and controllable emotion speech synthesis”. Domain adversarial training and cycle-consistency loss disentangle speaker from style.
This technology analyzes massive volumes of recorded human voices gathered via audiobooks, videos and other mediums in order to detect emotional patterns such as tone variations, emotional state variations and inflections in a speech sample.
Kim demonstrated how his method effectively transposed emotion onto neutral reading-style speakers using only a handful of labeled samples, with emotion intensity controlled via an intuitive scalar value.
He stated that this approach enables emotion transfer in an organic way without altering identity, noting that users can record a short snippet of their voice and apply an array of emotions and intensities, while AI adapts to individual voice characteristics.
Users can select different styles of emotional speech recorded by someone else and apply it to their voice while maintaining its individual identity. By recording only five minutes of their own voice, they can express happiness, sadness, anger or any other emotions through natural-sounding vocalizations.
Typecast’s technology has been adopted by Samsung Securities (a Samsung Group subsidiary), LG Electronics in South Korea and others since its establishment in 2017. Since 2017, Typecast has raised $26.8 billion. Now, Kim said that his startup is working on adapting its core technologies of speech synthesis for facial expression recognition.
Controllability is key to successful AI systems.
Kim noted the fast-changing nature of media.
Text-based blogs used to be the go-to format for corporate media; now short-form videos reign supreme and companies and individuals must produce more audio/video content more frequently.
“For an effective corporate message delivery, an expressive voice of high quality is absolutely indispensable,” Kim noted.
He stressed the importance of rapid and cost-effective production; manual work by human actors simply is not practical.
“Controllability in generative AI is key for content production,” Kim explained. “These technologies empower individuals and companies alike to unleash their creative potential while improving productivity.