Photo-Real Talking Head

Established: November 25, 2011

    • The 3D Photo-Real talking head project won “Demo of the Year”@2011 in MSRA, which is also shown at Craig Mundie’s Techforum 2011, Techfest 2011 (including public day), Exec Retreat 2011, MGX 2011, with great press coverage (MSNBC, PCWorld, CNET, The Seattle Times, etc.).
    • Dictionary Talking Head is selected as MSR highlighted 18 “tech transfers” (e.g. significant product impact) of 2010 from the worldwide labs (reported by PCWorld).
    • The Photo-Real talking head project won NO.1 in Audio-Visual consistency test in LIPS Challenge 2009, an international audio/visual lips rendering contest held in the AVSP Workshop.
  • abstract

    We propose an HMM trajectory-guided, real image sample concatenation approach to photo-real talking head synthesis. It renders a smooth and natural video of articulators in sync with given speech signals. With an audio/video footage as short as 20 minutes from a speaker, the proposed system can synthesize a highly photo-real video in sync with the given speech signals. This system won the FIRST place in the Audio-Visual match contest in LIPS2009 Challenge.

    Video Demo:

  • abstract

    This demo shows a trainable, Hidden Markov Model(HMM)-based, talking and singing head which can synthesize speech from a given text or singing voice from given lyrics and music scores (melody).

    In training, audio/visual features along with the corresponding scripts (text or lyrics and melody) are used to train statistical HMMs where key features and their dynamics of basic audio/visual components are captured and parameterized statistically. In speech synthesis, a given text is first analyzed and decomposed into a sequence of phonemes along with their corresponding durations and f0 prosody. Thus generated speech parameter trajectories are then used to synthesize the final speech waveform. In singing voice synthesis, given lyrics and melody of a song is used to determine the pitch trajectory and phoneme durations and the information is used to drive the trained HMMs to synthesize a singing voice.

    Since the HMMs are trained with a person’s speech or a singer’s voice data, personalized speech or singing voice can be optimally reproduced in the maximum likelihood sense. Head motions and synchronized lip-movements can be automatically synthesized with corresponding prosodic cues and viseme sequence and they can also be interactively modified.

     Video Demo:

  • abstract

    For foreign language users, learning correct pronunciation is considered by many to be of the most arduous of tasks if one does not have access to a personal tutor. The reason is that the most common method for learning pronunciation, that is, to listen and repeat audio tapes, has the following important deficiencies: completeness and engagement. Completeness, in that audio data alone does not offer users how to move their mouth/lips to sound out phonemes that are perhaps non-existent in their mother tongue. Also audio alone is less motivating/personalized for learners, and as supported by studies in Cognitive Informatics, information is processed by humans more efficiently as both audio and visual inform.

    The ambition is to create a visualized language teacher that can be engaged in many aspects of language learning from detailed pronunciation training to conversational practice. An initial implementation is a photo-realistic talking head for pronunciation training by demonstrating highly precise lip-sync animation for any arbitrary text input. So that, ESL users can watch synthesized videos to learn how the mouth moves with speech in a lip-sync manner for many sample sentences on Bing Dictionary (Engkoo).

    Video Demo:

    Live demo can be found on Bing dictionary (http://dict.bing.com.cn).

    News Coverage:

  • abstract

    We propose a new 3D photo-real talking head with a personalized, photo realistic appearance. Different head motions and facial expressions can be freely controlled and rendered. It extends our prior, high-quality, 2D photo-real talking head to 3D.

    Around 20-minutes of audio-visual 2D video are first recorded with read prompted sentences spoken by a speaker. We use a 2D-to-3D reconstruction algorithm to automatically wrap the 3D geometric mesh with 2D frames to construct a training database. In training, super feature vectors consisting of 3D geometry, texture and speech are formed to train a statistical, multi-streamed, Hidden Markov Model (HMM). The HMM is then used to synthesize both the trajectories of geometry animation and dynamic texture. The 3D talking head animation can be controlled by the rendered geometric trajectory while the facial expressions and articulator movements are rendered with the dynamic 2D image sequences. Head motions and facial expression can also be separately controlled by manipulating corresponding parameters. The new 3D talking head has many useful applications such as voice-agent, tele-presence, gaming, speech-to-speech translation, etc.

    Video Demo:

     News Coverage:

  • abstract

    Speaking fluently a foreign language, without even attending a traditional or self-paced language course, is incredible if not impossible. In this demo, we create a talking head which can speak foreign languages. We use Chinese (to be learned) and English (native language) as the language pair to demonstrate this technology and authentic Chinese is spoken by an English speaker’s talking head lip-synchronously in the original speaker’s voice. The talking head and corresponding Mandarin TTS is trained with the English speaker’s audio/video recording. Two advanced technologies, 3D photo-realistic talking head and cross-lingual TTS (Text-to-Speech) synthesis, are combined seamlessly. The Mandarin Chinese TTS was trained with 1 hour of the speaker’s English data. The synthesized Chinese is then lip-synced with the English speaker’s 3D photo-realistic talking head, by matching corresponding inter-language lip articulations between the English speaker and a reference Chinese speaker. We predict trajectories of the talking head with a statistically trained Hidden Markov Model (HMM) and render natural facial expressions and lips movements time-synchronously with the corresponding speech. The prototype is useful for applications like speech-to-speech translation, voice agents, gaming, and tele-presence and computer assisted language learning.

    Video demo:

    News Coverage:

  • abstract

    We present a high-fidelity, speech-to-lips conversion talking head with no linguistic knowledge of input speech. A context-dependent, multi-layer, Deep Neural Network (DNN) is first trained with error back-propagation procedure over thousands hours of speaker independent data. A highly discriminative mapping between acoustic speech input and 9k tied states is thus established. Additionally, an HMM-based lips motion synthesizer is trained over a speaker’s audio/visual data and where each state is statistically mapped to its corresponding lips images. In test, for given speech input, DNN predicts likely states in terms of their posterior probabilities. Photorealistic lips animation is then rendered through the DNN predicted state lattice with the HMM lips motion synthesizer. In addition to speaker independence, the DNN can also be trained language independently for corresponding gaming or telepresence applications.

    Video Demo

    English(en-US)

    Chinese(zh-CN)

    Japanese(ja-JP)

    Spanish(es-ES)

    French(fr-FR)

    MP4

    MP4

    MP4

    MP4

    MP4

    MP4 

    MP4

    MP4 

    MP4

    MP4

  • abstract

    In this work, we turn our high quality, 3D photo-realistic talking head into a talking robot. Instead of displaying the 3D talking head onto a flat-screen display, our new 3D physical robot has its 2D rendered image sequence projected onto a plastic talking robot’s face. The 3D talking robot has photo-realistic facial animation which is lip-synced with corresponding audio speech signals. The system consists of three components: a plastic face mask of the robot, a mini-projector which back projects rendered video images onto the plastic mask, and a laptop computer for rendering high quality audio/video for any given text input. The technology can drive different robots for many natural and user friendly applications.

People

Portrait of Lijuan Wang

Lijuan Wang

Principal Research Manager