Friday, March 7, 2008

Voice munching

Lip reading in computer vision tries to uncover the conversation through an audio less video sequence. It tries to exploit the movement of the lips and the jaw which are assumed to have a unique correspondence to what we speak. When it comes to broadcasting of media content for example say a news show, ideally it would only be required to transmit the video of the person and the software residing locally should be able to give you the news through lip reading. But would it really be worth the effort? How much bandwidth would the audio signal after all take? I would rather be impressed if the whole concept was reversed; bring in the lip movement by looking at the voice. Of course this would not replicate the exact video of the person talking, but no other better example could be found to support this concept. Transmit only the first frame of video and guess the next frames, I mean lip movement through the transmitted voice. The bandwidth to transmit a news channel will just be equal to the bandwidth of voice which means I will be able to use my landline phone with no special modem to make a video call. Crazy stuff! But all depends on how best you can make the person’s lips dance to the tunes of his voice.