Welcome to the world of intelligent speech processing!! Ordinary Automatic speech recognition systems use only audio information for speech processing. On the contrary, human speech is bimodal in nature. It consists of two main parts ? audio and video information. Audio information is the usual speech signal that we receive. Video signal consists of the lip movements and other visual information. The relevant data in a video signal is isolated and features are extracted from it. From a series of feature vectors, we make a set of higher set of semantic elements called visemes that are visual equivalents of phonemes, which are the basic elements of audio speech. So the use of the visual speech for speech processing will enhance the performance of the system especially in noisy channel. The effective intelligent speech recognition systems can be a help to the hearing impaired persons. Speech recognition systems also can be used for automation of telephone systems, for transcription in universities etc.
This paper deals with the basic concepts of lip reading and its importance in speech signal processing. The basic block diagram of a lip reading system is explained. The considerations to be taken while designing each block are also dealt with. But merely using visual information for speech recognition is not an effective way. Here comes the relevance of Joint Audio Visual Speech Recognition (JAVSR).The various considerations to be taken for merging the audio and video information are discussed. And based on our knowledge about Lip reading systems and JAVSR systems, we propose a Joint Audio Visual Speech Processing System which can reduce errors in automatic speech recognition.