Date of Award

Summer 1996

Document Type

Thesis

Degree Name

Master of Science (MS)

Department

Electrical & Computer Engineering

Program/Concentration

Electrical Engineering

Committee Director

Peter L. Silsbee

Committee Member

Stephen A. Zahorian

Committee Member

John W. Stoughton

Call Number for Print

Special Collections LD4331.E55 C433

Abstract

Motivated by the fact that human speech perception is a bimodal process (auditory and visual), several researchers have designed and implemented automatic speech recognition (ASR) systems consisting of both audio and visual subsystems, and shown improved performance relative to traditional purely auditory systems. Several visual speech reading approaches have used deformable templates to model the shape of a speaker's lips. Deformable templates are models of image objects, which can be deformed by adjusting a set of parameters to match the object in some optimal way, as defined by a cost function. Using a single deformable lip model has disadvantages such as limited flexibility and reduced accuracy. In this thesis, a novel technique that alleviates some of those drawbacks, called the multimodel approach, is introduced and implemented.

The multimodel approach is based on classification of lip images into groups depending upon their characteristics. A key image is selected for each group and assigned a lip model, initial values for the model parameters, and visual feature extraction guidelines. Test images are matched with key images using a classification algorithm; the classification results determine the lip model properties of the test images. Better template fits are possible because of better initialization and more appropriate constraints in the lip model definition.

The information about the visibility of teeth and tongue inside a speaker's mouth has not previously been thoroughly exploited to extract visual features. Using the multimodel approach, it has been possible to extract, and investigate in detail, visual features containing such information.

The multimodel algorithm is tested using a database of images of a speaker's mouth uttering consonants. Recognition results improved from 15% for a single model approach to 45% for the multimodel approach. It is therefore evident that using the multimodel technique results in better, robust visual speech recognition because of appropriate model definition, better template initialization, and detailed feature extraction.

Rights

In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/ This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).

DOI

10.25777/m2e0-qa58

Share

COinS