Error Awareness and Recovery in Conversational Spoken Language Interfaces

One of the most important and persistent problems in the development of conversational spoken language interfaces is their lack of robustness when confronted with understanding-errors. Most of these errors stem from limitations in current speech recognition technology, and, as a result, appear across all domains and interaction types. There are two approaches towards increased robustness: prevent the errors from happening, or recover from them through conversation, by interacting with the users. In this dissertation we have engaged in a research program centered on the second approach. We argue that three capabilities are needed in order to seamlessly and efficiently recover from errors: (1) systems must be able to detect the errors, preferably as soon as they happen, (2) systems must be equipped with a rich repertoire of error recovery strategies that can be used to set the conversation back on track, and (3) systems must know how to choose optimally between different recovery strategies at run-time, i.e. they must have good error recovery policies. This work makes a number of contributions in each of these areas.

Subsequent experiments with the machine learning infrastructure that has been used in parts of this work have revealed a small defect in the procedure for logistic regression model construction and evaluation. In places where such models were constructed in a stepwise model building process (chapters 6 and 7), the scoring of features was done by assessing performance on the entire dataset (including train + development folds), instead of exclusively on the train folds. Nevertheless, once a feature to be added to a model was selected, the model was trained exclusively on the training folds, i.e. the corresponding feature weight in the max-ent model was determined based only on the training data, and the evaluation was done on the held-out development fold. Subsequent experiments with a correct setup (where the feature scoring is done only by looking at the training folds) on several problems show that this bug does not significantly affect results. While with a correct setup the numbers reported in cross-validation might differ by small amounts, we believe the results reported in this work stand.