Practitioners and researchers who work with speech technology face a more complex problem than those who work with GUIs. In a VUI (voice user interface) design, information is conveyed auditorily, not visually, and involves aspects of human communication and social behavior that are not well understood in applied work. Additionally, the previous definitions of usability don’t account for another favorite theme in the speech industry, that of persona, or the personality of a speech interface. Applied practitioners and researchers must create or modify methods of usability testing to accommodate their unique needs.
The purpose of this article is to present several views of usability issues in speech technology. They address questions in three areas: 1) the definition of speech usability, 2) usability and design, and 3) usability testing methodology. Our respondents are three professionals who represent the range of educational backgrounds currently involved in speech technology design, and are currently working in the speech industry. Their unique perspectives demonstrate the wide variety of practices encountered today in speech usability and highlight potential areas of future research.
1. Define usability in the context of a speech-enabled application.
JUAN: Speech-enabled applications present unique challenges versus GUIs. For example, speech does not persist. Once something is said, it’s gone. In a GUI, however, information can be presented on screen where the user can look at, go away and then come back to it, and the screen persists.
The transience of a speech interface makes it vital that users immediately understand menu option names because they are unlikely to be able to hold several options in memory while they decide between them. Furthermore, if 1,000 people type the word “speech” using the same keyboard, the input will always be interpreted the same. In speech-enabled applications however, it is highly likely that those same 1,000 people could say “speech” and the interpretation would be different, due to inherent limitations in the algorithms that decode the spoken acoustic signal into words. These differences make the design, implementation, and evaluation of speech-enabled applications different from and more challenging than GUIs.
MELANIE: To me, speech usability is very similar to pragmatics, the study of language usage in a conversational context. My research suggests speech usability is a four-factor construct that consists of:
- User Goal Orientation—The efficiency of the dialogue and the extent to which a speech application is focused on user needs.
- Speech Characteristics—The extent to which an application’s voice meets existing expectations (including expectations of voice quality derived from popular media like TV or radio).
- Customer Service Behavior—The extent to which an application uses common vocabulary and speaks in a friendly, supportive, courteous manner.
- Verbosity—The extent to which the system is perceived as inappropriately talkative.
2. Where is the dividing line between “technical difficulties” (in other words recognition accuracy) and “real usability” for speech applications?
SUSAN: Some of the problems encountered by the user of speech technology will not be due to the design of the user interface, but to a failure of the technology itself. In the speech community, some argue that the source of problems experienced by users is ultimately unimportant because the end result is the same: the user is unable to use the speech application to accomplish his goals. There is some validity to this argument—a problem is a problem, who cares why it happened?
However, others believe that the diagnostic power of usability testing is diluted by giving equal weight to recognition failures and interface design problems. This view does not seek to minimize the impact of problems due to failure of speech recognition algorithms. Instead, it considers recognition failures and user interface failures separately to understand the contributions of each. In fact, the “divide and conquer” approach deals very seriously with recognition errors by emphasizing error-handling in the VUI design. This emphasis allows us to design prompts to minimize the negative effects of imperfect technology for the user.
JUAN: I believe that the technical difficulties associated with speech recognition accuracy are part of the usability evaluation. Usability evaluation should include recognition accuracy as a matter of fact. However, what’s more important is task completion. A poor recognition rate can be neutralized with error recovery. I don’t see a “dividing line” between the “technical difficulties” and “real usability” of speech applications. A good design expects speech recognition errors and handles them appropriately. It is part of human communication to misrecognize and then recover. When done well, the user won’t even notice.
MELANIE: I agree with aspects of both my colleagues’ answers and would reinforce the assertion that error handling is a critical aspect of VUI design. In fact, some social cognitive researchers have argued that we can never be sure that any communicative partner fully understands an utterance in the same way that the speaker intended it. Therefore, most communication is negotiating miscommunication.
Unfortunately, the strategies for designing error recovery are also one of the least developed areas of VUI design and there is relatively little research in miscommunication and communicative breakdown. At the same time however, there is at least anecdotal evidence that users may perceive a speech application as helpful and a good negotiator of meaning if it has an effective error recovery strategy. This issue certainly has important implications for the community of applied VUI designers and researchers.
3. Is usability the same as, or different from, persona? Which is more important?
MELANIE: In the speech industry, persona has become one of the more popular aspects of VUI design, especially for selling the technology. In speech technology, unlike GUIs, the term refers to the ”personality” of a speech interface, the set of personal characteristics that are conveyed by the choice of voice talent (the narrator), the style and tone of prompts, and the flow of dialogue.
In other words, a persona is an imaginary customer service repre-sentative who is emulated by the VUI. This concept has been very controversial—different designers put relatively more or less emphasis on persona. Some designers deny that persona exists at all, while others suggest that persona is the most important characteristic of an application.
Usability as a separate term also has received quite a bit of attention in speech technology: it is most often considered the ease of use and efficiency of a speech application.
In my mind, persona and usability are synonymous, and effective planning for both is critical to the success of a speech application. The research suggests that aspects of speech conveyed by the voice talent (for example, pitch, pitch range, loudness, loudness range, how well the speaker articulates), the overall organization of a VUI, and the linguistic style of prompts (for example, word choice, syntax, content and structure of error recovery prompts) all combine to form an impression of dialogue efficiency, interactive ease, and affective response in the user. These impressions are similar to Nielson’s characteristics of general usability. The same processes are at work in speaking with a machine or human—our social perception is extremely rapid, automatic, and strongly associated with our emotional responses to others.
SUSAN: I’d like to suggest that it is impossible for users not to infer a persona when participating in a spoken conversation, whether with a person or with an automated system. Part of the human language faculty involves the process of sizing up who you’re talking to by the way they speak. This isn’t something people can turn on and off, and it isn’t possi-ble for there to be no persona in a VUI. (VUIs supposedly designed with “no persona” tend to sound schizophrenic, robotic, or both.)
Persona in a VUI is like color on a web page. Color is not the sin-gle defining quality of usability on a web page, but users’ overall impressions are affected in significant ways by color. Persona functions similarly, but I would argue that persona and VUI usability are more tightly entwined than color and GUI usability for this reason: it is possible to test a colorless version of a web page, but it is not possible to run a persona-less test for a VUI.
JUAN: Usability is not the same as persona. Persona does have an impact on usability, but the two are not the same. For example, each person has a persona—language, appearance, sound, etc. However, our interactions with others can be successful or usable, or not.
4. In a typical speech project, what is the best time for usability testing?
SUSAN: Because speech technology is imperfect, it is vital to gather user requirements early in speech projects. For instance, an application that will be used primarily in a noisy environment (like at a public kiosk or in a car) dictates choosing a grammar-based recognition strategy (one with a limited number of built-in words or phrases) that will minimize the effects of acoustic interference. Because different recognition strategies require very different sorts of prompting, making these strategic decisions based on user input can be a “make or break” factor for a speech application.
As in any usability project, there is a trade-off between how early you test and how representative the data gathered really are. The impact of acoustic properties is much more dramatic in a speech application than the impact of a paper or wireframe model is on GUI interfaces. To deliver the true flavor of a VUI, users must hear the prompts as they are recorded by the voice talent, not spoken live by a researcher for each participant. One technique that VUI designers sometimes use in early stages, instead of prototypes, is peer review.
Tuning is the process of optimizing the performance of a production speech application. The focus in tuning is very broad: everything from low level recognition parameters to prompts is evaluated. If tuning data are used judiciously and interpreted narrowly, it is possible to come to some conclusions about usability from tuning data.
MELANIE: I’ve done some early prototype evaluation work in which usability participants listened to sample system-user dialogues to judge the quality of a speech system. In these third-party observations, linguistics and psychology experts and a general audience rated a set of user-system interactions. The high-to-low ranking of interface quality was similar between the groups, suggesting some validation for this methodology. However, the expert judges provided more negative ratings of the user interfaces overall, possibly as a result of their more sophisticated knowledge of conversational structures and norms.
JUAN: I believe the entire process is the best time to test. Usability testing can be done during design, implementation, pre-deployment, and post-deployment. During design, “Wizard of Oz” experiments can be done to test designs. Heuristic evaluations can be done as well. During implementation, prototypes can be tested using numerous methods. Before deployment, a summative evaluation can be done. Finally, during post-deployment, in-use, in-situation testing can be done.
5. When a team designs a user interface, there may be significant disagreement about the right way to create a callflow and phrase prompts. When is “evaluation by opinion” helpful and when does it create a problem?
MELANIE: I find this particular problem especially pronounced in speech technology development. In my experience, speech development teams often consist of the recognized VUI design team plus the armchair quarterbacks who argue for or against a certain wording, voice talent, and method of organizing an interface based on their own preferences. (If you had the opportunity to create a communication partner, wouldn’t you design him or her to be someone you’d like to talk to?)
I’ve found that this kind of evaluation by personal opinion is very destructive to a development team. It often undermines the credibility of the VUI designers (especially if disagreements occur in the presence of a client) and also can cause trust and collaboration to break down among project stakeholders. At the same time, a certain amount of opinion is also important to the practical success of a project; ultimately, you are designing an interface to represent a client’s brand. The client rightfully should collaborate with the design team during design and have the final say over the wording and the choice of voice talent to represent that brand.
As designers and experts in linguistics and social interaction, I think it is vital that we use our discipline’s existing knowledge to educate clients about the implications of their design decisions. If their lay reactions ultimately lead to a degradation of interface efficiency or affective response (in other words, usability), we need to make sure they understand the implications of their choices.
SUSAN: Armchair quarterbacks often fail to see prompts as an instantiation of an overall design philosophy that is based on a recognized set of user-centered design principles. It is important that, up front, VUI designers establish not only their credentials but the legitimacy of the UCD process with clients. If clients understand that designs are based on user data, not on the designer’s opinion, then it is easier to explain why changing one prompt can make such a big difference. This doesn’t mean that clients won’t still try to make changes, but at least they’ll understand the repercussions.
JUAN: Evaluation by opinion depends on the person with the opinion. If that person is an expert, then the opinion carries more weight. However, I recommend that, when in doubt—test! Get users and test the alternatives.
6. What are the measures or metrics for speech usability?
JUAN: There are many tools in the usability professional’s toolbox. I don’t think it is possible to name all of the measures or metrics.
- Task completion time or rate
- Word error rate or recognition rate
- User satisfaction
These and other measures and metrics should be used to get a complete picture of usability.
SUSAN: I don’t agree that recognition rate is a true usability measure. When the speech recognition algorithm fails to produce a good result, I don’t see this as primarily a usability problem, but as a technical one. Even if we had perfect recognition algorithms, there would still be usability issues with VUIs. Those of us who evaluate the usability of VUIs should not allow imperfect recognition to cloud other usability issues.
MELANIE: I think the most important measures of interface quality are subjective ratings of the social perceptions known to be associated with speech and language usage. One example: friendliness is not only expected from customer service representatives, it is associated with a relatively wide range of pitch and loudness in speech. Thus, we can get an indication of how effectively a voice talent is presenting prompts if we measure the “friendliness” of a speech interface.
7. Are there any good measures of speech usability? How were they developed?
MELANIE: All three of us have developed separate but parallel measures of usability each with very different approaches. These differences are due to differences in our individual disciplines and in our ideas about usability.
For example, I’ve approached usability measurement as a diagnostic and statistical problem. Over the past six years, I’ve developed measures that combine social communication, linguistics, speech, and customer-service behavior. Through this effort, I’ve discovered that my first scales (which measured only aspects of speech and linguistics) were too limited and did not discriminate satisfactorily among interfaces of different quality. More recently, I broadened the types of items measured to also include expectations of customer service providers, e-service provided through the Internet, and ease of use. Thus, my best measurement tool so far is a twenty-five-item rating scale that includes a variety of speech, linguistic, customer service, and ease-of-use items. Since it was developed using the statistical methods of the behavioral sciences, I have evidence that the new scale has good validity, reliability, and sensitivity.
SUSAN: My work has focused more on methods than metrics, but like Melanie, I also developed a subjective scale that allows users to rate their VUI experiences. The content of my survey (developed independently) is nearly identical to Melanie’s, which tends to validate both metrics. Additionally, I recently organized a workshop with Dr. Jim Larson at the SpeechTEK West 2006 Conference, in which the attendees’ goals were to set out theoretical and practical guidelines for VUI designers and usability specialists. Look for an article in an upcoming issue of Speech Technology Magazine.
JUAN: The Holistic Usability Measure (HUM) by Gupta and Gilbert is a measure that can be used for speech usability. It was developed at Auburn University in the Human-Centered Computing (HCC) Lab as part of an ongoing research effort on usability.