Human-to-Human Interaction Style in Voice User Interface Design

Advancements in natural language processing, voice recognition technology, and speech synthesis allow voice-enabled devices to mimic human-to-human interactions fairly well. The levels of capabilities that devices and machines have to simulate human voices and generate natural(-like) language in a conversation vary across platforms, and since it is a relatively new technological innovation, users often do not have consistent expectations of their conversation with a conversational user interface (CUI). These inconsistent expectations are often exacerbated by the differences between verbal and written language when the CUI modality is voice; this is a subset of conversational UIs called voice user interfaces (VUIs; often described as “voice-enabled” when embedded into a device).

This can lead to unpredictable user behavior: when the user does not know to act, we cannot predict how the user will act. This article attempts to mitigate some of this uncertainty by outlining a few general guidelines for designers to keep in mind when working with VUIs.

Natural Language Input and Output

There are two components of any conversational user interface (CUI)—in any modality, including voice—that determine the user experience:

User input: How well can the CUI interpret what the user is saying?
CUI output: How accurately is the CUI responding to the user, and how human-like is that response?

The natural language understanding (NLU) process allows the CUI to use regular, human language as its input, rather than a predetermined list of commands or utterances with an inviolable word order. In turn, a CUI responds with natural(-like) language using the natural language generation (NLG) process, enabling back-and-forth interactions that can mimic, as much as is programmatically possible, a conversation between two humans.

When a CUI is voice-enabled—which is to say, when the CUI is a VUI—whether or not the speech synthesis engine is able to successfully mimic the appropriate pauses, intonations, and inflections of human speech is another aspect of the NLG process that layers into an appropriately human-like output. For a VUI, the generated natural language output alone is not enough to create a “human-like” experience; a human-like voice needs to accompany the output in order for a VUI to achieve a human-to-human-like feel.

Domain Boundaries

The actual domain that a CUI covers is not as important as whether or not the user understands the boundaries of those domains, and, relatedly, the functionalities a CUI can perform. The natural language understanding and generation processes of a CUI are by necessity domain-specific, but the subconscious inferences that a user makes are based on how these components behave and are, if not domain-agnostic, at least domain-neutral.

You can reasonably—if subconsciously—assume that a VUI that can tell you the weather forecast for Minneapolis can also tell you the weather forecast for Sacramento or Tokyo; however, it is much less reasonable to assume that a VUI that has just told you Minneapolis’ five-day forecast is also capable of telling you how the New York Stock Exchange is doing, even though stocks also use “forecasting.” On the other hand, if a VUI allows you to search for an item and place it in your virtual shopping cart, then it is reasonable to assume that you could also ask for the price of that item, or for information on other items in your cart, or check out and buy the item in question.

Discoverability, or the user’s potential to uncover what functionalities a CUI offers, is more difficult in a VUI than in a chatbot or other type of text-based CUI. There are contextual clues within a text-based CUI that allow a user to determine what a CUI can or cannot do; often chatbots will provide guidelines or suggestions in a greeting message in order to enhance the likelihood that the user will be able to take advantage of all of its domain knowledge. VUIs, however, cannot offer visual cues without additional hardware, and so often must rely more heavily on managing the user’s expectations within the conversation and providing contextually-appropriate responses to out-of-scope inputs.

Creating a Human-to-Human Experience

The technical limitations a designer needs to keep in mind when creating a VUI experience center around how the user is expected to interact with the VUI. Since spoken language is not the same as written language, the natural language understanding process will need to account for a wide variety of vocabulary and syntax options.

The user will expect the VUI to understand the following inputs as identical:

“Hey can you uh set an alarm for, an alarm for uh six six thirty tomorrow ay em?”
“Can you set an alarm for six thirty tomorrow morning, please?”
“Set alarm, six thirty tomorrow morning.”

A designer also needs to consider how the user might behave while the VUI is answering the user. Does the VUI convey enough necessary information at the beginning of the response that the user is unlikely to interrupt it? Are there technical limitations that determine whether or not the user can interrupt the VUI? Shorter VUI outputs mean that, if the VUI cannot be interrupted, there is less user frustration if the NLU misinterprets a user input and triggers a mismatching response.

Of course, if the VUI uses a particular word or phrase in an output, it must accept that as an input. The VUI should never say something it does not “understand.”

Understanding User Expectations

The human aspect of a voice-enabled smart device is a self-sustaining premise. People assume their conversational partner understands them in a particular way based on who (or what) said conversational partner is; when people treat computers like humans, the interaction design must account for that. The more human-like the device, the more the user will treat the device like a human; the less human-like its responses are, or the more robotic its synthesized voice sounds, the more the user will treat the device like a computer.

This subconscious assessment of a device’s “humanity” extends to its domain expertise and reasonable expectations of its capabilities, and there are pros and cons to creating a VUI that users subconsciously consider “more human.” If a VUI’s responses make the user feel like their interaction is ersatz, that user will begin to default toward simpler vocabulary and syntax, reverting to commands instead of questions, and limit themselves to domains in which they know the VUI is competent.

This means that users are less likely to encounter out-of-domain errors. However, if users begin to feel too comfortable while conversing with the VUI and begin to forget or at least get comfortable with the fact that they are interacting with a computer, then they may include more complex syntax, wider vocabulary choices, patterns that are found in verbal language but not written language, or content that requires extralinguistic contextual cues.

This extra trust in the VUI’s competence comes with a price. However, if the natural language understanding process cannot keep up with the user’s less formal language or if the VUI’s knowledge is too narrow to include adjacent domains that a user might reasonably expect to touch upon in a conversation, the user is essentially unable to recognize the VUI’s boundaries. When that happens, user expectations rise and the VUI is unable to respond appropriately; this is interpreted by the user as VUI incompetence rather than the fact that the user has linguistically “stepped” out of bounds.

Personification: Managing Subconscious Expectations

One way in which designers can balance user expectations with natural language generation and speech synthesis capabilities is to choose a VUI name that accurately reflects how robust the VUI’s NLU and domain expertise is. Whether the VUI has a human or human-like name or a title/descriptor in lieu of such can affect how the user interacts with the VUI: what the user says for input, what the user expects the bot to be able to do, and whether the user feels that the interaction is human-to-human.

Take for example two well-known voice-enabled smart devices with some overlapping domains: Amazon’s Echo (Alexa) and Alphabet’s Google Home lineup. Figure 1 shows an Alexa device; Figure 2 shows a Google Home device, part of the Google Connected Home roster.

Image of an Amazon Echo device.

Figure 1. Amazon’s Echo answers to “Alexa” and uses first-person pronouns. (Credit: Piyush Maru)

Image of an activated Google Home Mini device.

Figure 2. When listening to user input, the lights on the Google Home Mini light up. (Credit: Andrea Marchitelli)

As “Alexa” is a human name, using it as a wake word (the set of syllables in a name or phrase that a VUI actively “listens” for and allows it to begin an interaction) encourages the subconscious impression of a human-to-human interaction. When the VUI fails to perform a task correctly or returns an incorrect answer, the user may be evaluating the interaction at a more human-to-human level.

Google Home, on the other hand, is triggered by “Hey Google,” and there is no name assigned to the VUI persona. As the user cannot engage the VUI verbally without using that wake word, it serves as an intentional reminder that the interactions are human-to-computer, realigning user expectations.

While both Amazon Alexa and Google Home undeniably have personalities—in casual conversation, both are referred to with the pronoun “she,” and when an error occurs we use the human-like terms “she made a mistake” or “that was stupid” rather than something like “there was a backend process that triggered incorrectly”—Google Home’s lack of a single, unifying name disperses some of that personification and allows the system to settle into a human-to-computer interaction role with its user.

Evolution of Human-Like Features Within A VUI: A Case Study of Radar Pace

Three years ago, I worked on Oakley’s Radar Pace, voice-enabled smart glasses with an interactive coaching system for running and cycling (see Figure 3). The “coach” allowed users to request information verbally (user-initiated dialogue) or to receive verbal updates from the system when it volunteered time-sensitive information relevant to the user (system-initiated output). (For more information on Radar Pace, see Danielescu and Christian’s “A Bot is Not A Polyglot: Designing Personalities for Multilingual Conversational Agents.”) Image of Radar Pace sunglasses that have earbuds connected to the frames.

Figure 3. Oakley’s Radar Pace sunglasses have no visual interface, but they double as fashionable athletic eyewear. (Credit: Gwen Christian)

As an example, the user might be able to say something like, “how much farther should I run?” or “how much longer?” at any point, and the Radar Pace coach would reply with the correct response. However, the coach would also volunteer information like “you’re halfway done” or “one mile to go,” which the user might not know or be able to ask for given the activities the user was undertaking when interacting with the coach. (Being too winded to ask the coach a question was a fairly common problem among test subjects.)

The System Seemed Smart

Test feedback indicated that this was a feature that users appreciated: The system was delivering the content the users wanted before the users even knew they wanted it. This system-initiated output became a key component in making Radar Pace interactions appear human-like to the user; determining what information is relevant and then passing it along is a human-like trait.

The system-initiated output was also an opportunity to train the user in the available domains and vocabulary. The users had no understanding of the VUI domain boundaries, and the topics that Radar Pace chose to bring up in its commentary allowed the user to understand what functionalities they could access.

This did not preclude user exploration, but its guided learning provided structure for users to expand upon within that exploration, whether or not they were consciously aware of it. If Radar Pace said “your stride rate is too long,” the user knew they could say “what’s my stride rate?” or “how is my stride rate?” The more often the user interacted with Radar Pace, the more knowledge they gained about the domains and vocabulary that were available to them.

User Reactions to Human-Like Output

Users personified the Radar Pace coach and felt as though they were having a human-to-human interaction with it. Evidence for this came from user feedback to the workout compliance summary provided at the end of the coaching session when—again, a system-initiated rather than user-initiated communication—if the coach said anything about the user not completing part of a workout, the user said they felt “judged” and that the coach was “disappointed” in them. Disappointment and judgement are human traits rather than electronic ones; if you say “my sunglasses are disappointed in me,” it does not have quite the same level of melodrama.

Conclusion

There are five major takeaways for designers working with Voice User Interfaces:

It is not just what you say, it is how you say it. Speech synthesis can matter just as much as natural language generation.
Users need to understand what a voice-enabled device can and cannot do, and what it can and cannot understand, in order to use the device.
If you have done a good enough job with your voice user interface that users subconsciously think they are in a human-to-human interaction, they will start to act like it—so make sure you have accounted for that.
Voice user interface names can affect user perception.
Human-like interactions can have positive and negative consequences on user feedback.

User Experience