Improve Your Usability Tests: Learnings from Evaluating 100-Plus Usability Tests

In this article, we present important insights from observing and evaluating intermediate and experienced usability professionals working to obtain their CPUX-UT certification (Certified Professional for User Experience – Usability Testing) from UXQB, the International Usability and UX Qualification Board. More than 100 usability tests resulted in insights for conducting remote testing; improving usability test tasks, for example, by avoiding pretender tasks and simulation; facilitating relaxed think-aloud; protecting the dignity of test participants; and more.

Background: CPUX-UT Certification

The insights reported in this article come from our experience of evaluating videos and usability test reports submitted by more than 100 candidates, who are mostly of German and British origin. Candidates submitted the reports and videos to obtain their CPUX-UT certificate. Training is based on the CPUX-UT curriculum, and both the curriculum and public test questions are open access to the public.

Candidates obtain the CPUX-UT certificate by passing a theoretical and a practical exam. In the practical exam, candidates must demonstrate their ability to conduct a simple usability test of a website specified by the examiner, for example, a weather website. They must recruit three usability test participants and write four usability test tasks, which must address two given use cases. They have seven days to carry out the usability test. The seven-day period can be scheduled whenever candidates have time. After the usability test has been completed, candidates submit unedited videos of the three usability test sessions and their usability test report. Each video must show the screen, the test participant, and the candidate during the whole test session.

Each practical exam is evaluated by an examiner based on a publicly available checklist of 106 items; we are currently examiners. Since 2014, both the CPUX-UT curriculum and checklist have been revised several times based on our evaluation of submissions from candidates and feedback from trainers. Examiners use the checklist to document problems in the test sessions, and we provide detailed comments so candidates can understand exactly why they obtained a particular score. The comments help candidates improve their usability testing abilities.

Figure 1. A summary of the learnings reported in this article.

Improving Usability Test Tasks

Diagnostic Value

We observed that one of the main problems for many certificate candidates is to create tasks that provide high diagnostic value. Test tasks with a high diagnostic value address important use cases that are not covered by previous tasks and that cannot be solved too easily. Some candidates, unfortunately, use tasks that can be solved simply by clicking a link or a button that is prominently displayed on the start page.

Often, tasks are similar or trivial, contain hints, do not match the assignment, or lack precise success criteria, leaving it unclear if the task was successfully completed.

Successful candidates ensure that each task has maximum diagnostic value. For example, after the task, “Find the current weather forecast for Los Angeles,” it would not be helpful to also ask, “Find the current weather forecast for San Francisco.” Instead, this task has a higher diagnostic value: “Suppose you plan a two-hour walk starting right here, right now. Find out on the website whether or not you’ll need an umbrella.”

Hints

Successful candidates avoid hints in the task. For example, they ask, “What happens after your trial period for the magazine ends?” instead of, “If you were to subscribe for a free trial today, when will you need to cancel your subscription to avoid being charged for a full subscription?” The former phrasing is better because “cancel your subscription” is being used repeatedly on the magazine’s website, which makes it a hint.

Some candidates read tasks to test participants. Successful candidates ask test participants to read the task silently and then paraphrase the task before they start working on it; the candidates ensure that test participants are working on the intended task.

The CPUX-UT curriculum discusses several types of non-optimal test tasks. We frequently observe pretender tasks and imprecise tasks.

Pretender Tasks

Figure 2. An example of a pretender task.

A pretender task is a test task that asks test participants to pretend to be someone they are not. Doing so creates a less realistic context for the task solution, and test participants may even perceive it as an insult. An example of a pretender task is: “You are planning a vacation in Porto, Portugal. Rent a powerful sports car to get around.” This task becomes a pretender task if the participant would never rent a sports car, but rather an economy size car or a bicycle (Figure 2).

We observed that pretender tasks become problematic when the role-play goes into the political-ideological realm. The participants must then not only play but act against their world views and values. For example:

“You love to visit Dubai, and tomorrow you will depart for a wonderful vacation there. Please find out what the weather will be like for the next seven days.”

The pretense can be eliminated by replacing the scenario with a task:

“Find the weather forecast for Dubai for the next seven days.”

Scenarios that involve motivations the participant may not have are sometimes the cause of pretender tasks.

Pretense that does not ask test participants to act against their values is acceptable. For example:

“You live in Frankfurt. A client, who lives in Hamburg, calls you to say that she needs to talk to you in person today. Use your smartphone to find the fastest train connection to Hamburg.”

Imprecise Tasks

An imprecise task is a test task with an unclear goal that makes it difficult to determine when the task is completed. For example:

“Find an article that interests you in the newspaper,” is imprecise because any article is a valid answer, whereas the task, “Find a review of the movie, Cruella,” is sufficiently precise.
“Find and read the cancellation policy,” is imprecise, whereas a similar task, “How much does it cost to cancel a reservation?” is sufficiently precise.

Moderation

Results from a usability test are unique in one aspect: They show what representative users can accomplish with the interactive system when they carry out representative tasks. Eliciting personal opinions from users or discussing them does not support this objective and should be left to other methods.

Figure 3. A candidate who talks excessively during a task.

Some candidates interview the test participant rather than moderate a usability test. They talk excessively between tasks, sometimes during tasks (as shown in Figure 3), and during the debriefing. Successful candidates observe quietly during moderation and limit the debriefing to two questions, “Which two to three things did you like most about the website?” and, “Which two to three things are most in need of improvement?” At most, two or three minutes are required to answer these questions.

We observed that some candidates incorrectly encourage relaxed think-aloud. They ask test participants to reflect on what they are doing. For example, “Please think aloud. Comment on what you are doing, and tell me what you like and dislike,” whereas successful candidates simply say, “Please think aloud.” Some candidates seem to have a strong need to say more than these three simple words, for example, “Please let me listen in to your thinking—just think aloud.”

Simulation

Simulation takes place when a test participant pretends that they are using the website instead of actually using it to complete tasks. In a simulation, the candidate no longer gets observational data but results from a usability inspection carried out by the test participant, that is, a layperson. For example, a test participant might start making assumptions about other users: “Other users might not notice the search button,” or, “Other users might be annoyed by this ad.” Another example of simulation can be found in Figure 4, in which a test participant makes assumptions about other users, for example, “Some people don’t know that clicking the logo in the upper left corner returns you to the home page.” Successful candidates point out diplomatically that the session is only about the test participant’s own actions and not about their assumptions about what other people might do.

We observed that simulation may be triggered by the following:

Questions that encourage simulation, for example, “What would you do if search does not return a helpful result?”
Asking for alternative ways of carrying out the task, such as, “What would you have done if you had not found the article on the front page?”
Questions during debriefing, for example, “Take another look at the home page and tell me what you think,” or, “Review the shopping cart page.”

Giving Test Participants Leeway

Some candidates’ openness triggered curiosity and exploration, which led to important insights.

A candidate created an open-ended task regarding accessories in connection with a transportation scenario. Almost all candidates think of child seats when it comes to transportation, and many candidates build tasks that explicitly ask for child seats. A test participant had a large dog. The candidate gave her test participant so much openness in the test task that he spontaneously started looking for ways to accommodate large dogs instead of child seats.

The Dignity of Test Participants

Figure 5. A candidate’s behavior may be interpreted by the participant as disrespectful.

A basic, ethical principle for a usability test is this: Always respect the dignity of the participants. Make sure participants are willing to come for another test. For some business profiles, participants are expensive; don’t alienate them! For example, in Figure 5, the candidate looks at his smartphone while moderating, which could be interpreted by the test participant as a lack of interest in what she is doing.

We observed:

Lack of respect for the test participant led to an unwillingness to participate. For an example, see Figure 5.
Candidates sometimes interrupted test participants while they were thinking aloud.
Some candidates were impatient and interrupted test participants even if they were still speaking.
Some candidates did not maintain appropriate physical distance to the participant.

These candidates did not notice that participants physically leaned more and more away from them. Some candidates reached over participants’ arms for the mouse, instead of installing their own mouse.

Prompt and assertive response led to self-criticism or perceived task failure.

The following two responses from candidates are exemplary:

The test participant said, “It took me a long time.” The candidate responded, “Yes, it did, but there is no failure here because knowing this difficulty helps us improve the website.”
After giving up on the first task, the test participant said, “I am a bit embarrassed that I can’t solve this task.” The candidate responded empathetically, “You do not need to feel embarrassed. You have provided us with valuable information by showing us what we need to improve. Please remember that we are not testing you.”

Remote Testing

During the pandemic, about 20 candidates used remote testing instead of in-person testing to attain certification. Here’s what we learned about remote testing by observing these candidates.

Remote Control

Test participants can access the website from their own computer or from the candidate’s computer. Table 1 compares the two approaches.

Table 1: Alternative Access for Remote Test Participants

	Participant’s Computer	Candidate’s Computer
Access	Test participants open the website on their computer and share the browser.	Test participants access the website using remote control of the candidate’s computer.It is, of course, important that remote control is allowed only for the browser, not for the entire computer. Some company computers do not allow outsiders to control the computer because of security concerns.
Advantages	Test participants work in their own well-known environment.There are no delays in mouse movements.	The candidate can easily intervene if required, for example, if the test participant attempts to make a purchase, conclude an agreement, or enter real personal data.

Data Privacy

Many video conferencing tools, for example Zoom, allow renaming test participants so their names are not visible on the video or to observers. Alternately, test participants can be asked to use pseudonyms.

Task Presentation

If the test participant uses the candidate’s computer, tasks can be presented via a PDF in a separate browser tab. In the PDF, each task should appear on a separate page, so test participants do not inadvertently see more than the current task.

Alternately, tasks can be presented in the chat. Because test participants often have difficulties finding the chat in screen sharing mode, candidates should provide advice about where to find it as part of the briefing.

Remote control and task presentation have migrated into the usability lab; we now use video conferencing tools in the usability lab because many things are brilliantly simple with it. For example, in a remote test, candidates have all documents on the screen, as opposed to paper shuffling in tests that are conducted face-to-face. Moreover, video conferencing tools have extensive recording capabilities; some even offer automatic transcription.

Trivial Usability Findings

Usability findings must be substantial.

We observed trivial usability findings, such as, “Too many ads,” or, “Test participants liked the home page.” Such findings should be avoided or turned into substantial findings, for example, by explaining how ads interfered with test participants’ use of the website or how the home page specifically helped test participants to get their job done. Most trivial findings were vague positive findings. Below, Table 2 shows some examples of trivial and non-trivial findings.

Table 2: Trivial and Non-Trivial Findings

Trivial	Non-Trivial
“Test participants noticed that search was available.”	“All test participants immediately located search and used it without problems.”
“None of the test participants had difficulties finding current news,” when “Current News” was a main category that was prominently displayed on the start page.	“All test participants quickly found local news, which seemed inconspicuous on the start page.”

Conclusion

It is rewarding for us to be allowed to observe candidates at work—we also learn from every examination. Observing other professionals is a wonderful opportunity for us to reflect on and improve our own methodology. We recommend that you observe your colleagues and ask them to observe you. From observing candidates, we gained particularly important insights regarding pretender tasks, the diagnostic value of tasks, and remote control, and we were able to share these results.

Rolf Molich

Rolf Molich has conducted hundreds of usability tests since 1984. He conceived and managed the Comparative Usability Evaluation studies. He received the UXPA Lifetime Achievement Award in 2014.

Bernard Rummel

Bernard Rummel has been working as a UX and user research expert at SAP since 2000, currently focusing on quantitative research and online usability testing. Together with Rolf, he developed the CPUX-UT curriculum.

Susanne Wasserroth

Susanne Waßerroth works in the field of UX and usability since 1998. She is specialized in usability testing and joined the team of CPUX-UT examiners 2,5 years ago. Currently, she is employed at Deutsche Telekom IT GmbH, where she also advocates accessibility.

User Experience