Automated Usability Testing: A Case Study

In our user experience team at Fidelity Investments, we’ve conducted over forty unmoderated remote usability tests over the past five years. We use them as an adjunct to traditional lab tests and remote, moderated usability tests. We’ve found that unmoderated remote tests reveal usability variations between different design solutions that typical lab tests generally don’t detect. The advantage of the unmoderated remote tests lies in the sheer number of participants. We usually have at least 500 participants in just a few days when we can use our own employees as participants in these tests, and it’s not uncommon to have over 1,000 participants. When performing evaluations with panels of our customers, we commonly have at least 200 participants in a week. These numbers provide tremendous data. We routinely get statistically significant differences in task completion rates, task times, and subjective ratings when comparing alternative designs. Even what appears to be a minor design difference (e.g., a different phrase to describe a single link on a website) can yield significant differences in usability measures.

A Sample Unmoderated Remote Usability Study

The best way to describe unmoderated remote usability tests is with an example, so I devised a test comparing two Apollo space program websites the official NASA site (Figure 1) and the Wikipedia site (Figure 2)

NASA homepage

Figure 1. Apollo program home page on NASA.

Wikipedia page on Apollo program

Figure 2. Apollo program home page on Wikipedia.

Participants in the study were randomly assigned to use only one of these sites. Most of the unmoderated remote studies I’ve conducted are this “between-subjects” design, where each participant uses only one of the alternatives being tested.

The next step was to develop tasks for the participants. I developed a set of candidate tasks before studying either site based on my own knowledge of the Apollo program. I then eliminated any tasks that I couldn’t find the answer to on both sites. That left nine tasks:

  1. How many legs did the Lunar Module (lander) have?
  2. Which Apollo mission brought back pieces of the Surveyor 3 spacecraft that had landed on the moon two years earlier?
  3. The famous photo called Earthrise, showing the Earth rising over the Moon, was taken on which Apollo mission?
  4. Which manned Apollo mission was struck by lightning shortly after launch?
  5. Who was the Command Module pilot for Apollo 14?
  6. Who were the last two men to walk on the moon?
  7. Which Apollo mission brought back the so-called Genesis Rock?
  8. What was the name of the Apollo 12 Lunar Module?
  9. Which area of the moon did Apollo 14 explore?

The best tasks have clearly defined correct answers. In this study, the participants chose the answer to each question from a dropdown list. We’ve also used free-form text entry for answers, but the results are more challenging to analyze.

We design most of our unmoderated remote usability studies so that most participants can complete them in under thirty minutes. One way to keep the time down is to randomly select a smaller number of tasks from the full set. Across many participants, this gives us good task coverage while minimizing each participant’s time. We gave each participant four randomly selected tasks out of the full set of nine, presented in a random order to minimize order effects.

When a potential participant went to the starting page (http://www.webusabilitystudy.com/Apollo/), an overview of the study displayed. When the user clicked “Next,” a set of instructions was shown. As explained in those instructions, when the user clicked “Begin Study,” two windows opened, filling the screen (Figure 3)

Apollo program homepage with usability study instructions at top of screen.

Figure 3. Screen and window configuration for an unmoderated remote usability study.

The small window at the top presents the tasks to perform; the larger window presents one of the two sites being evaluated. The users were free to use any of the features of the site; however, they were instructed not to use any other sites to find the answers (e.g., Google).

Each task included a dropdown list of possible answers, including “None of the above” and “Give Up.” Three to six other options were listed, one of which was the correct answer to the question. We required the user to select an answer (which could be “Give Up”) to continue to the next task. The participant was also asked to rate the task on a 5-point scale ranging from “Very Difficult” to “Very Easy.” We automatically recorded the time required to select an answer for each task, as well as the answer given.

After attempting all four tasks, we asked the participant to rate the site on two seven-point scales, each of which had an associated comment field:

  1. Overall, how easy or difficult was it to find the information you were looking for?
  2. Overall, how visually appealing do you think this website is?

We vary these rating scales from one study to another depending on the sites being tested and the study goals. We followed with two open-ended questions about any aspects of the website they found particularly challenging or frustrating, and any they thought were particularly effective or intuitive. We use these questions in most of our usability studies.

We also modified the System Usability Scale (SUS) to help evaluate websites. The original version of SUS was developed by John Brooke while working at Digital Equipment Corporation in 1986. We instructed participants to select the response that best describes their overall reactions to the website using each of ten rating scales (e.g., “I found this website unnecessarily complex,” or “I felt very confident using this website.”) Each statement was presented along with a 5-point scale of “Strongly Disagree” to “Strongly Agree”; half of the statements were positive and half negative.

Results of the Sample Study

The main purpose of the study was to illustrate the testing technique, not to seriously evaluate these particular sites. We posted a link to the unmoderated remote study on several usability-related email lists, and collected data from March 11 – 20, 2008. Many of the participants in the study work in the usability field or a related field, so they can’t be considered a random sample.

A total of 192 people began the study and 130 (68 percent) completed the tasks in some manner. Undoubtedly, some people simply wanted to see what the online study looked like and were not really interested in taking it.

One of the challenges with unmoderated remote studies is identifying participants who are not performing the tasks but simply clicking through them, answering randomly or choosing “Give Up.” They might be interested in the tasks or want to enter the drawing. In studies like this, about 10 percent of the participants usually fall into this category.

To identify these participants, I first completed all nine of the tasks myself several times using both sites, having first studied the sites to find exactly where the answers were. The best time I was able to achieve was an average of thirty seconds per task. I then eliminated thirteen (10 percent) participants who had an average time per task less than thirty seconds, bringing the total number of participants was 117. Of those, fifty-six used the NASA site and sixty-one used the Wikipedia site.

The basic findings of the study were that users of the Wikipedia site:

  • Got significantly more tasks correct than did users of the NASA site (71  percent vs. 58  percent, p=.03 by t-test).
  • Were marginally faster than users of the NASA site in doing their tasks (1.8 vs. 2.2 minutes per task, or about 23 seconds shorter, p=.07).
  • Rated the tasks as significantly easier on a 5-point scale than did users of the NASA site (3.1 vs. 2.6, p<.01).

One way to see an overall view of the task data for each site is to convert the accuracy, time, and rating data to percentages and then average those together. This provides an “overall usability score” for each task that gives equal weight to speed, accuracy, and task ease rating (Figure 4). With this score, if a given task had perfect accuracy, the fastest time, and a perfect rating of task ease, it would get an overall score of 100 percent. These results clearly show that Tasks 3 and 7 were the easiest, especially for the Wikipedia site, and Tasks 4 and 8 were among the most difficult.

bar graph

Figure 4. Average usability scores for each task and site, with equal weighting for accuracy, speed, and task ease.

After attempting their four tasks, the participants were asked to rate the site they had just used on two scales: Ease of Finding Information and Visual Appeal. The Wikipedia site received a significantly better rating for Ease of Finding Information (p<.01), while the NASA site received a marginally better rating for Visual Appeal (p=.06).

The final part of the study was the System Usability Scale (SUS), which consists of ten rating scales. A single SUS score was calculated for each participant by combining the ratings on the ten scales such that the best possible score is 100 and the worst is 0. Think of the SUS score as a percentage of the maximum possible score. The Wikipedia site received a significantly better SUS rating than the NASA site (64 vs. 40, p<.00001).

What about Usability Issues?

The study yielded a rich set of verbatim comments. The NASA site received 132 individual comments from the various open-ended questions while the Wikipedia site received 135. Some of these comments were distinctly negative (e.g., for the NASA site: “Search on this site is next to useless”) while others were quite positive (e.g., for the Wikipedia site: “The outlines for the pages were helpful in locating a specific section of the site to find the desired information.”)

The performance data, subjective ratings, and verbatim comments can be used to help identify usability issues within the test site. Verbatim comments often provide clues indicating why tasks for a given site yield particularly low success rates, long task times, or poor task ratings.

Strengths and Weaknesses

The primary strength of an unmoderated remote usability study is the potential for collecting data from a large number of participants in a short period of time. Since they all participate “in parallel” on the web, the number of participants is mainly limited by your resourcefulness in recruiting. Larger numbers provide additional advantages:

  • Unlike traditional moderated usability testing, there’s no significant increase in costs or resources with each participant.
  • The larger number of participants allows you to test a better cross-section of representative users, especially when the user base is large and diverse.
  • Since users participate from their own locations using their own systems, you potentially have more diverse environments (e.g., screen resolutions, monitor sizes, browsers, etc.)
  • Because of the larger sample sizes, you can potentially detect differences in usability metrics (task success rates, times, ratings, etc.) that you normally can’t detect in moderated tests.

Unmoderated remote usability studies are especially good at enabling comparisons between alternative designs. We’ve performed these studies where we simultaneously compared up to ten different designs. In just a few days we were able to test these designs with a large number of users and quickly identify the most promising designs.

Unmoderated remote usability studies aren’t always appropriate and some of the limitations of the technique follow:

  • The prototypes or designs to be tested must support the tasks at some level. Users need to be able to reasonably decide whether they have completed the task.
  • The prototypes need to be reasonably stable. Since the users are working on their own without a moderator to intervene if things go wrong, you don’t want any major surprises.

You need to be able to develop tasks that have relatively well defined end-states. Tasks like “find explicit information about this” work well.

Early exploratory studies, where you want to have an ongoing dialog with the participants about what they’re doing, are obviously not well suited to an unmoderated remote approach.

Unmoderated remote usability tests will never completely replace traditional moderated usability tests. A moderated test, with direct observation and the potential for interaction with the participant if needed, provides a much richer set of qualitative data from each session. But an unmoderated remote test can provide a surprisingly powerful set of data from a large number of users that often compensates for the lack of direct observation and interaction.

Tullis, T. (2008). Automated Usability Testing: A Case Study. User Experience Magazine, 7(3).
Retrieved from http://uxpamagazine.org/automated_usability_testing/

Comments are closed.