A Moderated Debate: Comparing Lab and Remote Testing

Remote unmoderated testing provides statistically significant results, which executives often say they require to make informed decisions about their websites. However, usability practitioners may question whether the large numbers, although seductive, really tell them what’s wrong and what’s right with a site.

Also, you can tell whether an observation is important or not by the number of people from a group of 200 to 1,000 participants who hit the problem. Of course you can’t observe participants’ difficulties and successes first hand. The lack of visibility might seem to be a serious disadvantage, but being unable to see participants does not prevent you from recognizing success or failure. Online testing tools anticipate and prevent most of the problems you might expect. Most importantly, you don’t have to leave the lab (or your own experience) behind.

Practical Differences

Online and in-lab testing methods are different in what you can test, in how you find and qualify participants, and in terms of rewards.

What you can test:

Moderated – A prototype at any stage of development.
Unmoderated – Only web-based software or websites.

Finding participants:

Moderated – Use a marketing or recruiting company or a corporate mail or email list.
Unmoderated – Send invitations to a corporate email list, intercept people online, or use pre-qualified participants from online panels like Greenfield Online, Survey Sampling International, and e-Rewards Opinion Panel, or from the remote testing firms themselves.

Qualifying participants:

Moderated – Ask them; have them fill in questionnaires at the start of the session.
Unmoderated – Ask them qualifying questions in a screener section and knock out anyone who doesn’t fit (age, geography,ownership of a particular type of car, etc.)Don’t let them retry—set a cookie.

Rewards:

Moderated – $50 cash, gifts. At $50 each,cash rewards for ten to twenty participants cost $500 to $1,000.
Unmoderated – $10 online gift certificates,coupons or credits, or raffles. At $10 each,cash rewards for 200 to 1,000 participants cost $2,000 to $10,000.

What Isn’t Much Different

Although remote unmoderated testing requires very little human interaction during the tests, it does require a high level of expertise both for the setup and for the analysis at the end. The remote testing software provides data crunching and statistical analysis, but neither scripting nor report writing is automated, nor is it possible to automate them. Information is never the same as understanding.

Scripting

The test scripts are not significantly different in content, only in delivery. Every moderated test has a script—the questions the client wants answered—even if the participant doesn’t know what it is.

In an unmoderated test, on the other hand, the analyst has to share the questions with the participants. The trick is to make sure that these questions don’t lead participants to the “right” answer or tip them off about the purpose of the test.

Also, the online test script has to be more detailed and the follow-up questions have to be thought out in advance. For example, if you think participants might say they hate a particular option, the script needs to be able to branch to a question asking why they hate it. You also need to include open-ended questions so that participants can comment on items about which you hadn’t thought to ask.

Relative Costs

The cost of a remote unmoderated test is comparable to the cost of lab testing.

If you have to, you can do a lab test very cheaply (say, for a small non-profit website) by borrowing a conference room, asking some friends to show up, and doing all the recording by hand. You can’t do a casual remote unmoderated test, however, because there are too many people and too much infrastructure involved.

But when you test high-priced, high-value websites, the costs for in-lab and remote testing are similar. As Liz Webb, co-founder of and partner of eVOC Insights, points out, “If you need to run a lab test in an expensive area like Los Angeles, that’s $20,000 to $30,000,” whereas “a remote unmoderated test would start around $30,000 for 200 participants.”

Also keep in mind that the three best-known firms, Keynote, RelevantView, and UserZoom offer different services at different price points. For example, a one-time test in which you write your own script and analyze your own results costs $8,000 to $1 3,000 depending on the firm. (You’ll probably need some training or handholding the first time, which might add to the cost.)
For a study in which the firm writes both the script and the report, the cost is likely to be between $20,000 and $75,000. All three firms offer yearly licenses at $70,000 to $100,000 depending on the level of service—this is a good option if you do more than five or six remote unmoderated tests a year. The costs for participants and rewards are additional.

Time Required

Liz’s timeframe for unmoderated tests is four to six weeks: “From kickoff with the client, one week to write the survey, one week for review by the client, one week in the field,” if all goes well, “and two weeks for analysis.” For a lab study, she estimates a week for the screener—it has to be more detailed than the unmoderated one so that the recruiting firm can find the right people; two weeks for recruiting and writing the discussion guide; one or two days in the lab; and two weeks for analysis. She also points out that if you’re doing moderated tests in multiple locations, you have to add at least one travel day between sites.

Comparing the Results

The real differences between moderated and unmoderated tests appear when you look at the results. Deciding whether unmoderated testing should be part of your toolkit depends on the answers to these questions:

Are large numbers of participants important?
Will online participants’ comments be as helpful as those captured in the lab?
What is the quality of the data?
Will there be too much data to analyze?
What kinds of information are likely to be missing in an unmoderated test?

Are More Numbers Better Numbers?

Asking what the difference is between samples of ten (moderated) and a hundred (unmoderated) participants is really the same question as “How many users are enough?” You have to look at your goals. Is the goal to assess quality? For benchmarking and comparisons, high numbers are good. Or is your goal to address problems and reduce risk before the product is released? To improve the product, small, ongoing tests are better.

Ania Rodriguez, who used to do unmoderated tests at IBM and is now a director at Keynote Systems, ran small in-lab studies to decide what to address in her remote studies. She said, “The smaller number is good to pick up the main issues, but you need the
larger sample to really validate whether the smaller sample is representative. I’ve noticed the numbers swinging around as we picked up more participants, at the level between 50 and 100 participants.” The numbers finally settled down at 100 participants, she said.

Michael Morgan, eBay user experience research manager, also uses both moderated and unmoderated tests. “In general, quantitative shows you where issues are where issues are happening. For why, you need qualitative.” But, he adds, “to convince the executive staff, you need quantitative data.”A new eBay product clearly required an unmoderated test, Michael said. “We needed the quantitative scale to
see how people were interacting with eBay Express. [Faceted search] was a new interaction paradigm—we needed click-through information—how deep did people go, how many facets did people use?” Since the automated systems collect clickstream data automatically, heatmaps were created that showed his audience exactly where and how deeply people went into the site.

Liz said that a small sample is good for the low-hanging fruit and for obvious user-interface issues. However, “to feel that the statement ‘[most] of the users are likely to return’ is reliable, you need at least 100 responses.” For 100 participants, you’d need results showing a difference of twenty points between the people who said they’d return and those who said they wouldn’t before you could trust the answer. However, “with 200 participants, about 12 percentage points is a statistically significant difference.”

webpage with textboxes listing statistics relating to how often certain headings have been clicked — Figure 1. Comments are backed up by clickstream analysis.

Will Typed Comments Be Good Enough?

It might not seem that charts and typed comments—the typical output of an unmoderated remote test—would be as convincing as audio or video. However, they have their place and can be quite effective.

Ania said that observing during a lab session is always better than audio, video, or typed comments. “While the test is happening, the CEOs can ask questions. They’re more engaged.” That being said, she affirms that, “you can create a powerful stop-action video using Camtasia and the clickstreams” from the remote tests. According to Michael, “The typed comments are very useful—top of mind. However, they’re not as engaging as video.” So in his reports he recommends combining qualitative Morae clips with the quantitative UserZoom data. “We also had click mapping—heat maps and first clicks,” and that was very useful. “On the first task, looking for laptops, we found that people were taking two different routes,” not just the route the company expected them to take. Liz added, “Seeing someone struggle live does make a point. But sometimes in the lab, one person will struggle and the next person will be very happy. So which observation is more important?” The shock value of the remote tests, she says, is showing the client that, for example, a competitor has 80 percent satisfaction and they only have 30 percent. “That can be very impactful,” especially if it indicates what the client can do to make the site more useful or usable. For example, on a pharmaceutical site, if the competitor offers a list of questions to take to the doctor, the client might want to do the same. “We want developers to see people struggling,” she adds, “but executives want to see numbers.”

What’s the Quality of the Results?

The quality of the results always starts with the participants: are they interested? Engaged? Honest?

Liz pointed out that “in the lab, participants may be trying to please you. You have to be careful in the questions you ask: not ‘Do you like this?’ but rather ‘What is your impression of this?’ And it’s the same problem online. Don’t ask questions so leading that the participants know what you’re looking for.”

In the lab, you can generally figure out right away that a participant shouldn’t be there and politely dismiss him or her. Online, however, you can’t tell immediately, and there are definitely participants who are in the study just for the reward.
The remote testing companies have methods for catching the freeloaders, the most important of which is watching the studies for anomalies. Here are some of the problems the back-office teams catch:

A sudden spike in new participants. This is usually due to messages posted on “free stuff” boards. Unqualified or uninvited people see the posting and then jump on the test site to start taking the test. When the testing software sees these spikes, it shuts down the study until the research associates, who keep track of the tests while they are running, can check the boards and figure out what has happened.
Participants who zip through the studies just to get the reward. The software can be setup to automatically reject participants’ responses if they spend less than a certain amount of time or if they type fewer than a certain number of characters per task.

Another data-quality pitfall is null results— how do you find out if a participant is stuck if you can’t see them getting stuck? One way is to ask people to stop if they’re spending more time on the task than they would in real life (figures 2 and 3)

When the participant clicks the give-up button, the script then asks them why they quit. Not every participant explains, but you generally get enough answers to tell the difference between a technical glitch and a real problem with the site.

instructions for a usability test — Figure 2. Participants are told to give up if the task is taking too long.

website with superimposed red arrow pointing to quit button in the usability test toolbar at the top of the screen. — Figure 3. On a Keynote Systems task-question screen, the give up button is at the upper right corner.

Drowning in Data?

It’s hard enough to compress dozens of pages of notes and ten to twenty videos into a fifty-page report. Working with thousands of data points and two to four thousand comments generated in an online study would be impossible if it weren’t for some tools built into the commercial products. All of the remote testing products automatically create graphs from the Likert scales, multiple-choice, and single-choice questions; do correlations between questions; and sort text comments into general categories.

Ania said, “I’ve never found that there was too much data. I might not put everything in the report, but I can drill in two or three months later if the client or CEO asks for more information about something.” With more data, “I can also do better segments,” she said. “For example, check a subset like ‘all women fifty and older versus all men fifty and older.'”

“You have to figure out upfront how much you want to know,” Michael said. “Make sure you get all the data you need for your stakeholders. You won’t necessarily present all the data to all the audiences. Not all audiences get the same presentation.” The details go into an appendix. “You also don’t want to exhaust the users by asking for too much information.” The rule of thumb is that a study should take no more than thirty minutes, about three tasks.

Liz took a broader view of quantity: “We now have stable measures from many tests.

To create a benchmark, we ask certain questions—for example, “How likely are you to return to this site?’—only once, and in only one way per study across multiple studies.” With this information, she has a rough idea about how good a client’s website is in comparison with others in the same industry or across industries.

The Final Decision

With unmoderated tests, you get more convincing numbers if you need them. However, preparation has to be more extensive since you can’t change the site or the questions midstream without invalidating all the expensive work you’ve done already.

Unmoderated tests provide you with more data, but they don’t automatically provide you with insight. Since they don’t create or analyze themselves, you can’t remove usability expertise from the equation. The remote testing firms will support you—even if you write the script yourself, they’ll review it for errors—but if you don’t know what you’re doing as a usability practitioner, the results will show that.

On the other hand, if you’re good at spotting where people are likely to have problems, at developing hypotheses, and at recognizing patterns, the abundance of data can be just what your stakeholders want to see.

User Experience