 | Level: Intermediate Barbara Millet (bmillet@us.ibm.com), Human Factors intern, IBM Pervasive Computing James Lewis (jimlewis@us.ibm.com), Senior Human Factors Engineer, IBM
02 Mar 2005 Interactive Voice Response (IVR) applications are typically comprised of statements and prompts over the telephone that relay information to a user about the task at hand. Prompts, in specific, are turn-taking cues that should provide the user with information that causes the user to speak and that convey to the user what may be spoken (Balentine, 1999). To ensure ease of use and overall user satisfaction, it is imperative that system dialogues not be ambiguous. This report describes the method and results of a usability evaluation of a prototype speech recognition IVR application.
The objective of the study being presented here was to assess the clarity of a language selection prompt planned for use in a speech recognition IVR system. Language selection was the first turn-taking prompt in the application. (Select English o seleccione Espanol.) This prompt requires that the user select a language with which to navigate through the system. The goal of this evaluation was to determine how well the prompt met this objective.
The method
The participants
Ten IBM employees (7 men, 3 women) and two contractors (1 man, 1 woman) participated in this study. Five of the twelve participants' native language was English. The participants' ages ranged from 20 to 49 years old. All participants had at least some college education and identified themselves as "very skilled" computer users with more than five years of computer experience. Six participants specified previous experience with speech recognition software, and all involved indicated they had experience in using speech recognition systems by telephone. The participants were given two tasks: Select a language in which to proceed, and purchase a Service Contract. The second task served as a dummy task, as the primary purpose of the study was to assess the clarity of the language-selection prompt.
Materials and equipment
A prototype of the application was developed using the IBM® Voice Tool Kit for WebSphere® Studio Version 5.0. We started with the Call Flow Builder, followed by direct modification of the VoiceXML code. A voice talent was employed to record the audio files. The VoiceXML code and audio files were then placed on a voice server, which in turn allowed connectivity to the system by telephone. Each test session was video recorded to capture voice inputs as well as any nonverbal gestures produced by the user. Additionally, a phone tap was used to record all dialogue generated by the system.
The procedure
All participants were tested, individually, in the Human Factors lab in IBM's Boca Raton, Florida, facility in the summer of 2004, with each test session taking no more than 15 minutes. At the start of each test session, a background questionnaire (see Appendix A in the source file) was provided to all participants. Immediately following the questionnaire, the test user was presented with a task scenario and the test task (see Appendix B in the source file).
Based on the task scenario provided, participants understood the task to be the purchase of a service contract using the IVR system. Note that this test evaluated both the participant's ability to correctly Select a Language (Task 1) and to Purchase a Service Contract (Task 2). The second task (purchasing a service contract) served as a dummy task masking the actual task of interest because the user needed to select a language before reaching the main menu.
While the participant executed the tasks, the experimenter logged the participant's actions. If the participant completed the tasks by providing the correct inputs, resulting in the correct outputs, then he or she had completed the task successfully. For Task 1, a correct input was to say "English" or "Espanol," or to press 1 or 2. Similarly, for Task 2, a correct input for buying a service contract would be to say "Make a Purchase" or press 2 at the main menu and then say "Service Contract" or press 4. If the user progressed to the global introduction of the system (by selecting English) or if the user was transferred directly to a Spanish-speaking agent (by selecting Spanish), then the user produced the correct output for Task 1. To produce the correct output for Task 2, the user had to be transferred to a Service Contract Specialist. The navigation strategy for producing these outputs is depicted in Figure 1.
Figure 1. The navigation strategy required to produce the correct outputs for the test tasks
Upon task completion, the test users completed the After Scenario Questionnaire (ASQ). To inquire further about the user's experience interacting with the system, the evaluator asked each participant an additional five questions (see Appendix C in the source file).
Data analysis
The measures of usability in this study were:
- Time to complete the tasks (measured for Task 1 as the time from when the language selection prompt began until the participant provided a correct input resulting in the correct output; for Task 2, as the time from when the main menu prompt began until the correct output was produced).
- ASQ scores for Task 2.
- Successful Task Completion Rates, ranging from 0, for unsuccessful completion of a task, to 1, for a successful task completion.
Time and preference data were analyzed using Microsoft® Excel (ANOVA with
α = 0.05 for significance, α = 0.10 for marginal significance). Success and error rates were analyzed with 95% binomial confidence interval (Lewis, 1996).
The results
Data was collected for this study with 12 participants (N = 12). Of these participants, 6 were identified as Expert users, given that they had indicated previous experience using speech recognition software on a computer, and 6 were categorized as Novice users (as they had no previous experience with speech recognition software on a computer). The overall mean times to successfully complete Task 1 and Task 2 were 10.9 and 47.5 seconds with standard deviations of 2.8 and 16.3, respectively. (Data is provided in Appendix D in the source file.)
For Task 1, only one participant generated a Help prompt to complete the task. The participant was commenting, out loud, about his uncertainty in providing a valid input and received a support statement as a result of a no-match input. Nonetheless, all participants completed Task 1 successfully (0% error). A 95% binomial confidence interval for this error percentage ranged from 0.0 to 26.5%.
For Task 2, 5 of 12 participants initially requested a repeat of menu options prior to selecting an option from the main menu. Therefore, it can be expected (with 95% confidence) that a minimum rate for providing repeat as input for this prompt would be 15.2% and as high as 72.3%. Similarly, one of 12 participants generated a Help prompt from the Main Menu prior to providing the correct input for the Purchase a Service Contract task. This indicates an observed rate for selecting "Help" of 8.3%, with a 95% confidence interval from 0.2 to 38.5%.
In completing the second task, 8 of 12 participants reached the target final path (that is, that of being transferred to a Service Contract Specialist) by selecting Make a Purchase from the main menu. This observation yields an observed success rate of 66.7%, which with 95% confidence can be as low as 34.9% or as high as 90.1%. Three participants selected All Other Needs with an observed rate of 25.0%, with a 95% confidence interval from 5.5 to 57.2%. Because the agent reached by requesting All Other Needs would transfer a caller to the appropriate agent, this is also a successful, albeit less efficient, task completion. Using this relaxed criterion, 11 of 12 participants successfully completed Task 2 (with a 95% binomial confidence interval ranging from 61.5 to 99.8%). Unexpectedly, one participant provided "Rebates" as the input for the main menu prompt. Tables 1 - 5, presented below, provides the data (that is, the means, standard deviation, success rate, and 95% confidence intervals) collected for Tasks 1 and 2.
Table 1. Summary of data
|
|
|
|
|
95% Binomial CI | |
Task # |
Task description |
Mean completion time (secs) |
Success rate |
Lower limit |
Upper limit | | 1 | Language selection | 10.9 | 100% | 73.5% | 100% | | Purchase a service contract (by selecting "Make a Purchase" from the main menu). | 47.5 | 66.7% | 34.9% | 90.1% | | 2 | Purchase a Service Contract (by selecting "Make a Purchase" or "All Other Needs" from the main menu | 44.1 | 91.7% | 61.5% | 99.8% |
Table 2. Task 1: Language selection
|
Time to successfully complete task (secs) | | | | | |
Mean |
STD DEV |
95% CI | | Expert | 9 | 10 | 11 | 9 | 9 | 9 | 9.5 | 0.8 | +-.07 | | Novice | 17 | 16 | 10 | 10 | 12 | 9 | 12.3 | 3.4 | +-2.7 | | All | | | | | | | 10.9 | 2.8 | +-1.6 |
Table 3. Task 2: Purchase a service contract *
|
Time to successfully complete task (secs) | | | | | |
Mean |
STD DEV |
95% CI
| | Expert | | 53 | 68 | | 44 | | 55.0 | 12.1 | +-9.7 | | Novice | 34 | | 66 | 33 | 58 | 24 | 43.0 | 18.0 | +-14.4 | | All | | | | | | | 47.5 | 16.3 | +-9.2 |
* Participants that selected "Make a Purchase" from the main menu.
Table 4. Task 2: Purchase a service contract**
|
Time to successfully complete task (secs) | | | | | |
Mean |
STD DEV |
95% CI
| | Expert | 40 | 53 | 68 | 22 | 44 | | 45.6 | 16.5 | +-13.3 | | Novice | 34 | 42 | 66 | 33 | 58 | 24 | 42.8 | 16.1 | +-12.9 | | All | | | | | | | 44.1 | 15.6 | +-8.8 |
** Participants that selected "Make a Purchase" or "All other needs" from the main menu.
Table 5. ASQ scores for task scenario
|
Score | | | | | |
Mean |
STD DEV |
95% CI
| | Expert | 2.0 | 1.3 | 1.3 | 2.0 | 2.3 | 2.0 | 1.8 | 0.4 | +-0.3 | | Novice | 1.0 | 1.0 | 1.3 | 2.0 | 1.3 | 1.5 | 1.4 | 0.4 | +-0.3 | | All | | | | | | | 1.6 | 0.4 | +-0.3 |
There were no statistically significant differences (with α = 0.05) between user groups on the time to successfully complete Task 1 (F(1,10) = 3.96, p = 0.075) or Task 2 (F(1,9) = 0.08, p = 0.786). Similarly, there was no statistically significant difference between groups in participant satisfaction (F(1,10) = 4.39, p = 0.063). However, marginal significance (with α = 0.10) was detected between user groups on the time to complete Task 1 and in participant satisfaction when using the system.
Discussion and recommendations
The purpose of this investigation was to evaluate the clarity of a language selection prompt to be used in a speech recognition IVR system. In doing so, it was discovered that few usability problems occurred, and none were severe. Additionally, responses to the ASQ indicated that participants were highly satisfied with the system.
The key findings and recommendations from this study are as follows:
- For the Language Selection task, all participants were able to complete the task, but five participants displayed facial expressions suggesting confusion with the prompt wording. Additionally, one participant generated a no-match Help prompt in trying to complete this task. Recommendation: Change the first word of the prompt from "Select" to "Say."
- 11 of 12 participants successfully completed Task 2. Eight of these participants selected "Make a Purchase" from the main menu, while three participants chose "All Other Needs."
- Five of 12 participants repeated the list of main menu options prior to making a selection. This is probably due to the participants' unfamiliarity with the application, is likely to only occur with initial system use, and therefore, has no system design implications.
- At the end of the experiment, most participants suggested that there were too many options in the main menu. However, had the correct option been more intuitive, the participants might not have needed to hear all the options (that is, might have barged in) after hearing the correct selection.
- Based upon the statistical analyses, user skill level had a marginally significant effect on the time needed to successfully complete Task 1 and on participant satisfaction in completing the task scenario. Furthermore, there was no statistically significant difference between user groups in time to complete Task 2. On average, the Expert group was somewhat faster in completing Task 1 and slightly more critical of the system. Compared to the Expert group, the Novice users were at an initial disadvantage in using the system because they lacked experience with speech recognition software. As the Novice user progressed to Task 2 and gained more familiarity with the system, the difference between groups, in time to complete the task, was no longer detectable. Similarly, the observed marginal difference in user satisfaction (with Novice users indicating a greater satisfaction with the system) is a likely consequence of different user exposure to speech recognition systems.
- Although there were many bilingual participants, all participants' selected English as the language in which to navigate within the system.
- Two participants provided variations of the required inputs in the completion of Task 2. Participant 8 provided "Purchase" instead of "Make a Purchase," while Participant 11 provided "Service" instead of "Service Contract."
Recommendation: Equip the application with flexible grammars to accommodate various inputs for each option. For example, include inputs that begin with "Purchase" to select the "Make a Purchase" option from the main menu, and include "Service" or "Contracts" for selecting the purchase of a service contract.
Resources
The following books are recommended reading on the topic of speech recognition technology:
- How to Build a Speech Recognition Application, (2nd edition) by Balentine, Bruce, and David P. Morgan (2001); EIG Press.
- IBM computer usability satisfaction questionnaires: Psychometric evaluation and instructions for use, by Lewis, J. R. (1995); International Journal of Human-Computer Interaction, 7, pp 57-78.
- Binomial confidence intervals for small sample usability studies, by Lewis, J. R. (1996), In G. Salvendy and A. Ozok (Eds.); Advances in Applied Ergonomics: Proceedings of the 1st International Conference on Applied Ergonomics -- ICAE '96, (pp 732-737). Istanbul, Turkey: USA Publishing.
- Psychometric evaluation of the PSSUQ using data from five years of usability studies, by Lewis, J. R. (2002); International Journal of Human-Computer Interaction, 14, pp 463- 488.
The following are tools and resources can also prove useful in developing for speech technology:
- Browse for books on these and other technical topics.
- Go to the Wireless zone to find all the resources you need to learn more and work better. The WebSphere Voice zone also has many additional tools for developing voice and speech.
About the authors  | 
|  | Barbara Millet is a Ph.D. student in the Industrial Engineering program at the University of Miami. She is currently an IBM Human Factors intern working with the Pervasive Computing User Centered Design group. Her research interests include Usability Engineering, Cognitive Ergonomics, and Decision Making. |
 | |  | Dr. James R. Lewis is a Senior Human Factors Engineer working in IBM Pervasive Computing, with a current focus on all user-centered design aspects of IBM's speech products (including speech user interface design, speech recognition, and artificial speech production, runtimes, and tooling). |
Rate this page
|  |