IBM Speech Sandbox
Using a speech interface in Virtual Reality
Why a Sandbox?
Designing for VR
You may ask, when does Virtual Reality make sense? Unlike Augmented Reality, which modifies the real world with digital content, Virtual Reality places users into an enveloping, immersive experience. Because of this, Virtual Reality is best utilized when your goal is to transport your user into a completely different environment. Use Augmented Reality when it’s important for your user to stay tied to the context of reality.
Designing experiences for virtual reality means not only creating what the user sees and hears, but also crafting rules for how they interact with the world around them. In IBM Speech Sandbox, there are many different interaction points to be considered. We had to account for how our users would move through the world, create objects, pick them up, and manipulate them. Each interaction point has to feel natural and intuitive, so users are not stuck recalling button functions instead of engaging with the Sandbox.
For a large part of the design process, we specifically focused on the voice and speech interactions to ensure that we were showcasing Watson services.
Because the world is immersive and users are fully absorbed in the experience, even slightly unintuitive behaviors can be extremely jarring in unpredictable ways. Therefore, it’s important to test often with real users as you are building your app.
People are affected by VR very differently. Some have used VR before and immediately understand the controls, others might not even know to look around the environment, and many could be easily affected by motion sickness. It is important to test with people that have varying degrees of familiarity with VR systems because they will all react differently.
Our process for user testing involved creating different hypotheses, then implementing simple builds of the app that demonstrated each hypothesis. Creating objects using voice was one of the most difficult things to get right; we had to ensure that every user would be able to successfully create objects with their voice and that objects would materialize in the world where our users expected them to appear. We created 3 different interaction models and tested each one, which is how we arrived on the laser-pointer system present in the current version of the game.
Why Speech Interface in VR?
Speech is a natural way of interacting with a virtual world and it provides much more flexibility than limited controller and UI options. Using controllers can be cumbersome because it takes the user away from the immersive experience and they can’t see their hands pressing the right buttons. In worlds where a user is making a selection from a long list of options — such as a Sandbox, where the user can create a large number of objects — speech is a great way to bypass long scrolling menu screens and easily search for the intended option.
Users prefer to direct their speech
Try talking to an empty room — feels kind of silly, right? It’s the same in VR. When using a speech interface, users are essentially ‘conversing’ with technology. Users feel more comfortable when they have someone to talk to, or even just a simple indicator of the direction that they should be speaking. In our tests, when users were asked to speak to an ‘empty room’ in the virtual environment, they would look around as if trying to find somewhere to direct their speech and overall they would use the speech commands less frequently. At first, we had issues getting people to engage with the speech interface. However, as soon as we added the laser pointer with the reticle to our game, users were able to intuitively direct their speech towards the reticle and thus became much more comfortable speaking.
Ambient Listening, not Push to Talk
To simplify the speech interface, we first designed for the user to press a button on the controller when they wanted to speak a command. However, this ‘push to talk’ functionality had some issues. Users would often forget to press the button entirely. After users got used to ‘push to talk’, they would almost universally hold the controller up to their mouth as if it were a microphone – even the expert users who knew the microphone was actually on the headset. This behavior indicates that, if a user needs to press a button to speak, the natural inclination is to speak to that button. In our game, because users can create objects at any place and time, we eliminated button presses and switched to ambient listening. Users can interact with the speech interface whenever they choose and manipulate objects with both controllers while they speak.
Set Expectations for Commands
With voice, there is an infinite number of interaction options, and you may be surprised how off-script users can go when testing the versatility of conversation interactions. To guide users away from attempting unsupported commands (and ease our developers’ lives when training Watson), we introduced a tutorial that let the user know what aspects of the world can be affected by voice and which commands to use. Once users understand the commands, they are much less likely to try to interact with the game using unsupported phrases.
Developing for VR
To implement the speech interface in our HTC Vive experience, we used the Watson Unity SDK to add Conversation and Speech-to-Text. Our lead developer wrote an in depth blog detailing how to implement the interface in the link below.