Making Voice Applications Easy - Blog

We spend a lot of time thinking about ease of use at Lucas, and not just when we are testing customer applications in the mock warehouse set up in the middle of our engineering offices. Similar to a smartphone, we think that using voice applications should be natural, intuitive, and easy. But what exactly makes a great user interface – in a voice system or a smartphone?

Consider the ‘pinch to zoom’ feature that figured prominently in the Apple-Samsung patent case. For those not familiar with it, pinch to zoom is the two-finger gesture a user makes on a smartphone screen to zoom in or out on an image. Pinch to zoom is a fabulous feature because it’s a natural movement that is easy to use. And it makes sense – the gesture suggests the action you’re performing.

In a voice directed warehouse application the voice dialogue (the prompts the user hears and the responses the user speaks) is the user interface and the key to making the system as easy as pinch to zoom.

To get the voice interface right you need to think about things from the perspective of the people who will be using it.

The Container Store, a growing retailer that installed Lucas took this advice to heart. As part of their voice project they gave hourly employees the chance to hear different voice systems and weigh in on the decision of which system to choose. Like thousands of employees at other DCs, the Container Store employees preferred the human voice of Jennifer over a computer synthesized voice (referred to as text-to-speech – TTS). (You can hear more about The Container Store’s voice experience in a Webcast presented by DC Velocity Magazine.)

In our experience, employee preference for natural human voice pays off in higher user satisfaction and rapid system adoption. Another reason natural voice improves ease of use is less obvious, unless you’ve ever observed – and heard – an experienced voice user working at full speed.

As voice users gain experience with a voice application they almost always want the prompts they hear to be delivered faster, so voice systems typically offer several voice speeds – the speeds are adjusted by means of spoken commands: “Jennifer faster” and “Jennifer slower.” While TTS voices tend to become incomprehensible at faster speech rates, human voice can be compressed without losing clarity, allowing workers to go as fast as they want. In reality, the slowest Jennifer voice prompts are delivered slightly faster than “normal” human speech, and for most users that isn’t fast enough. At top speed, Jennifer prompts are delivered about twice as fast as normal speech – so a phrase that would take one second to say is delivered in about half a second. That doesn’t sound like much, but it matters greatly to the people using the system.

Just as accurate, effective speech recognition ensures ease-of-use and user satisfaction with a voice system (see our earlier post on this subject), the voice of the system is foundational to a positive user experience.

The other part of the equation – focus on the voice dialogue

With voice applications, getting the voice right is only part of the “make it easy” challenge. The other part of the equation is to focus on the voice dialogue you create for users – the conversation the user has with the system.

When defining your voice dialogue (the system prompts and user responses), you need to think in terms of process flow, but you can’t simply take a process and systems perspective. It’s not just a matter of translating the text that appears on an RF screen to voice prompts, for example. You have to think about what information your users need, when they need it, and how they will interpret the information when they are standing at the pick face. You also need to think about other information pickers may need on an occasional basis, how they will ask for additional information, get help, or manage exceptions. In short, making the dialogue easy is harder than it sounds. It’s not rocket science, but it’s not a no-brainer, either.

In some DCs, making a better voice dialogue may require using some DC-specific terminology, slang or shorthand rather than more general warehousing terms that can be used differently in different facilities. In many DCs you also need to consider pack factors and special instructions to avoid user confusion and prevent chronic over- or under-picks.

For example, while a case is almost always a case, not every “each” is the same: some are boxes, others are tubes, bottles, etc. And some inner-packs may look like cases or eaches. In those situations, it’s important that the voice system give appropriate unit of measure prompts. Pick quantities may need to be very specific: “pick one box of three.” In other examples, you may want the voice system to provide a warning message: “careful – this is a two-part item.”

Pack factors, units of measure, and special instructions are just a few of the variables that need to be considered to make sure your voice dialogue will be easy and intuitive for the people using the system. The point is that you need to think about these details up front and not treat the voice dialogue – or the voice itself – as a given or an afterthought. Try to incorporate “natural” prompts and responses, and consider what your users are going to hear and how they will interpret that when they are standing at the pick face. Better yet, let your users test out the voice and give you feedback on the dialogue up front so you can get it right the first time.