Some of us take keyboard and mouse input for granted: You type a character on the keyboard, and it appears in your window, or you type a string of characters, press Return, and some action happens—either locally or at a networked distance. What else is there to expect? But what if you don't have or cannot use a keyboard or mouse, or you want to put a keystroke into one window and have it do something in a different window on a different desktop? Or, perhaps, you want to create a window, resize it, pull up a browser in that window, navigate to a URL, then tab through a number of links in the web page and click one—all without a keyboard or mouse, using your voice through a speech recognizer. This approach calls for keyboard and mouse emulation.
Access to the system calls that perform all of these tasks is perfectly possible but perhaps
inconvenient because it involves using a lower level of coding than most developers are
typically use. Just as a speech recognizer makes access to recorded speech patterns
easy, other libraries give access to the windowing schemes of Linux desktops.
Enter xdotool.
xdotool (see Resources for a link)
is a library of functions that help you send instructions to the windowing
environment. It can send a string to be typed into a remote window, resize the
window, send individual keystrokes and keystroke combinations, or even perform
a double-click of the right mouse button. And the remote application is unaware
that it is being manipulated, or voxipulated, indirectly. The website has
specific instructions for installation using make.
Sending xdotool commands is straightforward: The syntax
is a simple xdotool followed by options and arguments.
xdotool search --name 'mytest' |
In this example, xdotool searches for windows with a
title of mytest and returns their ID numbers, which you can record for
later use. Beware of searching for windows without some kind of criterion—you
will end up with a long list of meaningless numbers.
You can get hints on the names to use for keys and keystroke combinations from the website. One little trap is that although tab seems intuitive, only Tab works on my system.
Although it is possible to issue xdotool commands one at
a time from a terminal window, some internal variables are lost each time the
instance of xdotool ends, so chaining a number of
options and arguments after a single invocation of xdotool
is more efficient.
Probably the best way to get a feel for this is to see a working example.
Listing 1 shows a test that uses one terminal window (call it
work) to open another terminal (mytest) and send keystrokes to
it. The commands are scripted with PHP only to provide a way of slowing down the
process so you can see what happens: You can use Perl or Python or whatever
your favourite scripting engine might be. The effect of the script is to define a
number of keystrokes, have the working terminal say "Trying A" (where
A is a key), and send that key to the second terminal. You should see
activity in both terminals at the same time. If the code runs too fast for you to
see, increase the defined WAIT constant.
Listing 1. Duelling terminals
<?php
// script to test xdotool functionality
define('WAIT',200000);
$charrarray = setca();
$execcmd = "xdotool search --name 'mytest'";
exec($execcmd,$out);
$windowid = $out[0];
if ($windowid) {
echo "Found window ID ".$windowid."\n";
} else {
exec('konsole -p LocalTabTitleFormat=mytest');
$execcmd = "xdotool search --name 'mytest'";
exec($execcmd,$out);
$windowid = $out[0];
// die("Cannot find the window\n");
}
// now activate the window
$execcmd = "xdotool windowactivate --sync $windowid";
exec($execcmd,$out);
// test keys
foreach ($charrarray as $k=>$char) {
echo "Trying $char \n";
send_key($k);
}
// functions
function send_string($string) {
$execcmd = "xdotool type $string";
exec($execcmd,$out);
usleep(WAIT);
}
function send_key($key) {
$execcmd = "xdotool key $key";
exec($execcmd,$out);
usleep(WAIT);
}
function send_return() {
$execcmd = "xdotool key Return";
exec($execcmd,$out);
usleep(WAIT);
}
function setca() {
$ca = array(
"Up" => 'up',
"Down" => 'down',
"ampersand" => '&',
"apostrophe" => '\'',
"Left" => 'L',
"Right" => 'R',
"asciicircum" => '^',
"asciitilde" => '~',
"asterisk" => '*',
"at" => '@',
"backslash" => '\\',
"bar" => '|',
"braceleft" => '{',
"braceright" => '}',
"bracketleft" => '[',
"bracketright" => ']',
"colon" => ':',
"comma" => ',',
"dollar" => '$',
"equal" => '=',
"exclam" => '!',
"grave" => '`',
"greater" => '>',
"less" => '<',
"minus" => '-',
"numbersign" => '#',
"parenleft" => '(',
"parenright" => ')',
"percent" => '%',
"period" => '.',
"plus" => '+',
"question" => '?',
"quotedbl" => '"',
"semicolon" => ';',
"space" => ' ',
"BackSpace" => 'bs',
"Tab" => '\t',
"underscore" => '_',
"slash" => '/',
"eacute" => 'eac',
"ccedilla" => 'cced',
"udiaeresis" => 'uum',
"idiaeresis" => 'ïdi',
"Home" => '<',
"End" => '>',
"Return" => '\n'
);
return $ca;
}
?>
|
This script begins by defining a WAIT constant that has
global scope. This constant slows the functions down so you can see what is
happening. It then fills an array with the keystrokes you want to test. The full
list is, of course, much longer: This sample contains many of the standard characters
plus a few of the more unusual possibilities. xdotool then
looks for a window with the title mytest; if it does not find one, PHP
creates one by issuing a konsole instruction with a
parameter that gives the window a specific title:
exec('konsole -p LocalTabTitleFormat=mytest');
|
After creating the new window, you repeat the xdotool
search and set a global variable with the identity number of that window. This ensures that you are sending subsequent commands to the right destination. With
the window available, the script then activates it and begins looping through the
array of test characters. For each one, it prints in the work console that it's trying
a specific character and then prints that character in the target window.
When you're sure that you have the right combination of xdotool
arguments and options, you can gain the benefit of brevity by using
xdotool chaining and scripting capabilities. The PHP
scaffolding is useful when you're not sure what to use and need to have full
diagnostic control.
Although Listing 1 is productive in the sense that it allows you to test the names of keys, it would be even more productive to retrieve information from the Internet. The example in Listing 2 opens a new window, starts an instance of the Lynx text browser, and navigates to the website www.ibm.com.
Listing 2. Automate Lynx
<?php
// script to test xdotool functionality
// first get the window to launch lynx
define('WAIT',200000);
$winname = 'mylynx';
$execcmd = "xdotool search --name $winname";
exec($execcmd,$out);
$windowid = $out[0];
if ($windowid) {
echo "Found window ID ".$windowid."\n";
} else {
exec("konsole -p LocalTabTitleFormat=$winname");
exec($execcmd,$out);
$windowid = $out[0];
// die("Cannot find the window\n");
}
// now activate the window
$execcmd = "xdotool windowactivate --sync $windowid";
exec($execcmd,$out);
$execcmd = "xdotool getactivewindow windowsize --sync 750 750";
exec($execcmd,$out);
// start sending stuff
send_string("lynx");
send_return(); // opens lynx browser
send_key("g"); // get URL prompt
send_string("www.ibm.com");
send_return(); // send a URL
function send_string($string) {
$execcmd = "xdotool type $string";
exec($execcmd,$out);
usleep(WAIT);
}
function send_key($key) {
$execcmd = "xdotool key $key";
exec($execcmd,$out);
usleep(WAIT);
}
function send_return() {
$execcmd = "xdotool key Return";
exec($execcmd,$out);
usleep(WAIT);
}
?>
|
The code in Listing 2 is similar to the code in Listing 1 in that it defines
the wait period to slow things down a bit, then looks for a window called mylynx
(so as not to confuse it with mytest above), saves the window ID for later use,
activates the window, resizes it, and initiates an instance of Lynx in the window.
The script then sends a g keystroke into that window,
which tells Lynx to expect a URL, and follows that up with the string
www.ibm.com and the Return key. The website is now
open and ready to receive further commands, such as Tab
and Return for navigation.
Why choose Lynx as an application to automate? You might find that in attempting to automate other browsers or applications some automated keystrokes are ignored. The roots of this issue lie in fears of cross-site scripting. Lynx seems to allow a few more faked keystrokes than, say, Mozilla Firefox. Your mileage may vary.
When you get everything working and can see how easy it is to automate using
indirect keystrokes, it begs the question about input from other devices. Voice
is one option. A speech recognizer using a properly trained audio model can pick
up verbal instructions and use xdotool to execute
commands. If you're looking for a ready-made application that employs
xdotool and the VoxForge audio model, check out
kiku (see Resources for links
to more information), which is available for Ubuntu.
xdotool and the dialog manager
The example code in Listing 3 shows how to use
xdotool in a dialog manager.
Listing 3. xdotool in the dialog manager
<?php
...
function process($major,$minor) {
global $globalcontext;
switch ($major) {
case 'CONTEXT':
switch ($minor) {
case 'SOFTWARE':
$globalcontext = $minor;
echo "Set global context to software\n";
break;
default:
echo "Defaulted out $minor\n";
break;
}
break;
case 'BROWSER':
$ooc = ($globalcontext == 'SOFTWARE') ? true : false ;
if ($ooc) {
switch ($minor) {
case 'OPEN':
echo "Open browser\n";
browser_open();
break;
case 'CLOSE':
echo "Close browser\n";
browser_close();
break;
case 'LOCATION':
echo "Go to URL\n";
browser_location();
break;
default:
echo "Defaulted out $minor\n";
break;
}
} else {
// OOC
echo "Recognized but out of context\n";
}
break;
default:
echo "Defaulted out $major\n";
break;
}
}
function browser_open() {
global $windowid;
$winname = 'mylynx';
$execcmd = "xdotool search --name $winname";
exec($execcmd,$out);
$windowid = $out[0];
if ($windowid) {
echo "Found window ID ".$windowid."\n";
} else {
exec("konsole -p LocalTabTitleFormat=$winname");
$execcmd = "xdotool search --name $winname";
exec($execcmd,$out);
$windowid = $out[0];
echo "Using windowid $windowid";
// die("Cannot find the window\n");
}
// now activate the window
$execcmd = "xdotool windowactivate --sync $windowid";
exec($execcmd,$out);
$execcmd = "xdotool getactivewindow windowsize --sync 800 800";
exec($execcmd,$out);
// start sending stuff
send_string("lynx");
send_return(); // opens lynx browser
}
...
?>
|
Listing 3 shows two functions. The first is process(), which
is expecting two arguments. These arguments are strings that the speech-recognition
process returns. To find out more about the process of getting these strings, see the
tutorials in VoxForge or Sphinx (see Resources for links). In
this case, the expected strings come from a grammar that, in part, consists of the
instructions CONTEXT SOFTWARE and BROWSER
OPEN.
Following the process through, consider what happens when the speech recognizer hears
CONTEXT SOFTWARE. The major string is
CONTEXT, and the minor is SOFTWARE.
The process function declares access to a global variable in which the context is
stored, and then in the switch it sets the context variable
to SOFTWARE. This variable is known globally, so later,
when the speech recognizer picks up the major and minor BROWSER
OPEN, the switch can handle this but first verifies that the context is right.
The point of using a context at all is to help eliminate incorrect results from the
recognizer. If your context is SOCCER and you don't think
opening a browser is relevant in that context, the browser won't open.
If the context is right, then the command BROWSER OPEN
proceeds to the browser_open() function, which you will
quickly recognize as basically the same code used in Listing 2.
The more complex your context management becomes, the more you need a method of
reeling out switch statements that follow rules defined elsewhere that faithfully
reproduce your context management structure. You can define these rules in a
number of ways. One way is to use the structure of the Speech Recognition Grammar
Specification and Semantic Interpretation for Speech Recognition—see
Resources for links to more information. And a slightly
simpler approach is to store the necessary code fragments, including the
xdotool instructions, in an XML structure.
Use XML to store context and code fragments
Listing 4 is an XML file that has details of the building blocks
for an imaginary dialog manager. The advantage of a flat, editable XML file is
that all of your contexts and functions are seen in the same file; because
they are separated from the more complex switching code of the dialog manager,
it is easier to see and edit the context structure. This data contains one
xdotool instruction to perform a left-click wherever the
mouse cursor might be at the time. This command is valid in all contexts and therefore
does not have a ctxt attribute.
Listing 4. The XML store
<?xml version="1.0" encoding="UTF-8"?>
<snips>
<context>
<func>click_left</func>
<func ctxt="software">browser_open</func>
<func ctxt="software">browser_location</func>
<func ctxt="software">browser_close</func>
<func ctxt="hardware">cpu_temperature</func>
<func ctxt="hardware">fan_speed</func>
</context>
<snip fn="click_left">
<![CDATA[
function click_left() {
exec('xdotool click 1');
}
]]>
</snip>
<snip fn="cpu_temperature">
<![CDATA[
function cpu_temperature() {
$g = 0;
}
]]>
</snip>
<snip fn="fan_speed">
<![CDATA[
function fan_speed() {
$g = 1;
}
]]>
</snip>
<snip fn="browser_open">
<![CDATA[
function browser_open() {
$f = 0;
}
]]>
</snip>
<snip fn="browser_location">
<![CDATA[
function browser_location() {
$f = 1;
}
]]>
</snip>
<snip fn="browser_close">
<![CDATA[
function browser_close() {
$f = 2;
}
]]>
</snip>
</snips>
|
In this code, the root element is snips. This element
has two types of children:
- A context element with contextual information
- A number of
snipelements that contain the code fragments
The context element contains func elements, each of which
has the name of a function as the element value, and an attribute
ctxt that contains the context in which the function is
valid. Thus, the function fan_speed would not be valid if
the context were set as software. The function
click_left, however, has no context, so it is valid
anywhere. The code snippets are stored in CDATA segments, which ensures that
the XML parser skips these sections and accepts them regardless of the coding they
contain.
Now, all you need is the script to expand the dialog data into its own script. The PHP code in Listing 5 does just that.
Listing 5. Dialog manager generator
<?php
// pull DM structure from an xml file
$xml = simplexml_load_file('snipstor.xml');
// get the contexts
// generate the main switch
echo "function process(\$major,\$minor) {\n";
echo "global \$globalcontext;\n";
echo " switch (\$major) {\n";
$majtmp = "";
$mintmp = "";
foreach ($xml->context->func as $mjmn) {
list($major,$minor) = explode("_",$mjmn);
//echo "$major,$minor\n";
if ($major != $majtmp) {
if ($majtmp != "") echo " default:
echo \"Failed \$minor\";
break;
}\n } else {
echo 'OOC';\n }\n";
echo " case '$major':\n";
if ($minor != $mintmp) {
$test = ($mjmn['ctxt']) ? "\$globalcontext == '".$mjmn['ctxt']."'" : 'true' ;
echo " if ($test) {\n";
echo " switch (\$minor) {\n";
$mintmp = $minor;
}
echo " case '$minor':
$mjmn();
break;\n";
$majtmp = $major;
} else {
echo " case '$minor':
break;\n";
}
}
echo " default:
echo \"Failed \$minor\";
break;
}\n } else {
echo 'OOC';\n }\n";
echo " default:
echo \"Failed \$major\";
break;
}\n";
echo "}\n";
// generate the code snippets
echo "// functions\n";
foreach ($xml->snip as $snipfn) {
echo trim($snipfn);
echo "\n";
}
?>
|
The code in Listing 5 writes more PHP code that you can fit into a dialog manager.
It begins by reading the dialog manager structure of Listing 4
saved in a file called snipstor.xml into a SimpleXML variable. It then
reads the contents of the function names in the context element, extracts the
major and minor components from these names, and uses them to build the
switch statements that control the flow in the dialog manager. As it builds the
code, the script looks for a context (from the ctxt
attribute) in which the command is valid. It inserts a conditional if
statement to take care of any contexts that might apply to the switch case. If
there is a context, it's inserted as an expression to be evaluated at run time;
otherwise, it just inserts true so that the programme
always executes enclosed statements—that is, the command is valid in all
contexts. Finally, the script outputs the library of code that the switch cases need.
The result of running Listing 5 on the data contained in Listing 4 follows in Listing 6.
Listing 6 Resulting dialog manager extract
function process($major,$minor) {
global $globalcontext;
switch ($major) {
case 'click':
if (true) {
switch ($minor) {
case 'left':
click_left();
break;
default:
echo "Failed $minor";
break;
}
} else {
echo 'OOC';
}
case 'browser':
if ($globalcontext == 'software') {
switch ($minor) {
case 'open':
browser_open();
break;
case 'location':
break;
case 'close':
break;
default:
echo "Failed $minor";
break;
}
} else {
echo 'OOC';
}
case 'cpu':
if ($globalcontext == 'hardware') {
switch ($minor) {
case 'temperature':
cpu_temperature();
break;
default:
echo "Failed $minor";
break;
}
} else {
echo 'OOC';
}
case 'fan':
if ($globalcontext == 'hardware') {
switch ($minor) {
case 'speed':
fan_speed();
break;
default:
echo "Failed $minor";
break;
}
} else {
echo 'OOC';
}
default:
echo "Failed $major";
break;
}
}
// functions
function click_left() {
exec('xdotool click 1');
}
function cpu_temperature() {
$g = 0;
}
function fan_speed() {
$g = 1;
}
function browser_open() {
$f = 0;
}
function browser_location() {
$f = 1;
}
function browser_close() {
$f = 2;
}
|
This output is similar to what you started with in Listing 3. Note
that the actual CDATA content is valid code but is abbreviated for simplicity. The
OOC is just shorthand for Out of Context.
You see this message when the speech recognizer hears an enunciation that is valid but does not
make sense according to the structure of the snipstor.xml file. To make the example
more meaningful, you can replace the CDATA section for the
browser_open() function with the code for that function
from Listing 3.
As you can see, xdotool is a handy library of calls to the
windowing system. Combined with a speech recognizer, you can use voice to initiate
xdotool commands under the control of a dialog manager.
And finally, because the dialog manager can get quite complex and unwieldy when
contexts are complicated, you can use code fragments containing your
xdotool instructions stored conveniently in XML CDATA
sections to generate the dialog manager consistently and effectively.
Learn
- VoxForge and Carnegie Mellon's CMU Sphinx: Find out more about putting together a speech recognition model.
- Look, Ma! No keyboard! Voice input and response using fixed grammars (Colin Beckingham, developerWorks, November 2010): Read more about voice recognition using grammars.
- Querying a database using open source voice control software (Colin Beckingham, linux.com, May 2008): Get an overview of open source software working in a voice/speech context.
- Dealing with data in XML (Chris Herborth, developerWorks, January 2010): Learn more about CDATA elements and how to use them effectively to ship marked-up data along with your XML file.
- More articles by this author (Colin Beckingham,
developerWorks, March 2009-current): Read articles about XML, voice recognition, XHTML, PHP, SMIL, and other
technologies.
- New to XML? Get the resources you need to learn XML.
- XML
area on developerWorks: Find the resources you need to advance your skills in the
XML arena, including DTDs, schemas, and XSLT. See the XML technical library for a wide range of technical articles and tips, tutorials, standards, and IBM Redbooks.
- IBM XML certification: Find out how you can become an IBM-Certified Developer in XML and related technologies.
- developerWorks technical events and webcasts: Stay current with technology in these sessions.
- developerWorks on Twitter: Join today to follow developerWorks tweets.
- developerWorks podcasts: Listen to interesting interviews and discussions for software developers.
- developerWorks on-demand demos: Watch demos ranging from product installation and setup for beginners to advanced functionality for experienced developers.
Get products and technologies
xdotool: Download and get more information on setup and use to simulate keyboard input and mouse activity, move and resize windows, and so on.kiku: Find out more about the kiku speech recognition and dialog manager and how to use voice recognition to control your operating system.- Lynx source distribution directory: Download and learn more about the Lynx web browser with its User Guide and main help page.
- IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Discuss
- XML zone discussion forums: Participate in any of several XML-related discussions.
- The developerWorks community: Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.
Colin Beckingham is a freelance researcher, writer, and programmer who lives in eastern Ontario, Canada. Holding degrees from Queen's University, Kingston, and the University of Windsor, he has worked in a rich variety of fields including banking, horticulture, horse racing, teaching, civil service, retail, and travel and tourism. The author of database applications and numerous newspaper, magazine, and online articles, his research interests include open source programming, VoIP, and voice-control applications on Linux. You can reach Colin at colbec@start.ca.




