Trigger keyboard and mouse actions with your voice and xdotool

Get keystrokes when you combine voice input with the xdotool library

xdotool is a helpful library of instructions that allows programmers to emulate keystrokes and mouse actions. The particular strength of the tool comes when the keyboard or mouse is absent or in accessibility situations where the user is not physically able to employ regular input methods. This article has two goals—first, to provide an introduction to the use of xdotool in a Linux® desktop environment, and second, to use voice input to trigger actions typically done through hardware input. A concluding example uses XML to store xdotool-oriented code fragments for insertion into auto-generated dialog manager code.

Colin Beckingham, Writer and Researcher, Freelance

Colin Beckingham is a freelance researcher, writer, and programmer who lives in eastern Ontario, Canada. Holding degrees from Queen's University, Kingston, and the University of Windsor, he has worked in a rich variety of fields including banking, horticulture, horse racing, teaching, civil service, retail, and travel and tourism. The author of database applications and numerous newspaper, magazine, and online articles, his research interests include open source programming, VoIP, and voice-control applications on Linux. You can reach Colin at colbec@start.ca.



06 September 2011

Also available in Russian Japanese Portuguese

Some of us take keyboard and mouse input for granted: You type a character on the keyboard, and it appears in your window, or you type a string of characters, press Return, and some action happens—either locally or at a networked distance. What else is there to expect? But what if you don't have or cannot use a keyboard or mouse, or you want to put a keystroke into one window and have it do something in a different window on a different desktop? Or, perhaps, you want to create a window, resize it, pull up a browser in that window, navigate to a URL, then tab through a number of links in the web page and click one—all without a keyboard or mouse, using your voice through a speech recognizer. This approach calls for keyboard and mouse emulation.

Frequently used acronyms

  • CDATA: Character data
  • URL: User interface
  • XML: Extensible Markup Language

Access to the system calls that perform all of these tasks is perfectly possible but perhaps inconvenient because it involves using a lower level of coding than most developers are typically use. Just as a speech recognizer makes access to recorded speech patterns easy, other libraries give access to the windowing schemes of Linux desktops. Enter xdotool.

xdotool

xdotool (see Resources for a link) is a library of functions that help you send instructions to the windowing environment. It can send a string to be typed into a remote window, resize the window, send individual keystrokes and keystroke combinations, or even perform a double-click of the right mouse button. And the remote application is unaware that it is being manipulated, or voxipulated, indirectly. The website has specific instructions for installation using make.

Sending xdotool commands is straightforward: The syntax is a simple xdotool followed by options and arguments.

xdotool search --name 'mytest'

In this example, xdotool searches for windows with a title of mytest and returns their ID numbers, which you can record for later use. Beware of searching for windows without some kind of criterion—you will end up with a long list of meaningless numbers.

You can get hints on the names to use for keys and keystroke combinations from the website. One little trap is that although tab seems intuitive, only Tab works on my system.

Although it is possible to issue xdotool commands one at a time from a terminal window, some internal variables are lost each time the instance of xdotool ends, so chaining a number of options and arguments after a single invocation of xdotool is more efficient.

Probably the best way to get a feel for this is to see a working example.


Automate a console

Listing 1 shows a test that uses one terminal window (call it work) to open another terminal (mytest) and send keystrokes to it. The commands are scripted with PHP only to provide a way of slowing down the process so you can see what happens: You can use Perl or Python or whatever your favourite scripting engine might be. The effect of the script is to define a number of keystrokes, have the working terminal say "Trying A" (where A is a key), and send that key to the second terminal. You should see activity in both terminals at the same time. If the code runs too fast for you to see, increase the defined WAIT constant.

Listing 1. Duelling terminals
<?php
// script to test xdotool functionality
define('WAIT',200000);
$charrarray = setca();
$execcmd = "xdotool search --name 'mytest'";
exec($execcmd,$out);
$windowid = $out[0];
if ($windowid) {
  echo "Found window ID ".$windowid."\n";
} else {
  exec('konsole -p LocalTabTitleFormat=mytest');
  $execcmd = "xdotool search --name 'mytest'";
  exec($execcmd,$out);
  $windowid = $out[0];
  // die("Cannot find the window\n");
}
// now activate the window
$execcmd = "xdotool windowactivate --sync $windowid";
exec($execcmd,$out);
// test keys
foreach ($charrarray as $k=>$char) {
  echo "Trying $char \n";
  send_key($k);
}
// functions
function send_string($string) {
  $execcmd = "xdotool type $string";
  exec($execcmd,$out);
  usleep(WAIT);
}
function send_key($key) {
  $execcmd = "xdotool key $key";
  exec($execcmd,$out);
  usleep(WAIT);
}
function send_return() {
  $execcmd = "xdotool key Return";
  exec($execcmd,$out);
  usleep(WAIT);
}
function setca() {
$ca = array(
   "Up" => 'up',
   "Down" => 'down',
   "ampersand" => '&',
   "apostrophe" => '\'',
   "Left" => 'L',
   "Right" => 'R',
   "asciicircum" => '^',
   "asciitilde" => '~', 
   "asterisk" => '*',   
   "at" => '@',
   "backslash" => '\\',
   "bar" => '|',
   "braceleft" => '{',
   "braceright" => '}',
   "bracketleft" => '[',
   "bracketright" => ']',
   "colon" => ':',
   "comma" => ',',
   "dollar" => '$',
   "equal" => '=', 
   "exclam" => '!',
   "grave" => '`', 
   "greater" => '>',
   "less" => '<',   
   "minus" => '-',  
   "numbersign" => '#',
   "parenleft" => '(', 
   "parenright" => ')',
   "percent" => '%',   
   "period" => '.',    
   "plus" => '+',      
   "question" => '?',  
   "quotedbl" => '"',  
   "semicolon" => ';', 
   "space" => ' ',
   "BackSpace" => 'bs',
   "Tab" => '\t', 
   "underscore" => '_',
   "slash" => '/', 
   "eacute" => 'eac', 
   "ccedilla" => 'cced', 
   "udiaeresis" => 'uum', 
   "idiaeresis" => 'ïdi', 
   "Home" => '<',
   "End" => '>',
   "Return" => '\n'
);
return $ca;
}
?>

This script begins by defining a WAIT constant that has global scope. This constant slows the functions down so you can see what is happening. It then fills an array with the keystrokes you want to test. The full list is, of course, much longer: This sample contains many of the standard characters plus a few of the more unusual possibilities. xdotool then looks for a window with the title mytest; if it does not find one, PHP creates one by issuing a konsole instruction with a parameter that gives the window a specific title:

exec('konsole -p LocalTabTitleFormat=mytest');

After creating the new window, you repeat the xdotool search and set a global variable with the identity number of that window. This ensures that you are sending subsequent commands to the right destination. With the window available, the script then activates it and begins looping through the array of test characters. For each one, it prints in the work console that it's trying a specific character and then prints that character in the target window.

When you're sure that you have the right combination of xdotool arguments and options, you can gain the benefit of brevity by using xdotool chaining and scripting capabilities. The PHP scaffolding is useful when you're not sure what to use and need to have full diagnostic control.


Automate a text browser

Although Listing 1 is productive in the sense that it allows you to test the names of keys, it would be even more productive to retrieve information from the Internet. The example in Listing 2 opens a new window, starts an instance of the Lynx text browser, and navigates to the website www.ibm.com.

Listing 2. Automate Lynx
<?php
// script to test xdotool functionality
// first get the window to launch lynx
define('WAIT',200000);
$winname = 'mylynx';
$execcmd = "xdotool search --name $winname";
exec($execcmd,$out);
$windowid = $out[0];
if ($windowid) {
  echo "Found window ID ".$windowid."\n";
} else {
  exec("konsole -p LocalTabTitleFormat=$winname");
  exec($execcmd,$out);
  $windowid = $out[0];
  // die("Cannot find the window\n");
}
// now activate the window
$execcmd = "xdotool windowactivate --sync $windowid";
exec($execcmd,$out);
$execcmd = "xdotool getactivewindow windowsize --sync 750 750";
exec($execcmd,$out);
// start sending stuff
send_string("lynx");
send_return(); // opens lynx browser
send_key("g"); // get URL prompt
send_string("www.ibm.com"); 
send_return(); // send a URL
function send_string($string) {
  $execcmd = "xdotool type $string";
  exec($execcmd,$out);
  usleep(WAIT);
}
function send_key($key) {
  $execcmd = "xdotool key $key";
  exec($execcmd,$out);
  usleep(WAIT);
}
function send_return() {
  $execcmd = "xdotool key Return";
  exec($execcmd,$out);
  usleep(WAIT);
}
?>

The code in Listing 2 is similar to the code in Listing 1 in that it defines the wait period to slow things down a bit, then looks for a window called mylynx (so as not to confuse it with mytest above), saves the window ID for later use, activates the window, resizes it, and initiates an instance of Lynx in the window. The script then sends a g keystroke into that window, which tells Lynx to expect a URL, and follows that up with the string www.ibm.com and the Return key. The website is now open and ready to receive further commands, such as Tab and Return for navigation.

Why choose Lynx as an application to automate? You might find that in attempting to automate other browsers or applications some automated keystrokes are ignored. The roots of this issue lie in fears of cross-site scripting. Lynx seems to allow a few more faked keystrokes than, say, Mozilla Firefox. Your mileage may vary.


Automate with voice

When you get everything working and can see how easy it is to automate using indirect keystrokes, it begs the question about input from other devices. Voice is one option. A speech recognizer using a properly trained audio model can pick up verbal instructions and use xdotool to execute commands. If you're looking for a ready-made application that employs xdotool and the VoxForge audio model, check out kiku (see Resources for links to more information), which is available for Ubuntu.


xdotool and the dialog manager

The example code in Listing 3 shows how to use xdotool in a dialog manager.

Listing 3. xdotool in the dialog manager
<?php
...
function process($major,$minor) {
global $globalcontext;
  switch ($major) {
    case 'CONTEXT':
      switch ($minor) {
	case 'SOFTWARE':
	  $globalcontext = $minor;
	  echo "Set global context to software\n";
	break;
	default:
	  echo "Defaulted out $minor\n";
	break;
      }
    break;
    case 'BROWSER':
      $ooc = ($globalcontext == 'SOFTWARE') ? true : false ;
      if ($ooc) {
	switch ($minor) {
	  case 'OPEN':
	    echo "Open browser\n";
	    browser_open();
	  break;
	  case 'CLOSE':
	    echo "Close browser\n";
	    browser_close();
	  break;
	  case 'LOCATION':
	    echo "Go to URL\n";
	    browser_location();
	  break;
	  default:
	    echo "Defaulted out $minor\n";
	  break;
	}
      } else {
	// OOC
	echo "Recognized but out of context\n";
      }
    break;
    default:
	  echo "Defaulted out $major\n";
    break;
  }
}
function browser_open() {
global $windowid;
  $winname = 'mylynx';
  $execcmd = "xdotool search --name $winname";
  exec($execcmd,$out);
  $windowid = $out[0];
  if ($windowid) {
    echo "Found window ID ".$windowid."\n";
  } else {
    exec("konsole -p LocalTabTitleFormat=$winname");
    $execcmd = "xdotool search --name $winname";
    exec($execcmd,$out);
    $windowid = $out[0];
    echo "Using windowid $windowid";
    // die("Cannot find the window\n");
  }
  // now activate the window
  $execcmd = "xdotool windowactivate --sync $windowid";
  exec($execcmd,$out);
  $execcmd = "xdotool getactivewindow windowsize --sync 800 800";
  exec($execcmd,$out);
  // start sending stuff
  send_string("lynx");
  send_return(); // opens lynx browser
}
...
  ?>

Listing 3 shows two functions. The first is process(), which is expecting two arguments. These arguments are strings that the speech-recognition process returns. To find out more about the process of getting these strings, see the tutorials in VoxForge or Sphinx (see Resources for links). In this case, the expected strings come from a grammar that, in part, consists of the instructions CONTEXT SOFTWARE and BROWSER OPEN.

Following the process through, consider what happens when the speech recognizer hears CONTEXT SOFTWARE. The major string is CONTEXT, and the minor is SOFTWARE. The process function declares access to a global variable in which the context is stored, and then in the switch it sets the context variable to SOFTWARE. This variable is known globally, so later, when the speech recognizer picks up the major and minor BROWSER OPEN, the switch can handle this but first verifies that the context is right. The point of using a context at all is to help eliminate incorrect results from the recognizer. If your context is SOCCER and you don't think opening a browser is relevant in that context, the browser won't open.

If the context is right, then the command BROWSER OPEN proceeds to the browser_open() function, which you will quickly recognize as basically the same code used in Listing 2.

The more complex your context management becomes, the more you need a method of reeling out switch statements that follow rules defined elsewhere that faithfully reproduce your context management structure. You can define these rules in a number of ways. One way is to use the structure of the Speech Recognition Grammar Specification and Semantic Interpretation for Speech Recognition—see Resources for links to more information. And a slightly simpler approach is to store the necessary code fragments, including the xdotool instructions, in an XML structure.


Use XML to store context and code fragments

Listing 4 is an XML file that has details of the building blocks for an imaginary dialog manager. The advantage of a flat, editable XML file is that all of your contexts and functions are seen in the same file; because they are separated from the more complex switching code of the dialog manager, it is easier to see and edit the context structure. This data contains one xdotool instruction to perform a left-click wherever the mouse cursor might be at the time. This command is valid in all contexts and therefore does not have a ctxt attribute.

Listing 4. The XML store
<?xml version="1.0" encoding="UTF-8"?>
<snips>
  <context>
    <func>click_left</func> 
    <func ctxt="software">browser_open</func>    
    <func ctxt="software">browser_location</func>    
    <func ctxt="software">browser_close</func>    
    <func ctxt="hardware">cpu_temperature</func>    
    <func ctxt="hardware">fan_speed</func>    
  </context>
  <snip fn="click_left">
<![CDATA[
function click_left() {
  exec('xdotool click 1');
}
]]>
  </snip>
  <snip fn="cpu_temperature">
<![CDATA[
function cpu_temperature() {
  $g = 0;
}
]]>
  </snip>
  <snip fn="fan_speed">
<![CDATA[
function fan_speed() {
  $g = 1;
}
]]>
  </snip>
  <snip fn="browser_open">
<![CDATA[
function browser_open() {
  $f = 0;
}
]]>
  </snip>
  <snip fn="browser_location">
<![CDATA[
function browser_location() {
  $f = 1;
}
]]>
</snip>
  <snip fn="browser_close">
<![CDATA[
function browser_close() {
  $f = 2;
}
]]>
</snip>
</snips>

In this code, the root element is snips. This element has two types of children:

  • A context element with contextual information
  • A number of snip elements that contain the code fragments

The context element contains func elements, each of which has the name of a function as the element value, and an attribute ctxt that contains the context in which the function is valid. Thus, the function fan_speed would not be valid if the context were set as software. The function click_left, however, has no context, so it is valid anywhere. The code snippets are stored in CDATA segments, which ensures that the XML parser skips these sections and accepts them regardless of the coding they contain.

Now, all you need is the script to expand the dialog data into its own script. The PHP code in Listing 5 does just that.

Listing 5. Dialog manager generator
<?php
// pull DM structure from an xml file
$xml = simplexml_load_file('snipstor.xml');
// get the contexts
// generate the main switch
echo "function process(\$major,\$minor) {\n";
echo "global \$globalcontext;\n";
echo "  switch (\$major) {\n";
$majtmp = "";
$mintmp = "";
foreach ($xml->context->func as $mjmn) {
  list($major,$minor) = explode("_",$mjmn);
  //echo "$major,$minor\n";
  if ($major != $majtmp) {
    if ($majtmp != "") echo "      default:
      echo \"Failed \$minor\";
      break;
      }\n    } else {
      echo 'OOC';\n    }\n";
    echo "  case '$major':\n";
    if ($minor != $mintmp) {
      $test = ($mjmn['ctxt']) ? "\$globalcontext == '".$mjmn['ctxt']."'" : 'true' ;
      echo "    if ($test) {\n";
      echo "      switch (\$minor) {\n";
      $mintmp = $minor;
    }
    echo "      case '$minor':
        $mjmn();
      break;\n";
    $majtmp = $major;
  } else {
    echo "      case '$minor':
      break;\n";
  }
}
echo "      default:
      echo \"Failed \$minor\";
      break;
      }\n    } else {
      echo 'OOC';\n    }\n";
echo "  default:
    echo \"Failed \$major\";
  break;
  }\n";
echo "}\n";
// generate the code snippets
echo "// functions\n";
foreach ($xml->snip as $snipfn) {
  echo trim($snipfn);
  echo "\n";
}
?>

The code in Listing 5 writes more PHP code that you can fit into a dialog manager. It begins by reading the dialog manager structure of Listing 4 saved in a file called snipstor.xml into a SimpleXML variable. It then reads the contents of the function names in the context element, extracts the major and minor components from these names, and uses them to build the switch statements that control the flow in the dialog manager. As it builds the code, the script looks for a context (from the ctxt attribute) in which the command is valid. It inserts a conditional if statement to take care of any contexts that might apply to the switch case. If there is a context, it's inserted as an expression to be evaluated at run time; otherwise, it just inserts true so that the programme always executes enclosed statements—that is, the command is valid in all contexts. Finally, the script outputs the library of code that the switch cases need.

The result of running Listing 5 on the data contained in Listing 4 follows in Listing 6.

Listing 6 Resulting dialog manager extract
function process($major,$minor) {
global $globalcontext;
  switch ($major) {
  case 'click':
    if (true) {
      switch ($minor) {
      case 'left':
        click_left();
      break;
      default:
      echo "Failed $minor";
      break;
      }
    } else {
      echo 'OOC';
    }
  case 'browser':
    if ($globalcontext == 'software') {
      switch ($minor) {
      case 'open':
        browser_open();
      break;
      case 'location':
      break;
      case 'close':
      break;
      default:
      echo "Failed $minor";
      break;
      }
    } else {
      echo 'OOC';
    }
  case 'cpu':
    if ($globalcontext == 'hardware') {
      switch ($minor) {
      case 'temperature':
        cpu_temperature();
      break;
      default:
      echo "Failed $minor";
      break;
      }
    } else {
      echo 'OOC';
    }
  case 'fan':
    if ($globalcontext == 'hardware') {
      switch ($minor) {
      case 'speed':
        fan_speed();
      break;
      default:
      echo "Failed $minor";
      break;
      }
    } else {
      echo 'OOC';
    }
  default:
    echo "Failed $major";
  break;
  }
}
// functions
function click_left() {
  exec('xdotool click 1');
}
function cpu_temperature() {
  $g = 0;
}
function fan_speed() {
  $g = 1;
}
function browser_open() {
  $f = 0;
}
function browser_location() {
  $f = 1;
}
function browser_close() {
  $f = 2;
}

This output is similar to what you started with in Listing 3. Note that the actual CDATA content is valid code but is abbreviated for simplicity. The OOC is just shorthand for Out of Context. You see this message when the speech recognizer hears an enunciation that is valid but does not make sense according to the structure of the snipstor.xml file. To make the example more meaningful, you can replace the CDATA section for the browser_open() function with the code for that function from Listing 3.


Conclusion

As you can see, xdotool is a handy library of calls to the windowing system. Combined with a speech recognizer, you can use voice to initiate xdotool commands under the control of a dialog manager. And finally, because the dialog manager can get quite complex and unwieldy when contexts are complicated, you can use code fragments containing your xdotool instructions stored conveniently in XML CDATA sections to generate the dialog manager consistently and effectively.

Resources

Learn

Get products and technologies

  • xdotool: Download and get more information on setup and use to simulate keyboard input and mouse activity, move and resize windows, and so on.
  • kiku: Find out more about the kiku speech recognition and dialog manager and how to use voice recognition to control your operating system.
  • Lynx source distribution directory: Download and learn more about the Lynx web browser with its User Guide and main help page.
  • IBM product evaluation versions: Download or explore the online trials in the IBM SOA Sandbox and get your hands on application development tools and middleware products from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into XML on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=XML, Linux, Open source
ArticleID=754176
ArticleTitle=Trigger keyboard and mouse actions with your voice and xdotool
publish-date=09062011