Logo AHome
Logo BIndex
Logo CACM Copy

papersTable of Contents

Toward Automatic Generation of Novice User Test Scripts

David J. Kasik, Harry G. George

Boeing Commercial Airplane Group
P.O. Box 3707, Mail Stop 6H-WT
Seattle WA 98124 USA
+1 206 234 0575

This paper contains the following major sections:


Graphical user interfaces (GUI's) make applications easier to learn and use. At the same time, they make application design, construction, and especially test more difficult because user-directed dialogs increase the number of potential execution paths. This paper considers a subset of GUI-based application testing: how to exercise an application like a novice user. We discuss different solutions and a specific implementation that uses genetic algorithms to automatically generate user events in an unpredictable yet controlled manner to produce novice-like test scripts.


Automated test generation, dialog model specification, genetic algorithms, software engineering test process.


The role of the user interface has become increasingly important to the success of computer applications. Graphical interfaces make complex applications visually attractive and accessible to a wide range of users.

Users can exercise an interactive application in many different ways. In reactive, GUI-based applications, even more options are available because multiple widgets and paths can be active concurrently. This causes problems as an application is tested for failures, especially when the it is made available to a large user community. Unsophisticated and novice users often exercise applications in ways that the designer, the developer, and the tester did not anticipate.

Tools and techniques to improve testing for program failures (e.g., capture/playback tools) are becoming more widely available.

As the tools have been deployed, the prime beneficiaries have been experts. An expert user or tester usually follows a predictable path through an application to accomplish a familiar task. A developer knows where to probe to find the 'weak points' in an application.

As a result, applications contain state transitions that work well for predicted usage patterns but become unstable when given to novice users. Novices follow unexpected paths and do things 'no one would ever do.' Such program failures are hard to generate, reproduce, diagnose, and predict. Current methods (e.g., recruiting naive users, beta testing) are manual, costly, and usually occur after development is finished.

This paper presents a technique to move novice-like testing earlier in the overall system test process. We use genetic algorithms as a repeatable technique for generating user events that drive conventional automated test tools. Reprogramming the genetic algorithm reward system can mimic different forms of novice user behavior. Our prototype implementation works at run-time and is independent of application design and development tools.

Return to Top


Testing is the subject of an entire branch of computer science [1, 13]. The test process is shown in Figure 1.


Figure 1. Test Process Framework

This paper deals with one aspect of the overall test process: how to automate generation of test scripts that exhibit novice user characteristics as part of Design Test. Novice-like testing is often the target of beta programs and can involve many people (e.g., Microsoft's 400,000 beta copies of Windows95). When this type of novice testing is done, it is designed to find program failures rather than to determine usability characteristics.

GUI-based applications present special problems in test design. Older interactive applications often embedded command hierarchies directly in the structure of the program. The user then followed the command tree while navigating from one function to another. The test designer could assume that certain logical conditions held because the user could not deviate from the program-controlled sequence.

Multiple dialog sequences are available concurrently in a GUI-based application. The application becomes reactive and paths more unpredictable. Therefore, a test designer must consciously assure that the program maintains logical relationships through the state changes needed to control:

The size and complexity of both the application program and the test set grow dramatically when concurrent dialog is involved. Industry statistics [12] indicate that many GUI-based applications devote half of the code to managing the user interface. As discussed in the next section, there are a number of different techniques for improving the user interface design/build process. Testing is still required even with improvements in the early stages of the system engineering process.

The general problem of test automation for GUI-based applications has received some attention. The principal effort has been in the area of automated record/playback tools [16]. Such tools accept input at two levels:

  1. Raw keystroke capture saves individual user keyboard and mouse events.
  2. Logical widget names provide higher level specifications that are screen position insensitive.
The tools generally offer a language that can be used to program test scripts. Some form of bitmap compare is included to help a tester compare results of a test run with a known-good result.

Test automation tools work well when given to an expert and poorly when given to a novice because:

Novice Testing

Testing can be conducted on a number of different levels to discover application failures and to measure and verify acceptable performance. The most common characterization of testing describes phases where:

Different people are assigned responsibilities for testing. Developers isolate specific application functions and interfaces during unit and functional test. These tests require extensive knowledge and analysis of code internals. System test is more task oriented. In addition to finding failures, system test determines if an application can be used to accomplish tasks. The testing is done in a black box manner and must be driven from the user interface. Experts perform most system testing, and novices are occasionally recruited to find application weak spots.


Figure 2. Different Paths through an Application

As shown in Figure 2, there are three paths that can be taken to perform a task during system test:

Novices do not work at random: they learn the application by performing a task. Random testing does have a role in exploring the outer limits of an application but is not our primary focus.

Novice testing is often ignored. At worst, some applications have been released in anticipation that users will find errors and accept fixes in the next version. Beta programs find some failures caused by novice, but beta users are often quite literate in an application domain. Some companies recruit large numbers of naive users to test true novice behavior prior to beta release.

All three approaches to novice testing are costly. Not only do they occur late in the cycle, but a novice's actions are also hard to replicate. Novices wander through convoluted paths that only rigorous keystroke recording can capture. Large numbers of novice keystroke files become difficult to manage and upgrade to the next version.

We chose to automate generation of novice test scripts to address these problems. Automation requires a script environment to record and playback sessions and a method to generate user events in a novice-like manner for those scripts. Our approach assumes the existence of an automated record/playback tool. The automation effort focused on user event generation.

Return to Top


In order to generate meaningful events for a script, we require a clear, accurate specification of both the user interface dialog [14] and the program state information. The state information controls the legality of specific dialog components and the names of the legal commands during a session. Without access to state, the generator could produce many meaningless input events.

This section contains a brief review and analysis of three techniques designers and developers use to specify user interface dialog and control its state.

GUI Toolkits and Source Code

The most commonly used form of dialog and state specification is the application program itself. The UI designer and implementer often use a GUI toolkit to specify graphical layout and input gathering mechanics.

A GUI toolkit does not manage state information. Therefore, both logical relationships and concurrency must be managed in programmer maintained code. Application code grows more complex as the number of relationships increases. In addition, the way in which a programmer manages logical relationships and concurrency is different from but interleaved with the algorithmic code that performs a requested function.

Because the code is an integral part of the specification and state information cannot be derived from static code analysis, any automated test generation scheme needs to be able to obtain current user interface and state information from the application code itself.

User Interface Management Systems

The second alternative is a user interface management system (UIMS). UIMSs have been formally pursued since the early 1980's [4, 9]. The UIMS ideal is to operate the same application across a wide variety of interface styles (e.g., GUI, command line, form, script file) with little or no change to the dialog or application.

This is possible because a UIMS contains a dialog specification language that contains state information. The language is translated into an informal model that is interpreted to control program execution. Because they contain state information, UIMS specifications are inherently more complete than GUI widget hierarchies and more suitable for use in generating test scripts.

However, few applications are specified with a UIMS. Applying any automated test generation technique to a manually developed application requires reverse engineering to derive the UIMS model. This is a daunting task: a working program often contains 'features' and loopholes that are difficult to capture in a more abstract UIMS model.

Process Modeling

The third specification alternative for dialog is formal process modeling. Formal models are commonly used in time and state sensitive areas (e.g., real time software, business process modeling, temporal protocol logic).

In contrast to the informal models in UIMSs, formal models can be validated prior to execution. Theorem proving techniques can be used to determine correctness for text-based formal specifications [5]. Graphical formal models like Petri nets can be analyzed mathematically [17] or via discrete event simulation [7].

Formal models have been used as direct input to the test process, especially for safety critical applications. Two examples are path models built from source code to generate test data values [6] and finite state models to reduce the number of required tests [2]. Little has been done to automate the generation of the scripts themselves.

Petri nets have been used as a type of process model applied to dialog [15, 19]. Palanque and Bastide have documented reasonably complex interfaces based on Petri nets. The nets proved to be an effective specification technique. However, they were manually verified and translated into an executable application. Manual translation of a formal model to an executing program can contain errors and deviations from the specification.


Both user interface management systems and formal process models contain the information necessary to generate test scripts automatically from the specification itself. But neither technique is used widely. As a result, we would have needed to build a reverse engineering tool to recreate either a UIMS or formal model specification from an existing program. This approach would make the automated test generator itself difficult to use and error prone. Formal models also suffer from a lack of tools to translate the abstract specification into an executable form.

Therefore, we chose to work with the application itself as the dialog specification. Moreover, we chose to use the executing application rather than the source code. Source code analysis techniques cannot deduce program and dialog state changes that occur during execution. The state changes allow each new step in the script send syntactically correct input to an active part of the application.

Return to Top


Given that an executing program specifies the application user interface dialog, we needed to develop methods to:

  1. Simulate user inputs. Genetic algorithms proved to be a good method to control the generation of random numbers used to simulate input events.
  2. Capture the current state of the user interface during application execution. Our implementation approach requires no added processing logic in the application.
  3. Tie the genetic algorithm generated input values to the user interface during execution in an application independent manner.
  4. Allow the tester to generate and save novice-like scripts.

The prototype uses standard tools and techniques wherever possible. The prototype is application independent and needs no special application dialog design or code structure.

Simulated User Inputs

Commercial test tools provide a mechanism to drive a GUI-based application from captured keystrokes. Automation requires a method to generate keystrokes.

Analyzing user behavior led to the conclusion that emulating novice user behavior requires a way to 'remember' success. Both novices and experts use an application to perform tasks. Novices explore to learn the semantics of individual functions and how to combine sequences of functions into meaningful work. Experts have already discovered successful paths through an application and rely on past experience to accomplish new tasks.

Random number generators alone are inadequate because they do not rely on past history to govern future choices. Genetic algorithms do rely on past history. Success has been reported in applying genetic algorithms to hardware test sequence generation [18]. Therefore, we chose to use genetic algorithms to simulate novice user events.

Genetic Algorithms

Genetic algorithms [11] can be programmed to simulate a pseudo-natural selection process. In its simplest form, a genetic algorithm manipulates a table (or pool) of random numbers. Each row in the table represents a different gene. The individual components of a gene are called alleles and contain a numeric genetic 'code.' The interpretation of the allele values varies according to application.

Allele values start as random numbers that define an initial 'genetic code'. The genetic algorithm lets genes that contain 'better' alleles survive to compete against new genes in subsequent generations.

During a run through multiple generations to determine the best genes, the gene pool contains the same number of genes that have the same number of alleles. The number of genes in the pool and the number of alleles in a gene can vary from run to run.

A basic genetic algorithm:

Gene crossover styles, mutation rates, and death rates can be programmed and varied. These techniques are useful in determining how many genes survive into a new generation, how existing allele values are swapped among genes, and how new random allele values are inserted into surviving or new genes. Each generation is guaranteed to produce a set of genes that survive. Allowing a genetic algorithm user to vary survival techniques lets the user tune the algorithm to a particular problem.

Genetic algorithms are not an effective way to explore all possible paths in a dialog sequence. Instead, a genetic algorithm uses previous sets of random numbers as the basis for new ones. This means that the method used to compute the 'best results' becomes the key factor in any application that uses genetic algorithms. As applied to user interface event generation, 'best results' meant designing an algorithm that represents how novice users learn to use an application.

Run-time User Interface Capture

At run-time, different GUIs require different implementations to capture the current state of the user interface as the Application Under Test executes. Our prototype, whose architecture is shown in Figure 3, works with applications built with Motif 1.2 and X11R5. All software components are written as objects in Modula-3 [3].


Figure 3. Run-time Architecture

The test script generator (XTest) can be applied to different Applications Under Test (AUT). During execution,

XProbe, TestDriver, and TestPort use standard Motif and X communications with a protocol based on the editres protocol. This approach proved superior to other UNIX techniques for sharing information across process spaces like shared memory, rpc servers, pipes, and shared files.

Xtest controls the AUT through the standard XSendEvent mechanism. While this technique works, XSendEvent operates strictly at the keystroke level. Therefore, XTracer generates test scripts at the keystroke level. A reasonable extension is to implement a reverse protocol to pass widget-level information to the application under test and to generate a more readable test script based on widget names.

The editres protocol asks the AUT to put some of its process-specific information into a form for use by TestDriver. The prototype requires only one slight modification to the AUT source code. Two procedure calls must be inserted to establish the TestPort callback and to provide the ID of its top level window. These calls are executed only once during AUT initialization, and all other processing happens transparently. A preferable approach would require no source code change.

Integrating Genetic Algorithms and UI Capture

The prototype interprets the alleles in each gene to dynamically generate events that are legal for the AUT. At run-time, TestDriver observes and controls the AUT as shown in Figure 4.


Figure 4. Observe and Control Loop

Prior to execution, the tester defines the number of alleles in each gene and the total number of genes in the pool. The following steps are executed for each gene in the gene pool in XProbe: Each gene in the pool restarts the application to insure an identical initial state. From then on, each gene can start at a different spot because allele values are used to randomly select an active widget. Finally, even though the number of alleles is the same for all genes, the number of input events varies from gene to gene because a different number of input events is needed for each dialog state.

After all the genes in the pool are tried, the genetic algorithm in GenePool (refer back to Figure 3) lets the winning genes survive, generates new genes, and mutates the pool before proceeding to the next generation. The script that corresponds to the top scoring gene in the last generation is output via XTracer.

Reward System

A realistic reward system lets the genes that generated the 'best' novice-like behavior survive by assigning a weighted score to user events. For example, one score can be given to a set of alleles that picks a list widget, a second to a set that types in specific characters, and a third to a set that provides input to a widget on the same window as the previous widget. Adjusting the weights allows the reward system to represent different types of novice user behavior.

We based our prototype reward system on the observation that a novice user often learns how to use an application via controlled exploration. A novice starts one function in a dialog sequence and experiments with a number of different parameters. In this way, the novice uses localized parameter settings to understand the overall effect of a single function. This is only one of the possible characterizations of novice user behavior.

To implement this reward system, we set the weight for all user events to zero except one. A gene receives a positive score each time its allele value(s) generate input for a widget (e.g., entering data into a text field, choosing an item from a list, selecting a radio button) that has the same window name as the last active window name. No additional score is generated to differentiate among the possible types of widgets on the window. The net result is that the longer a gene stays on the same window, the higher its score and better its odds of survival.

Tester's Interface

To simulate novice behavior, a tester: Because the prototype works outside an automated test tool, we defined a simplified script language to specify the expert test. A postprocessor generates scripts for commercial test tools. The resulting scripts can then be included as part of a complete test suite. The tester can use realistic, previously created scripts that contain novice behavior to initiate new simulations.

This interface strategy lets the tester control when deviations occur because a DEVIATE command can be inserted at arbitrary script locations. The script can then continue in either of the two modes shown in Figure 5. Pullback mode rewards genes for returning to the original script, while meander mode allows the activity to wander indefinitely. Even though pullback mode returns to the expert script, it will generally not generate the same results because additional functions are exercised.


Figure 5. Deviation Modes

The following script demonstrates pullback mode. It opens and loads a data file and deviates to see that an application can still enter data for an x,y graphing program after novice-like activity:

(# expert script, in the form of a list of window, widget pairs with pullback #)
("SGE: On Version Dialog" "Cancel")
("Style Guide Example - <New File>" "file")
("Root" "open")
("SGE: Open Dialog" "sb_text" "solardat")
("SGE: Open Dialog" "OK")
("Root" "dataTable")
("SGE: Table Dialog" "text" 4 "30")
("SGE: Table Dialog" "text" 5 "40")
("SGE: Table Dialog" "OK")

The implementation of meander mode is simple: execute the expert script and turn control over to the genetic algorithm when the DEVIATE command is encountered. The reward system then identifies genes that stay on the same window.

Pullback mode relies on the ability to look ahead to the next command in the expert script when processing a DEVIATE command. To implement pullback mode,

Return to Top


This section analyzes specific aspects of the prototype:

Implementation Architecture

The multi-process implementation architecture proved to be flexible, adaptable, and application independent. The widget tree was correctly captured during execution and keystrokes successfully sent to different AUTs without changing TestDriver. The applications we instrumented were small but did support concurrent dialog sequences. Isolating GenePool let us change the reward system without affecting the basic observe and control loop.

Reward System

We quickly discovered that implementation of the reward system was the most sensitive part of the genetic algorithm. The reward system governs which genes survive to generate a novice-like test script that a tester can keep.

We first tried our reward system to see if the genes could learn to simulate novice-like behavior in a standalone manner. The test script was a single DEVIATE command. We varied the genetic algorithm parameters to let the algorithm itself generate novice-like events. At best, the resulting scripts seemed more chimpanzee-like than novice-like. Getting a script to accomplish anything meaningful was unlikely. This occurred because we could not provide any application semantic knowledge based on GUI widgets alone. A higher level specification (as found in a UIMS or process modeler) is needed to insure that a particular run can even open a file.

We then used meander mode and inserted a DEVIATE command at the end of an existing expert script. In this way, we were able to open files and do other activities before turning control over to the genetic algorithm. The results were better but could be attributed more to starting with something already done than to the genetic algorithm.

Pullback mode produced the best results. This occurred because pullback mode forces the script back to a state in which the script performs some meaningful activity. We were able to insert more than one DEVIATE command in a script to insure that the application could continue to operate through multiple encounters with unpredictable user events.

We have been able to evaluate the 'novice-ness' of the resulting scripts only on an informal basis. We asked other group members to observe a set of automatically generated scripts and collected their feedback. The scripts used with pullback mode were judged to be the best representation of their understanding of novice user behavior.

Impact on Test Process

Our experiments demonstrate the ability to use a small number of inputs to generate a large number of test scripts in an hour. The tester decides how many scripts should be saved as 'winners' that represent novice behavior.

Using automated script generation decreases the total number of scripts that need to be saved and modified as application versions change. Only the parameters that govern genetic algorithm execution are saved. Therefore, new novice scripts can be regenerated for a new user interface and application. The regeneration process does not guarantee identical scripts because the application dialog state changes when the user interface changes.

Return to Top


Our implementation resulted in a framework that can be used to simulate novice user behavior and solve some real problems. It provides a basis that demonstrates the potential of automated script generation and can be used for additional projects area to improve:

Test script configuration management. Our approach can be used to generate a large number of novice test scripts quickly. Measures must be developed to determine when enough novice testing has occurred.

Test results evaluation and comparison. The results of each novice test script must be evaluated to insure that the system has worked properly. At a minimum, the novice tests can be used to insure that the application does not break. The applications we tested with the prototype did not fail, which increased our confidence in them. We used manual observation to determine that the applications continued to work properly.

More work is required to develop effective ways for determining that a script produces the proper results. A simple comparison of the results produced by a companion script is inadequate even in pullback mode. The deviation may have deleted or edited data in a way that makes direct results comparison impossible.

Emulation and evaluation of more types of novice user behavior. Additional genetic scoring algorithms and reward systems will expand the repertoire of characterizations of novice user behavior beyond our learning-by-experimentation style.

Care must be taken to formally evaluate the results of any automated techniques to insure their value and validity. The context for such comparisons is well documented (for a good example, see [8]). Usability testing facilities offer a solid technology foundation for conducting real versus automated novice evaluations.

Integration with automated test tools. Given the current implementation of the novice test script generator, the use of genetic algorithms could be incorporated as a new command in any existing widget-based test tool.

Higher level user interface specifications. In the long term, higher level dialog models than a GUI widget tree should be used to generate both application user interface code and test scripts. Automatic UI code generation will decrease user interface state management errors (although application code errors will still occur). Such specifications should be analyzable for usability for experts or novices. Testing will still be needed across multiple user skill levels to determine if program failures occur whether the application is driven by real or simulated events.

Return to Top


Using automated test script generation to simulate novice users can be used to help identify application failures earlier than beta test or production. Such failures are costly to fix and frustrating to users, especially novices.

Our technique works best as a companion to automated test tools and expert test scripts. Expert users still must generate complex scripts that exercise an application thoroughly. Genetic algorithms provide a controllable method of emulating novice input events to test an application in an unexpected, but not purely random, way. Including automated novice testing early in the development process should improve overall application quality.


Rob Jasper and Dan Murphy of the Boeing Commercial Airplane Group provided insight into the issues involved with testing and genetic algorithms. Keith Butler of Boeing Information and Support Services helped mold early drafts. The SIGCHI referees provided excellent comments as part of their reviews.

Return to Top


  1. Beizer, B. Software System Testing and Quality Assurance, Van Nostrand, 1984.
  2. Bernhard, P.J. A Reduced Test Suite for Protocol Conformance Testing. ACM Transactions on Software Engineering and Methodology 3, 3 (Jul 1994), 201-220.
  3. Harbison, S.P. Modula-3, Prentice-Hall, 1992.
  4. Hartson, H.R. and Hix, D. Human Computer Interface Development: Concepts and Systems. ACM Computing Surveys 21, 1 (Mar. 1989), 5-92.
  5. Hoare, C.A.R. Communicating Sequential Processes, Prentice-Hall International, 1985.
  6. Jasper, R., Brennan, M., Williamson, K., Currier, W., and Zimmerman, D. Test Data Generation and Feasible Path Analysis. International Symposium on Software Testing and Analysis (Seattle, Aug. 1994).
  7. Jensen, K. Coloured Petri Nets: A High Level Language for System Design and Analysis. Advances in Petri Nets 1990 (G. Rozenberg editor), Lecture Notes in Computer Science 483, Springer-Verlag, pp. 342-416.
  8. Karat, C-M., Campbell, R., and Fiegel, T. Comparison of Empirical Testing and Walkthrough Methods in User Interface Evaluation. Proc. CHI'92 Human Factors in Computing Systems (May 1992), ACM Press, 397-404.
  9. Kasik, D.J. A User Interface Management System. Computer Graphics (Proc. SIGGRAPH 82), July 1982, ACM Press, pp. 99-106.
  10. Kieras, D.E. Towards a Practical GOMS Model Methodology for User Interface Design. Handbook of Human-Computer Interaction (M. Helander editor), Elsevier Science, 1988, pp. 135-157.
  11. Michalewicz, Z. Genetic Algorithms + Data Structures = Evolution Programs, Springer-Verlag, 1992.
  12. Myers, B.A. and Rosson, M.B. Survey on User Interface Programming. Proc. CHI'92 Human Factors in Computing Systems (May 1992), ACM Press, 195-202.
  13. Myers, G.J. The Art of Software Testing, Van Nostrand, 1978.
  14. Olsen, D.R. User Interface Management Systems: Models and Algorithms, Morgan Kaufmann, 1992.
  15. Palanque, P., Bastide, R., Dourte, L., and Sibertin-Blanc, C. Design of User-Driven Interfaces Using Petri Nets and Objects. Proc. Advanced Information Systems Engineering: 5th International Conference, CAiSE 93, Springer-Verlag, pp. 569-585.
  16. Parker, T. Automated Software Testing. Unix Review, January 1995, pp. 49-56.
  17. Peterson, J.L. Petri Net Theory and the Modeling of Systems, Prentice-Hall, 1981.
  18. Rudnick, E., Patel, J., Greenstein, G., and Niewmann, T. Sequential Circuit Generation in a Genetic Algorithm Framework. Proc. 31st Design Automation Conference (Jun 1994), 698-704.
  19. vanBiljon, W.R. Extending Petri Nets for Specifying Man-machine Dialogues. International Journal Man-Machine Studies 28 , 1988, 437-455.
Return to Top
Contact Information:

Toward Automatic Generation of Novice User Test Scripts

David J. Kasik and Harry G. George
Boeing Commercial Airplane Group
P.O. Box 3707, Mail Stop 6H-WT
Seattle WA 98124 USA
+1 206 234 0575

Return to Top