Testing Overhaul

This page describes the status of VisIt testing infrastructure prior to June, 2009 as well as plans and current work in overhauling that infrastructure to improve a variety of shortcomings.

Current VisIt test infrastructure, its limitations and shortcomings

VisIt's testing infrastructure involves a combination of Python and shell (sh) scripts.

Currently, only some of VisIt's components are exercised during testing. These are the viewer, engine, mdserver and cli componentss. The gui component is not tested due to a variety of issues that have made that impractical thus far. Indeed, there is nothing in our current testing overhaul plan to address this shortcoming. Perhaps there should be.

The build_visit script is tested with a completely different testing process which is not described here. This is partly due to the fact that the primary purpose in testing build_vist is to ensure it builds all of VisIt's 3rd party libraries successfully (as well as VisIt) and we don't necessarily want to be doing that each night. The code supporting that does NOT change as frequently as the VisIt source code itself does.

Test categories

Tests are written in Python. Each Python test script generates any number of image and/or text results. The expected or baseline results from all tests are stored in the baseline directory under the test directory. Python test scripts are organized into categories.

  • Databases - which test database plugins.
  • Operators -- which test operator plugins.
  • Plots -- which test plot plugins.
  • Queries -- which test queries.
  • Rendering -- which test aspects of rendering.
  • Session -- which test management of internal state across restarts of the viewer.
  • Fault Tolerance -- which test VisIt's response to some error conditions.

These categories exist primarily to indicate the intent of a given test. Nonetheless, tests in any category often involve the use of functionality from other categories. For example, it is not possible to fully test a database plugin without actually reading data from the database and plotting and/or quering it. This means there are probably cases of unnecessary redundancy in the functionality that is tested by different Python scripts. As the test suite continues to grow, we may find it necessary to audit the test suite to minimize this redundancy so as to improve performance. Presently, we are not even measuring how much coverage the test suite is providing in terms of VisIt source code. There are a variety of reasons why it would be good to do that.

Within each category there are a number of Python scripts that focus on testing different aspects of VisIt. For example, in the rendering category, there are tests for legends, annotations and volume rendering among other things. As of this writing, there are about 200 Python scripts totaling about 35,000 lines of code and which produce roughly 2800 png (image) or txt (text) files as results.

As an aside, these 200 Python scripts represent a great library of examples of using VisIt's cli. However, these Python scripts all refer to functions defined in a helper Python script, Testing.py, at the top level in the test directory. That means the testing scripts cannot execute correctly when fed directly to VisIt without also including the functionality in the Testing.py script. This is not often a problem but it does mean we cannot simply pass around testing code to users and expect them to be able to run it without modification.

The Testing.py Python support script

The Testing.py helper script contains a number of functions used throughout the testing process. These include functions to compute differences between the expected and actual outputs for both images and text, to generate HTML pages for convenient browsing of results upon completion (though not all HTML generation functionality resides here), as well as functions to compute amount of memory VisIt is using, to detect possible memory leaks, to manage various aspects of VisIt's global state, such as annotation attributes or engine launching. There is also code there to compute and manage difference metrics and thresholds. We'll describe this in more detail later.

The runtest shell script

The entire testing process is orchestrated by yet another shell script, runtest, also at the top level in the test directory. The runtest script performs a number of functions including removal of test results from prior run, confirming the (re-)generation of data for input to tests prior to attempting to run tests, HTML generation necessary to tie results from all tests together, various ways of launching VisIt for each test, OS-based means for enforcing time-limits of tests (so parallel deadlocks don't prevent test suite from completing for example), diagnosing the return code from tests, checking for core files, etc.

Because VisIt is fairly tolerant to faults, it is possible for tests to complete and produce expected results but to have done so in an unexpected and undesireable way. For example, the engine may have cored and restarted. Therefore, runtest includes logic to detect this kind of situation as well as other odd-ball events and inform developers that the test did not behave as expected.

Nightly Testing

Testing is initiated on a nightly basis by cron jobs running under a given user's name (Mark Miller presently) on systems at LLNL (naboo.llnl.gov) and NERSC (davinci.nersc.gov). In addition, testing can be invoked manually on any checked out trunk which includes the test and data directories directly underneath the trunk.

Typically, a manual test is invoked like so

   runtest <test py files>

Test Modes

Each night, all tests are executed under three different modes; serial, parallel, and scalable-parallel. The serial mode runs a non-parallel engine. The parallel modes runs a two-processor parallel engine. In both the serial and parallel mode, rendering is done in software (Mesa) in the viewer component of VisIt. The scalable-parallel mode runs a two-processor parallel engine but also does all rendering in the engine component of VisIt in scalable rendering or SR mode. Compiling VisIt sources on 3 processors takes about 1 to 1.5 hours. Then, each test mode takes about 1.5 to 2 hours. So, the whole test sequence takes about 8 hours.

In the past a valgrind mode was routinely run as well as a dynamic load balance mode. It would be good to re-institute those. In addition, it would be good to routinely run some other modes such as client-server where VisIt's engine and Viewer are run on different and disparate platforms, hardware-render where all rendering is done on the local graphics card and an optimized compilation mode where VisIt is compiled with optimization. Presently, all tests are run with a version of VisIt that is compiled with debugging (e.g. with -g flag passed to the compiler). This is bad as the version of VisIt deployed for users is compiled with optimizations. But, we aren't testing it that way!

Other Shortcomings

We currently cannot test on Windows systems. This is because one of the key pieces of code to manage testing, runtest, is shell code and will not run on Windows. On SMP's like Davinci, it would be nice to run multiple serial (and maybe even parallel) tests simultaneously. We do not do that now though it would not be hard to modify runtest to support that. It would be nice to be able to install VisIt from a test build if we decide we'd like to. It would be nice to make it easy for developers to examine results and simply click a button or two to re-baseline the results if the new results are acceptable. It would be nice to make it easy for any developer to run tests (of his or her own choosing) on a regular basis and have those tests reported in a common place with all other testing results. It would be nice to associate with tests the files and/or lines of code that are exercised by the test and which were changed so that developers would get a good initial clue as to which change(s) in code caused which changes in test results. It would be nice to be able to easily search over test history in a general way so that we can find cases of interest that are related.

New Testing System Design