Selections

VisIt 2.1 provides a capability called named selections that let you take the set of cells associated with the output of a plot and save that list of cells so it can be used to restrict the cells operated on by other plots. As the plot that originates the selection changes, the plots that use it can change too. Selections can also be saved to a file and loaded in later for repeated use in analysis.

This page describes the additional features that are wanted for selections.

Cumulative queries

The original conception of the Selection window had 3 tabs on it that let the user specify various selection properties. The original scope of Brad's work was to add selections to the gui and viewer, exposing what was implemented. The 3 tabs were omitted during VisIt 2.1's release because the additional selection functionality for cumulative queries was never implemented.

A cumulative query is a selection with additional criteria that let the user further restrict/specify the selection's list of cells as the conditions change over time.

The goal of the cumulative option is to allow the user to form a query that is applied to all or a sub set of time slices. The idea was the user would form a query like they would for a typical name selection and then apply that query over the desired time steps. This step would be done in the "Range" tab.

An important part is that the results from each time slice would be retrieved and stored so that one could display histogram information which could be refined. For example: user has state information, trapped or untrapped. The user wants to see the particles that are trapped most frequently. To accomplish this task each time step must queried, with a running count of the frequency that the event occurred for each particle. However, the user may instead want to see the time steps that have the most trapped particles. Again a running total must be kept. Both of the above can be answered from the same query if the information from each query is stored appropriately. Note both of the above options are seen as option in the "Histogram" tab as the "Display Axis Type:" as "Matches" or "Time Slice" respectively. Once the data is collected from the query the user would then decide

Within the "Histogram" window the user would be able to do a sub-selection. For instance, there may 1000 particles that are trapped at least once during the simulation. The user wants to see the top 10. This step would be done via the sub selection in the Histogram tab.

Proposed tabs in the Selection window:

Range

The initial idea would be to collect the variables and their ranges from a parallel coordinates plot and used them on each time slice. To collect the variables on would use the "Get variable range data" button. However, this communication may not be possible. Instead Brad allows the user to manually add each variable and ranges. This more manual approach is fine as the user needs to be able to add a variable that may not be in the parallel coordinates plot. In addition, the user can obtain the necessary range information from the parallel coordinates UI. Further, the histogram provided gives the user information not available in the parallel coordinates plot.

March 25, Brad: So, to make this easier I can try and initialize a new selection with the variables used in the plot that originated the selection. That might let me use the ParallelCoordinates attributes in the viewer to initialize the selection's variables with selected min/max ranges. The "Get variable range data" button would do essentially the same operation to an existing selection, updating the ranges and picking up new variables.

In addition, the user needs to be able to add "Add time dependent expression" which is similar to a CMFE for looking at state changes. While this expression can be formed as a regular expression it would need to be indexed. As such, having it here may allow us to take advantage of the indexing to speed up the evaluation. i.e. for example:

if(ne( trapped, conn_cmfe(<[-1]id:trapped>, particles) ), 1, 0)

could be evaluated as return all of the particles that are trapped at the current time step and all of the particles that are not trapped in the previous time step, then add the results together.


The "Query over time" option when enabled would evaluate over the desired "Time Slice Start and Stop" values.

The 'Get ID data" button is similar to the Update Selection. Except that when doing a query over time it would do the query but also populate the histogram tab. When doing a single time slice the histogram tab would not be active.

Brad's notes: The range tab lets you pick variables and give them ranges. This seems like a multi-variable threshold. I think the "useful" part is that it can be evaluated for a range of time steps. The set of cells from each time step can then be combined AND/OR to produce a comprehensive set of cells for the selection. This lets the user say give me all cells that fit the range in ALL time steps or give me all cells that fit the ranges in ANY time steps.

By the looks of the window, the evaluation over time is optional and applies the same to each variable. Under the covers, this might amount to running the data range selection on each time state in the range and then using some combining operation to get the final set of cells for the selection.

Notes on the current implementation:

Not sure what the "Query over time" does versus the "Cumulative Query" as cumulative should imply over time. I think the cumulative when checked over rides what might used in the parallel coordinate plots for selection. Then the user can add in the time slices. I like that approach but we need some reorg. Perhaps:

Selection Range o Current plot o Specified variables

If Current plot it would work as is on the current plot, and if Specified variables then "add variable" would be active along with the "query over time"

The "Use cells" option should part of the Histogram and is the "Summation" "Type"

Selection1.png

Histogram

Present information about the selection back to the user so they can see a summary.

An important part is that the results from each time slice would be retrieved and stored so that one could display histogram information which could be refined. For instance, a particle that the query is true most frequently.

Options for creating the histogram, i.e. "Display Axis Type:"

"Time Slice" - Each bin would represent one or more contiguous time slices (in numerical order) while the sum would the number of matches to the query. This option would allow the user to observe if one or more time slices have a significant event.


"Matches" or perhaps "Match frequency" - Each bin would represent one or more particles and the frequency each matches the query. The particles would sorted based on their frequency of matches.

"Id / Variable " - Each bin would represent scalar values. If based on Id it would like "Matches" but the results would not be sorted. I.e. each bin would contain one or more ids in numerical order. If a variable then the binning would be based on the variable values. If binning by variable it would require the query to be run again the first time through the range would be acquired while the second would allow the binning to happen.

"Number of bins" should be self explanatory.

"Auto update" would automatically up the selection rather than using the current "Update Selection"

The histogram would contain limits that the user could set so they can do a sub-selection. When the sub selection is done the "Min" and "Max" would be updated.

"Type" would set the summation. For instance, if the display is by time slice and the type "And" then the user selects 5 time slices then only those id in all 5 times slices would be selected. If "Or" then all ids. We could also have Xor if two slices are selected. (This selection is the current "Use cells", but it needs to be delayed).

Brad's notes: The idea here is that CQ ranges will be applied to each cell in the dataset for each specified time step. At each time step cell i will either match the CQ ranges or it won't. We keep a count for the number of times each cell matches the CQ ranges. What we get is a cell-centered array that has the number of matches over time. We can throw out the cells that don't have any matches, leaving some subset. We can then sort from high to low match frequency and bin them up by frequency so we have different bins with each bin having a set of cell id's. The user could further refine the CQ definition to turn off different bins, each time shaving off lists of cells from the selection.

It sounds like the histogram tab would have a single histogram that would let the user remove cells from the selection. I assume that the restriction in the histogram would be based on some frequency min/max values. If that is the case, I could continue using expression-based evaluation but switch from average_over_time to sum_over_time of the on/off cell values. That would give me a number of times each cell matched the CQ ranges over time, which is similar to what I already calculate. From there, I'd just use the histogram min/max to threshold the sum_over_time frequency values to get my list of cells for the selection.

  • User wants to have min/max selection controls on the histogram widget to restrict the ranges.


Selection3.png

Statistics

The statistics tab gives the user an idea of properties of the cells that were actually selected.

  • Actual extents for the variables requested by range selection
  • How many cells were returned by the selection. Display ncells_that_passed_selection and ntotal_cells.


Selection2.png

Approach

The avtNamedSelectionManager class is responsible for creating named selections. Since my approach will make use of avtFilters and the avtNamedSelectionManager class exists within libavtpipeline, I created a class called avtNamedSelectionExtension that gets passed to the avtNamedSelectionManager (NSM) so it can use additional logic to set up the pipeline it needs to execute in order to create the data it needs for the named selection. I created a subclass of avtNamedSelectionManager called CumulativeQueryNamedSelectionExtension (CQNSE) and I put it into the engine/main directory so it exists in the compute engine. This placement enables the extension to use both libavtpipeline and libavtfilters.

The CQNSE injects additional filters into the pipeline created for named selections. The filters are custom and exist also in the CumulativeQueryNamedSelectionExtension source files. When CQ selections are created, the input dataset is fed through 3 additional filters: CQHistogramCalculationFilter, avtThresholdFilter, and CQFilter.

CQHistogramCalculationFilter This class is a specialized filter that computes histograms on specific variables. We cache the histograms and make them available to the named selection extension object that creates this filter.
avtThresholdFilter The standard AVT filter that implements the threshold operator. This filter is used to remove the cells that do not match the variables and ranges specified by the CQ selection properties.
CQFilter This is a time loop filter that gets the thresholded output for all time steps. We then examine the datasets and count the number of cells in each time step and we then figure out the list of cells that exist in any or all time steps, depending on the summation rule. We then take that set of cells and sort it based on the desired histogramming method and then we select a range of bins from the histogram to contribute cells to our final selection. The final dataset produced by the filter is a dummy rectilinear dataset that contains the original cell numbers that describe the cells in the selection. That array is used later in named selection to create the selection.

The CQNSE accepts SelectionProperties from the engine which get passed when the user wants to create a new selection. The SelectionProperties contain the variables and ranges that will be used in creating the selection as well as the histogram method, etc. The SelectionProperties are used to inform the filters that are created to fulfill the CQ selection.

The CQHistogramCalculationFilter gets the original data for each variable in the CQ selection properties and creates histograms for those variables. The variables are cached in the filter until later once the subpipeline has been executed. At that point, the histograms are read out and stored in the SelectionSummary state object which gets returned to the viewer as a result of creating the CQ selection.

The threshold filter just removes the cells that do not match the conditions set forth in the CQ selection properties.

The CQFilter is added to the end of the pipeline and is the piece that does most of the work. Since it is a time iteration filter, the threshold and histogramming filters get executed for each time step. Once the pipeline has been executed for the time steps of interest, the resulting datasets are passed into the CQFilter's ExecuteAllTimesteps method where we begin to analyze the results and construct the CQ selection. The operations are broken down into stages where we first calculate the frequency for each cell so there is a count of how many time steps contain the cell. Next, the frequencies are put through a summation, which creates the initial CQ selection. Next, the selection is globalized and sent to all processors so each has the selection. Next, the selection is sorted according to the histogram method and further reduced. Finally, the selection is limited to only the cells that are valid for the local processor. The set of cells is used to create a dummy rectilinear dataset onto which we put the avtOriginalCellNumbers array which is used by the NSM to construct the final named selection.

The CQFilter also creates some information to put into the SelectionSummary that gets sent back to the client.

CQSelections.png
Screen shot of initial CQ implementation

Reported issues

This table lists some issues that have been reported with selections.

Issue Status
When the user deletes a plot that creates a selection, the selection is also deleted. This is needed since the selection is not saved to disk and when it no longer has a plot that creates it, there's no way to recreate the selection's definition. Some developers were unprepared for this behavior. One suggestion was that if we're deleting the originating plot then the selection should be saved to disk so it can safely be decoupled from the deleted plot. I still think if the selection does not get saved then we should remove the selection from all plots since its definition has been lost and that causes problems with restore session, etc. Needs fixing
A previously saved selection that has been loaded cannot be resaved. VisIt issues an error. This must be a problem in the engine since saving the selection should just cause its old file to be overwritten. Needs fixing
Selections are always saved to ~/.visit. This is just how selection load/save was implemented in the first place. Users might want to load/save selections to other directories. Needs fixing
If the user cancels while looking for a selection file then VisIt attempts to load an invalid selection. Fixed
If the user has a selection called "test" and then wants to load a selection called "test" from a file then either the previous test selection is overwritten or the load is not carried out. Selection names need to be unique so ideally, one of the selections would be automatically renamed so both selections could coexist. Then give the user the ability to rename selections. Needs fixing
A plot that creates more than 1 selection only shows the first selection it creates when the plot is expanded in the plot list. Needs fixing

Issues with new CQ implementaion

This table lists some issues I've noticed with my new CQ implementation. Once 2.3.0 is released, I'll convert these into redmine tickets for the ones I don't figure out.

Issue Status
Creating these CQ selections is expensive because of the time component. Is there something that can be cached in the database? The reason I'm logging this is that for histogramming operations such as changing the bins, that could be done on the resulting histogram dataset instead of figuring out what that is all over again. Fixed
When Load/Save a CQ selection, I'll probably need to save the CQ selection parameters to the .ns file so we can populate from it. We should probably return the selection properties from the Load operation. Needs fixing
Test with parallel Works with a domain-decomposed dataset
Investigate wierdness with setting var min/max. Do CQ with wave.visit and use pressure. The default 0..1 limits do something weird with the pressure limits, which are different. Find a good reproducer. Fixed (implemented min/max convention)
When you have 2 plots in a window and one creates a selection that applied to the other, there are problems when you turn on automatic selection updates. The plot that uses the selection never gets created. Needs fixing
Selections use a default name scheme of selections%d.ns. If you save a bunch of selections that way and then try to load them via the Load button, they will be grouped because of automatic file grouping. File grouping should be temporarily turned off when using the file dialog for selections. Needs fixing
When a selection is applied to a plot and the window (viewer) is copied, the selection is not applied in the copied window. Needs fixing
When you switch to matches histogram mode, clear the histogram so the user doesn't see the old #times bins when expecting to see some other number of bins. Maybe add text in the histogram saying "no data" or similar. This comes up when people do not use auto update. Done
Creating a selection from a database can fail when the dataset is too large. We might want to defer the selection creation until we have initialized its CQ attributes. Needs fixing. I did increase the max selection size to 50M instead of 1M cells so that should help
Make the "Initialize ranges" button able to initialize its values from the Threshold operator too. Needs fixing
Make sure the variable histograms show data or at least words saying "no data". Sometimes the data may just be hard to see. Make it more visible. Done
Loading a second selection from a file causes VisIt to remove the first selection from the selection list. Done
Min/Max values are not reliably calculated in parallel. Needs fixing
problem Needs fixing
problem Needs fixing

Named Selection Implementation Observations

This section applies to the implementation of named selections in general and not CQ selections.

The only types of named selections I've seen in the code are avtZoneIdNamedSelection selections and avtFloatingPointIdNamedSelection.

When the user creates a named selection, the avtNamedSelectionManager creates a new contract based on the contract used to create the data object. In this new contract, original cell numbers are requested and if they are not present, the pipeline is re-executed so they will be present. The avtNamedSelectionManager then flattens the data object tree and puts the original cell array's domain and cell ids into new arrays that are used to instantiate an avtZoneIdNamedSelection object. The avtZoneIdNamedSelection is merely derived type of avtDataSelection that contains the domain/cellid pairs for the cells.

avtNamedSelectionFilter and avtZoneIdDataSelection

Later when a named selection is applied to a pipeline, the avtNamedSelectionFilter gets added to the pipeline. The filter is given the name of the selection to apply to the pipeline. The filter modifies the contract, restricting the domains to be processed to the set of domains present in the selection by calling the selection's GetDomainList method. After restricting domains, the named selection attempts to instantiate an avtDataSelection for itself so the selection can be sent down through the pipeline to the database readers which may try to implement the selection as a data selection. Derived types of avtNamedSelection provide a virtual CreateSelection method that let it instantiate an appropriate avtDataSelection object. avtZoneIdNamedSelection does not create an avtDataSelection so it must be handled outside of the database readers. In fact, the avtNamedSelectionFilter handles the actual implementation of avtZoneIdNamedSelection where the data is restricted to the cells present in the selection. The restriction is handled by creating an array of on/off values for the cells by comparing the dataset's avtOriginalCells array with the cells in the selection. The dataset is then filtered using the VTK threshold filter. This means that zone selections always require all of the data since they do not pass any data selection down to the database to read a subset of cell values. It's just as well since few file formats would support such reads or construction of the inevitable unstructured mesh.

avtFloatingPointIdNamedSelection

From what I can tell, avtFloatingPointIdNamedSelection can be created from the threshold and parallel coordinates filters. In addition, you can read one in from a file and that creates the named selection.

The named selection is handled in the avtNamedSelectionFilter too but there are some differences from the ZONE_ID named selection. First, the avtFloatingPointIdNamedSelection adds a new data selection to the contract's data request object. The data selection that gets created is an avtIdentifierSelection and it gets the values stored in the avtFloatingPointIdNamedSelection. As a data selection, the avtIdentifierSelection gets passed all the way down to the data reader where it will be fulfilled. When the avtNamedSelectionFilter filter executes, it first checks whether its selection was handled already by the reader. In this case, it has to be because there's no mechanism at the higher level to implement the named selection.

Down in the H5Part reader, there is code to handle the avtIdentifierSelection, which is called "Identifier Data Selection". This code is apparently called by the plugin's GetAuxiliaryData method. What happens is the GetAuxiliaryData method wants AUXILIARY_DATA_IDENTIFIERS and it gets a set of data selections to look through. It picks out all of the avtRangeSelection and avtIdentifierSelection objects and passes them to its ConstructIdentifiersFromDataRangeSelection method. That method iterates over the data selections and assembles a query to H5Part that gets executed to produce a set of particle ids whose data matches the selection query. The resulting array of particle ids is packaged up into a new avtIdentifierSelection object and returned. The GetAuxiliaryData method then returns the avtIdentifierSelection.

Of course, what next? The avtIdentifierSelection is passed up from the generic database's GetAuxiliaryData method, ultimately through avtMetaData::GetIdentifiers.

Somewhere, the reader gets its list of selections from the database through a call to its RegisterDataSelections method. The H5part reader's RegisterDataSelections method either stores the avtIdentifierSelection's ids away or in the more complex case, creates a query string from the selections. Later when GetVar is called, the query gets executed so only the requested particles are read and returned. The reader tells VisIt which selections were applied by setting bools to true for the selections that were passed into RegisterDataSelections.

Note: I noticed in the Threshold operator, there's code for the filter to create a named selection. Well, it actually creates a set of avtDataRangeSelections and then passes them to GetMetsData()->GetIdentifiers(). That looks to have the effect if immediately querying the database for the cellids that match the ranges in the data selection. If the cellids are successfully passed back in an avtIdentifierSelection then they are used to create a new avtFloatingPointIdNamedSelection. I think this gets called from the avtNamedSelectionManager::CreateNamedSelection method and it is used for specialized selections that accelerate selection creation (by limiting the data that gets read) when databases support it. When the reader does not support the avtFloatingPointIdNamedSelection, it will not have been handled at the reader level.