Distributed Metadata

Strategies to support Distributed Mesh Metadata

Intro

We currently use global data structures to support:

  • AMR Hierarchies (Domain Nesting)
  • Ghost Zone Communication (Domain Boundaries)
  • Down stream optimizations, such as spatial & data extents.

This was primarily done b/c it mirrors how many codes represent this info natively. They save this info in a global fashion, we read it via a database reader, analyze it and propagate to the domain level.

For scaling reasons some codes are starting to transition away from this model to a distributed model where each domain provides the complete info about its neighbors and nesting relationships. This page will hold info related to adding support for this new model, as well as any implications there may be to the old model.

Note: A middle ground solution would be a hierarchical/tree based representation - however this is kind of the worst of both worlds. If the codes want to provide use with direct info, that is ideal for both parties.

Notes

I (Cyrus) have a powerpoint presentation that enumerates a lot of the issues. Here are some random notes:

Example local domain boundary spec:

from visit_utils import *
dbn = PropertyTree()
dbn.id = 0
dbn.lext = [0,10 , 0,10 , 0,10]
dbn.nneighbors = 2
dbn.neighbors = [PropertyTree() for i in range(2)]

nei = dbn.neighbors[0]
nei.id = 1
nei.lext = [10,20 , 0,20 , 0,10]
nei.overlap = [10,10 , 0,10 , 0,10]
nei.orient = [0,1,2]

nei = dbn.neighbors[1]
nei.id = 2
nei.lext = [0,10 , 10,20 , 0,10]
nei.overlap = [0,10 , 10,10 , 0,10]
nei.orient = [0,1,2]

Looking at moderately complex AMR nesting example, it appears the 'local' info required for a patch can easily be on the order of the size of the entire AMR nesting tree.


Silo Objects and VisIt's Silo Plugin

Silo's multi-block (MB) objects are a fundamental part of using Silo in parallel. An MB object is simply a list of the individual silo objects, blocks, that comprise it. Each member object of an MB object is enumerated by a character string representing its path within the file it is stored in and, optionally, a path within the filesystem to the file holding the object. It is not uncommon for the character string for each block to reach hundreds of characters in length.

As we scale to larger problems, that is MB objects comprised of more and more blocks, the current approach begins to break down in several ways. First, the shere storage cost for an MB object becomes non-trivial. For example, for N=128,000 blocks and at 128 characters per member object name, that is ~15.5 megabytes for a single MB object. So, there is a lot of memory (and file) storage consumed by MB objects at scale. Next, the application responsible for writing the silo files winds up taking longer and longer to construct and write these MB objects to the file. This presents a serious scaling issue during I/O for applications. The time and storage required to read the objects during post-processing likewise presents a serious scaling issue.

Another kind of information that relates to MB objects is something in VisIt called domain neighbor information. In the context of MB objects in Silo, it is probably more apt to call this 'block neighbor' information. Nonetheless, this is the information that says which blocks are neighbors of a given block and, for structured grids, the relative orientations of the IJK indexing schemes within the blocks as well as the hyperslabs of nodes (or zones) over which neighbors touch one another. Except for IJK indexing orientation, unstructured grids have a similar need for this kind of neighbor information.

The multi-mesh adjancecy object was introduced to Silo to accomodate block neighbor information. But, the object suffers from the same scaling issues mentioned above. It causes scaling issues in the applications writing Silo files due to time, communication and storage required to construct the object. Because the object can be constructed and output to a Silo file in pieces, it is somewhat less susceptible to ordinary MB scaling issues but nonetheless not practical for Exascale.

In addition there are a few other issues related to VisIt's handling of Silos MB objects that need consideration. These are...

  1. Need to handle filename and dirname part of MB block paths separately.
  2. Need to distinguish between Silo dirs holding MB block parts and other Silo dirs to control Silo dir traversal.
  3. Need to support "EMPTY" block functionality.
  4. Need to match MB var objects to their cooresponding MB mesh object.
  5. Some need to maintain list of MB objects read from Silo

For Exascale, we aim to find a way to achieve basically the same thing we currently do with MB objects but to not wind up forcing the applications writing Silo files to have to engage in any unnecessary communication or non-scalable object construction. Likewise, we aim to arrive at a Silo object that in post-processing does not require a tool like VisIt to have to store non-scalable metadata or engage in non-sclable I/O to obtain the metadata it needs to open a file. This may involve revisiting just what information VisIt really does need during PopulateDatabaseMetadata to properly open any file, let alone a Silo file.

We consider two possible solutions here. One is to employ a predictable, computable _namescheme_ making it possible to generate the names of each member object of an MB object on demand. The fact is, typically these objects are named in very predictable and repeatable ways and it is easily possible to codify their on-demand generation. The other is to employ a tree of MB objects that terminates on leaves being the actual member objects of the MB object.

Nameschemes

In the namescheme approach, the application writing the Silo files decides upon a particular schema to be used to name the members of an MB object. For example, the application can decide that all the parts of a mesh will be named "domain00000", "domain00001", etc, or that all the parts of the pressure variable, p, will be named "p00000", "p00001". In addition, the application also decides upon a particular schema to be used to name the various Silo files into which different blocks of the MB objects are placed. For example, the application can decide to name its files "nnq_128_00000.0", "nnq_128_00000.1", etc. Then, instead of generating a long list of names for an MB object, the application needs only to specify the namescheme or rule by which these names can be generated. It is a simple, short character string sprintf-like expression for the name such as "domain%05d".

This approach also has the advantage that Nameschemes have already been implemented in Silo. The various options for building nameschemes are general enough to suit almost any application's needs.

"EMPTY" Block Functionality and Nameschemes

However, there is a difficulty in handling "EMPTY" blocks. If some random set of members of an MB object are "EMPTY", then it is not possible to easily construct a namescheme that will produce "EMPTY" for those blocks. So, either there has to be some means for the application writing the data to indicate which blocks are "EMPTY" or the "EMPTY" functionality as it is currently implemented in VisIt's Silo plugin will have to change.

Requiring the application to output a separate array of integers or a bitmap indicating which parts of an MB object are "EMPTY" is likely to present scaling issues. So, the question arises what is the "EMPTY" functionality really doing for us and can we get away with not continuing to support it?

As far as I can tell, it was introduced primarily to avoid having to modify the Silo library while still enabling the Silo plugin to behave more or less normally in the presence of MB objects with missing blocks. However, the route taken to explicitly enumerate "EMPTY" blocks is actually not necessary. Instead, what the plugin could do is simply not treat the absence of a block as an error.

The one situation where this could be a bit problematic is during file open. Currently, for each and every MB object, we descend into its first non-empty block to obtain additional metadata about the object apparently not present on the MB object itself. That metadata includes

  • Spatial (and topological) dimensions of the mesh
  • The coordinate system of the mesh (e.g. XY or RZ)
  • Axes lables and units
  • The cellOrigin (don't really know what that is)

All of this information could be promoted to the MB object instead and then avert the need to descend into a non-empty block. However, this would also require changes to applications writing data to include this metadata on the MB objects.

The value in having blocks with the string "EMPTY" in them is that we never attempt to consider trying to descend into those blocks during this process.

Another option to remove the need for "EMPTY" block functionality is to simply loop over blocks trying to read the first non-empty sub-block and then terminate that loop when an attempt to get a sub-block actually succeeds. If an MB object had a large number of "EMPTY" initial blocks, performance would suffer.

Why do we need to know all that metadata during file open? Don't we really only need to know the name of the mesh. I mean all that other information can be obtained later when we're actually going to do something with the mesh like plot it or some variable on it. Then, having the spatial and topological dimension is important. I wonder if we could significantly pair down the amount of information we need to collect during file open to free ourselves from the need having to tease all this information out of file real early on.

If we decide to handle "EMPTY" blocks naturally rather than explicitly, then we loose the ability to distinguish betwween cases where missing blocks are intentional and cases where they are not. In that case, its possible to enhance Silo and the plugin to annotate MB objects with a Silo option to indicate if they are not expected to be fully populated or not. This is not as full proof a way to detect intentional missing blocks from unintentional cases but I wonder how often we'd really encounter the need for that in practice anyways.

So, to use the Namescheme approach, we'd remove "EMPTY" logic from Silo plugin, add a new option to Silo's MB objects, DBOPT_MAY_HAVE_MISSING_BLOCKS and then only treat it as an error if we descend inside an MB object and discover a missing block if the option is not set. A less desireable approach might be to have the application actually output tiny Silo objects that stand in for a missing block so that when its read from the Silo file, we know it as such. This is something Dan Laney proposed. However, I think the former approach is simpler both for the plugin and for application developers.

Pseudo-Algorithm for Processing Multi-Block Objects with Optional Nameschemes

First, we enhance Silo library to support all the metadata (optlist options) currently supported on individual blocks on the multi-block objects too. This would include units, labels, topolgoical and spatial dimensions, etc. Next, we encourage application developers -- especially those exploiting EMPTY block functionality -- to adjust their applications to write these Silo options on the multi-block objects instead of the individual block objects. The new Silo plugin code to PopulateDatabaseMetadata would take the following form...

   Read an MB object
   If the MB object holds all the metadata necessary to complete the
       equiv. avt<>MetaData object, then add that avt object and we're done.
   Otherwise, we need to go read an individual block of the MB object.
       If the MB object contains a normal list of names (e.g. not using
           Nameschemes), find the first one not named "EMPTY" as we currently
           do, read that object and use the information from it to construct the
           avt<>MetaData object.
       If the MB object uses a namescheme for its blocks, then start iterating
           in a loop attempting to read a first non-empty block object. Upon the first
           successful read, terminate the loop and populate the avt<>MetaData object
           with information from that block

First and foremost, if the application has been updated to write silo options on the MB objects, there will be no attempt to read an individual block of the MB object. So, once applications get updated to do that, they will see an improvement in performance right away.

Next, legacy files will still behave as normally. The plugin will examine the explicit list of names, skipping past initial EMTPY blocks to find the first non-empty block to read. This is how things work now.

Third, applications neglecting to write silo options on the MB object and using nameschemes but not EMPTY block functionality will behave pretty much as things currently behave today. Thats because the first non-empty block in such cases will indeed be the first block of the mesh anyways and the loop to iterate attempting to read the first non-empty block will terminate after only a single iteration.

Finally, if for some reason, an application is using nameschemes but also is creating MB objects with empty blocks and that application does not write Silo options on the MB object, then it will pay a price during file open to iterate over the block names attempting to read each one until it encounters a successful read. If the MB object has a slew of initial empty blocks, this loop will have to iterate through all of them before getting to the first non-empty block. But, the application can avoid this simply by writing the essential silo options on the MB object.

Even if the Silo library hasn't been updated to handle the necessary optlist options on the MB objects, all of the other work defined here can proceed on the Silo plugin without it. Later, when the Silo library has been udpated, we can adjust the plugin to short-circuit all the logic to read an individual block.

Changes Elsewhere to Silo plugin to Support Nameschemes

About the only other place to modify the Silo plugin to support Nameschemes is wherever the DetermineFileAndDirectory and RegisterDomainDirs methods.

DetermineFileAndDirectory currently takes an explicit name explicitly stored as an entry in an MB object's list of names and returns file and block information. Instead, calls to that method must either generate the name when using a Namescheme object or, better yet, the method interface could be modified to simply take an Silo object MB pointer type and then do the right thing.

RegisterDomainDirs is designed to register certain directories in the Silo file as having already been processed so that the plugin's dir-traversal algorithm does not attempt to descend into those directories. The rationale is that directories used to store individual blocks of an MB object are not ever used to store Silo variables that are not already bound to some MB object. Therefore, those directories do not need to be traversed to individual Silo variable objects.

Tree of MB objects

This is something that has been requested by users for other purposes and was proposed recently as a solution to scaling problems as well. In this approach, we add an option to an MB object, DBOPT_MB_TREEBOUNDS, which is an array of integers equal to the size of the MB object (e.g. the number of member objects it lists). The presence of this option would indicate that the member objects are other MB objects. Secondly, the values in the array would indicate the upper bounds of the leaf member object indices that are to be found in the associated subtree. In particular, the last value in this array would be the total number of leaf members (e.g. number of domains of the mesh). We might also need a way to indicate the mesh-type or variable type of all the leaf member objects as another option.

Any given processor writing or reading a tree of MB objects would need only have to deal with the path through the tree for the part(s) of the tree it holds. So, it should scale much better than our ordinary MB objects.

The tree of MB objects approach also supports the "EMPTY" functionality without alteration.

Recommendations

The Namescheme approach is without a doubt much more efficient in storage and I/O performance. However, it is also somewhat more complex and somewhat less general than the tree of MB objects approach. We have less experience using Nameschemes in practice. The Tree of MB objects approach is only incrimentally different from what we have been doing and what has been working for 20+ years in Silo.

Nonetheless, as we continue to grow in scale (e.g. number of blocks) the namescheme approach will at least have to be part of the solution. Even within the next several months, we expect to have files with more than one 1,000,000 domains and maybe 100,000 files. Nameschemes are going to be a signficant reduction in storage and I/O performance.

Impact of AMR on Global Domain Mapping

Given global domain number i, a key thing the Silo plugin needs to know or be able to figure out is which file that domain is in and which Silo directory within the file. When the assignment of domains to files is regular, this mapping information is relatively trivial to maintain. The problem is that AMR makes the mapping very irregular. This is because each patch in an AMR hierarchy can wind up being refined into an arbitrary number of new patches. Consequently, as the total domain count grows, the cost to maintain this mapping in the presence of AMR can become a scaling problem. Even with the use of nameschemes we can still wind up having to store a pair of integer values for each and every domain. If there are ~3 Million domains and the integer values are 4 bytes, thats 24 Megabytes of data.

Worse, due to a variety of issues including SIL selection, read optimizations and the use of non-absolute load-balancing, a given domain can wind up needing to be processed by different processors during different plot executions. So, the information necessary to support the mapping of a global domain number i to where it is in the ocean of files that are storing the given data set can wind up being needed by every processor.

One simple way to reduce the storage is to not attempt to store the entire map on every processor. In so doing though, we either introduce the need to communicate whenever a processor is required to process a domain that is not handled by the part of the map the processor stores or to ensure that a given domain is only ever processed by the same processor. This latter step, to ensure that once a file is opened, a given domain is always processed by the same processor is somewhat of a departure from the way VisIt currently operates but is also a relatively simple thing to enforce. In so doing, we eliminate the need for any one processor to maintain the entire domain-to-file-and-dir map.

Of course, there will be some work when the file is first opened to decide upon a good absolute assignment of domains to processors taking into account which domains are in which files as well as which domains are neighbors of each other. But, this can be computed during PopulateDatabaseMetadata. Next, no one processor will even know the entire domain-to-processor map and will instead simply know the list of domains that it owns. When VisIt does an execution of a plot, it will first restrict its work to include only those domains a given processor has been assigned apriori. If none of those domains are even on in the SIL selection, then that processor has no work to perform. If the per-processor domain count is high, which we'd expect for these larger meshes, then the likelihood of bad load balance using this approach is low.

Another option Brad mentioned here is to instead use individual files in place of our regular notion of domains. Files get assigned to processors. Files get turned on and off in the SIL, etc. and a GetMesh call returns all the domains in a given file. Or, we could introduce a GetAllMeshesInFile call that VisIt can use in place of GetMesh (and other var equivalents as well), that cause it to operate this way. We loose some fine grained control over individual domains but that might be ok, particularly in the short term in trying to resolve our scaling issues. However, we'd still need to maintain the ability to control which AMR levels are on and off. So, maybe the plugin returns all meshes in the file in a GetMesh call but VisIt trims what it gets back from plugin based on current SIL selection, etc.