Parallelization over time
This discussion of 'next' operation in VisIt's VCR controls is related to VisIt's notion of 'time' but is not essential to a discussion regarding parallelization of VisIt through time. So, the reader may want to skip to the next paragraph. VisIt's notion of the `next' in a sequence is pretty much tied to a time series right now. It could be generalized to support unstructured AMR, iterating over refinements instead of time -- which might feed into doing time in parallel or a completely arbitrary state-space. Looking at this from the notion of the `next' operation, I can envision representing the state-space as a graph. For time, it is a simple graph where each node represents a time state and each edge represents a state transition from one timestep to the next. However, it could be easily generalized to a general graph. At a particular node in the graph, `next' may have multiple options and you could pick which. `Back' would just keep a record of the path you have taken through the graph. You could envision a `default' path that visits every node in the graph once and that would be the default order in which you visited the nodes. It would be a single traversal of the whole state-space graph. In a simple 2D state-space, the graph would be very regular, like a quadmesh. But, the graph also permits bifurcation. It would also be nice to `render' the state space, perhaps even with labels on it so a user could see the whole space and see where they are `at' in it as well as directly go to specific states in addition to step along a path in it. Different classes of edges would represent different state transitions and the `next' operation could have an `active' edge class to follow.
If you wanted to allocate processors to time and domains -- 2 of the
dimensions used in VisIt's plugin interfaces, that would be pretty
easy. Currently, when you specify the number of processors,
domains get assigned to processors using one of a few rules. If we
wanted to do things in time, also, we'd need a rule to assign time
steps to processors too. Suppose we had 23 domains and 101
timesteps and we ran on 17 processors? We have 23*101 = 2323
domain-time chunks. Could we just assign these domain-time
chunks to processors in the same way we assign domain chunks
now? If we did, we'd have to be really careful about pipelines
requiring collective operations. Maybe streaming would help here.
More than likely, we'd have to pick how many processors we want
any given timestep loaded onto. Say 5, then we could have 3
timesteps in memory for a total of 15 processors and two would be
sitting idle, at least in this example. So, when we allocate processors,
we'd really need to choose a multiple of number we'd want to
use for a single timestep. Now, if the operation(s) being performed
are such that VisIt can a priori determine that it needs only some subset
of the domains to operate on that timestep, then why not process
that timestep on a smaller number of processors? Suppose we're
doing an iso-contour calculation and we've got domain extents for
the domains for each timestep. We could `look ahead' and determine
how many domains we'll need for each timestep. Then, on
each timestep, we'd create an mpi-communicator for a smaller number
of processors, tell that group to process that timestep and move
on to the next timestep. How does the UI process communicate
results back to the viewer? The UI process would probably have to
become `special' in that it isn't really assigned to any processor
group and just sits and waits for RPCs from the viewer which it
turns around and sets to the right processor group as well as
waiting from RPC responses from processor groups that it turns
around and ships to the viewer.
Since there is only one socket connection from engine to viewer, establishing
more would be difficult. So, I think we need to turn the single (rank 0)
UI process into a servlet which receives data from the engine processor groups
and ships data to the viewer. It also recieves RPCs from the viewer
and ships them to engine process groups. You would want to allocate NxM+1 procs -- the '+1' for the UI process -- as in
'visit -np 5x20+1 ...'. Here are some of the steps in the initialization...
1. Engine calls MPI_Init as before
2. Engine confirms MPI comm size is NxM+1
3. Engine creates processor groups and mpi communicators for those groups
4. One process in each group is identified as the 'UI proxy' for that group
5. UI process keeps track of each processor group's UI-proxy process
6. UI goes into a servlet loop; looking for RPCs from viewer and any outstanding requests from processor groups. Incoming RPCs are routed to correct processor group and non-blocking recieve is posted for that group's result.
7. The timestep for the request indicates which processor group to send to.
8. Viewer issues requests like open database, add plot, execute, etc to the engine as normally. However, it knows it can issue multiple requests up to a maximum and does so, keeping track of the RPCs it issues and waiting for their responses. Now, the polling loop in the viewer checks for multiple responses from the UI process it is connected to. As it recieves them, it sticks the results in the correct actor for that plot. Its almost like animation caching in this case because we'll be populating actors which are not currently being displayed. We may not get the actors in the right order.
9. If the engine is in SR mode, then we have a tiny problem in that we have only one actor on the viewer for all time for an SR image. If we instead let any actor in a plot contain the SR rendered image, then animation caching would work with SR mode and the engine could render in parallel across time. The reason the avtERIA is `special' really has to do with the issuing of rendering requests to the engine. The resulting images DO NOT also need to be stored in that actor. They could be stored as the ViewerPlot object's actors in a ViewerPlot object. This would also help to resolve current issues with SR+Animation Caching issues.
10. Now, as user advances in time, the viewer could issue a new request for the timestep that got free'd up. This would be a little like a `look ahead' operation. In this way, we could maybe run VisIt such that we overlap execution of the next timestep's plots with the current timestep's render. That would be cool!
How does the UI servlet work? It has a socket to viewer as normal.
It runs a polling loop as normal. In that loop, it checks BOTH for
RPCs from viewer AND for any outstanding requests from processor
groups. It uses MPI_Comm_VisIt to issue non-blocking forwards of RPCs from viewer. The processor group's ui-proxy's
behave such that the communication they used to do through a
socket to the viewer is now done on MPI_Comm_VisIt to UI. All
other communication they do is on the local group's communicator.
So, to parallelize in time, VisIt would assign any timestep to a `group' of processors up to some user-specified max size (maybe % of total). The Viewer would issue executes for the timesteps it wants and sit back and wait for the results for each request. So, it would not be waiting on multiple responses from the engine.
The easy thing is to divide the processors into a 2D array; one for time and one for domains. The viewer would know aprior how many timesteps it could issue executes for. The engine would, after getting mpi comm world, dup the communicator and create a bunch of groups. Either the viewer issues executes to each of the `leaders' of the groups or the viewer issues them all to a `master' UI process that controls all groups. Each group does the `usual' thing with mpi communicator replaced with something smaller.
How would viewer correctly order the results arriving from different groups? If each state was coming from a different `group leader', it would be easy to order them. If from the same, `master' UI process, maybe the `request' associated with the message could be used to order them.
You could parallelize in time `on the fly' by just telling VisIt to `init' for n-timesteps also. Long after mpi-init is called, we can change the communicator(s) the engine uses as a result of a user request to parallelize in time.
Now that I think about it though, we could opt to 'never' have an idle engine by employing a 'look ahead' strategy in the viewer that whenever the viewer can, it will issue a request to the engine to render a timestep that is 'nearby' what the user is currently looking at. As long as the Viewer can reliably interrupt the engine at any point during an execution (which we can't actually do now but hope to with the MPI refactor), then the only 'harm' in so doing would be that the engine needs a place to store the results it computes until the viewer asks for them. This notion of asking the engine to 'look ahead' is actually independent of parallelizing in time. But, it is related because the viewer needs to decide if it can ask the engine for results from timesteps other than the one the user is currently looking at.
If the processing load varies much from one timestep to the next, it would be best if the engine could decide to change the allocation of processors to 'domains' and 'timesteps' enabling it to tradeoff how many timesteps it processes simultaneously with how many domains for a given timestep it processes.
Brad's comments
I'm sure I'll have more but here's a first pass.
1. The first paragraph with the description of the states as a graph sounds to me similar to the "state space" notion of time that you'd proposed some time back. Of course, as a graph it's more general. I think I prefer the previous "state space mesh" idea over the graph. It's still possible to have a cursor that iterates over that space in some fashion, providing the "time state" that VisIt will use for rendering. This discussion is orthogonal to the concept of parallelization over time.
2. I agree with the compute engine's processors being divided up into a UI process and N smaller groups of M processors, which make up a "logical engine" that acts pretty much like VisIt's current compute engine does. The UI and master processes in each group would share a communicator that would be used for the purpose of the UI broadcasting the incoming RPC's from the viewer to the masters in each group. The masters would then broadcast to their workers. I like this 2 step approach because the masters in a group could do things like add their rank in the N group to time state arguments in the RPC's. This could be an easy way to make each group execute work for its particular time step.
3. I definitely don't like the idea of each master in N group talking to the viewer. We have enough socket woes. The master in a group should communicate back to the viewer through the UI process.
4. I'm not quite sure what to do in the viewer as far as listening for engine output is concerned. I think the sequence of methods needed to create plots in the engine is overly complex. I think I'd like to see the compute engine interface rewritten so everything needed to render a scene is passed down to the viewer at the same time. The means 1 (or a few) RPC calls to the engine to specify everything needed to render on the engine. Of course, this means patterning the existing plot RPCs to look more like the external render request that we send to the engine for scalable rendering. This would enable several things.
- The engine could be made smart enough to execute all plots in the plot list at the same time. The existing compositing code (on the engine) could remain to composite the different plots together in SR mode.
- Essentially 1 interface to get either a rendered image or a "data object reader" -- or whatever it is that we need in the viewer to execute the viewer portion of a plot. That is, we get a simplified way of getting either an image or geometry. This would make it easier to do things like make a client application talk directly to the compute engine. For example, an Internet browser controlled compute engine that displays images from the compute engine.
- When the large "render request" comes in, the engine can distribute the work over processors as it sees fit and the viewer is none the wiser. This means that the render request could be parallelized over time over a group of N processor groups. Or, the engine could even be smarter and do all plots at the same time for some set of time steps. When the viewer requests the data again for the next time state, using pretty much the same render request, but with a different time state, the work would likely already have been done. The groups that have a time state less than the new time state could be tasked by the UI to start work on the next time states in the series that have not been completed.
- Simplifying the engine interface is also attractive for using the engine as a library from simulations that use libsim.
5. I mentioned having an avtParallelContext class in the contract. In the MPI case, I see this having the world communicator, worldSize, worldRank, the group communicator, groupSize, groupRank, and a stack of communicators that could be used at various points in the pipeline for parallel error handling (created on demand within the most recent communicator to restrict communications to the procs that have data and don't have errors). I could also see the avtParallelContext class as a place to put per-thread data for a multithreaded, non-MPI compute engine. This discussion of parallelization over time does not necessarily require an avtParallelContext in the pipeline but since there will be various communicators in play in parallelization over time, it is fair to mention a better solution than using global variables -- especially since it could facilitate some advances in parallel error handling.