Reference documentation for deal.II version Git d3aed38b93 20211028 13:33:27 +0200

This tutorial depends on step8.
Table of contents  

This program does not introduce any new mathematical ideas; in fact, all it does is to do the exact same computations that step8 already does, but it does so in a different manner: instead of using deal.II's own linear algebra classes, we build everything on top of classes deal.II provides that wrap around the linear algebra implementation of the PETSc library. And since PETSc allows to distribute matrices and vectors across several computers within an MPI network, the resulting code will even be able to solve the problem in parallel. If you don't know what PETSc is, then this would be a good time to take a quick glimpse at their homepage.
As a prerequisite of this program, you need to have PETSc installed, and if you want to run in parallel on a cluster, you also need METIS to partition meshes. The installation of deal.II together with these two additional libraries is described in the README file.
Now, for the details: as mentioned, the program does not compute anything new, so the use of finite element classes, etc., is exactly the same as before. The difference to previous programs is that we have replaced almost all uses of classes Vector
and SparseMatrix
by their nearequivalents PETScWrappers::MPI::Vector
and PETScWrappers::MPI::SparseMatrix
that store data in a way so that every processor in the MPI network only stores a part of the matrix or vector. More specifically, each processor will only store those rows of the matrix that correspond to a degree of freedom it "owns". For vectors, they either store only elements that correspond to degrees of freedom the processor owns (this is what is necessary for the right hand side), or also some additional elements that make sure that every processor has access the solution components that live on the cells the processor owns (socalled locally active DoFs) or also on neighboring cells (socalled locally relevant DoFs).
The interface the classes from the PETScWrapper namespace provide is very similar to that of the deal.II linear algebra classes, but instead of implementing this functionality themselves, they simply pass on to their corresponding PETSc functions. The wrappers are therefore only used to give PETSc a more modern, object oriented interface, and to make the use of PETSc and deal.II objects as interchangeable as possible. The main point of using PETSc is that it can run in parallel. We will make use of this by partitioning the domain into as many blocks ("subdomains") as there are processes in the MPI network. At the same time, PETSc also provides dummy MPI stubs, so you can run this program on a single machine if PETSc was configured without MPI.
Developing software to run in parallel via MPI requires a bit of a change in mindset because one typically has to split up all data structures so that every processor only stores a piece of the entire problem. As a consequence, you can't typically access all components of a solution vector on each processor – each processor may simply not have enough memory to hold the entire solution vector. Because data is split up or "distributed" across processors, we call the programming model used by MPI "distributed memory computing" (as opposed to "shared memory computing", which would mean that multiple processors can all access all data within one memory space, for example whenever multiple cores in a single machine work on a common task). Some of the fundamentals of distributed memory computing are discussed in the Parallel computing with multiple processors using distributed memory documentation module, which is itself a submodule of the Parallel computing module.
In general, to be truly able to scale to large numbers of processors, one needs to split between the available processors every data structure whose size scales with the size of the overall problem. (For a definition of what it means for a program to "scale", see this glossary entry.) This includes, for example, the triangulation, the matrix, and all global vectors (solution, right hand side). If one doesn't split all of these objects, one of those will be replicated on all processors and will eventually simply become too large if the problem size (and the number of available processors) becomes large. (On the other hand, it is completely fine to keep objects with a size that is independent of the overall problem size on every processor. For example, each copy of the executable will create its own finite element object, or the local matrix we use in the assembly.)
In the current program (as well as in the related step18), we will not go quite this far but present a gentler introduction to using MPI. More specifically, the only data structures we will parallelize are matrices and vectors. We do, however, not split up the Triangulation and DoFHandler classes: each process still has a complete copy of these objects, and all processes have exact copies of what the other processes have. We will then simply have to mark, in each copy of the triangulation on each of the processors, which processor owns which cells. This process is called "partitioning" a mesh into subdomains.
For larger problems, having to store the entire mesh on every processor will clearly yield a bottleneck. Splitting up the mesh is slightly, though not much more complicated (from a user perspective, though it is much more complicated under the hood) to achieve and we will show how to do this in step40 and some other programs. There are numerous occasions where, in the course of discussing how a function of this program works, we will comment on the fact that it will not scale to large problems and why not. All of these issues will be addressed in step18 and in particular step40, which scales to very large numbers of processes.
Philosophically, the way MPI operates is as follows. You typically run a program via
which means to run it on (say) 32 processors. (If you are on a cluster system, you typically need to schedule the program to run whenever 32 processors become available; this will be described in the documentation of your cluster. But under the hood, whenever those processors become available, the same call as above will generally be executed.) What this does is that the MPI system will start 32 copies of the step17
executable. (The MPI term for each of these running executables is that you have 32 MPI processes.) This may happen on different machines that can't even read from each others' memory spaces, or it may happen on the same machine, but the end result is the same: each of these 32 copies will run with some memory allocated to it by the operating system, and it will not directly be able to read the memory of the other 31 copies. In order to collaborate in a common task, these 32 copies then have to communicate with each other. MPI, short for Message Passing Interface, makes this possible by allowing programs to send messages. You can think of this as the mail service: you can put a letter to a specific address into the mail and it will be delivered. But that's the extent to which you can control things. If you want the receiver to do something with the content of the letter, for example return to you data you want from over there, then two things need to happen: (i) the receiver needs to actually go check whether there is anything in their mailbox, and (ii) if there is, react appropriately, for example by sending data back. If you wait for this return message but the original receiver was distracted and not paying attention, then you're out of luck: you'll simply have to wait until your requested over there will be worked on. In some cases, bugs will lead the original receiver to never check your mail, and in that case you will wait forever – this is called a deadlock. (See also video lecture 39, video lecture 41, video lecture 41.25, video lecture 41.5.)
In practice, one does not usually program at the level of sending and receiving individual messages, but uses higher level operations. For example, in the program we will use function calls that take a number from each processor, add them all up, and return the sum to all processors. Internally, this is implemented using individual messages, but to the user this is transparent. We call such operations collectives because all processors participate in them. Collectives allow us to write programs where not every copy of the executable is doing something completely different (this would be incredibly difficult to program) but where in essence all copies are doing the same thing (though on different data) for themselves, running through the same blocks of code; then they communicate data through collectives; and then go back to doing something for themselves again running through the same blocks of data. This is the key piece to being able to write programs, and it is the key component to making sure that programs can run on any number of processors, since we do not have to write different code for each of the participating processors.
(This is not to say that programs are never written in ways where different processors run through different blocks of code in their copy of the executable. Programs internally also often communicate in other ways than through collectives. But in practice, parallel finite element codes almost always follow the scheme where every copy of the program runs through the same blocks of code at the same time, interspersed by phases where all processors communicate with each other.)
In reality, even the level of calling MPI collective functions is too low. Rather, the program below will not contain any direct calls to MPI at all, but only deal.II functions that hide this communication from users of the deal.II. This has the advantage that you don't have to learn the details of MPI and its rather intricate function calls. That said, you do have to understand the general philosophy behind MPI as outlined above.
The techniques this program then demonstrates are:
ElasticProblem
.ElasticProblem::setup_system()
function.ElasticProblem::assemble_system()
function.ElasticProblem::solve()
and ElasticProblem::refine_grid()
functions.pcout
variable in the program, initialized in the constructor.Since all this can only be demonstrated using actual code, let us go straight to the code without much further ado.
First the usual assortment of header files we have already used in previous example programs:
And here come the things that we need particularly for this example program and that weren't in step8. First, we replace the standard output std::cout
by a new stream pcout
which is used in parallel computations for generating output only on one of the MPI processes.
We are going to query the number of processes and the number of the present process by calling the respective functions in the Utilities::MPI namespace.
Then, we are going to replace all linear algebra components that involve the (global) linear system by classes that wrap interfaces similar to our own linear algebra classes around what PETSc offers (PETSc is a library written in C, and deal.II comes with wrapper classes that provide the PETSc functionality with an interface that is similar to the interface we already had for our own linear algebra classes). In particular, we need vectors and matrices that are distributed across several processes in MPI programs (and simply map to sequential, local vectors and matrices if there is only a single process, i.e., if you are running on only one machine, and without MPI support):
Then we also need interfaces for solvers and preconditioners that PETSc provides:
And in addition, we need some algorithms for partitioning our meshes so that they can be efficiently distributed across an MPI network. The partitioning algorithm is implemented in the GridTools
namespace, and we need an additional include file for a function in DoFRenumbering
that allows to sort the indices associated with degrees of freedom so that they are numbered according to the subdomain they are associated with:
And this is simply C++ again:
The last step is as in all previous programs:
ElasticProblem
class templateThe first real part of the program is the declaration of the main class. As mentioned in the introduction, almost all of this has been copied verbatim from step8, so we only comment on the few differences between the two tutorials. There is one (cosmetic) change in that we let solve
return a value, namely the number of iterations it took to converge, so that we can output this to the screen at the appropriate place.
The first change is that we have to declare a variable that indicates the MPI communicator over which we are supposed to distribute our computations.
Then we have two variables that tell us where in the parallel world we are. The first of the following variables, n_mpi_processes
, tells us how many MPI processes there exist in total, while the second one, this_mpi_process
, indicates which is the number of the present process within this space of processes (in MPI language, this corresponds to the rank of the process). The latter will have a unique value for each process between zero and (less than) n_mpi_processes
. If this program is run on a single machine without MPI support, then their values are 1
and 0
, respectively.
Next up is a streamlike variable pcout
. It is, in essence, just something we use for convenience: in a parallel program, if each process outputs status information, then there quickly is a lot of clutter. Rather, we would want to only have one process output everything once, for example the one with rank zero. At the same time, it seems silly to prefix every place where we create output with an if (my_rank==0)
condition.
To make this simpler, the ConditionalOStream class does exactly this under the hood: it acts as if it were a stream, but only forwards to a real, underlying stream if a flag is set. By setting this condition to this_mpi_process==0
(where this_mpi_process
corresponds to the rank of an MPI process), we make sure that output is only generated from the first process and that we don't get the same lines of output over and over again, once per process. Thus, we can use pcout
everywhere and in every process, but on all but one process nothing will ever happen to the information that is piped into the object via operator<<
.
The remainder of the list of member variables is fundamentally the same as in step8. However, we change the declarations of matrix and vector types to use parallel PETSc objects instead. Note that we do not use a separate sparsity pattern, since PETSc manages this internally as part of its matrix data structures.
The following is taken from step8 without change:
ElasticProblem
class implementationThe first step in the actual implementation is the constructor of the main class. Apart from initializing the same member variables that we already had in step8, we here initialize the MPI communicator variable we shall use with the global MPI communicator linking all processes together (in more complex applications, one could here use a communicator object that only links a subset of all processes), and call the Utilities::MPI helper functions to determine the number of processes and where the present one fits into this picture. In addition, we make sure that output is only generated by the (globally) first process. We do so by passing the stream we want to output to (std::cout
) and a true/false flag as arguments where the latter is determined by testing whether the process currently executing the constructor call is the first in the MPI universe.
Next, the function in which we set up the various variables for the global linear system to be solved needs to be implemented.
However, before we proceed with this, there is one thing to do for a parallel program: we need to determine which MPI process is responsible for each of the cells. Splitting cells among processes, commonly called "partitioning the mesh", is done by assigning a subdomain id to each cell. We do so by calling into the METIS library that does this in a very efficient way, trying to minimize the number of nodes on the interfaces between subdomains. Rather than trying to call METIS directly, we do this by calling the GridTools::partition_triangulation() function that does this at a much higher level of programming.
Following partitioning, we need to enumerate all degrees of freedom as usual. However, we would like to enumerate the degrees of freedom in a way so that all degrees of freedom associated with cells in subdomain zero (which resides on process zero) come before all DoFs associated with cells on subdomain one, before those on cells on process two, and so on. We need this since we have to split the global vectors for right hand side and solution, as well as the matrix into contiguous chunks of rows that live on each of the processors, and we will want to do this in a way that requires minimal communication. This particular enumeration can be obtained by reordering degrees of freedom indices using DoFRenumbering::subdomain_wise().
The final step of this initial setup is that we get ourselves an IndexSet that indicates the subset of the global number of unknowns this process is responsible for. (Note that a degree of freedom is not necessarily owned by the process that owns a cell just because the degree of freedom lives on this cell: some degrees of freedom live on interfaces between subdomains, and are consequently only owned by one of the processes adjacent to this interface.)
Before we move on, let us recall a fact already discussed in the introduction: The triangulation we use here is replicated across all processes, and each process has a complete copy of the entire triangulation, with all cells. Partitioning only provides a way to identify which cells out of all each process "owns", but it knows everything about all of them. Likewise, the DoFHandler object knows everything about every cell, in particular the degrees of freedom that live on each cell, whether it is one that the current process owns or not. This can not scale to large problems because eventually just storing the entire mesh, and everything that is associated with it, on every process will become infeasible if the problem is large enough. On the other hand, if we split the triangulation into parts so that every process stores only those cells it "owns" but nothing else (or, at least a sufficiently small fraction of everything else), then we can solve large problems if only we throw a large enough number of MPI processes at them. This is what we are going to in step40, for example, using the parallel::distributed::Triangulation class. On the other hand, most of the rest of what we demonstrate in the current program will actually continue to work whether we have the entire triangulation available, or only a piece of it.
We need to initialize the objects denoting hanging node constraints for the present grid. As with the triangulation and DoFHandler objects, we will simply store all constraints on each process; again, this will not scale, but we show in step40 how one can work around this by only storing on each MPI process the constraints for degrees of freedom that actually matter on this particular process.
Now we create the sparsity pattern for the system matrix. Note that we again compute and store all entries and not only the ones relevant to this process (see step18 or step40 for a more efficient way to handle this).
Now we determine the set of locally owned DoFs and use that to initialize parallel vectors and matrix. Since the matrix and vectors need to work in parallel, we have to pass them an MPI communication object, as well as information about the partitioning contained in the IndexSet locally_owned_dofs
. The IndexSet contains information about the global size (the total number of degrees of freedom) and also what subset of rows is to be stored locally. Note that the system matrix needs that partitioning information for the rows and columns. For square matrices, as it is the case here, the columns should be partitioned in the same way as the rows, but in the case of rectangular matrices one has to partition the columns in the same way as vectors are partitioned with which the matrix is multiplied, while rows have to partitioned in the same way as destination vectors of matrixvector multiplications:
We now assemble the matrix and right hand side of the problem. There are some things worth mentioning before we go into detail. First, we will be assembling the system in parallel, i.e., each process will be responsible for assembling on cells that belong to this particular process. Note that the degrees of freedom are split in a way such that all DoFs in the interior of cells and between cells belonging to the same subdomain belong to the process that owns
the cell. However, even then we sometimes need to assemble on a cell with a neighbor that belongs to a different process, and in these cases when we add up the local contributions into the global matrix or right hand side vector, we have to transfer these entries to the process that owns these elements. Fortunately, we don't have to do this by hand: PETSc does all this for us by caching these elements locally, and sending them to the other processes as necessary when we call the compress()
functions on the matrix and vector at the end of this function.
The second point is that once we have handed over matrix and vector contributions to PETSc, it is a) hard, and b) very inefficient to get them back for modifications. This is not only the fault of PETSc, it is also a consequence of the distributed nature of this program: if an entry resides on another processor, then it is necessarily expensive to get it. The consequence of this is that we should not try to first assemble the matrix and right hand side as if there were no hanging node constraints and boundary values, and then eliminate these in a second step (using, for example, AffineConstraints::condense()). Rather, we should try to eliminate hanging node constraints before handing these entries over to PETSc. This is easy: instead of copying elements by hand into the global matrix (as we do in step4), we use the AffineConstraints::distribute_local_to_global() functions to take care of hanging nodes at the same time. We also already did this in step6. The second step, elimination of boundary nodes, could also be done this way by putting the boundary values into the same AffineConstraints object as hanging nodes (see the way it is done in step6, for example); however, it is not strictly necessary to do this here because eliminating boundary values can be done with only the data stored on each process itself, and consequently we use the approach used before in step4, i.e., via MatrixTools::apply_boundary_values().
All of this said, here is the actual implementation starting with the general setup of helper variables. (Note that we still use the deal.II full matrix and vector types for the local systems as these are small and need not be shared across processes.)
The next thing is the loop over all elements. Note that we do not have to do all the work on every process: our job here is only to assemble the system on cells that actually belong to this MPI process, all other cells will be taken care of by other processes. This is what the ifclause immediately after the forloop takes care of: it queries the subdomain identifier of each cell, which is a number associated with each cell that tells us about the owner process. In more generality, the subdomain id is used to split a domain into several parts (we do this above, at the beginning of setup_system()
), and which allows to identify which subdomain a cell is living on. In this application, we have each process handle exactly one subdomain, so we identify the terms subdomain
and MPI process
.
Apart from this, assembling the local system is relatively uneventful if you have understood how this is done in step8. As mentioned above, distributing local contributions into the global matrix and right hand sides also takes care of hanging node constraints in the same way as is done in step6.
The next step is to "compress" the vector and the system matrix. This means that each process sends the additions that were made to those entries of the matrix and vector that the process did not own itself to the process that owns them. After receiving these additions from other processes, each process then adds them to the values it already has. These additions are combining the integral contributions of shape functions living on several cells just as in a serial computation, with the difference that the cells are assigned to different processes.
The global matrix and right hand side vectors have now been formed. We still have to apply boundary values, in the same way as we did, for example, in step3, step4, and a number of other programs.
The last argument to the call to MatrixTools::apply_boundary_values() below allows for some optimizations. It controls whether we should also delete entries (i.e., set them to zero) in the matrix columns corresponding to boundary nodes, or to keep them (and passing true
means: yes, do eliminate the columns). If we do eliminate columns, then the resulting matrix will be symmetric again if it was before; if we don't, then it won't. The solution of the resulting system should be the same, though. The only reason why we may want to make the system symmetric again is that we would like to use the CG method, which only works with symmetric matrices. The reason why we may not want to make the matrix symmetric is because this would require us to write into column entries that actually reside on other processes, i.e., it involves communicating data. This is always expensive.
Experience tells us that CG also works (and works almost as well) if we don't remove the columns associated with boundary nodes, which can be explained by the special structure of this particular nonsymmetry. To avoid the expense of communication, we therefore do not eliminate the entries in the affected columns.
Having assembled the linear system, we next need to solve it. PETSc offers a variety of sequential and parallel solvers, for which we have written wrappers that have almost the same interface as is used for the deal.II solvers used in all previous example programs. The following code should therefore look rather familiar.
At the top of the function, we set up a convergence monitor, and assign it the accuracy to which we would like to solve the linear system. Next, we create an actual solver object using PETSc's CG solver which also works with parallel (distributed) vectors and matrices. And finally a preconditioner; we choose to use a block Jacobi preconditioner which works by computing an incomplete LU decomposition on each diagonal block of the matrix. (In other words, each MPI process computes an ILU from the rows it stores by throwing away columns that correspond to row indices not stored locally; this yields a square matrix block from which we can compute an ILU. That means that if you run the program with only one process, then you will use an ILU(0) as a preconditioner, while if it is run on many processes, then we will have a number of blocks on the diagonal and the preconditioner is the ILU(0) of each of these blocks. In the extreme case of one degree of freedom per processor, this preconditioner is then simply a Jacobi preconditioner since the diagonal matrix blocks consist of only a single entry. Such a preconditioner is relatively easy to compute because it does not require any kind of communication between processors, but it is in general not very efficient for large numbers of processors.)
Following this kind of setup, we then solve the linear system:
The next step is to distribute hanging node constraints. This is a little tricky, since to fill in the value of a constrained node you need access to the values of the nodes to which it is constrained (for example, for a Q1 element in 2d, we need access to the two nodes on the big side of a hanging node face, to compute the value of the constrained node in the middle).
The problem is that we have built our vectors (in setup_system()
) in such a way that every process is responsible for storing only those elements of the solution vector that correspond to the degrees of freedom this process "owns". There are, however, cases where in order to compute the value of the vector entry for a constrained degree of freedom on one process, we need to access vector entries that are stored on other processes. PETSc (and, for that matter, the MPI model on which it is built) does not allow to simply query vector entries stored on other processes, so what we do here is to get a copy of the "distributed" vector where we store all elements locally. This is simple, since the deal.II wrappers have a conversion constructor for the deal.II Vector class. (This conversion of course requires communication, but in essence every process only needs to send its data to every other process once in bulk, rather than having to respond to queries for individual elements):
Of course, as in previous discussions, it is clear that such a step cannot scale very far if you wanted to solve large problems on large numbers of processes, because every process now stores all elements of the solution vector. (We will show how to do this better in step40.) On the other hand, distributing hanging node constraints is simple on this local copy, using the usual function AffineConstraints::distributed(). In particular, we can compute the values of all constrained degrees of freedom, whether the current process owns them or not:
Then transfer everything back into the global vector. The following operation copies those elements of the localized solution that we store locally in the distributed solution, and does not touch the other ones. Since we do the same operation on all processors, we end up with a distributed vector (i.e., a vector that on every process only stores the vector entries corresponding to degrees of freedom that are owned by this process) that has all the constrained nodes fixed.
We end the function by returning the number of iterations it took to converge, to allow for some output.
Using some kind of refinement indicator, the mesh can be refined. The problem is basically the same as with distributing hanging node constraints: in order to compute the error indicator (even if we were just interested in the indicator on the cells the current process owns), we need access to more elements of the solution vector than just those the current processor stores. To make this happen, we do essentially what we did in solve()
already, namely get a complete copy of the solution vector onto every process, and use that to compute. This is in itself expensive as explained above and it is particular unnecessary since we had just created and then destroyed such a vector in solve()
, but efficiency is not the point of this program and so let us opt for a design in which every function is as selfcontained as possible.
Once we have such a "localized" vector that contains all elements of the solution vector, we can compute the indicators for the cells that belong to the present process. In fact, we could of course compute all refinement indicators since our Triangulation and DoFHandler objects store information about all cells, and since we have a complete copy of the solution vector. But in the interest in showing how to operate in parallel, let us demonstrate how one would operate if one were to only compute some error indicators and then exchange the remaining ones with the other processes. (Ultimately, each process needs a complete set of refinement indicators because every process needs to refine their mesh, and needs to refine it in exactly the same way as all of the other processes.)
So, to do all of this, we need to:
Now all processes have computed error indicators for their own cells and stored them in the respective elements of the local_error_per_cell
vector. The elements of this vector for cells not owned by the present process are zero. However, since all processes have a copy of the entire triangulation and need to keep these copies in sync, they need the values of refinement indicators for all cells of the triangulation. Thus, we need to distribute our results. We do this by creating a distributed vector where each process has its share and sets the elements it has computed. Consequently, when you view this vector as one that lives across all processes, then every element of this vector has been set once. We can then assign this parallel vector to a local, nonparallel vector on each process, making all error indicators available on every process.
So in the first step, we need to set up a parallel vector. For simplicity, every process will own a chunk with as many elements as this process owns cells, so that the first chunk of elements is stored with process zero, the next chunk with process one, and so on. It is important to remark, however, that these elements are not necessarily the ones we will write to. This is a consequence of the order in which cells are arranged, i.e., the order in which the elements of the vector correspond to cells is not ordered according to the subdomain these cells belong to. In other words, if on this process we compute indicators for cells of a certain subdomain, we may write the results to more or less random elements of the distributed vector; in particular, they may not necessarily lie within the chunk of vector we own on the present process. They will subsequently have to be copied into another process' memory space, an operation that PETSc does for us when we call the compress()
function. This inefficiency could be avoided with some more code, but we refrain from it since it is not a major factor in the program's total runtime.
So here is how we do it: count how many cells belong to this process, set up a distributed vector with that many elements to be stored locally, copy over the elements we computed locally, and finally compress the result. In fact, we really only copy the elements that are nonzero, so we may miss a few that we computed to zero, but this won't hurt since the original values of the vector are zero anyway.
So now we have this distributed vector that contains the refinement indicators for all cells. To use it, we need to obtain a local copy and then use it to mark cells for refinement or coarsening, and actually do the refinement and coarsening. It is important to recognize that every process does this to its own copy of the triangulation, and does it in exactly the same way.
The final function of significant interest is the one that creates graphical output. This works the same way as in step8, with two small differences. Before discussing these, let us state the general philosophy this function will work: we intend for all of the data to be generated on a single process, and subsequently written to a file. This is, as many other parts of this program already discussed, not something that will scale. Previously, we had argued that we will get into trouble with triangulations, DoFHandlers, and copies of the solution vector where every process has to store all of the data, and that there will come to be a point where each process simply doesn't have enough memory to store that much data. Here, the situation is different: it's not only the memory, but also the run time that's a problem. If one process is responsible for processing all of the data while all of the other processes do nothing, then this one function will eventually come to dominate the overall run time of the program. In particular, the time this function takes is going to be proportional to the overall size of the problem (counted in the number of cells, or the number of degrees of freedom), independent of the number of processes we throw at it.
Such situations need to be avoided, and we will show in step18 and step40 how to address this issue. For the current problem, the solution is to have each process generate output data only for its own local cells, and write them to separate files, one file per process. This is how step18 operates. Alternatively, one could simply leave everything in a set of independent files and let the visualization software read all of them (possibly also using multiple processors) and create a single visualization out of all of them; this is the path step40, step32, and all other parallel programs developed later on take.
More specifically for the current function, all processes call this function, but not all of them need to do the work associated with generating output. In fact, they shouldn't, since we would try to write to the same file multiple times at once. So we let only the first process do this, and all the other ones idle around during this time (or start their work for the next iteration, or simply yield their CPUs to other jobs that happen to run at the same time). The second thing is that we not only output the solution vector, but also a vector that indicates which subdomain each cell belongs to. This will make for some nice pictures of partitioned domains.
To implement this, process zero needs a complete set of solution components in a local vector. Just as with the previous function, the efficient way to do this would be to reuse the vector already created in the solve()
function, but to keep things more selfcontained, we simply recreate one here from the distributed solution vector.
An important thing to realize is that we do this localization operation on all processes, not only the one that actually needs the data. This can't be avoided, however, with the simplified communication model of MPI we use for vectors in this tutorial program: MPI does not have a way to query data on another process, both sides have to initiate a communication at the same time. So even though most of the processes do not need the localized solution, we have to place the statement converting the distributed into a localized vector so that all processes execute it.
(Part of this work could in fact be avoided. What we do is send the local parts of all processes to all other processes. What we would really need to do is to initiate an operation on all processes where each process simply sends its local chunk of data to process zero, since this is the only one that actually needs it, i.e., we need something like a gather operation. PETSc can do this, but for simplicity's sake we don't attempt to make use of this here. We don't, since what we do is not very expensive in the grand scheme of things: it is one vector communication among all processes, which has to be compared to the number of communications we have to do when solving the linear system, setting up the blockILU for the preconditioner, and other operations.)
This being done, process zero goes ahead with setting up the output file as in step8, and attaching the (localized) solution vector to the output object.
The only other thing we do here is that we also output one value per cell indicating which subdomain (i.e., MPI process) it belongs to. This requires some conversion work, since the data the library provides us with is not the one the output class expects, but this is not difficult. First, set up a vector of integers, one per cell, that is then filled by the subdomain id of each cell.
The elements of this vector are then converted to a floating point vector in a second step, and this vector is added to the DataOut object, which then goes off creating output in VTK format:
Lastly, here is the driver function. It is almost completely unchanged from step8, with the exception that we replace std::cout
by the pcout
stream. Apart from this, the only other cosmetic change is that we output how many degrees of freedom there are per process, and how many iterations it took for the linear solver to converge:
main
functionThe main()
works the same way as most of the main functions in the other example programs, i.e., it delegates work to the run
function of a managing object, and only wraps everything into some code to catch exceptions:
Here is the only real difference: MPI and PETSc both require that we initialize these libraries at the beginning of the program, and uninitialize them at the end. The class MPI_InitFinalize takes care of all of that. The trailing argument 1
means that we do want to run each MPI process with a single thread, a prerequisite with the PETSc parallel linear algebra.
If the program above is compiled and run on a single processor machine, it should generate results that are very similar to those that we already got with step8. However, it becomes more interesting if we run it on a multicore machine or a cluster of computers. The most basic way to run MPI programs is using a command line like
to run the step17 executable with 32 processors.
(If you work on a cluster, then there is typically a step in between where you need to set up a job script and submit the script to a scheduler. The scheduler will execute the script whenever it can allocate 32 unused processors for your job. How to write such job scripts differs from cluster to cluster, and you should find the documentation of your cluster to see how to do this. On my system, I have to use the command qsub
with a whole host of options to run a job in parallel.)
Whether directly or through a scheduler, if you run this program on 8 processors, you should get output like the following:
(This run uses a few more refinement cycles than the code available in the examples/ directory. The run also used a version of METIS from 2004 that generated different partitionings; consequently, the numbers you get today are slightly different.)
As can be seen, we can easily get to almost four million unknowns. In fact, the code's runtime with 8 processes was less than 7 minutes up to (and including) cycle 14, and 14 minutes including the second to last step. (These are numbers relevant to when the code was initially written, in 2004.) I lost the timing information for the last step, though, but you get the idea. All this is after release mode has been enabled by running make release
, and with the generation of graphical output switched off for the reasons stated in the program comments above. (See also video lecture 18.) The biggest 2d computations I did had roughly 7.1 million unknowns, and were done on 32 processes. It took about 40 minutes. Not surprisingly, the limiting factor for how far one can go is how much memory one has, since every process has to hold the entire mesh and DoFHandler objects, although matrices and vectors are split up. For the 7.1M computation, the memory consumption was about 600 bytes per unknown, which is not bad, but one has to consider that this is for every unknown, whether we store the matrix and vector entries locally or not.
Here is some output generated in the 12th cycle of the program, i.e. with roughly 300,000 unknowns:
As one would hope for, the x (left) and ydisplacements (right) shown here closely match what we already saw in step8. As shown there and in step22, we could as well have produced a vector plot of the displacement field, rather than plotting it as two separate scalar fields. What may be more interesting, though, is to look at the mesh and partition at this step:
Again, the mesh (left) shows the same refinement pattern as seen previously. The right panel shows the partitioning of the domain across the 8 processes, each indicated by a different color. The picture shows that the subdomains are smaller where mesh cells are small, a fact that needs to be expected given that the partitioning algorithm tries to equilibrate the number of cells in each subdomain; this equilibration is also easily identified in the output shown above, where the number of degrees per subdomain is roughly the same.
It is worth noting that if we ran the same program with a different number of processes, that we would likely get slightly different output: a different mesh, different number of unknowns and iterations to convergence. The reason for this is that while the matrix and right hand side are the same independent of the number of processes used, the preconditioner is not: it performs an ILU(0) on the chunk of the matrix of each processor separately. Thus, it's effectiveness as a preconditioner diminishes as the number of processes increases, which makes the number of iterations increase. Since a different preconditioner leads to slight changes in the computed solution, this will then lead to slightly different mesh cells tagged for refinement, and larger differences in subsequent steps. The solution will always look very similar, though.
Finally, here are some results for a 3d simulation. You can repeat these by changing
to
in the main function. If you then run the program in parallel, you get something similar to this (this is for a job with 16 processes):
The last step, going up to 1.5 million unknowns, takes about 55 minutes with 16 processes on 8 dualprocessor machines (of the kind available in 2003). The graphical output generated by this job is rather large (cycle 5 already prints around 82 MB of data), so we contend ourselves with showing output from cycle 4:
The left picture shows the partitioning of the cube into 16 processes, whereas the right one shows the xdisplacement along two cutplanes through the cube.
The program keeps a complete copy of the Triangulation and DoFHandler objects on every processor. It also creates complete copies of the solution vector, and it creates output on only one processor. All of this is obviously the bottleneck as far as parallelization is concerned.
Internally, within deal.II, parallelizing the data structures used in hierarchic and unstructured triangulations is a hard problem, and it took us a few more years to make this happen. The step40 tutorial program and the Parallel computing with multiple processors using distributed memory documentation module talk about how to do these steps and what it takes from an application perspective. An obvious extension of the current program would be to use this functionality to completely distribute computations to many more processors than used here.