Numerical and Parallel Utilities
Contents
Numerical and Parallel Utilities¶
Tridiagonal System Solver¶
Description¶
To solve systems of the form \(Ax=b\), where \(A\) is a tridiagonal matrix, Selalib offers a native, robust tridiagonal system solver. The present implementation contains only a serial version. The algorithm is based on an LU factorization of a given matrix, with row pivoting. The tridiagonal matrix must be given as a single array, with a memory layout shown next.
Exposed Interface¶
Factorization of the matrix \(A\) is obtained through a call to the subroutine
sll_setup_cyclic_tridiag( a, n, lu, ipiv )
where a
is the matrix to be factorized, n
is the problem size (the
number of unknowns), lu
is a real array of size \(7n\) where
factorization information will be returned and ipiv
is an integer
array of length n
on which pivoting information will be returned. From
the perspective of the user, lu
and ipiv
are only arrays that
sll_setup_cyclic_tridiag
requires and do not need any further
consideration.
The solution of a tridiagonal system, once the original array \(A\) has been factorized, is obtained through a call to
sll_solve_cyclic_tridiag( lu, ipiv, b, n, x )
where lu
, ipiv
are the arrays returned by
sll_setup_cyclic_tridiag()
, b
is the independent term in the
original matrix equation, n
is the system size and x
is the array
where the solution will be returned.
Usage¶
To use the module in a stand-alone way, include the line:
use sll_tridiagonal
The following code snippet is an example of the use of the tridiagonal solver.
sll_int32 :: n = 1000
sll_int32 :: ierr
sll_real64, allocatable, dimension(:) :: lu
sll_int32, allocatable, dimension(:) :: ipiv
sll_real64, allocatable, dimension(:) :: x
SLL_ALLOCATE( lu(7*n), ierr )
SLL_ALLOCATE( ipiv(n), ierr )
SLL_ALLOCATE( x(n), ierr )
! initialize a(:) with the proper coefficients here...
and then:
sll_setup_cyclic_tridiag( a, n, lu, ipiv )
sll_solve_cyclic_tridiag( lu, ipiv, b, n, x )
SLL_DEALLOCATE_ARRAY( lu, ierr )
SLL_DEALLOCATE_ARRAY( ipiv, ierr )
SLL_DEALLOCATE_ARRAY( x, ierr )
Note that if the last call had been made as in
sll_solve_cyclic_tridiag( lu, ipiv, b, n, b )
the system would have been solved in-place.
Boundary Condition Descriptors¶
Description¶
This tiny module defines a few symbols that are meant to be used library-wide as a means to indicate various boundary condition types or their combinations. The objective is to have pre-defined symbols that will prevent the need of using numeric values explicitly to specify this information.
Exposed Interface¶
The defined symbols are:
SLL_USER_DEFINED
SLL_PERIODIC
SLL_DIRICHLET
SLL_NEUMANN
SLL_HERMITE
SLL_NEUMANN_MODE_0
SLL_SET_TO_LIMIT
Usage¶
Simply pass the parameter as an argument to any routine call which requires a boundary condition descriptor.
Cubic Splines¶
Description¶
The cubic splines module provides capabilities for 1D and 2D data
interpolation with cubic B-splines and different boundary conditions (at
the time of this writing: periodic, hermite). The data to be
interpolated are represented by a simple array. The spline coefficients
and other information are stored in a spline object, which is also used
to interpolate the fitted data. To use this module, simply declare an
instance of the type to be used, initialize the object with the desired
parameters, use the initialized object to do interpolations and when no
longer needed, release the resources through a call to the
sll_delete( )
routine. More details below.
Exposed Interface¶
Fundamental types:
sll_cubic_spline_1d
sll_cubic_spline_2d
These types are declared as pointers and only manipulated through the functions and subroutines described below. For more explicit examples, see the usage section.
For the 1D case, the available routines are:
new_cubic_spline_1D( num_points,
xmin,
xmax,
bc_type,
[slope_L],
[slope_R] )
compute_cubic_spline_1D( data, spline_object )
interpolate_value( x, spline_object )
interpolate_derivative( x, spline_object )
interpolate_array_values( a_in, a_out, np, spline_object )
interpolate_pointer_values( a_in, a_out, np, spline_object )
interpolate_array_derivatives( a_in, a_out, np, spline_object )
interpolate_pointer_derivatives( ptr_in, ptr_out, np, spline_object )
get_x1_min( spline_object )
get_x1_max( spline_object )
get_x1_delta( spline_object )
sll_delete( spline_object )
In the above list:
new_cubic_spline_1D()
is responsible for allocating all the necessary storage for the spline object and initializating it. Arguments:
- num_points
32-bit integer. Number of data points for which the spline needs to be computed, including the end-points (regardless of the boundary condition used).
- xmin
Double precision real. Lower bound of the domain in which the data array is defined. In other words, if we think of the data array (indexed 1:NP) as the values of a discrete function f defined over a sequence of \(x_i\)’s, then \(x_1 = xmin\).
- xmax
Double precision real. Similarly to
xmin
,xmax
represents the maximum extent of the domain. For data indexed 1:NP: \(xmax = x_{NP}\).- bc_type
Pre-defined parameter. Descriptor of the type of boundary condition to be imposed in the spline. Use only the symbols defined in Selalib’s boundary condition descriptors module. Presently, one of
PERIODIC_SPLINE
orHERMITE_SPLINE
. These are really aliases to integer flags, but in Selalib we avoid the use of non-descriptive flags.- slope_L
Double precision real. Optional value representing the desired slope at \(x = xmin\), in the hermite BC case.
- slope_R
Double precision real. Optional value representing the desired slope at \(x = xmax\), in the hermite BC case.
compute_cubic_spline_1D()
calculates the spline coefficients needed to
represent the given data and stores this information in the spline
object. Arguments:
- data
Double precision 1D real array containing NP values.
- spline_object
Initialized object of type
sll_cubic_spline_1d
.
interpolate_value()
returns the value of f(x) where x is a
floating-point value between xmin
and xmax
, and f is a continuous
function built with cubic B-splines and the user-defined boundary
conditions contained in the spline object passed. Essentially, the
spline object, together with the interpolate_value()
function create
the illusion of having available a continuous function when originally
using the discrete data given. Arguments:
- x
Double precision value representing the ordinate whose image is sought.
- spline_object
Initialized object of type
sll_cubic_spline_1d
. A call tocompute_cubic_spline_1D()
should have been made previous to calling this function.
interpolate_array_values()
has an analogous functionality than the
previous interpolation function, but is capable of processing whole
arrays. This is useful to avoid the overhead associated with a call to
interpolate_value()
inside a loop.
interpolate_derivative()
returns the value of f’(x) where x is a
floating-point value between xmin
and xmax
, and f is a continuous
function built with cubic B-splines and the user-defined boundary
conditions contained in the spline object passed. Same types of
arguments as in interpolate_value()
.
interpolate_array_values()
is a subroutine that interpolates the
results of multiple values of x. Arguments:
- a_in
[in] double precision 1D array containing the values of the ordinates to be interpolated.
- a_out
[out]double precision 1D array to store the results, i.e., the images of the values contained in
a_in
.- np
[in] 32-bit integer storing the number of points meant to be interpolated.
- spline_object
[inout] Initialized object of type
sll_cubic_spline_1d
.
interpolate_pointer_values()
is a subroutine with analogous arguments
to interpolate_array_values
except that it receives pointers instead
of arrays for its arguments.
interpolate_array_derivatives()
computes multiple first-derivative
values. Its interface is identical to interpolate_array_values()
.
interpolate_pointer_derivatives()
computes multiple first-derivative
values, just as interpolate_array_derivatives()
except that its
arguments are pointers.
get_x1_min()
returns the minimum value of the domain in which the
spline object is defined. Arguments:
- spline_object
a pointer to a previously initialized object of type
sll_cubic_spline_1D
get_x1_max()
returns the maximum value of the domain in which the
spline object is defined. Arguments:
- spline_object
a pointer to a previously initialized object of type
sll_cubic_spline_1D
get_x1_delta()
returns the value of the spacing between the points
used to compute the spline values. This is determined from the extent of
the domain and the number of cells used. Arguments:
- spline_object
a pointer to a previously initialized object of type
sll_cubic_spline_1D
sll_delete()
subroutine to free the resources of a given spline
object. Arguments:
- spline_object
a pointer to a previously initialized object of type
sll_cubic_spline_1D
For the 2D case, the available routines are:
new_cubic_spline_2d( num_pts_x1,
num_pts_x2,
x1_min,
x1_max,
x2_min,
x2_max,
x1_bc_type,
x2_bc_type,
const_slope_x1_min,
const_slope_x1_max,
const_slope_x2_min,
const_slope_x2_max,
x1_min_slopes,
x1_max_slopes,
x2_min_slopes,
x2_max_slopes )
compute_cubic_spline_2d( data, spline_object )
interpolate_value_2d( x1, x2, spline_object )
interpolate_x1_derivative_2d( x1, x2, spline_object )
interpolate_x2_derivative_2d( x1, x2, spline_object )
get_x1_min( spline_object )
get_x1_max( spline_object )
get_x2_min( spline_object )
get_x2_max( spline_object )
get_x1_delta( spline_object )
get_x2_delta( spline_object )
sll_delete( spline_object )
In the above list:
new_cubic_spline_2D()
is responsible for allocating all the necessary
storage for the spline object and initializating it. Arguments:
- num_points_x1
[in] 32-bit integer. Number of data points in the first (memory-contiguous) direction for which the spline needs to be fitted. Endpoints at \(x_{1,j} = x1_{min}\) and \(x_{NPX1,j} = x1_{max}\) must be included.
- num_points_x2
[in] 32-bit integer. Number of data points in the second direction for which the spline needs to be fitted. Endpoints at \(x_{i,1} = x2_{min}\) and \(x_{i,NPX2} = x2_{max}\) must be included.
- x1_min
[in] Double precision real. Lower bound of the domain in which the data array is defined in the first (memory-contiguous) direction.
- x1_max
[in]Double precision real. Represents the maximum extent of the domain in the first (memory-contiguous) direction.
- x2_min
[in] Double precision real. Lower bound of the domain in which the data array is defined in the second direction.
- x2_max
[in]Double precision real. Represents the maximum extent of the domain in the second direction.
- x1_bc_type
[in] Pre-defined parameter. Descriptor of the type of boundary condition to be imposed in the spline in the first direction. Presently, one of
PERIODIC_SPLINE
orHERMITE_SPLINE
as defined in Selalib’s boundary condition descriptors module.- x2_bc_type
[in] Pre-defined parameter. Descriptor of the type of boundary condition to be imposed in the spline in the second direction. Presently, one of
PERIODIC_SPLINE
orHERMITE_SPLINE
as defined in Selalib’s boundary condition descriptors module.- const_slope_x1_min
[optional, in]Double precision real. Optional value representing the desired slope at \(x1_{min}\), in the hermite BC case if a constant value for the whole edge is to be specified.
- const_slope_x1_max
[optional, in]Double precision real. Optional value representing the desired slope at \(x1_{max}\), in the hermite BC case if a constant value for the whole edge is to be specified.
- const_slope_x2_min
[optional, in]Double precision real. Optional value representing the desired slope at \(x2_{min}\), in the hermite BC case if a constant value for the whole edge is to be specified.
- const_slope_x2_max
[optional, in]Double precision real. Optional value representing the desired slope at \(x2_{max}\), in the hermite BC case if a constant value for the whole edge is to be specified.
- x1_min_slopes
[optional, in] Array with the values of the slopes at each of the
num_points_x2
locations in the edge at \(x1_{min}\). To be used with the hermite boundary condition only.- x1_max_slopes
[optional, in] Array with the values of the slopes at each of the
num_points_x2
locations in the edge at \(x1_{max}\). To be used with the hermite boundary condition only.- x2_min_slopes
[optional, in] Array with the values of the slopes at each of the
num_points_x1
locations in the edge at \(x2_{min}\). To be used with the hermite boundary condition only.- x2_max_slopes
[optional, in] Array with the values of the slopes at each of the
num_points_x1
locations in the edge at \(x2_{max}\). To be used with the hermite boundary condition only.
compute_cubic_spline_2D()
calculates the spline coefficients needed to
represent the given 2D data and stores this information in the spline
object. Arguments:
- data
[in] Double precision 2D real array containing NPX1*NPX2 values.
- spline_object:
Initialized object of type
sll_cubic_spline_2d
.
interpolate_value_2d()
returns the value of f(x1,x2) where the
ordered pair \((x1,x2)\) is in the domain
\([x1_{min},x1_{max}] X [x2_{min},x2_{max}]\). f is a continuous
function built with cubic B-splines and the user-defined boundary
conditions contained in the spline object passed. Arguments:
- x1
[in] Double precision value representing the value of the first coordinate.
- x2
[in] Double precision value representing the value of the second coordinate.
- spline_object
Initialized object of type
sll_cubic_spline_2d
. A call tocompute_cubic_spline_2D()
should have been made previous to calling this function.
interpolate_x1_derivative_2d()
returns the value of the first
derivative in the \(x1\) direction. Uses same types of arguments as the
\(interpolate_value_2d\) function.
interpolate_x2_derivative_2d()
returns the value of the first
derivative in the \(x2\) direction. Uses same types of arguments as the
\(interpolate_value_2d\) function.
get_x1_min()
returns the minimum value of \(x1\) in the domain in which
the spline object is defined. Arguments:
- spline_object
a pointer to a previously initialized object of type
sll_cubic_spline_2D
get_x1_max()
returns the maximum value of \(x1\) in the domain in which
the spline object is defined. Arguments:
- spline_object
a pointer to a previously initialized object of type
sll_cubic_spline_2D
get_x2_min()
returns the minimum value of \(x2\) in the domain in which
the spline object is defined. Arguments:
- spline_object
a pointer to a previously initialized object of type
sll_cubic_spline_2D
get_x2_max()
returns the maximum value of \(x2\) in the domain in which
the spline object is defined. Arguments:
- spline_object
a pointer to a previously initialized object of type
sll_cubic_spline_2D
get_x1_delta()
returns the value of the spacing between the points
used to compute the spline values in the \(x1\) direction. This is
determined from the extent of the domain and the number of cells used.
Arguments:
- spline_object
a pointer to a previously initialized object of type
sll_cubic_spline_2D
get_x2_delta()
returns the value of the spacing between the points
used to compute the spline values in the \(x2\) direction. This is
determined from the extent of the domain and the number of cells used.
Arguments:
- spline_object
a pointer to a previously initialized object of type
sll_cubic_spline_2D
sll_delete()
subroutine to free the resources of a given spline
object. Arguments:
- spline_object
a pointer to a previously initialized object of type
sll_cubic_spline_2D
Usage¶
To use the module in a stand-alone way, include the line:
use sll_splines
The following example is an extract from the module’s unit test.
program spline_tester
#include "sll_working_precision.h"
#include "sll_assert.h"
#include "sll_memory.h"
use sll_splines
use numeric_constants
implicit none
#define NP 5000
sll_int32 :: err
sll_int32 :: i
type(sll_spline_1d), pointer :: sp1
type(sll_spline_1d), pointer :: sp2
sll_real64, allocatable, dimension(:) :: data
sll_real64 :: accumulator1, accumulator2
sll_real64 :: val
accumulator1 = 0.0_f64
accumulator2 = 0.0_f64
SLL_ALLOCATE(data(NP), err)
print *, 'initialize data array'
do i=1,NP
data(i) = sin((i-1)*sll_pi/real(NP-1,f64))
end do
sp1 => new_spline_1D( NP, &
0.0_f64, &
sll_pi, &
SLL_PERIODIC )
call compute_cubic_spline_1D( data, sp1 )
sp2 => new_spline_1D( NP, 0.0_f64, &
sll_pi, &
SLL_HERMITE,
1.0_f64,
-1.0_f64 )
call compute_cubic_spline_1D( data, sp2 )
print *, 'cumulative errors at nodes: '
do i=1, NP-1
val = real(i-1,f64)*sll_pi/real(NP-1,f64)
accumulator1 = accumulator1 + abs(data(i) - &
interpolate_value(val, sp1))
end do
print *, 'hermite case: '
do i=1, NP
val = real(i-1,f64)*sll_pi/real(NP-1,f64)
accumulator2 = accumulator2 + abs(data(i) - &
interpolate_value(val, sp2))
end do
print *, 'Periodic case: '
print *, 'average error at the nodes = '
print *, accumulator1/real(NP,f64)
call delete_spline_1D(sp1)
if( accumulator1/real(NP,f64) < 1.0e-15 ) then
print *, 'PASSED TEST'
else
print *, 'FAILED TEST'
end if
print *, '**************************** '
print *, 'Hermite case: '
print *, 'average error at the nodes = '
print *, accumulator2/real(NP,f64)
call delete_spline_1D(sp2)
if( accumulator2/real(NP,f64) < 1.0e-15 ) then
print *, 'PASSED TEST'
else
print *, 'FAILED TEST'
end if
end program spline_tester
Here we do not go in detail over every line but only highlight those lines in which we interact with the splines module
- Line 5
Imports the spline module. The intent is to eventually not require this but to import a single module, say ‘selalib’ which will include all modules itself. For now, this is the way to include these individual capabilities.
- Lines 13 - 14:
Declaration of the spline pointers.
- Lines 29, 34:
Allocation and partial initialization of the splines. Note the
=>
pointer assignment syntax.- Lines 33, 39:
Calculation of the spline coefficients. After this call the spline object becomes fully usable. Note that in the example we used both existing interfaces to call the initialization subroutine.
- Lines 45, 52:
Value interpolation using the existing splines.
- Lines 57, 67:
Destruction of spline objects.
Gauss-Legendre Integrator¶
Description¶
This is a low-level mathematical utility that applies the Gauss-Legendre method to compute numeric integrals. This module aims at providing a single interface to the process of integrating a function on a given interval.
Exposed Interface¶
To integrate the function f(x)
(real-valued and of a single,
real-valued argument x
) over the interval \([a, b]\), the simplest way
is through a function call such as:
gauss_legendre_integrate_1D(f, a, b, n)
In the function above, n
represents the desired number of Gauss
points used in the calculation:
$\(\int_{-1}^1f(x) \mathrm{d} x \approx \sum_{k=1}^n w_kf(x_k)\)$
Presently, the implementation accepts values of degree
between \(2\) and
\(10\) inclusively. The function gauss_legendre_integrate_1D
internally
does the proper scaling of the points to adjust the integral over the
desired interval.
The function gauss_legendre_integrate_1D
is a generic interface and as
such, it hides some alternative integrators, which are selected
depending on the type of the passed arguments. For instance, we have
available the function
gauss_legendre_integrate_interpolated_1D(f, spline, a, b, n)
which integrates a function represented by a spline object. The function
f
in this case is the spline interpolation function. It looks like
this interface could be simplified and we could eliminate the first
parameter and pass only the spline object. The only reason to leave the
interpolation function as an argument is if we find some compelling
reason to parametrize the interpolation function as well.
It will be necessary to implement other integrators for functions with a different signature, such as two- or three-parameter functions. It might also be necessary to distinguish between one-dimensional and two- or 3-dimensional integration. While this is not yet implemented, here we lay out some suggestions on how to proceed in such cases.
For the class of integrals that are done in one-dimension, such as the above, the cleanest but somewhat more laborious approach appears to be to write a different integrator for every function signature that is needed. For instance, one may need to write specialized integrators for \(f(x_1, x_2)\), \(f(x_1, x_2, x_3)\) and so on, with the convention that the integral is carried out over, say, the first of the variables. The variables which are not integrated can be used as parameters. The alternative to this approach could be to write a single integrator that is able to receive multiple parameters, through the use of arrays or derived types, but the ugliness of this approach, and the need to basically write glue-code (pack/unpack the arrays or derived types with variables and parameters) every time one wants to integrate something are reasons to reject this approach.
Usage¶
As mentioned above, the name of the generic function that hides the
specialized functions is gauss_legendre_integrate_1D
. The specialized
functions can be individually called to avoid the overhead of the
generic function call if desired. A one-dimensional function (user or
Fortran) can be integrated by a call like:
gauss_legendre_integral_1D( test_function, &
0.0_f64, &
sll_pi/2.0, &
4 )
A function that is represented by an underlying spline object can be called like:
gauss_legendre_integral_interpolated_1D( interpolate_value,&
sp1, &
0.0_f64, &
sll_pi, &
4)
where sp1 is a spline object. It should be decided if this last case is indeed that interface that is wished, or if something more simplified should be implemented intstead.
FFT¶
Description¶
The FFT module intends to provide an unified interface to native or
external library FFT functions. This module plays a role analogous to
the Collective module, explained below, but in this case applied to
FFT capabilities instead of a parallelization library like MPI. In this
sense, the sll_fft
module provides a set of types, functions and
subroutines that can in principle be implemented with a choice of
external libraries.
Exposed Interface¶
The interface to the FFT functions is inspired by the FFTW interface in which the FFT operation is carried out by the creation of a plan followed by the execution of this plan. The “plan” stores all relevant information (twiddle factor arrays, auxiliary arrays, etc.) that may be needed by a given type of FFT operation. The application of the plan executes the operation itself on a given data. Like all other native types in Selalib, the FFT plan is declared through a pointer, which must be allocated and initialized. To declare, call:
type(sll_fft_plan), pointer :: fft_plan
Followed by an initialization step:
fft_plan => sll_new_fft( sample_number,
data_type,
fft_flags )
Where the data_type
parameter can take either of the values:
FFT_REAL
or FFT_COMPLEX
. Presently we only provide double precision
transforms (i.e.: 64-bit floating precision numbers). The fft_flags
can take the values: FFT_NORMALIZE_FORWARD
and/or
FFT_NORMALIZE_INVERSE
. If no flags are present, an unnormalized FFT
will be executed. The user may request that the FFT be normalized in
both directions by combining the flags with a
+ sign.
To execute the FFT plan:
sample_number
- spline
A pointer to the spline object to be filled or updated.
Collective Communications¶
Description¶
Selalib applies the principle of modularization throughout all levels of abstraction of the library and aims at keeping third-party library modules as what they are: separate library modules. Therefore, in its current design, even a library like MPI has a single point of entry to Selalib. The collective communications module is such point of entry. We focus thus on the functionality offered by MPI, assign wrappers to its most desirable functionalities and write wrappers around them. These are the functions that are actually used throughout the program. This allows to adjust the exposed interfaces, do additional error-checking and would even permit to completely change the means to parallelize a code, by being able to replace MPI in a single file if this were ever needed.
Exposed Interface¶
Fundamental type:
sll_collective_t
Constructors, destructors and access functions:
sll_new_collective( parent_col )
sll_delete_collective( col )
When the Selalib environment is activated, there exists, in exact
analogy with MPI_COMM_WORLD
, a global named sll_world_collective
. At
the beginning of a program execution, this is the only collective in
existence. Further collectives can be created down the road. The above
functions are responsible for the creation and destruction of such
collectives. The following functions are used to access the values that
a particular collective knows about.
sll_get_collective_rank( col )
sll_get_collective_size( col )
sll_get_collective_color( col )
sll_get_collective_comm( col )
Since the wrapped library requires initialization, so does
sll_collective
. To start and end the parallel environment, the user
needs to call the functions:
sll_boot_collective( )
sll_halt_collective( )
These functions would not be exposed at the top level, and would be
hidden by a further call to something akin to boot_selalib
and
halt_selalib
. Finally, the wrappers around the standard MPI
capabilities are presently exposed through the following generic
functions:
sll_collective_bcast( col, buffer, size, root )
sll_collective_gather( col, send_buf, send_sz, root,
rec_buf )
sll_collective_gatherv( col, send_buf, send_sz, recvcnts,
displs, root, recv_buf )
sll_collective_allgatherv( col, send_buf, send_sz, sizes,
displs, rec_buf )
sll_collective_scatter( col, send_buf, size, root,
rec_buf )
sll_collective_scatterv( col, send_buf, sizes, displs,
rec_szs, root, rec_buf )
sll_collective_all_reduce( col, send_buf, count, op,
rec_buf )
which presently stand for specialized versions that operate on specific types. For instance:
sll_collective_all_to_allv_real( send_buf,
send_cnts,
send_displs,
recv_buf,
recv_cnts,
recv_displs,
col )
Usage¶
To use the module as stand-alone, include the line:
use sll_collective
Any use of the module’s functionalities must be preceeded by calling
call sll_boot_collective()
and to “turn off” the parallel capabilities, one should finish by a call to:
call sll_halt_collective()
This booting of the parallel environment needs to be done only once in a program.
Some more specific examples are needed here…
Remapper¶
Description¶
Written on top of sll_collective
, the remapper is a powerful module
capable of rearranging non-overlapping data in a parallel machine. This
module also implements the concept of a layout, which is a data
structure that stores the index ranges of a given array to be found in
each processor. In other words, the layout contains a description of the
distribution of the data. Layouts are needed to specify remap
operations.
The use of the remapper module can be broken down in several stages:
Declare the layouts and the remap plan object, as well as the arrays to be used.
Initialize the layouts,
use the layouts and the arrays to create the remapper plan and,
apply the remapper plan using the input and output arrays as arguments.
As a final step, the resources in the layouts and remap plan should be deallocated wtih
sll_delete()
The remap operation is an out-of-place operation. Remap is designed to consider only non-overlapping data configurations, that is, an all_gather-type operation should not be attempted with a remap.
Exposed Interface¶
Since the layouts are a pre-requisite for dealing with the remapper, we start with these. Presently, Selalib offers the following layout types:
layout_2D
layout_3D
layout_4D
layout_5D
layout_6D
These types are each accompanied by their own constructors, destructors and accessors. A comprehensive list of the procedures that operate with the layouts is:
new_layout_2D( collective )
new_layout_3D( collective )
new_layout_4D( collective )
new_layout_5D( collective )
new_layout_6D( collective )
get_layout_collective( layout )
get_layout_num_nodes( layout )
get_layout_global_size_i( layout )
get_layout_global_size_j( layout )
get_layout_global_size_k( layout )
get_layout_global_size_l( layout )
get_layout_global_size_m( layout )
get_layout_global_size_n( layout )
get_layout_i_min( layout, rank )
get_layout_i_max( layout, rank )
get_layout_j_min( layout, rank )
get_layout_j_max( layout, rank )
get_layout_k_min( layout, rank )
get_layout_k_max( layout, rank )
get_layout_l_min( layout, rank )
get_layout_l_max( layout, rank )
get_layout_m_min( layout, rank )
get_layout_m_max( layout, rank )
get_layout_n_min( layout, rank )
get_layout_n_max( layout, rank )
set_layout_i_min( layout, rank, val )
set_layout_i_max( layout, rank, val )
set_layout_j_min( layout, rank, val )
set_layout_j_max( layout, rank, val )
set_layout_k_min( layout, rank, val )
set_layout_k_max( layout, rank, val )
set_layout_l_min( layout, rank, val )
set_layout_l_max( layout, rank, val )
set_layout_m_min( layout, rank, val )
set_layout_m_max( layout, rank, val )
set_layout_n_min( layout, rank, val )
set_layout_n_max( layout, rank, val )
initialize_layout_with_distributed_2d_array(
global_npx1,
global_npx2,
num_proc_x1,
num_proc_x2,
layout)
initialize_layout_with_distributed_3d_array(
global_npx1,
global_npx2,
global_npx3,
num_proc_x1,
num_proc_x2,
num_proc_x3,
layout)
initialize_layout_with_distributed_4d_array(
global_npx1,
global_npx2,
global_npx3,
global_npx4,
num_proc_x1,
num_proc_x2,
num_proc_x3,
num_proc_x4,
layout)
initialize_layout_with_distributed_5d_array(
global_npx1,
global_npx2,
global_npx3,
global_npx4,
global_npx5,
num_proc_x1,
num_proc_x2,
num_proc_x3,
num_proc_x4,
num_proc_x5,
layout)
initialize_layout_with_distributed_6d_array(
global_npx1,
global_npx2,
global_npx3,
global_npx4,
global_npx5,
global_npx6,
num_proc_x1,
num_proc_x2,
num_proc_x3,
num_proc_x4,
num_proc_x5,
num_proc_x6,
layout)
compute_local_sizes(
layout_Nd,
loc_sz_1,
loc_sz_2,
...,
loc_sz_N )
sll_delete( layout )
In brief:
new_layout_ND()
function, is responsible for allocating all the necessary storage for the layout. Returns a pointer to a layout_ND object, ready for initialization. Arguments:
- collective
[pointer] sll_collective type to be associated with the layout.
get_layout_collective
returns a pointer to the sll_collective object associated with a given layout. Arguments:
- layout
[pointer] layout whose collective is requested.
get_layout_num_nodes
returns a 32-bit integer with the number of processors in the collective associated with the given layout. Arguments:
- layout
[pointer] layout whose size (in number of processes) is being requested.
get_layout_global_size_X
returns a 32-bit integer with the size of the global array (in an abstract sense, i.e. the array that is distributed) in the direction X. Arguments:
- layout
[pointer] layout whose size (in number of processes) is being requested.
get_layout_X_min
returns a 32-bit integer with the minimum index in direction X contained in the given process rank. Arguments:
- layout
[pointer] layout whose size (in number of processes) is being requested. [in] 32-bit integer, rank of the process under inquiry.
get_layout_X_max
returns a 32-bit integer with the maximum index in direction X contained in the given process rank. Arguments:
- layout
[pointer] layout whose size (in number of processes) is being requested. [in] 32-bit integer, rank of the process under inquiry.
Specifically, the constructors are:
Note that each layout descriptor needs to be allocated by providing an
instance of sll_collective
. This can be understood by reasoning about
the data layout as a group of processors (the collective) and a
specification of the data boxes contained in each one. More concretely,
a layout for an N-dimensional array associated with a collective of size
\(N_p\) can be thought of as a collection of \(N_p\) N-dimensional boxes,
each representing the intervals of the array indices contained in a
given processor (rank: 0 … \(N_p-1\)). After calling any of the
new_layout
functions, the returned instance becomes associated to the
given collective and enough memory is allocated (size of the collective)
to hold the boxes specification. While the memory has been allocated,
the layout itself still needs to be initialized.
There are at least two ways to accomplish the initialization. The manual way is through the individual access functions defined for the layouts. The second method involves helper routines.
The following is a list of the access functions defined for the layouts:
In the above list, all the routines are generic, thus they will accept
layouts of any dimension as long as the field to be accessed is defined
in the layout. The array dimensions are labeled i, j, k, l, m, n
(although in the arguments we use the numbers 1, 2, … , 6). Thus, for
example, a layout_2D
contains information on the indices i
, j
only. Accessing the k
index would yield a compilation error. The
get_
routines are functions. Except for get_layout_collective( )
which returns a pointer to the collective associated with a given
layout, all the other get_
-functions return the value of a 32-bit
integer value, representing the minimum or maximum value of an index
contained in a given rank, or in the case of the
get_layout_global_size_X
functions, the global size of the (abstract)
array that is split among the different processors in the way described
by the layout.
The set_
procedures are subroutines. These can be used to set a given
minimum/maximum index value in a given layout to a desired value. These
are the accessors that may be used when initializing a layout manually.
However this is not encouraged. In general, it is recommended to
initialize layout with the provided helper functions, as shown next.
The above procedures are subroutines. Except for the layout to be
initialized, all the arguments are 32-bit integers. The global_npxX
arguments denote the global size of the array (its total size in the
\(X\)-th direction) that is to be divided amongst multiple processors in
accordance with the partition determined by the num_proc_xX
arguments.
To understand the role of the num_proc_xX
parameters it is necessary
to imagine the domain represented by the N-dimensional array as being
divided in a number of blocks in each dimension, just as if the
processors themselves were arranged in a processor mesh of dimensions
num_proc_x1
* num_proc_x2
* … * num_proc_xN
= number of
processes in the collective. Thus, if distributing a 2D array of sizes
global_npx1
\(= 1024\), global_npx2
\(= 512\), in a process mesh of
dimensions num_proc_x1
\(= 2\) and num_proc_x2
\(= 1\), this would yield
a partition of the original array in one block, of dimensions
\(512 * 512\) in each of the two processors. Note again that the number of
processors corresponds to the product of the nproc_xX
arguments.
The information in the initialized layouts can be used to inquire about the size of the data distributed as per a given layout. The procedures for this are hidden behind the generic interface:
Once a layout is declared and initialized it can be used as an aid to
allocate the memory of the array(s) whose distribution it represents.
For instance, suppose that you start with a multidimensional array that
needs to be distributed among \(N_p\) processors. The layout of the data
(that is, the description of what ranges of the data are contained in
each processor) is specified by an instance of the type layout_XD
,
(where X
is the dimension of the data). The layout contains a notion
of an \(N_p\)-sized collection of boxes, each box representing a
contiguous chunk of the multidimensional array stored in each processor.
If in the course of a computation, you wish to reconfigure the layout of
the data (for example, if you wished to re-arrange data in a way that
would permit launching serial algorithms locally in each processor),
then you would create and initialize a new layout descriptor with the
target configuration (i.e.: you to define the box to be stored in each
node). Armed with the layout descriptions for the initial and target
configurations, one can execute a remap operation to reconfigure the
data. This is an out-of-place operation.
NEW_REMAPPER_PLAN_3D( initial_layout,
target_layout,
data_size_in_integer_sizes )
NEW_REMAPPER_PLAN_4D( initial_layout,
target_layout,
data_size_in_integer_sizes )
NEW_REMAPPER_PLAN_5D( initial_layout,
target_layout,
data_size_in_integer_sizes )
will yield an instance of the type remap_plan_3D_t
, or
remap_plan_4D_t
or remap_plan_5D
, respectively, that will contain
all the information necessary to actually carry out the data
re-distribution. Finally, a call to
apply_remap_3D( plan, data_in, data_out )
apply_remap_4D( plan, data_in, data_out )
apply_remap_5D( plan, data_in, data_out )
will actually redistribute data
(as an out-of-place operation)
according to plan
in an optimized way3.
To appreciate the power of such facility, note that in principle, the construction of a (communications latency-limited) parallel quasi-neutral solver can be based exclusively on remapping operations. This is an important tool in any problem that would require global rearrangements of data. The remapper thus is able to present a single powerful abstraction that is general, reusable and completely hides most of the complications introduced by the data distribution.
The access functions for the layout
types are are always prefixed with
the corresponding get_layout_XD
/set_layout_XD
(where the ‘X
’
denotes the dimensionality of the data), and they presuppose knowledge
of the convention for ordering the indices as in i, j, k, l, m
, for
the dimensions. Specifically, to get/set values inside the layout
types we have available for 3D layouts:
As a very inelegant convenience, the layout type allows direct access to its collective reference.
The above functions define the interface that will allow you to declare
and initialize the layout
types as desired. This is where the work
lies when using this module. Note that all the above functions could be
coalesced into a set of functions of the type
set_layout_X_XXX(layout, rank, val)
if we choose to hide all the above
functions behind a generic interface. The selection would be done
automatically depending on the type of layout passed as an argument.
The type remap_plan
exists also in multiple flavors, depending on the
dimensionality of the data to be remapped:
remap_plan_3D_t
remap_plan_4D_t
remap_plan_5D_t
The remap_plan_t
type stores the locations of the memory buffers that
will be involved in the communications, the specification of the data
that will be sent and received, as well as the collective within which
the communications will take place. There are, however, declaration
functions available. The choice depends on the dimensionality of the
data:
NEW_REMAPPER_PLAN_3D( initial_layout,
final_layout,
array_name )
NEW_REMAPPER_PLAN_4D( initial_layout,
final_layout,
array_name )
NEW_REMAPPER_PLAN_5D( initial_layout,
final_layout,
array_name )
Finally, the way to execute the plan on a particular data set is through a call of the appropriate subroutine (here presented as generic interfaces)
apply_remap_3D( plan, data_in, data_out )
apply_remap_4D( plan, data_in, data_out )
apply_remap_5D( plan, data_in, data_out )
Usage¶
For use in stand-alone way, use the line:
#include "sll_remap.h"
While verbose, the best way to demonstrate the usage of the remapper is with a complete program. Below it, we examine the different statements.
program remap_test
use sll_collective
#include "sll_remap.h"
#include "sll_memory.h"
#include "sll_working_precision.h"
#include "misc_utils.h"
implicit none
! Test of the 3D remapper takes a 3D array whose global
! size Nx*Ny*Nz, distributed among pi*pj*pk processors.
integer, dimension(:,:,:), allocatable :: a3
integer, dimension(:,:,:), allocatable :: b3
! Take a 3D array of dimensions 8X8X1
integer, parameter :: total_sz_i = 8
integer, parameter :: total_sz_j = 8
integer, parameter :: total_sz_k = 1
! the process mesh
integer, parameter :: pi = 4
integer, parameter :: pj = 4
integer, parameter :: pk = 1
! Split into 16 processes, each with a local chunk 2X2X1
integer :: local_sz_i
integer :: local_sz_j
integer :: local_sz_k
integer :: ierr
integer :: myrank
integer :: colsz
integer :: i,j,k
integer :: i_min, i_max
integer :: j_min, j_max
integer :: k_min, k_max
integer :: node
integer, dimension(1:3) :: gcoords
! Remap variables
type(layout_3D_t), pointer :: conf3_init
type(layout_3D_t), pointer :: conf3_final
type(remap_plan_3D_t), pointer :: rmp3
! Boot parallel layer
call sll_boot_collective()
! Initialize and allocate the variables.
local_sz_i = total_sz_i/pi
local_sz_j = total_sz_j/pj
local_sz_k = total_sz_k/pk
SLL_ALLOCATE(a3(1:local_sz_i,1:local_sz_j,1:local_sz_k), ierr)
SLL_ALLOCATE(b3(1:local_sz_i,1:local_sz_j,1:local_sz_k), ierr)
myrank = sll_get_collective_rank(sll_world_collective)
colsz = sll_get_collective_size(sll_world_collective)
conf3_init => new_layout_3D( sll_world_collective )
conf3_final => new_layout_3D( sll_world_collective )
random_layout1 => new_layout_3D( sll_world_collective )
! Initialize the layout
do k=0, pk-1
do j=0, pj-1
do i=0, pi-1
node = i+pi*(j+pj*k) ! linear index of node
i_min = i*local_sz_i + 1
i_max = i*local_sz_i + local_sz_i
j_min = j*local_sz_j + 1
j_max = j*local_sz_j + local_sz_j
k_min = k*local_sz_k + 1
k_max = k*local_sz_k + local_sz_k
call set_layout_i_min( conf3_init, node, i_min )
call set_layout_i_max( conf3_init, node, i_max )
call set_layout_j_min( conf3_init, node, j_min )
call set_layout_j_max( conf3_init, node, j_max )
call set_layout_k_min( conf3_init, node, k_min )
call set_layout_k_max( conf3_init, node, k_max )
end do
end do
end do
! Initialize the data using layout information.
do k=1, local_sz_k
do j=1, local_sz_j
do i=1, local_sz_i
gcoords= local_to_global_3D(conf3_init,(/i,j,k/))
a3(i,j,k) = gcoords(1) + &
total_sz_i*((gcoords(2)-1) + &
total_sz_j*(gcoords(3)-1))
end do
end do
end do
! Initialize the final layout, in this case, just a
! transposition
do k=0, pk-1
do j=0, pj-1
do i=0, pi-1
node = i+pi*(j+pj*k) ! linear index of node
i_min = i*local_sz_i + 1
i_max = i*local_sz_i + local_sz_i
j_min = j*local_sz_j + 1
j_max = j*local_sz_j + local_sz_j
k_min = k*local_sz_k + 1
k_max = k*local_sz_k + local_sz_k
call set_layout_i_min( conf3_final, node, j_min )
call set_layout_i_max( conf3_final, node, j_max )
call set_layout_j_min( conf3_final, node, i_min )
call set_layout_j_max( conf3_final, node, i_max )
call set_layout_k_min( conf3_final, node, k_min )
call set_layout_k_max( conf3_final, node, k_max )
end do
end do
end do
rmp3 => NEW_REMAPPER_PLAN_3D( conf3_init, conf3_final, a3 )
call apply_remap_3D( rmp3, a3, b3 )
! At this moment, b3 contains the expected output
! from the remap operation.
! Delete the layouts
call delete_layout_3D( conf3_init )
call delete_layout_3D( conf3_final )
call sll_halt_collective()
end program remap_test
- Lines 1 - 5
Required preamble at the time of this writing. Eventually this will be replaced by a single statement to include the whole library. Presently, we include various headers individually, so bear in mind that this is not the way this will end up being. Line 3 specifically loads the remapper facility. Here it is brought as a header file as the NEW_REMAPPER_PLAN_XD() is implemented as a macro.
- Lines 9 - 12
For this example we allocate two 3D arrays for the input and output of the remap operation.
- Lines 14 - 22
Definition of the array size from a global perspective. In other words, the array to be remapped is a \(8*8*1\) array, to be distributed on a processor mesh of dimensions \(4*4*1\).
- Lines 24 - 36
Miscellaneous integer variables that we will use.
- Lines 38 - 41
Pointers to the initial and final layouts and the remap plan.
- Line 44
Presently we boot from collective. Eventually this will be replaced by a call to something like
boot_selalib()
or something similar, where we declare and initialize anything we need in a single call.- Lines 46 - 57
Initialization of the variables.
- Lines 59 - 78
This is where the actual work is when using the remapper. We need to initialize a layout, in this case the initial configuration. We use the access functions
set_layout_x_xxx( )
to populate the fields. Here we obviously take into account the geometry of our ‘process mesh’ to find out the rank of the process that we are initializing.- Lines 80 - 90
We need to initialize the data, here we choose simply to assign the index of the array, considered as a 1D array. Note the use of the helper function
local_to_global_3D( layout, triplet )
. We exploit the knowledge of the global layout of the data to find out the global indices of a local 3-tuple.- Lines 92 - 112
The other main part of the work, the initialization of the target layout. In this case, we chose a simple transposition, which is achieved by switching
i
andj
.- Lines 114 - 115
Here we allocate and initialize the remap plan, using the initial and final configurations as input. The third argument is passed to inform the remapper of the type of data to be passed. The call to
apply_remap_3D()
is a call to a generic function, hence, a type-dependent sub-function must have been defined to be able to successfully make this call. At the time of this writing, only single precision integers and double precision floats have been implemented.- Line 116
Here we apply the plan. This function is type-dependent due to the input/output arrays. Please refer to the implementation notes for some commentary on our options with this interface.
- Lines 122 - 125
Cleanup. The layouts need to be deleted to prevent memory leaks.
Implementation Notes¶
The biggest challenge with the remapper is to attain a desired level of genericity and to preserve the modularity of the library. These two problems are intimately related. Ideally, we should be able to apply a remap operation on data of any type, including user-derived types. Another requirement has been to confine a library like MPI to a single entry point into Selalib. This means that we do not want the MPI derived types to pollute the higher abstraction levels of the library: especially at the top level, we want to express our programs with the capabilities of the FORTRAN language alone.
These requirements were solved in the prototype version of the remapper
through the use of a single datatype to represent all other types of
data at the moment of assembling the exchange buffers and launching the
MPI calls. In our case, we have chosen to represent all data as
‘integers’. This means that the exchange buffers that are stored in the
remap plans are integer arrays. Thus, the design decision in the
prototype has been to choose flexibility and ease of change over
execution speed. In contrast with the C language, the constant call to
the transfer()
function to store and retrieve data from the exchange
buffers carries with it a possibly significant execution time penalty.
The function NEW_REMAPPER_PLAN_XD()
is by nature type-independent, as
the design of the plan only depends on the layouts. However, it is also
convenient to store the send/receive buffers in the plan, and the
allocation of these buffers requires knowledge of the amount of memory
required. This information is passed in the third argument. The macro
will internally select an element of this array and determine its size
in terms of the fundamental datatype being exchanged (i.e.: integer
).
This way we now how much memory to allocate in the buffers.
Another means to achieve the illusion of genericity are Fortran’s
built-in features in this regard. For example, we can have specialized
apply_remap_3D()
functions for the most commonly used datatypes, all
hidden behind the same generic name. These specialized functions would
not depend on the current choice of using a single type for the exchange
buffers, eliminating any penalty that we are definitively paying at
present, with the calls to the transfer()
intrinsic function. This
solution would mean writing redundant code, something that could be
addressed with preprocessor macros, but this would not be a solution for
eliminating the penalizations of the transfer()
intrinsic when we are
exchanging derived types. A solution that can exchange these arbitrary
data while not requiring the use of the MPI derived types at the higher
levels is yet to be found. It could be that the Fortran way to solve
this problem would be to accept the invasion of MPI at the higher
levels…
- 3
The optimizations carried out by the remapper are related with finding out the minimally-sized communicators to launch the exchanges and the selection of the lower-level communications functions (alltoall vs. alltoallv, for instance). Other optimizations depend on how the local MPI is tuned.