.. role:: hidden :class: hidden-section =============== distdl.backends =============== .. toctree:: :glob: :hidden: backends/* .. contents:: :local: :depth: 3 Overview ======== Data Movement Backend Interface =============================== Each back-end will is responsible for providing the concrete definitions of partition classes and the functional interfaces for primitive data movement layers. Partitions ---------- Currently, two partition classes must be provided for a fully compatible back-end: ``Partition`` for partitions with no topology and ``CartesianPartition`` for partitions with a Cartesian topology. The partition classes serve as wrappers around the back-end data movement tool's interface. Implementations are free to make use of any back-end specific API, but in general that API should not be explicitly exposed to DistDL users. The interfaces defined in these classes inherit a number of concepts from the MPI back-end. As DistDL evolves, it is anticipated that these interfaces can become more generalized. Partition interfaces are explicitly for distributed communication and data movement, and as such they provide some interfaces for general parallel communication concepts (e.g., broadcasts and all-gathers) for use other than tensor data movement. While much of the interface is inspired by MPI's API, MPI-specific terminology is avoided, e.g., we use ``broadcast_data``, not ``bcast_data``. Partitions with No Topology ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :width: 100% :header-rows: 1 :align: left * - Back-end - Module - Class Definition - Public Alias * - :ref:`MPI ` - :any:`backends.mpi.partition ` - :any:`MPIPartition ` - :any:`Partition ` The partition with no topology, exposed from each back-end as ``Partition``, is the general container for an unstructured team of workers. It must provide interfaces for creating teams of workers, creating sub-teams of workers, creating unions of teams of workers, etc. Regardless of back-end, a number of core partition concepts are to be exposed by a basic unstructured partition: * ``size``: The number of workers in the partition. * ``rank``: The lexicographic identifier of the current worker. * ``active``: The status of a worker in a partition. An active member of a partition has the ability to communicate with other workers. An inactive member has no knowledge of the other workers and is entirely disconnected from the set of active workers. For convenience, we also allow topology-free partitions to be treated as if they are endowed with a 1-dimensional Cartesian topology. Consequently, they also have two further exposed concepts: * ``shape``: Always a 1-iterable with containing `size` as the value. * ``index``: Always the `rank`. A partition must also provide the following API for creating new partitions: * ``__init__()``: A partition is initialized using whatever back-end specific information is required to determine the above properties. * ``create_partition_inclusive()``: Creates a subpartition of the current partition inclusive of a specified subset of workers. * ``create_partition_union()``: Creates a new partition containing the union of the workers in two different partitions. Workers calling instance are to be ordered before workers in the `other` instance and workers cannot be repeated. * ``create_cartesian_topology_partition()``: Using the workers in the team for the unstructured partition, create a partition endowed with a Cartesian topology. * ``create_allreduction_partition()``: Create the partition required to support the :ref:`All-sum-reduce ` operation. * ``create_broadcast_partition_to()``: Following the DistDL :ref:`code_reference/nn/broadcast:Broadcast Rules`, create the sending and receiving partitions required to support the :ref:`Broadcast ` operation between two partitions. * ``create_reduction_partition_to()``: Following the DistDL :ref:`code_reference/nn/broadcast:Broadcast Rules`, create the sending and receiving partitions required to support the :ref:`Sum-reduce ` operation between two partitions. A partition must provide the following API for comparing partitions: * ``__eq__()``: Comparison for strict equality. This means the same team, ordered the same way. If two partitions have similar structure, meaning the same size and shape but have different team members or organization, the are not equal. A partition must provide the following API for communicating within partitions: * ``broadcast_data()``: For sharing data from one worker to all workers in the partition. This is distinct from the tensor broadcast operation. This function does not require the receiving workers to have any knowledge of the structure of the data they are to receive. Any information required to store this data (e.g., ``dtype`` or ``shape`` of an array) must be communicated in the function. This function supports two modes. The first acts as a standard broadcast within a partition. The second allows data from one worker in a partition that is a *subpartition* of the calling partition to broadcast the data to the calling partition. This is useful when the structure of the data is known only by workers on the subpartition but the data is needed on the superpartition. * ``allgather_data()``: For sharing information on all workers in the partition with all other workers in the partition. Partitions with Cartesian Topology ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: :width: 100% :header-rows: 1 :align: left * - Back-end - Module - Class - Alias * - :ref:`MPI ` - :any:`backends.mpi.partition ` - :any:`MPICartesianPartition ` - :any:`CartesianPartition ` Partitions endowed with a Cartesian topology are themselves partitions, with additional ordering information that is useful, for example, when partitioning a tensor. In addition to the members required for :ref:`code_reference/backends:Partitions with No Topology`, Cartesian partitions also specify: * ``shape``: An iterable giving the number of workers in each dimension. * ``index``: An iterable with the lexicographic identifier of the worker in each dimension. In addition to the API required for :ref:`code_reference/backends:Partitions with No Topology`, Cartesian partitions must also specify: * ``cartesian_index()``: A routine for obtaining the index of any worker in the partition, from its `rank`. * ``neighbor_ranks()``: A routine for obtaining the ranks of neighboring workers in all dimensions of the partition. In addition to the API required for :ref:`code_reference/backends:Partitions with No Topology`, Cartesian partitions may also specify: * ``create_cartesian_subtopology_partition()``: A routine for creating a subtopology, following the MPI subtopology specification. Functional Primitives --------------------- All Sum-Reduce ~~~~~~~~~~~~~~ .. list-table:: :width: 100% :header-rows: 1 :align: left * - Back-end - Module - Class * - :ref:`MPI ` - :any:`backends.mpi.functional.all_sum_reduce ` - :any:`AllSumReduceFunction ` The functional primitive for the AllSumReduce data movement operation does not use the original tensor partitions :math:`P_x` (input and output) directly. Instead, the calling class must create new sub-partitions of :math:`P_x`, along specified dimensions, to enable actual data movement. For the AllSumReduce operation, these are back-end specific implementations of :ref:`code_reference/backends:Partitions with No Topology` and the data movement occurs *within* these partitions. The forward AllSumReduce is equivalent to a standard SumReduce followed by a standard Broadcast, though the actual implementation will rarely use these routines directly. The sub-partitions are created using the ``create_allreduce_partition()`` method of the ``Partition`` class for the selected back-end, which creates a number of partitions equal to the product the dimensions of :math:`P_x` which are *not* reduced in the reduction. Each worker in :math:`P_x` is a member of exactly one of these sub-partitions. We illustrate some examples of the construction of these partitions in the following example. Consider a :math:`2 \times 3 \times 2` input partition :math:`P_x`, as illustrated below, where workers are identified a *global* lexicographic ordering. For example, in an MPI-based environment, this would mean the identifiers are the ranks in a group of size 12 that is a parent group to both partitions. .. note:: Implementations should not impose anything on the ordering of the workers. Back-end specific (private) routines may be used to map indices to workers. .. figure:: /_images/all_sum_reduce_example_A01.png :alt: Image of 2x3x2 partition with labeled workers. Caption here. If the reduction is specified over the 0th and 2nd dimensions, then this broadcast creates 3 partitions as described above, illustrated below, which we label :math:`P_0`, :math:`P_1`, and :math:`P_2` for convenience. .. figure:: /_images/all_sum_reduce_example_A02.png :alt: Image of partitions induced by 2x3x2 all-sum-reduce over the first and last dimensions, with labeled workers. The following table gives the global lexicographic identifiers of each worker in the partition. .. list-table:: :header-rows: 1 :align: center * - Partition - Workers * - :math:`P_0` - 0, 1, 6, 7 * - :math:`P_1` - 2, 3, 8, 9 * - :math:`P_2` - 4, 5, 10, 11 After the forward application of the all-sum-reduction, all workers active in the ``P_x`` partition have as output the sum reduction of the correct subtensors of the original tensor. .. figure:: /_images/all_sum_reduce_example_A03.png :alt: Image of 1x3x1 to 2x3x2 all-sum-reductin along dimensions 0 and 2. Completed all-sum-reduction. AllSumReduce may apply over any combination of the dimensions, including all and none of them. If all dimensions of the partition are included, the output is the sum over all input tensors in the partition. If the set of input dimensions is empty, then the layer is functionally the identity. The adjoint phase works the same way, as this operation is self-adjoint, as described in the `motivating paper `_. Broadcast ~~~~~~~~~ .. list-table:: :width: 100% :header-rows: 1 :align: left * - Back-end - Module - Class * - :ref:`MPI ` - :any:`backends.mpi.functional.broadcast ` - :any:`BroadcastFunction ` The functional primitive for the Broadcast data movement operation does not use the original tensor partitions :math:`P_x` (input) and :math:`P_y` (output) directly. Instead, the calling class must create two new partitions to enable actual data movement. For the Broadcast operation, these are back-end specific implementations of :ref:`code_reference/backends:Partitions with No Topology` and the data movement occurs *within* these partitions. The partitions are created using the ``create_broadcast_partition_to()`` method of the ``Partition`` class for the selected back-end, which creates a number of partitions equal to the product of the shape of :math:`P_x`. Each of these partitions has, as its root, the worker that is the source of the data to be copied in the broadcast and all other workers in the partition are to receive those copies. The root is specified in a back-end specific manner, e.g., the first worker in the partition is the 0th rank in the associated group of processors in the MPI back-end implementation. Each worker in :math:`P_x` or :math:`P_y` has access to up to two of these partitions. If the worker is active in :math:`P_x` it will have a non-null ``P_send`` partition, which it is the root worker of. If it is active in :math:`P_y`, it will have a non-null ``P_recv`` partition. Usually, the worker will be a non-root worker in ``P_recv``, indicating that it is going to receive a copy from the root. However, it is possible that the worker *may be the root of* ``P_recv``. This occurs if ``P_send`` and ``P_recv`` are the same and thus the worker is required to copy its own data to itself as part of the broadcast. We illustrate some examples of the construction of these partitions in the following example. Consider a broadcast of a tensor partitioned by a :math:`1 \times 3 \times 1` partition :math:`P_x` to :math:`2 \times 3 \times 2` partition :math:`P_y`, as illustrated below, where workers are identified a *global* lexicographic ordering. For example, in an MPI-based environment, this would mean the identifiers are the ranks in a group of size 12 that is a parent group to both partitions. .. note:: Implementations should not impose anything on the ordering of the workers in :math:`P_x` and :math:`P_y`. The sets of workers in these partitions may be disjoint, partially overlapping, or completely overlapping. .. figure:: /_images/broadcast_example_A01.png :alt: Image of 1x3x1 to 2x3x1 broadcast with labeled workers. Blue, yellow, and red subtensors on :math:`P_x` are to be broadcast to :math:`P_y`. Labels on subtensors are global lexicographic identifiers for the workers. This broadcast creates 3 partitions as described above, illustrated below, which we label :math:`P_0`, :math:`P_1`, and :math:`P_2` for convenience. .. figure:: /_images/broadcast_example_A02.png :alt: Image of partitions induced by 1x3x1 to 2x3x1 broadcast, with labeled workers. The following table gives the global lexicographic identifiers of each worker in the partition, with the root worker listed first in bold. .. list-table:: :header-rows: 1 :align: center * - Partition - Workers * - :math:`P_0` - **1**, 0, 6, 7 * - :math:`P_1` - **2**, 3, 8, 9 * - :math:`P_2` - **3**, 4, 5, 10, 11 The following table gives the partition label for the partition that each worker associates with ``P_send`` and ``P_recv``. .. list-table:: :width: 95% :widths: 10 10 10 1 10 10 10 1 10 10 10 :header-rows: 1 :align: center * - Worker - ``P_send`` - ``P_recv`` - - Worker - ``P_send`` - ``P_recv`` - - Worker - ``P_send`` - ``P_recv`` * - 0 - n/a - :math:`P_0` - - 4 - n/a - :math:`P_2` - - 8 - n/a - :math:`P_1` * - 1 - :math:`P_0` - :math:`P_0` - - 5 - n/a - :math:`P_2` - - 9 - n/a - :math:`P_1` * - 2 - :math:`P_1` - :math:`P_1` - - 6 - n/a - :math:`P_0` - - 10 - n/a - :math:`P_2` * - 3 - :math:`P_2` - :math:`P_1` - - 7 - n/a - :math:`P_0` - - 11 - n/a - :math:`P_2` This example illustrates the case where a worker needs to broadcast to itself as root (workers 1 and 2), the case where a worker broadcasts from itself as root and receives data in another partition (worker 3), and the case where a worker simply receives copies of the data from the roots (workers 0 and 4-11). It also illustrates that the root worker is the one that has the data, not the lowest globally ranked worker (e.g., in :math:`P_0`). After the forward application of the broadcast, all workers active in a ``P_recv`` partition have as output a copy of the correct subtensor of the original tensor. .. figure:: /_images/broadcast_example_A03.png :alt: Image of 1x3x1 to 2x3x1 broadcast with labeled workers. Completed broadcast. The adjoint phase works the same way, but in reverse. However, as demonstrated in the `motivating paper `_, the subtensors on :math:`P_y` are sum-reduced back to :math:`P_x`, rather than copied. Halo Exchange ~~~~~~~~~~~~~ Sum-Reduce ~~~~~~~~~~ .. list-table:: :width: 100% :header-rows: 1 :align: left * - Back-end - Module - Class * - :ref:`MPI ` - :any:`backends.mpi.functional.sum_reduce ` - :any:`SumReduceFunction ` The functional primitive for the SumReduce data movement operation does not use the original tensor partitions :math:`P_x` (input) and :math:`P_y` (output) directly. Instead, the calling class must create two new partitions to enable actual data movement. For the sum-reduce operation, these are back-end specific implementations of :ref:`code_reference/backends:Partitions with No Topology` and the data movement occurs *within* these partitions. The partitions are created using the ``create_reduction_partition_to()`` method of the ``Partition`` class for the selected back-end, which creates a number of partitions equal to the product of the shape of :math:`P_y`. Each of these partitions has, as its root, the worker that is the destination of the data to be reduced in the sum-reduction and all other workers in the partition are the sources of that data. The root is specified in a back-end specific manner, e.g., the first worker in the partition is the 0th rank in the associated group of processors in the MPI back-end implementation. Each worker in :math:`P_x` or :math:`P_y` has access to up to two of these partitions. If the worker is active in :math:`P_x` it will have a non-null ``P_send`` partition. If it is active in :math:`P_y`, it will have a non-null ``P_recv`` partition, which it is the root worker of. Usually, the worker will be a non-root worker in ``P_send``, indicating that it is going to send partial sums to the root. However, it is possible that the worker *may be the root of* ``P_recv``. This occurs if ``P_send`` and ``P_recv`` are the same and thus the worker also contributes its own data to its reduction. We illustrate some examples of the construction of these partitions in the following example. Consider a sum-reduction of a tensor partitioned by a :math:`2 \times 3 \times 2` partition :math:`P_x` to :math:`1 \times 3 \times 1` partition :math:`P_y`, as illustrated below, where workers are identified a *global* lexicographic ordering. For example, in an MPI-based environment, this would mean the identifiers are the ranks in a group of size 12 that is a parent group to both partitions. .. note:: Implementations should not impose anything on the ordering of the workers in :math:`P_x` and :math:`P_y`. The sets of workers in these partitions may be disjoint, partially overlapping, or completely overlapping. .. figure:: /_images/sum_reduce_example_A01.png :alt: Image of 2x3x1 to 1x3x1 sum-reduction with labeled workers. Colored subtensors on :math:`P_x` to be sum-reduced to :math:`P_x`. Labels on subtensors are global lexicographic identifiers for the workers. This broadcast creates 3 partitions as described above, illustrated below, which we label :math:`P_0`, :math:`P_1`, and :math:`P_2` for convenience. .. figure:: /_images/sum_reduce_example_A02.png :alt: Image of partitions induced by 2x3x1 to 1x3x1 sum-reduction, with labeled workers. The following table gives the global lexicographic identifiers of each worker in the partition, with the root worker listed first in bold. .. list-table:: :header-rows: 1 :align: center * - Partition - Workers * - :math:`P_0` - **1**, 0, 6, 7 * - :math:`P_1` - **2**, 3, 8, 9 * - :math:`P_2` - **3**, 4, 5, 10, 11 The following table gives the partition label for the partition that each worker associates with ``P_send`` and ``P_recv``. .. list-table:: :width: 95% :widths: 10 10 10 1 10 10 10 1 10 10 10 :header-rows: 1 :align: center * - Worker - ``P_send`` - ``P_recv`` - - Worker - ``P_send`` - ``P_recv`` - - Worker - ``P_send`` - ``P_recv`` * - 0 - :math:`P_0` - n/a - - 4 - :math:`P_2` - n/a - - 8 - :math:`P_1` - n/a * - 1 - :math:`P_0` - :math:`P_0` - - 5 - :math:`P_2` - n/a - - 9 - :math:`P_1` - n/a * - 2 - :math:`P_1` - :math:`P_1` - - 6 - :math:`P_0` - n/a - - 10 - :math:`P_2` - n/a * - 3 - :math:`P_1` - :math:`P_2` - - 7 - :math:`P_0` - n/a - - 11 - :math:`P_2` - n/a This example illustrates the case where a worker needs to sum-reduce to itself as root (workers 1 and 2), the case where a worker sum-reduces from itself as root and receives data in another partition (worker 3), and the case where a worker simply contributes data to the reduction (workers 0 and 4-11). It also illustrates that the root worker is the one that has the data, not the lowest globally ranked worker (e.g., in :math:`P_0`). After the forward application of the sum-reduction, all workers active in a ``P_recv`` partition have as output the sum-reduction of the appropriate subtensors of the original tensor. .. figure:: /_images/sum_reduce_example_A03.png :alt: Image of 2x3x1 to 1x3x1 sum-reduction with labeled workers. Completed sum-reduction. CMY channels of :math:`P_x` subtensors add up to red, yellow, and blue subtensors of :math:`P_y`. The adjoint phase works the same way, but in reverse. However, as demonstrated in the `motivating paper `_, the subtensors on :math:`P_y` are broadcast back to :math:`P_x`, rather than sum-reduced. Transpose ~~~~~~~~~ .. list-table:: :width: 100% :header-rows: 1 :align: left * - Back-end - Module - Class * - :ref:`MPI ` - :any:`backends.mpi.functional.repartition ` - :any:`BroadcastFunction ` The functional primitive for the Transpose data movement operation does not use the original tensor partitions :math:`P_x` (input) and :math:`P_y` (output) directly. Instead, the calling class creates a new union partition, within which data is moved. Within this union, workers either share data with other workers, receive shared data from other workers, or keep subsets of their own data. All sharing of data occurs through intermediate buffers. .. note:: These buffers are allocated by the calling class, using a helper function that must be specified by the back-end. The calling class provides two sets of meta information and buffers, one for any sending and one for any receiving a worker has to do. The meta information is a triplet, ``(slice, size, partner)``. When the worker must send data, ``slice`` is a Python ``Slice`` object describing the indices of the input tensor that it must copy to worker ``partner``. ``size`` is the volume of ``slice`` and ``partner`` is the lexicographic identifier (or rank) of the partner worker. The associated send buffer will be of size ``size``, specified in records, not bytes. Meta information for the receive is essentially the same, except ``slice`` describes the slice of the output tensor the data will go to and ``partner`` is the source the data. While most operations can be completed using information about the local subtensor only, this operation requires the global input tensor size. This is because we do not require the global tensor size to be specified when the layer is instantiated. Instead, the layer determines that size when it is called. Consequently, the shape of the local output subtensor tensor cannot be known until then, either. A worker may be part of either :math:`P_x`, :math:`P_y`, or both. If it is part of :math:`P_x`, then it will only need to share parts of its local input subtensor with other workers. If it is not part of :math:`P_x`, it will take a zero-volume tensor as input. If it is part of :math:`P_y`, its output subtensor will be formed by receiving copies of parts of other workers' local input subtensors. If it is not part of :math:`P_y`, then its output will be a zero-volume tensor. If a worker is part of both, its output may be a direct copy of part of its own input. The adjoint phase works the same way, but in reverse. However, as demonstrated in the `motivating paper `_, the subtensors on :math:`P_y` are summed back to :math:`P_x`, rather than copied. However, because there are no overlaps in the local scatters, this sum can safely be replaced with a copy. Backends ======== .. list-table:: :widths: 35 65 :width: 100% :header-rows: 0 :align: left * - :ref:`code_reference/backends/mpi:MPI Backend` - The MPI data movement backend.