AllSumReduce Layer

Overview

The AllSumReduce distributed data movement primitive sums data on a within a set of workers in a Partition.

In DistDL, all-sum-reductions sum data from (sub)tensors along slices of a partition. The all-sum-reduce operation applies for partitions with and without a (Cartesian) topology.

For the purposes of this documentation, we will assume that an arbitrary global input tensor \({x}\) is partitioned by \(P_x\).

Note

The definition of a all-sum-reduction in DistDL goes beyond the classical parallel reduction operation, for example, MPI_Allreduce() in MPI. Such reductions typically assume 1-dimensional arrays, reduced within a group of workers, and neither impose nor exploit topological structure on the set of workers.

Motivation

In distributed deep learning, there are many applications of the all-reduction primitive. For example, in normalization layers, including distributed code_reference/nn/batchnorm:Batch Normalization Layers, global tensor statistics are required on all workers.

Implementation

A back-end functional implementation supporting DistDL AllSumReduce allows users to specify which dimensions of the partition the reductions happen along. No other options are required because the all-reduction occurs within the input partition.

Input tensors may be partitioned by this partition but they are not required to be. If the AllSumReduce is part of a broader all-sum-reduction on a tensor (along specific dimesions) then the local reduction must be performed first and the distributed reduction afterward. (This is normal for distributed all-reductions.)

Assumptions

  • The all-sum-reduction operation is not in-place. Even if the operation is equivalent to an identity (no dimensions are used in the reduction), a Torch .clone() of the tensor is returned.

Forward

The forward operation sums subtensors within \(P_x\), within subpartitions of the input partition, as specified by the user, and broadcasts the results within the same subpartition. The reduce and broadcast operations are not necessarily explicit.

  • A worker that is active in \(P_x\) will take a subtensor of \(x\) as input and return a subtensor of \(y\) as output.

  • A worker that is not active in \(P_x\) will take a zero-volume tensor as input and return a clone of that tensor as output.

This class provides only an interface to the back-end implementation of the forward algorithm. This interface does not impose any mechanism for performing the reduction. Performance details and optimizations are back-end dependent.

The back-end forward operation is implemented through the PyTorch autograd functional interface and called through the AllSumReduce \(~distdl.nn.AllSumReduce.forward\) function.

Adjoint

AllSumReduce is self-adjoint. Thus, the adjoint (backward) operation is exactly the same as the forward operation.

This class provides only an interface to the back-end implementation of the adjoint algorithm. This interface does not impose any mechanism for performing this broadcast. Performance details and optimizations are back-end dependent.

The adjoint operation (PyTorch grad function class) is generated automatically via autograd and calls the backward() function implemented by the back-end functional interface.

Examples

To reduce a 2-dimensional tensor that lives on a 2 x 2 x 3 partition along the last two dimesions:

>>> P_x_base = P_world.create_partition_inclusive(np.arange(0, 12))
>>> P_x = P_x_base.create_cartesian_topology_partition([2, 2, 3])
>>>
>>> x_local_shape = np.array([3, 7, 5])
>>>
>>> axes_reduce = (1, 2)
>>> layer = AllSumReduce(P_x, axes_reduce)
>>>
>>> x = zero_volume_tensor()
>>> if P_x.active:
>>>     x = torch.rand(*x_local_shape)
>>>
>>> y = layer(x)

Here, each subtensor of \({y}\) is the sum of the subtensors of \({x}\) from 6 workers.

API