Pooling Layers

Overview

DistDL’s The Distributed Pooling layers use the distributed primitive layers to build various distributed versions of PyTorch Pooling layers.

For the purposes of this documentation, we will assume that an arbitrary global input tensor \({x}\) is partitioned by \(P_x\).

Implementation

Currently, all pooling operations follow the same pattern. Therefore, a single base class implements the core distributed work and the actual pooling operation is deferred to the underlying PyTorch layer.

As there are no learnable parameters in these layers, the parallelism is induced by partitions of the inout (and therefore output) tensors. Here, input (and outout) tensors that are distributed in feature-space only.

Construction of this layer is driven by the partitioning of the input tensor \(x\), only. Thus, the partition \(P_x\) drives the algorithm design. With a pure feature-space partition, the output partition will have the same structure, so there is no need to specify it.

In general, due to the non-centered nature of pooling kernels, halos will be one-sided. See the motivating paper for more details.

Assumptions

  • The global input tensor \(x\) has shape \(n_{\text{b}} \times n_{c_{\text{in}}} \times n_{D-1} \times \cdots \times n_0\).

  • The input partition \(P_x\) has shape \(1 \times 1 \times P_{D-1} \times \cdots \times P_0\), where \(P_{d}\) is the number of workers partitioning the \(d^{\text{th}}\) feature dimension of \(x\).

  • The global output tensor \(y\) will have shape \(n_{\text{b}} \times n_{c_{\text{out}}} \times m_{D-1} \times \cdots \times m_0\). The precise values of \(m_{D-1} \times \cdots \times m_0\) are dependent on the input shape and the kernel parameters.

  • The output partition \(P_y\) implicitly has the same shape as \(P_x\).

Forward

Under the above assumptions, the forward algorithm is:

  1. Perform the halo exchange on the subtensors of \(x\). Here, \(x_j\) must be padded to accept local halo regions (in a potentially unbalanced way) before the halos are exchanged. The output of this operation is \(\hat x_j\).

  2. Perform the local forward pooling application using a PyTorch pooling layer. The bias is added everywhere, as each workers output will be part of the output tensor.

Adjoint

The adjoint algorithm is not explicitly implemented. PyTorch’s autograd feature automatically builds the adjoint of the Jacobian of the feature-distributed convolution forward application. Essentially, the algorithm is as follows:

  1. Each worker computes its local contribution to \(\delta x\), given by \(\delta x_j\), using PyTorch’s native implementation of the adjoint of the Jacobian of the local sequential pooling layer.

  2. The adjoint of the halo exchange is applied to \(\delta \hat x\), which is then unpadded, producing the gradient input \(\delta x\).

Pooling Mixin

Some distributed pooling layers require more than their local subtensor to compute the correct local output. This is governed by the “left” and “right” extent of the pooling window. As these calculations are the same for all pooling operations, they are mixed in to every pooling layer requiring a halo exchange.

Assumptions

  • Pooling kernels are not centered, the origin of the window is the “upper left” entry.

  • When a kernel has even size, the left side of the kernel is the shorter side.

Warning

Current calculations of the subtensor index ranges required do not correctly take padding and dilation into account.

Examples

API

distdl.nn.pooling