DExTra

DExTra, or Deep and Light-weight Expand-reduce Transformation, is an innovative technique used in machine learning that helps to learn wider representations efficiently. The light-weight expand-reduce transformation makes use of group linear transformations to derive output efficiently from specific input parts.

What is DExTra?

DExTra is a light-weight expand-reduce transformation technique that is used in machine learning. It allows mapping of an input vector with $d\_{m}$ dimensions to a high dimensional space (expansion) and then reduce the vector to a $d\_{o}$ output vector. This is possible by employing $N$ layers of group transformations. Further, these transformations use group linear transformations that are quick and efficient in learning local representations when compared to linear transformations. DExTra shares information between different groups using feature shuffling while learning global representations.

DExTra transformation is controlled by five configuration parameters:

Depth N
Width multiplier $m\_{w}$
Input dimension $d\_{m}$
Output dimension $d\_{o}$
Maximum groups $g\_{max}$ in a group linear transformation

How does DExTra work?

DExTra works by employing group linear transformations to extract output from specific input parts. The transformation is comprised of a two-phase process: expansion and reduction.

In the expansion phase, the input vector is projected linearly to a high-dimensional space, $d\_{max} = m\_{w}d\_{m}$ using $\text{ceil}\left(\frac{N}{2}\right)$ layers. In the reduction phase, it projects the $d\_{max}$-dimensional vector to a $d\_{o}$-dimensional space using the remaining layers ($N − \text{ceil}\left(\frac{N}{2}\right)$).

The output $Y$ at each layer $l$ is defined as:

$$ \mathbf{Y}\_{l} = \mathcal{F}\left(\mathbf{X}, \mathbf{W}^{l}, \mathbf{b}^{l}, g^{l}\right) \text{ if } l=1 $$ $$ \mathbf{Y}\_{l} = \mathcal{F}\left(\mathcal{H}\left(\mathbf{X}, \mathbf{Y}^{l-1}\right), \mathbf{W}^{l}, \mathbf{b}^{l}, g^{l}\right) \text{ Otherwise } $$

The number of groups at each layer $l$ is computed as:

$$ g^{l} = \text{min}\left(2^{l-1}, g\_{max}\right), 1 \leq l \leq \text{ceil}\left(N/2\right) $$ $$ g^{N-l}, \text{Otherwise}$$

In the above equations, $\mathcal{F}$ is a group linear transformation function, which takes the input $\left(\mathbf{X} \text{ or } \mathcal{H}\left(\mathbf{X}, \mathbf{Y}^{l-1}\right) \right)$, splits it into $g^{l}$ groups, and then applies a linear transformation with learnable parameters $\mathbf{W}^{l}$ and bias $\mathbf{b}^{l}$ to each group independently. Then, the outputs of each group are concatenated to produce the final output $\mathbf{Y}^{l}$. The function $\mathcal{H}$ first shuffles the output of each group in $\mathbf{Y}^{l-1}$ and then combines it with the input $\mathbf{X}$ using an input mixer connection.

In the experiments, the authors used $g\_{max} = \text{ceil}\left(\frac{d\_{m}}{32}\right)$, and this ensured that each group had a minimum of 32 input elements. It is important to note that group linear transformations reduce to linear transformations when $g^{l} = 1$. Likewise, DExTra is equivalent to a multi-layer perceptron when $g\_{max} = 1$.

DExTra is a light-weight expand-reduce transformation that allows for efficient learning of wider representations in machine learning. Its use of group linear transformations in deriving output that takes into account specific part of the input results in it being fast and efficient compared to other linear transformations. As a result, it is an innovative machine learning technique that is likely to see widespread use in the years ahead.