Electric

The Basics of Electric: A Cloze Model for Text Representation Learning

Electric is an advanced energy-based cloze model for representation learning over text, developed in the field of machine learning. It has a similar structure to the popular BERT, but with subtle differences in its architecture and functioning.

The primary purpose of Electric is to generate vector representations for text, and it uses the generative model methodology to achieve this goal. Specifically, it models $p\_{\text{data }}\left(x\_{t} \mid \mathbf{x}\_{\backslash t}\right)$, a conditional probability distribution over input tokens given their context.

How Electric Works: An Overview of the Model Architecture and Training Process

The Electric model follows a transformer network, which maps the input text $\mathbf{x}=\left[x\_{1}, \ldots, x\_{n}\right]$ into contextualized vector representations $\mathbf{h}(\mathbf{x})=\left[\mathbf{h}\_{1}, \ldots, \mathbf{h}\_{n}\right]$. It does not use masking or a softmax layer to generate the output vector.

Electric assigns a scalar energy score to each input token, indicating how probable it is given its context. This score is called an energy function and is defined as

$$E(\mathbf{x})\_{t}=\mathbf{w}^{T} \mathbf{h}(\mathbf{x})\_{t}$$

Here, the weight vector $w$ is learned through the training process. The energy function assigns a distribution over the possible tokens at position $t$, given by the following equation:

$$p\_{\theta}\left(x\_{t} \mid \mathbf{x}_{\backslash t}\right)=\exp \left(-E(\mathbf{x})\_{t}\right) / Z\left(\mathbf{x}\_{\backslash t}\right) $$

Using this equation, Electric computes the probability of an input token being in a certain position given its context. $\text{REPLACE}\left(\mathbf{x}, t, x^{\prime}\right)$ replaces the token at position $t$ with $x^{\prime}$ and $\mathcal{V}$ is the vocabulary, typically word pieces. Unlike in BERT, where the probabilities for all possible tokens are generated via a softmax layer, Electric passes in a candidate $x^{\prime}$ as input to the transformer network.

Still, computing $p_{\theta}$ is excessively time-consuming. To generate the probabilities, Electric needs to run the transformer $|\mathcal{V}|$ times. The partition function $Z\_{\theta}\left(\mathbf{x}\_{\backslash t}\right)$, which appears in the equation, requires calculating the sum of the energies of all possible token candidates, making it more difficult compared to other energy-based models.

The Advantages of Electric Over Other Models

Although BERT and Electric share some similarities, Electric offers some advantages over other models, mostly because of the way it assigns energy scores to input tokens instead of using a softmax layer. This feature allows for better sample noise rejection and more flexibility in assigning importance scores. Additionally, training Electric requires a smaller quantity of text, and it is easy to integrate into downstream tasks. Furthermore, in some cases, Electric offers better results than BERT on sentiment classification tasks, due to the better sample noise rejection and importance scoring that it can achieve with its energy function methodology.

Electric is an energy-based cloze model for representation learning over text, which offers an approach to generating vector representations for text that has many potential advantages over other similar models. Although Electric's architecture and training process are more complex than other models in the field of machine learning, its advantages make it a promising and unique model in the field of text representation learning.