GBST

What is Gradient-Based Subword Tokenization?

Gradient-Based Subword Tokenization (GBST) is a method of automatically learning latent subword representations from characters. It is a soft gradient-based subword tokenization module that uses a block scoring network to score candidate subword blocks. GBST is a data-driven approach that enumerates subword blocks and learns to score them position-wise.

The scoring network scores each candidate subword block and learns a position-wise soft selection over the subword blocks. GBST is different from prior tokenization-free methods because it learns interpretable latent subwords. These subwords enable easy inspection of lexical representations and are more efficient than other byte-based models.

The Benefits of GBST

The main benefits of using GBST are efficiency and interpretability. GBST can generate a smaller set of subwords for each numerical document or text batch compared to traditional methods. It is also more efficient in terms of speed and memory usage. GBST can learn latent subwords that can be directly interpreted and easily inspected. These subwords can help in identifying the most common prefixes, root words and suffixes in texts, thereby enabling the identification of linguistic patterns efficiently.

How GBST Works

The GBST module works in a step-wise manner, with the following steps:

Preprocessing - this step involves preprocessing the text to convert it into numerical format so that machine learning techniques can process it. GBST accepts a list of documents or sentences, converts them into word embeddings and tokenizes them into latent subwords.
Candidate Subwords Enumeration - in this step, GBST enumerates all the candidate subword blocks in the text. This involves dividing the text into smaller subword segments or blocks.
Scoring Network - the scoring network is used to score each candidate subword block. The scoring network assigns a score to each subword block based on its frequency and relevance within the text.
Soft Selection - in this step, GBST applies a position-wise soft selection to the candidate subword blocks. This is achieved by calculating a weighted average of subword block scores, which gives preference to high-scoring subword blocks.
Training - GBST is trained on a large corpus of text. The training process involves adjusting the weights of the scoring network to maximize the accuracy of the subword selection process.
Inference - once the GBST model has been trained, it can be used for inference on new text data. During inference, GBST applies the above steps to new documents or sentences to tokenize them into subwords.

Applications of GBST

GBST has several applications in natural language processing (NLP).

Text Classification

GBST can be used for text classification tasks such as sentiment analysis or topic classification. By tokenizing text into subwords, GBST can identify the most common forms of prefixes, root words, and suffixes within each category of text data. This enables the identification of important linguistic patterns, which can provide insight into the nature of the text data.

Word Embeddings

GBST can be used to generate word embeddings for words in a text corpus. Word embeddings are dense vector representations of words that capture semantic meaning. GBST can generate word embeddings for subword tokens, which enables the generation of word embeddings for words not present in the text corpus.

Translation

GBST can be used for machine translation tasks. By tokenizing text into subwords, GBST can enable the identification of common subword symbols that are frequently found in a given language. This can assist in the translation process, by enabling the identification of subword symbols in sentences that have no direct translation.

Gradient-Based Subword Tokenization is an efficient and interpretable approach to subword tokenization. It is a data-driven approach that enumerates candidate subword blocks and scores them using a scoring network. GBST can learn latent subwords that enable easy inspection of lexical representations and are more efficient than other byte-based models. GBST has several applications in NLP, including text classification, word embeddings, and translation tasks.