Foundations of Large Language Models

Pre-training NLP Models

Unsupervised, Supervised and Self-supervised Pre-training

In deep learning, pre-training involves optimizing a neural network before it is fine-tuned for specific tasks, based on the premise that a model trained on one task can be adapted for another This method alleviates the need to train complex neural networks from scratch, especially in scenarios with limited labeled data By leveraging tasks with more readily available supervision signals, pre-training reduces dependence on task-specific labeled data and fosters the creation of more generalized models applicable across various problems.

The resurgence of neural networks through deep learning has seen early pre-training efforts focus on unsupervised learning, where neural network parameters are optimized using criteria unrelated to specific tasks For instance, minimizing the reconstruction cross-entropy of input vectors for each layer is a common approach Unsupervised pre-training serves as a valuable preliminary step before supervised learning, providing advantages such as facilitating the discovery of better local minima and introducing a regularization effect during training These benefits contribute to a more manageable and stable subsequent supervised learning phase.

A second approach to pre-training is to pre-train a neural network on supervised learning tasks For example, consider a sequence model designed to encode input sequences into some

In natural language processing (NLP), tokens are the fundamental units of text, often used interchangeably with words During the pre-training phase, a model is combined with a classification layer to create a system that classifies sentences based on sentiment, identifying whether they express positive or negative emotions This model is then adapted for downstream tasks by developing a new classification system that assesses whether sequences are subjective or objective Fine-tuning the model's parameters with task-specific labeled data is crucial for optimizing performance on these new tasks The straightforward nature of supervised pre-training aligns with established supervised learning paradigms, but the increasing complexity of neural networks necessitates larger amounts of labeled data, making pre-training more challenging when such data is scarce.

Self-supervised learning is a third approach to pre-training neural networks, where the model generates its own supervision signals from unlabeled data instead of relying on human input This method involves creating pseudo labels and iteratively refining the model, similar to self-training, which has been effectively used in various NLP tasks such as word sense disambiguation and document classification Unlike traditional self-training, self-supervised pre-training does not depend on an initial model for data annotation; instead, it derives all supervision signals directly from text, allowing the model to be trained from scratch A prominent example of this technique is training sequence models to predict masked words based on their context, facilitating large-scale self-supervised learning and enhancing performance in understanding, writing, and reasoning tasks.

Figure 1.1 illustrates a comparison of three pre-training methods, highlighting the dominance of self-supervised pre-training in contemporary state-of-the-art NLP models This chapter, along with the rest of the book, will concentrate on self-supervised pre-training, detailing the process of pre-training sequence models through self-supervision and the subsequent applications of these pre-trained models.

Adapting Pre-trained Models

As mentioned above, two major types of models are widely used in NLP pre-training.

Sequence encoding models transform a sequence of words or tokens into real-valued vectors or a series of vectors, effectively capturing the essence of the sequence This representation serves as crucial input for other models, like sentence classification systems, enhancing their ability to process and understand language.

Supervised Supervised Pre-training Tuning

Zero/Few Shot Learning Pre-training Tuning

Pre-training methods in machine learning can be categorized into unsupervised, supervised, and self-supervised approaches Unsupervised pre-training involves training on large-scale unlabeled data, serving as a foundational step for later optimization with labeled data In contrast, supervised pre-training operates on the premise that various learning tasks are interconnected, allowing a model trained on one task to be adapted to another with minimal additional training Self-supervised pre-training, on the other hand, utilizes large-scale unlabeled data through self-supervision, enabling effective model training that can be efficiently fine-tuned or prompted for new tasks.

Sequence generation models in natural language processing (NLP) involve generating a sequence of tokens based on specific contexts The term "context" varies in meaning depending on the application; for instance, it pertains to preceding tokens in language modeling and to the source-language sequence in machine translation.

We need different techniques for applying these models to downstream tasks after pre-training. Here we are interested in the following two methods.

1.1.2.1 Fine-tuning of Pre-trained Models

For sequence encoding pre-training, fine-tuning is a widely used method to adapt pre-trained models Let Encode θ (ã) represent an encoder with parameters θ, such as a standard Transformer encoder Once we have pre-trained the model and achieved optimal parameters θ, we can utilize it to model any sequence and generate the corresponding representation.

The encoder function H = Encode θ ˆ (x) transforms the input sequence {x 0 , x 1 , , x m } into an output representation of real-valued vectors {h 0 , h 1 , , h m } Typically, the encoder is not used in isolation within natural language processing (NLP) systems; rather, it is integrated into larger frameworks For instance, in a text classification task, the encoder helps determine the sentiment polarity, categorizing text as positive or negative.

In auto-regressive decoding for machine translation, each token in the target language is generated by considering both the preceding tokens and the source-language sequence To develop a text classification system, we can implement a classifier on top of the encoder This text classification model can be represented as a neural network, denoted as LetClassify ω (ã), with specific parameters ω.

The probability distribution HerePr ω, θ ˆ (ã|x) categorizes labels into three classes: positive, negative, and neutral The output is determined by selecting the label with the highest probability from this distribution For simplicity, we will refer to this process as F ω, θ ˆ (ã), which represents Classify ω (Encode θ ˆ (ã)).

The model parameters ω and θˆ are not suitable for direct use in classification tasks Therefore, it is essential to modify the model to better fit the specific task requirements A common approach to achieve this is by fine-tuning the model through explicit labeling in downstream tasks.

We can fine-tune the parameters ω and θ on a labeled dataset as a supervised learning task, resulting in optimized parameters ω˜ and θ˜ Alternatively, by freezing the encoder parameters θˆ, we can focus on optimizing ω alone, enabling the classifier to effectively adapt to the pre-trained encoder.

Once we have obtained a fine-tuned model, we can use it to classify a new text For example, suppose we have a comment posted on a travel website:

I love the food here It’s amazing!

We first tokenize this text into tokens 3 , and then feed the token sequencex new into the fine-tuned modelF ω, ˜ θ ˜(ã) The model generates a distribution over classes by

F ω, ˜ θ ˜ (x new ) = hPr(positive|x new ) Pr(negative|x new ) Pr(neutral|x new )i (1.4)

And we select the label of the entry with the maximum value as output In this example it is positive.

Fine-tuning a pre-trained model requires significantly less labeled data compared to the extensive pre-training data, resulting in lower computational costs This efficiency allows for the effective adaptation of pre-trained models to specific downstream tasks by simply collecting a small amount of labeled data and making minor adjustments to the model parameters For a more comprehensive understanding of fine-tuning, please refer to Section 1.4.

1.1.2.2 Prompting of Pre-trained Models

Sequence generation models are typically used on their own to tackle language generation tasks like question answering and machine translation, eliminating the need for extra components This independence allows for easier fine-tuning of these models.

Text tokenization can be approached in various ways, with one straightforward method being the segmentation of text into individual English words and punctuation marks, such as {I, love, the, food, here, , It, ’s, amazing, !} This process serves as a foundational step for downstream tasks For instance, fine-tuning a pre-trained encoder-decoder multilingual model on bilingual data can enhance its effectiveness in specific translation tasks.

Large language models, trained on extensive datasets, excel in sequence generation by predicting the next token based on previous ones This seemingly simple task, traditionally limited to language modeling, allows these models to learn comprehensive language knowledge through repeated practice Consequently, pre-trained large language models demonstrate impressive token prediction capabilities, enabling the transformation of various NLP challenges into straightforward text generation tasks, such as reinterpreting text classification as a text generation problem.

I love the food here It’s amazing! I’m

To classify text as positive, we look for specific predicted words or phrases, such as "happy," "glad," or "satisfied." A straightforward prompting method involves appending "I'm" to the input text, which aids in determining the appropriate label for the original content based on the predicted completion.

Large language models excel in language understanding and generation, allowing users to issue prompts that guide them in executing complex tasks One such application is polarity classification, where a specific prompt instructs the model to analyze and categorize sentiments effectively.

Assume that the polarity of a text is a label chosen from {positive, negative, neutral} Identify the polarity of the input.

Input: I love the food here It’s amazing!

Self-supervised Pre-training Tasks

Decoder-only Pre-training

The decoder-only architecture has been widely used in developing language models [Radford et al.,

In 2018, a Transformer decoder can be utilized as a language model by omitting the cross-attention sub-layers, allowing it to predict token distributions based on preceding tokens This model outputs the token with the highest probability, and training involves minimizing a loss function across various token sequences Denoted as Decoder θ (ã), the decoder generates a distribution of subsequent tokens at each position i, based on prior tokens {x 0, , x i}, represented as Pr θ (ã|x 0, , x i) or simply p θ i+1 The gold-standard distribution at the same position, p gold i+1, can be viewed as a one-hot representation of the correct predicted word, leading to the definition of the loss function for language modeling.

L(p θ i+1 , p gold i+1 )to measure the difference between the model prediction and the true prediction In NLP, the log-scale cross-entropy loss is typically used.

Given a sequence ofm tokens{x 0 , , x m }, the loss on this sequence is the sum of the loss over the positions{0, , m−1}, given by

LogCrossEntropy(p θ i+1 , p gold i+1 ) (1.5) whereLogCrossEntropy(ã)is the log-scale cross-entropy, andp gold i+1 is the one-hot representation ofx i+1

The loss function can be expanded to accommodate a collection of sequences D, where the goal of pre-training is to identify the optimal parameters that minimize the loss on D, represented as θˆ = arg min θ.

Note that this objective is mathematically equivalent to maximum likelihood estimation, and can be re-expressed as θˆ = arg max θ

With these optimized parametersθ, we can use the pre-trained language modelˆ Decoder θ ˆ(ã) to compute the probabilityPr ˆ θ (x i+1 |x 0 , , x i )at each position of a given sequence.

Encoder-only Pre-training

An encoder, defined in Section 1.1.2.1, is a function that processes a sequence of tokens \( x = x_0 x_m \) to generate a corresponding sequence of vectors \( H = h_0 h_m \) Training this model poses challenges due to the lack of gold-standard data for evaluating the output of the real-valued function To facilitate encoder pre-training, a common approach involves integrating the encoder with output layers that provide more accessible supervision signals As illustrated in Figure 1.2, a typical architecture for pre-training Transformer encoders includes adding a Softmax layer atop the Transformer encoder, resembling the structure of a decoder-based language model, which ultimately produces a sequence of probability distributions.

4 If we view h i as a row vector, H can be written as

Softmax model reconstructs the masked token

Pre-trained Encoder Prediction Network Output for Downstream Tasks

(b) Applying the Pre-trained Encoder

In the pre-training phase of a Transformer encoder, the model is trained with a Softmax layer through self-supervision, as depicted in Fig 1.2 Once pre-training is complete, the Softmax layer is removed, and the encoder is integrated with a prediction network to tackle specific tasks To enhance performance and adaptability for these tasks, the system undergoes fine-tuning using labeled data.

In this context, the output distribution Pr(ã|x) at position i is represented as Herep W ,θ i The Softmax layer, denoted as Softmax W (ã), is parameterized by W, meaning that Softmax W (H) equals Softmax(HãW) For ease of notation, we may occasionally omit the superscripts W and θ from the probability distributions.

The distinction between this model and standard language models lies in the interpretation of output p i during encoder pre-training and language modeling In language modeling, p i represents the probability distribution for predicting the next word through an auto-regressive decoding process, where the model only considers words preceding position i Conversely, in encoder pre-training, the model can access the entire sequence simultaneously, rendering the prediction of any tokens in that sequence irrelevant.

Masked language modeling, a key method for encoder pre-training, underpins the BERT model [Devlin et al., 2019] This technique involves creating prediction tasks by masking certain tokens in an input sequence, prompting the model to predict these hidden tokens Unlike causal language modeling, which relies solely on left-context for predictions while ignoring the right-context, masked language modeling utilizes all unmasked tokens, enabling a bidirectional approach that incorporates both left and right contexts for improved accuracy in word prediction.

In a given input sequence \( x = x_0, x_1, \ldots, x_m \), we can mask specific tokens at positions defined by \( A(x) = \{i_1, \ldots, i_u\} \) This process results in a modified token sequence \( \bar{x} \), where each token at the masked positions is substituted with a special symbol, [MASK] For instance, consider the following sequence.

The early bird catches the worm we may have a masked token sequence like this

The[MASK]bird catches the[MASK] where we mask the tokensearlyandworm(i.e.,i 1 = 2andi 2 = 6).

In this process, we have two sequences, x and x̄, and the model is optimized to accurately predict x based on x This resembles an autoencoding mechanism, where the training objective focuses on maximizing the reconstruction probability Pr(x|x) Notably, there is a straightforward position-wise alignment between x and x̄, as unmasked tokens in x̄ correspond directly to tokens in x at the same positions, eliminating the need to predict these unmasked tokens Consequently, the training objective simplifies to maximizing the probabilities solely for the masked tokens, which can be framed using maximum likelihood estimation.

X i∈A(x) log Pr W ,θ i (x i |x)¯ (1.10) or alternatively express it using the cross-entropy loss

The LogCrossEntropy function calculates the log-scale probabilities of true tokens in a sequence, represented as LogCrossEntropy(p W, θ i, p gold i) In this context, Pr W, θ k (x k | x)¯ denotes the probability of the true token x k at position k, given a corrupted input x, while p W, θ k represents the probability distribution at that position For instance, in the phrase “the early bird catches the worm,” if two tokens are masked, the goal is to maximize the total log-scale probabilities for accurate token prediction.

Loss = log Pr(x 2 ly|¯x= [CLS]The[MASK]

) + log Pr(x 6 =worm|¯x= [CLS]The[MASK]

Once we obtain the optimized parametersWc and θ, we can dropˆ W Then, we can furtherc fine-tune the pre-trained encoderEncoder θ ˆ (ã)or directly apply it to downstream tasks.

Masked language modeling, despite its simplicity and widespread use, presents several challenges A key issue arises from the special token [MASK], which is utilized exclusively during training and not during testing, creating a gap between these two phases Additionally, the auto-encoding approach fails to account for the interdependencies between masked tokens; for instance, the prediction of the first masked token is made without considering the context provided by the second masked token, which can lead to less accurate results.

The permuted language modeling approach, as introduced by Yang et al (2019), effectively addresses various issues by enabling predictions of tokens in any order, unlike traditional causal language modeling which follows a natural sequence In this method, while the actual order of tokens remains unchanged, the prediction sequence is determined and trained using standard language modeling techniques For instance, in a sequence of five tokens (x0, x1, x2, x3, x4), the model predicts in a flexible order rather than strictly left-to-right This innovative approach allows for a more versatile generation process to model the probability of token sequences.

Pr(x) = Pr(x 0)ãPr(x 1|x 0)ãPr(x 2|x 0 , x 1)ãPr(x 3|x 0 , x 1 , x 2)ã

Now, let us consider a different order for token prediction: x 0 → x 4 →x 2 →x 1 → x 3 The sequence generation process can then be expressed as follows:

Pr(x) = Pr(x 0 )ãPr(x 4 |e 0 )ãPr(x 2 |e 0 , e 4 )ãPr(x 1 |e 0 , e 4 , e 2 )ã

The new prediction order enhances token generation by allowing models to consider a broader context, utilizing both left-context (e0, e1, e2) and right-context (e4) during the process This method retains the positional information of the tokens, maintaining their original order Consequently, it resembles masked language modeling, where the target token (x3) is masked, and surrounding tokens (x0, x1, x2, x4) are employed to predict it effectively.

Implementing permuted language models in Transformers is straightforward due to the self-attention mechanism's insensitivity to input order This allows for permutation through the application of specific masks in self-attention without needing to reorder the sequence explicitly For instance, when calculating Pr(x1 | e0, e4, e2), we can arrange x0, x1, x2, x3, and x4 in sequence and restrict the attention from x3 to x1, as demonstrated in the arrangement of x0, x1, x2, x3, and x4.

Blue box = valid attention Gray box = blocked attention

For a more illustrative example, we compare the self-attention masking results of causal language modeling, masked language modeling and permuted language modeling in Figure1.3.

1.2.2.3 Pre-training Encoders as Classifiers

A popular approach for training an encoder involves utilizing classification tasks, particularly in self-supervised learning, where new challenges are generated from unlabeled text There are various methods to design these classification tasks, and we will discuss two widely used examples.

BERT's original paper introduces a method known as next sentence prediction (NSP), which posits that an effective text encoder must understand the relationship between pairs of sentences NSP utilizes the encoded outputs of two consecutive sentences, Sent A and Sent B, to ascertain if Sent B logically follows Sent A For instance, if Sent A is "It is raining." and Sent B is "I need an umbrella.", the encoder processes this input sequence to evaluate their connection.

In the context of Transformer encoding, the sequence begins with a start symbol [CLS], which is followed by a separator [SEP] that delineates two sentences: "It is raining." and "I need an umbrella." Each token in this sequence is represented by its corresponding embedding, and the embeddings are processed by the encoder to generate an output sequence The first output, h0, serves as the representation of the entire sequence, allowing us to apply a Softmax layer for constructing a binary classification system.

Pr(x 0) = 1 Pr(x 1 |e 0 ) Pr(x 2 |e 0 , e 1 ) Pr(x 3 |e 0 , e 1 , e 2 ) Pr(x 4 |e 0 , e 1 , e 2 , e 3 ) (a) Causal Language Modeling (order: x 0 →x 1 →x 2 →x 3 →x 4 ) x 0 x 0 x 1 x 1 x 2 x 2 x 3 x 3 x 4 x 4 masked masked masked masked

(b) Masked Language Modeling (order: x 0 ,[MASK], x 2 ,[MASK], x 4 →x 1 , x 3 ) x 0 x 0 x 1 x 1 x 2 x 2 x 3 x 3 x 4 x 4

Pr(x 0 ) = 1 Pr(x 1 |e 0 , e 4 , e 2 ) Pr(x 2|e 0 , e 4 ) Pr(x 3 |e 0 , e 4 , e 2 , e 1 ) Pr(x 4 |e 0 )

Encoder-Decoder Pre-training

Encoder-decoder architectures play a crucial role in natural language processing (NLP) for sequence-to-sequence tasks like machine translation and question answering These models can also be adapted to tackle a variety of other challenges by treating text as both input and output For instance, given a piece of text, an encoder-decoder model can generate a corresponding output that conveys the sentiment of the input, categorizing it as positive, negative, or neutral.

Such an idea allows us to develop a single text-to-text system to address any NLP problem.

We can transform various problems into a unified text-to-text format Initially, we train an encoder-decoder model to acquire general language knowledge through self-supervised learning Subsequently, this model is fine-tuned for specific tasks using focused text-to-text datasets.

1.2.3.1 Masked Encoder-Decoder Pre-training

InRaffel et al.[2020]’sT5model, many different tasks are framed as the same text-to-text task. Each sample in T5 follows the format

Sure! Here’s a rewritten version of your article that maintains the original meaning while optimizing for SEO: -**Understanding Task Separation in AI Text Processing**In AI text processing, it is crucial to distinguish between the source text, which includes task descriptions or instructions, and the target text, which represents the system's response For example, when translating from Chinese to English, a training sample is structured as follows: "Return the results as a paragraph in English, without additional explanations." -This version highlights key concepts and maintains clarity for SEO purposes.

[CLS] Translate from Chinese to English: 你好！ → hsiHello! where[CLS]andhsiare the start symbols on the source and target sides, respectively 5

Likewise, we can express other tasks in the same way For example

[CLS] Answer: when was Albert Einstein born?

→ hsiHe was born on March 14, 1879.

[CLS] Simplify: the professor, who has has published numerous papers in his field, will be giving a lecture on the topic next week.

→ hsiThe experienced professor will give a lecture next week.

The phrase "when in Rome, do as the Romans do" emphasizes the importance of adapting to local customs and practices In Chinese, this is expressed as "人在罗马就像罗马人一样做事," which conveys the same message of embracing local traditions.

→ hsi0.81 where instructions are highlighted in gray An interesting case is that in the last example we

In our approach, we utilize the same start symbol for various sequences, employing distinct symbols to differentiate between the encoder and decoder sides Additionally, we redefine the scoring challenge as a text generation task, aiming to produce a textual representation of the number 0.81 instead of simply outputting it as a numerical figure.

The proposed framework offers a novel method for universal language understanding and generation by supplying task instructions and problem inputs in text format This system adheres to the given instructions to accomplish various tasks, allowing for the integration of multiple problems Consequently, it enables the training of a single model capable of executing numerous tasks concurrently, enhancing efficiency and versatility.

Fine-tuning is essential for adapting pre-trained models to specific downstream tasks, utilizing various methods to instruct the model, such as task prefixes or detailed descriptions Since task instructions are incorporated into the input text, the language understanding developed during the pre-training phase aids in this process This capability can facilitate zero-shot learning, allowing pre-trained models to generalize and tackle new problems even when they have not previously encountered the specific task instructions.

Self-supervised learning has introduced powerful techniques for Transformer encoders and decoders, making it easy to pre-train encoder-decoder models A popular approach involves training these models as language models, where the encoder processes a sequence prefix, and the decoder generates the remaining sequence This method differs from traditional causal language modeling, which generates the entire sequence autoregressively from the first token Instead, in prefix language modeling, the encoder handles the prefix simultaneously, allowing the decoder to predict subsequent tokens based on that context.

[CLS] The puppies are frolicking

An encoder-decoder model can be effectively trained using specific examples, allowing the encoder to grasp the prefix while the decoder generates text based on this comprehension Additionally, for large-scale pre-training, generating numerous training examples from unlabeled text is a straightforward process.

For pre-trained encoder-decoder models to excel in multilingual and cross-lingual tasks like machine translation, they must be trained on multilingual data, which necessitates a vocabulary that encompasses tokens from all languages This approach allows the models to develop shared representations across various languages, enhancing their ability to understand and generate language in both multilingual and cross-lingual contexts.

Masked language modeling is a key method for pre-training encoder-decoder models, where certain tokens in a sequence are randomly substituted with a mask symbol The model learns to predict these masked tokens by analyzing the context provided by the entire sequence.

As an illustration, consider the task of masking and reconstructing the sentence

The puppies are frolicking outside the house

By masking two tokens (say,frolickingandthe), we have the BERT-style input and output of the model, as follows

[CLS] The puppies are [MASK] outside [MASK] house

The masked position indicates where token predictions are not made, allowing for a flexible approach that can be adapted for both BERT-style and language modeling-style training by adjusting the percentage of masked tokens For instance, when all tokens are masked, the model learns to generate the complete sequence.

[CLS] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK]

→ hsiThe puppies are frolicking outside the house

In this case, we train the decoder as a language model.

In the encoder-decoder architecture, the encoder processes a masked sequence while the decoder predicts the original sequence, functioning as a denoising autoencoder The encoder converts a corrupted input into a hidden representation, and the decoder reconstructs the uncorrupted input from this representation This process exemplifies denoising training.

[CLS] The puppies are [MASK] outside [MASK] house

The model learns to translate a corrupted sequence into its original form, enhancing its understanding during encoding and generation during decoding This process is illustrated in Figure 1.4, which demonstrates how an encoder-decoder model is trained using BERT-style and denoising autoencoding objectives.

As we randomly select tokens for masking, we can certainly mask consecutive tokens [Joshi et al.,

[CLS] The puppies are [MASK] outside [MASK] [MASK]

An alternative approach to handling consecutive masked tokens is to represent them as spans In line with the work of Raffel et al (2020), we utilize sentinel tokens [X], [Y], and [Z] to encompass one or more consecutive masked tokens This notation allows us to reformulate the training example effectively.

[CLS] The puppies are [X] outside [Y]

[CLS] The puppies are [M] in [M] house

(a) Training an encoder-decoder model with BERT-style masked language modeling

[CLS] The puppies are [M] in [M] house

Encoder Decoder frolicking are puppies The hsi in the house

The puppies are frolicking in the house

(b) Training an encoder-decoder model with denoising autoencoding

Comparison of Pre-training Tasks

In our exploration of pre-training tasks, we've noted that the same training objectives can be utilized across various architectures, such as employing masked language modeling for both encoder-only and encoder-decoder models Therefore, it is more effective to categorize these pre-training tasks according to their training objectives rather than by the model architecture.

• Language Modeling Typically, this approach refers to an auto-regressive generation pro- cedure of sequences At one time, it predicts the next token based on its previous context.

• Masked Language Modeling Masked Language Modeling belongs to a general mask- predict framework It randomly masks tokens in a sequence and predicts these tokens using the entire masked sequence.

Permuted language modeling is a technique that builds on the concept of masked language modeling by incorporating the order of token prediction It involves rearranging the input sequence and predicting tokens one after another, with each prediction informed by a selection of context tokens chosen at random.

Discriminative training involves generating supervision signals from classification tasks, where pre-training models are incorporated into classifiers This integrated approach allows for joint training of the classifiers and their components, ultimately improving classification performance.

Denoising Autoencoding is a technique used for pre-training encoder-decoder models, where the model learns to reconstruct an original sequence from a corrupted version of that sequence.

Table 1.1 presents various methods and their examples, highlighting model architectures suitable for pre-training tasks Each example features an input token sequence, with outputs represented as either token sequences or probabilities In generation tasks like language modeling, superscripts denote the order of generation; absence of superscripts indicates that the output can be generated autoregressively or simultaneously We assume a standard Transformer encoding process on the source side, allowing each token to access the entire sequence through self-attention, except for permuted language modeling, which utilizes attention masks for autoregressive generation For clarity, we exclude the tokenhsifrom the target side in these examples.

Various pre-training tasks can be compared within a unified framework and experimental setup, as shown in studies by Dong et al (2019), Raffel et al (2020), and Lewis et al (2020) Due to the extensive number of pre-training tasks available, a comprehensive list cannot be provided here For those interested in a deeper exploration of pre-training tasks, several surveys are available for further reading, including works by Qiu et al (2020) and Han et al (2021).

Example: BERT

The Standard Model

The standard BERT model, introduced by Devlin et al in 2019, utilizes a Transformer encoder and is trained through masked language modeling and next sentence prediction tasks The training process involves minimizing the combined loss from both tasks.

Loss BERT = Loss MLM + Loss NSP (1.17)

In training deep neural networks, we optimize model parameters by minimizing the loss function This process involves collecting a set of training samples, and during the training phase, we utilize a batch of these samples to update the model effectively.

Method Enc Dec E-D Input Output

Causal LM • • The 1 kitten 2 is 3 chasing 4 the 5 ball 6 7 Prefix LM • • [C]The kitten is chasing 1 the 2 ball 3 4

Masked LM • • [C]The kitten[M]chasing the[M] is ball

MASS-style • • [C]The kitten[M] [M] [M]ball is chasing the

BERT-style • • [C]The kitten[M]playingthe[M] kitten is chasing ball

The kitten is playfully chasing a ball, showcasing its energetic nature In a different context, birds are known to eat worms, highlighting the diverse behaviors of animals in their natural habitats.

Sentence • Encode a sentence ash a and Score(h a ,h b )

Token classification involves identifying individual elements within a sentence, as seen in the example where "The kitten is chasing the ball" is broken down into its components Token reordering rearranges the sequence of words, demonstrating how "kitten" can be placed at the beginning for emphasis Token deletion highlights the removal of spaces, creating a compact representation of the phrase Span masking replaces specific segments with markers, such as "[M]," to focus on particular parts of the sentence Lastly, sentinel masking introduces placeholders like "[X]" and "[Y]" to denote specific tokens, showcasing a method for emphasizing or obscuring certain elements within the text.

The kitten is swiftly chasing the ball as it rolls away.

Table 1.1 presents a comparison of various pre-training tasks, including language modeling, masked language modeling, permuted language modeling, discriminative training, and denoising autoencoding The notation used indicates the roles of different tokens, such as [C] for [CLS] and [M] for [MASK], while [X] and [Y] represent sentinel tokens The table also specifies the applicability of these approaches to encoder-only, decoder-only, and encoder-decoder models For generation tasks, superscripts denote the order of tokens During training, samples are randomly selected, and the loss for BERT is accumulated across these samples Model parameters are updated through gradient descent or its variants, with this iterative process continuing until a predefined stopping criterion, like convergence of training loss, is met.

BERT models are primarily designed to represent either a single sentence or a pair of sentences, making them effective for addressing a range of language understanding tasks In this context, we consider the input representation as a sequence that includes two sentences, referred to as Sentence A and Sentence B.

[CLS] Sent A [SEP] Sent B [SEP]

Here we follow the notation in BERT’s paper and use[SEP]to denote the separator.

In this process, we can derive LossMLM and LossNSP independently For masked language modeling, we predict a portion of the tokens within the sequence, with a standard practice of randomly selecting around 15% of the tokens, as seen in the BERT model The sequence is then altered in three distinct ways.

• Token Masking 80% of the selected tokens are masked and replaced with the symbol [MASK] For example

Original: [CLS] It is raining [SEP] I need an umbrella [SEP]

Predicting masked tokens helps the model learn to understand words based on their surrounding context For example, in the sentence "It is [MASK]," the missing word can be inferred from the context Similarly, in "I need [MASK] umbrella," the model can determine the appropriate word by analyzing the surrounding text This technique enhances the model's ability to represent language effectively.

• Random Replacement 10% of the selected tokens are changed to a random token For example,

Original: [CLS] It is raining [SEP] I need an umbrella [SEP] Random Token: [CLS] It is raining [SEP] I need an hat [SEP]

This helps the model learn to recover a token from a noisy input.

• Unchanged.10%of the selected tokens are kept unchanged For example,

Original: [CLS] It is raining [SEP] I need an umbrella [SEP] Unchanged Token: [CLS] It is raining [SEP]Ineed an umbrella [SEP]

This is not a difficult prediction task, but can guide the model to use easier evidence for prediction.

LetA(x)be the set of selected positions of a given token sequencex, andx¯ be the modified sequence ofx The loss function of masked language modeling can be defined as

LossMLM = − X i∈A(x) log Pr i (x i |x)¯ (1.18) wherePr i (x i |x)¯ is the probability of predictingx i at the positionigivenx Figure¯ 1.5shows a running example of computingLoss MLM

For next sentence prediction, we follow the method described in Section1.2.2.3 Each training sample is classified into a label set{IsNext,NotNext}, for example,

Sequence: [CLS] It is raining [SEP] I need an umbrella [SEP]

Sequence: [CLS] The cat sleeps on the windowsill [SEP] Apples grow on trees [SEP]Label: NotNext

[CLS] It is raining [SEP] I need an umbrella [SEP]

Select tokens with a probability of 15%

[CLS] It is raining [SEP] I need an umbrella [SEP]

Mask selected tokens with a probability of 80%

[CLS] It is [MASK] [SEP] I need [MASK] umbrella [SEP] Token Masking:

Alter selected tokens with a probability of 10%

[CLS] It is [MASK] [SEP] I need [MASK] hat [SEP]

Keep selected tokens unchanged with a probability of 10%

[CLS] It is [MASK] [SEP] I need [MASK] hat [SEP]

Train the Transformer encoder with the modified sequence

[CLS] It is [MASK] [SEP] I need [MASK] hat [SEP] e 0 e 1 e 2 e 3 e 4 e 5 e 6 e 7 e 8 e 9 e 10 e 11 h 0 h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 8 h 9 h 10 h 11 training I an umbrella

In BERT-style masked language modeling, 15% of tokens from a sequence are randomly chosen for processing These tokens can be replaced with a [MASK] token 80% of the time, substituted with a random token 10% of the time, or left unchanged 10% of the time The model is then trained to predict the original tokens based on the altered sequence, with the embedding of each token at position i represented as e i, while gray boxes indicate the Softmax layers used in the process.

The encoder's output vector for the initial token [CLS] serves as the sequence representation, referred to as h_cls (or h_0) A classifier is constructed using h_cls, allowing us to calculate the probability of a label c given h_cls, expressed as Pr(c|h_cls) Various loss functions are available for classification tasks, and in maximum likelihood training, we can define Loss NSP accordingly.

LossNSP = −log Pr(c gold|h cls ) (1.19) wherec gold is the correct label for this sample.

BERT models utilize the standard Transformer encoder architecture, as illustrated in Figure 1.6 The input consists of a sequence of embeddings, which are generated by combining the token embedding, positional embedding, and segment embedding, represented by the equation e = x + e_pos + e_seg (1.20).

In Transformer models, both token embedding (x) and positional embedding (e pos) are standard components A novel addition is the segment embedding (e seg), which identifies the association of a token with either Sentence A or Sentence B.

Token [CLS] It is raining [SEP] I need an umbrella [SEP] x x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 e pos PE(0) PE(1) PE(2) PE(3) PE(4) PE(5) PE(6) PE(7) PE(8) PE(9) PE(10) PE(11) e seg e A e A e A e A e A e A e B e B e B e B e B e B

BERT models are primarily based on a multi-layer Transformer network, where each Transformer layer includes a self-attention sub-layer and a feed-forward network (FFN) sub-layer These layers utilize a post-norm architecture represented by the equation: output = LNorm(F(input) + input) In this equation, F(ã) denotes the core function of the sub-layer, while LNorm(ã) represents the layer normalization unit To create a deep network, multiple Transformer layers are stacked together, with the final layer producing a real-valued vector that serves as the output representation for each position in the sequence.

There are several aspects one may consider in developing BERT models.

In Transformers, each input token corresponds to an entry in a vocabulary, denoted as |V| While larger vocabularies can encompass a wider range of word variants, they also result in higher storage demands.

• Embedding Size(d e ) Every token is represented as ad e -dimensional real-valued vector.

As presented above, this vector is the sum of the token embedding, positional embedding, and segment embedding, all of which are alsod e -dimensional real-valued vectors.

More Training and Larger Models

BERT represents a significant advancement in natural language processing (NLP), inspiring numerous initiatives aimed at enhancing its capabilities A key focus of these efforts involves scaling the model by augmenting training data and creating larger model architectures.

RoBERTa, an extension of the standard BERT model, is an example of such efforts [Liu et al.,

In 2019, significant advancements were made in BERT models, highlighting that increasing training data and computational resources can enhance model performance without altering architectures Furthermore, the removal of the Next Sentence Prediction (NSP) loss does not adversely affect performance on downstream tasks when training is sufficiently scaled These insights indicate a promising avenue for pre-training, suggesting that improvements can be achieved by focusing on scaling simple pre-training tasks.

One effective method to enhance BERT models is by increasing the number of parameters, as demonstrated by He et al [2021], who developed a BERT-like model with 1.5 billion parameters through greater model depth and hidden size However, scaling up BERT and similar pre-trained models presents training challenges, such as instability and convergence difficulties with large models This complexity necessitates careful attention to various factors, including model architecture, parallel computation, and parameter initialization Additionally, Shoeybi et al [2019] successfully trained a 3.9 billion-parameter BERT-like model, utilizing hundreds of GPUs to meet the heightened computational requirements.

More Efficient Models

BERT, introduced as a larger model than its predecessors, demands significant memory resources, which can slow down system performance The challenge of creating smaller and faster BERT models is part of the wider effort to enhance Transformer efficiency, as explored in the works of Tay et al (2020) and Xiao and Zhu (2023) However, this article will focus specifically on examining several efficient variants of BERT rather than delving into the broader topic.

NLP researchers are focused on enhancing BERT models through various research avenues, particularly knowledge distillation This technique involves training smaller student models using the outputs from well-trained teacher models, allowing for the effective transfer of knowledge from larger BERT models Since BERT models consist of multiple layers, knowledge distillation can be applied at various representation levels, not only from the output layers but also by incorporating training loss that assesses discrepancies in hidden layer outputs between teacher and student models Consequently, knowledge distillation has emerged as a prominent method for developing compact pre-trained models.

Conventional model compression techniques can effectively compress BERT models, with a prevalent method being the application of general-purpose pruning strategies These strategies typically involve the removal of entire layers from the Transformer encoding networks, as demonstrated by Gale et al (2019).

[Fan et al.,2019] or a certain percentage of parameters in the networks [Sanh et al.,2020;Chen et al.,

2020] Pruning is also applicable to multi-head attention models For example, Michel et al.

Research from 2019 indicates that reducing the number of heads in BERT models does not notably impact performance but enhances inference speed Additionally, quantization, as demonstrated by Shen et al in 2020, offers another effective method for compressing BERT models by using low-precision numbers for model parameters Although quantization is not exclusive to BERT, it is particularly beneficial for large Transformer-based architectures.

To enhance the efficiency of BERT models, which are known for their depth and size, researchers are exploring dynamic networks that adapt these models for more efficient inference One approach involves dynamically selecting the layers to process a token, allowing for early exits at optimal depths in depth-adaptive models, thereby skipping unnecessary layers [Xin et al., 2020; Zhou et al., 2020] Additionally, length-adaptive models can be developed to adjust the input sequence length dynamically, enabling the model to bypass less significant tokens and reduce computational load, ultimately improving overall efficiency.

Fourth, it is also possible to share parameters across layers to reduce the size of BERT models.

Sharing the parameters of an entire Transformer layer throughout the layer stack can significantly reduce the number of parameters needed This approach not only facilitates the reuse of the same layer in a multi-layer Transformer network but also results in a reduced memory footprint during testing.

Multi-lingual Models

The original BERT model was designed for English, but it quickly expanded to support multiple languages One method for this expansion involves creating individual models for each language, while a more favored approach is to train multilingual models using data from all languages simultaneously This led to the development of multilingual BERT (mBERT), which is trained on diverse text sources across various languages.

mBERT models differ from monolingual BERT models by utilizing larger vocabularies that encompass tokens from 104 languages This enables the representation of tokens from various languages to be mapped into a unified space, facilitating knowledge sharing across languages through a universal representation model.

Multi-lingual pre-trained models play a crucial role in cross-lingual learning, where a model trained on tasks in one language is utilized for similar tasks in another language For instance, in cross-lingual text classification, we first fine-tune a multi-lingual pre-trained model using English annotated documents, and subsequently apply this fine-tuned model to classify documents in Chinese.

An enhancement to multi-lingual pre-trained models like mBERT involves incorporating bilingual data during pre-training, which allows the model to explicitly understand the relationships between tokens in two languages This bilingual training approach equips the model with inherent cross-lingual transfer capabilities, facilitating its adaptation to various languages Lample and Conneau (2019) introduced a method for pre-training cross-lingual language models (XLMs), where the model can be trained using either causal or masked language modeling techniques.

The BERT model, utilized for pre-training, functions as an encoder with a training objective focused on maximizing the probabilities of selected tokens, which may be masked, replaced, or left unchanged In the context of bilingual data, pairs of aligned sentences are sampled and combined into a single sequence for training For instance, an English-Chinese sentence pair serves as an example of this approach.

鲸鱼是哺乳动物。 ↔ Whales are mammals

We can pack them to obtain a sequence, like this

[CLS]鲸鱼是哺乳动物。 [SEP] Whales are mammals [SEP]

We then select a certain percentage of the tokens and replace them with[MASK].

[CLS] [MASK] 是 [MASK] 动物。 [SEP] Whales[MASK] [MASK] [SEP]

The pre-training objective aims to maximize the probabilities of masked tokens based on the preceding sequence, enabling the model to effectively learn the representations of both English and Chinese sequences This training approach allows the model to understand the relationships between tokens in the two languages, such as predicting the Chinese token 鲸鱼 using the English token Whales By aligning the representations of English and Chinese, the model effectively functions as a translation model, which is why this training objective is referred to as translation language modeling.

Multi-lingual pre-trained models possess the valuable ability to manage code-switching, which is the practice of alternating between languages within a text This capability is particularly significant in natural language processing (NLP) and linguistics, as it allows for seamless integration of diverse linguistic elements For instance, a text that combines both Chinese and English exemplifies the concept of code-switching effectively.

周末我们打算去做 hiking ，你想一起来吗？

(We plan to go hiking this weekend, would you like to join us?)

Multi-lingual pre-trained models eliminate the need to distinguish between languages such as Chinese and English, as every token is treated as part of a shared vocabulary This approach effectively creates a "new" language that integrates all the languages intended for processing.

The effectiveness of multi-lingual pre-training is shaped by various factors, including the fixed model architecture, shared vocabulary size, language sample proportions, and model size Conneau et al (2020) highlight that as the number of supported languages grows, a larger model becomes essential Additionally, an expanded shared vocabulary enhances the model's ability to capture linguistic diversity Low-resource languages can significantly benefit from cross-lingual transfer, especially when trained alongside similar high-resource languages However, prolonged training may lead to interference within the model.

[CLS] [MASK] 是 [MASK] 动物。 [SEP] Whales [MASK] [MASK] [SEP]

(zh) (zh) (zh) (zh) (zh) (zh) (zh) (en) (en) (en) (en) (en) e 0 e 1 e 2 e 3 e 4 e 5 e 6 e 7 e 8 e 9 e 10 e 11 h 0 h 1 h 2 h 3 h 4 h 5 h 6 h 7 h 8 h 9 h 10 h 11

鲸鱼哺乳 are mammals

Translation language modeling involves predicting masked tokens within a bilingual sequence, facilitating cross-lingual modeling According to Lample and Conneau (2019), input embeddings combine token, positional, and language embeddings, requiring language labels for each token to differentiate between languages However, in multilingual pre-training with shared vocabularies, specifying token languages may not be necessary The reliance on language embeddings complicates code-switching, leading to the assumption that token representations are language-independent Consequently, the performance of pre-trained models tends to decline after a certain point, suggesting that early termination of pre-training may be necessary to avoid performance interference.

A Brief Introduction to LLMs

Training at Scale

Long Sequence Modeling

General Prompt Design

Advanced Prompting Methods

Learning to Prompt

Instruction Alignment

Human Preference Alignment: RLHF

Improved Human Preference Alignment

Tiêu đề	Foundations of Large Language Models
Tác giả	Tong Xiao, Jingbo Zhu
Trường học	Northeastern University
Chuyên ngành	Natural Language Processing
Thể loại	Sách
Năm xuất bản	2025
Thành phố	Boston

Định dạng
Số trang	231
Dung lượng	1,93 MB