NLP with Transformers: Introduction — Part 1

Cahit Barkin Ozer
13 min readJun 27, 2023

The first section of the thorough summary of the book natural language processing with Transformers.

https://towardsdatascience.com/transformers-89034557de14

If you can afford to buy and read the book [1,] I strongly advise you to do so rather than reading my notes. But if you can’t… let’s get started :)

This series will include 11 sections, with the goal of teaching you:

  • Learn how to build, debug, and optimize transformer models for core NLP tasks such as text categorization, named entity identification, and question answering.
  • Learn how to use transformers for cross-lingual transfer learning.
    In real-world circumstances where labeled data is rare, use transformers.
  • Make transformer models more deployable by employing approaches like distillation, pruning, and quantization.
  • Learn how to train transformers from the ground up and scale them across many GPUs and distributed environments.

The Transformer architecture is so good at capturing patterns in long sequences of data and coping with large datasets that it is already being used for applications other than NLP, such as image processing.

Most projects will not provide you with a large dataset to train a model from scratch. Fortunately, it’s sometimes possible to download a model that has already been trained on a generic dataset; all you have to do now is fine-tune it on your own (much smaller) dataset.

Pretraining has been widely used in image processing since the early 2010s, but it had previously been limited to contextless word embeddings in NLP.” Then, in 2018, several papers proposed full-fledged language models that could be pre-trained and fine-tuned for a variety of NLP tasks; this completely changed the game.

Hugging Face’s Transformers library is open source, supports TensorFlow and PyTorch, and allows you to quickly download a cutting-edge pre-trained model from the Hugging Face Hub, configure it for your task, fine-tune it on your dataset, and assess it. The library’s use is rapidly increasing.

It was the integration of various ideas that were bubbling in the academic community at the time, such as attention, transfer learning, and scaling up neural networks, that revolutionized the field overnight.

Introduction to Transformers

Combining the Transformer architecture with unsupervised learning, RNNs, and LSTMs eliminated the requirement to train task-specific architectures from scratch and significantly outperformed almost every NLP benchmark. A plethora of transformer models has evolved since the publication of GPT (generative pre-trained transformers) and BERT (Bidirectional Encoder Representations from Transformers).

To understand what makes transformers unique, we must first explain: • Attention mechanisms • Transfer learning.

The Encoder-Decoder Framework

The encoder’s job is to convert the information from the input sequence into a numerical representation known as the last concealed state. The decoder then uses this state to construct the output sequence.

Encoder and decoder components can be any type of neural network design that can model sequences in general.

For RNNs, for example, the English line “Transformers are great!” is encoded as a hidden state vector, which is then decoded to create the German translation “Transformers are great!” The encoder processes the input words sequentially, and the output words are generated one at a time, from top to bottom.

The final hidden state of the encoder generates an information bottleneck because it must convey the meaning of the whole input sequence because this is all the decoder has access to when generating the output. This is especially difficult for long sequences, because information at the beginning of the series may be lost in the process of condensing everything to a single, fixed representation.

Fortunately, by granting the decoder access to all of the encoder’s secret states, this bottleneck can be avoided. Attention is the general mechanism for this, and it is a critical component in many recent neural network topologies.

Attention Mechanisms

The key principle underpinning attention is that, rather than providing a single hidden state for the input sequence, the encoder outputs a hidden state that the decoder can access at each step. However, employing all of the stages at once would result in a massive input for the decoder, thus some mechanism is required to prioritize which states to use. Attention comes into play here: it allows the decoder to apply a distinct amount of weight, or “attention,” to each of the encoders states at each decoding timestep.

These attention-based models can learn nontrivial alignments between words in a generated translation and those in a source sentence by focusing on which input tokens are most relevant at each timestep.

Although attention allowed for considerably better translations, there remained one big drawback to employing recurrent models for the encoder and decoder: the computations are fundamentally sequential and cannot be parallelized across the input sequence.

With the transformer, a new modeling paradigm was introduced: abandon recurrence totally in favor of a unique type of attention known as self-attention. Self-attention enables attention to function on all states in the same neural network layer. The below figure shows that both the encoder and the decoder have their own self-attention mechanisms, the outputs of which are given to feed-forward neural networks (FF NNs). This architecture, which can be trained significantly quicker than recurrent models, has paved the way for many recent advances in NLP.

The translation model was trained on a vast dataset of sentence pairings in several languages in the original Transformer paper. Most of us, however, do not have access to massive amounts of annotated text data with which to train our models. Transfer learning was the final missing component to kickstart the transformer revolution.

Transfer Learning

In computer vision, transfer learning is used to train a convolutional neural network, such as ResNet, on one task and then adapt or finetune it to a new task. This enables the network to apply the knowledge gained from the original task. Architecturally, this entails dividing the model into a body and a head, with the head representing a task-specific network.

During training, the body’s weights learn broad properties of the source domain, and these weights (a neural network’s learnable parameters) are utilized to establish a new model for the next task. When compared to traditional supervised learning, this approach often yields high-quality models that can be trained much more efficiently on a variety of downstream tasks and with significantly less labeled data.

Using features extracted from unsupervised pretraining, OpenAI researchers achieved good performance on a sentiment classification challenge in 2017 and 2018. Following that was ULMFiT, which established a broad framework for adapting pre-trained LSTM models to varied applications.

Pretraining

The first training goal is to predict the next word based on prior words. Language modeling is the term used to describe this process. The beauty of this strategy is that no labeled data is required, and material from freely available sources such as Wikipedia can be used.

Domain adaptation

After pre-training the language model on a large-scale corpus, the next step is to adapt it to the in-domain corpus (for example, from Wikipedia to the IMDb corpus of movie reviews). Language modeling is still used at this level, but the model must now predict the next word in the target corpus.

Fine-tuning

The language model is fine-tuned using a classification layer for the goal job (for example, classifying the sentiment of movie reviews) at this step.

In 2018, two transformers that merged self-attention with transfer learning were released:

GPT

The GPT solely uses the Transformer architecture’s decoder and the same language modeling approach as ULMFiT. GPT was pre-trained on the BookCorpus, which has 7,000 unpublished works in genres like Adventure, Fantasy, and Romance.

BERT

The BERT makes use of the Transformer architecture’s encoder and a type of language modeling known as masked language modeling. Masked language modeling attempts to predict randomly masked words in a text. Given a sentence like “I looked at my [MASK] and saw that [MASK] was late,” the model must estimate the most likely choices for the masked words marked by [MASK]. BERT was already familiar with BookCorpus and English Wikipedia.

However, because different research labs released their models in incompatible frameworks (PyTorch or TensorFlow), NLP practitioners found it difficult to transfer these models to their own applications. With the release of Transformers, a uniform API for more than 50 architectures was gradually constructed. This library sparked an explosion of transformer research, which swiftly filtered down to NLP practitioners, making it simple to integrate these models into many real-world applications today.

Hugging Face Transformers: Bridging the Gap

Applying a new machine learning architecture to a new task can be difficult, and often entails the following steps:

  1. Model architecture is often implemented in code using PyTorch or TensorFlow.
  2. Load the pre-trained weights from a server (if available).
  3. Preprocess the inputs, then run them through the model with some task-specific postprocessing.
  4. To train the model, implement data loaders and define loss functions and optimizers.

Each of these processes necessitates the development of unique logic for each model and activity. Traditionally, when a research group publishes a new publication, the code and model weights are also released. This code, however, is rarely standardized and frequently necessitates days of engineering to adapt to new use cases.

It provides a standardized interface to a diverse set of transformer models, as well as code and tools for adapting these models to new applications. The library currently supports three major deep learning frameworks (PyTorch, TensorFlow, and JAX) and lets you switch between them with ease. It also includes task-specific heads, which allow you to easily fine-tune transformers for downstream tasks like text classification, named entity identification, and question answering.

Hugging Face Transformers features a tiered API that lets you interface with the library at different degrees of abstraction. All of the steps required to convert raw text into a collection of predictions from a fine-tuned model are abstracted away by pipelines. In hugging face Transformers, we instantiate a pipeline by invoking the pipeline() function and supplying the task name:

from transformers import pipeline
classifier = pipeline("text-classification")

Each pipeline accepts a text string (or a list of strings) as input and outputs a list of predictions. Because each forecast is a Python dictionary, we can use Pandas to present it as a DataFrame:

import pandas as pd
outputs = classifier(text)
pd.DataFrame(outputs)

This was a sentiment analysis text categorization sample. The outcome ranges from -1 to 1.

Named Entity Recognition

We frequently want to know if the comment was about a specific item or service. Real-world items such as products, places, and people are referred to as named entities in NLP, and extracting them from text is referred to as named entity recognition (NER).
We may use NER by loading the appropriate pipeline and feeding it our customer review:

ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)

The pipeline identified all of the entities and assigned each of them a category such as ORG (organization), LOC (location), or PER (person). In this case, we used the aggregation_strategy option to aggregate the words based on the mod el’s predictions. For example, the entity “Optimus Prime” is made up of two words but is classified as MISC (miscellaneous). The ratings indicate how certain the model was of the things it recognized.

Question Answering

We give the model a chunk of text called the context, as well as a query whose response we want to extract. The model then returns the text span associated with the response.

reader = pipeline("question-answering")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])

The pipeline also produced start and end integers corresponding to the character indices where the answer span was identified (similar to NER tagging). There are various types of question answering, but this one is known as extractive question answering since the answer is extracted straight from the text.

Summarization

This is a considerably more difficult assignment than the previous ones because the model must generate coherent text.

summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

In this example, you can also see that we gave some keyword parameters to the pipeline, such as max_length and clean_up_tokenization_spaces, which allow us to alter the outputs at runtime.

Translation

Translation, like summarization, is a task whose output is created text. Let us use a translation pipeline to convert a text from one language to another:

translator = pipeline("translation_en_to_de",
model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

Text Generation

To be able to respond to client comments more quickly by having access to an autocomplete capability.

generator = pipeline("text-generation")
response = "Dear Bumblebee, I am sorry to hear that your order was mixed up."
prompt = text + "\n\nCustomer service response:\n" + response
outputs = generator(prompt, max_length=200)
print(outputs[0]['generated_text'])

All of the models used in this chapter are freely available and have already been fine-tuned for the task at hand. In general, you’ll want to fine-tune models on your own data, which you’ll learn how to do in the next chapters.

The Huggingface Ecosystem

The Hugging Face ecosystem is divided into two parts: a library family and the Hub. The libraries supply the code, while the Hub provides the pre-trained model weights, datasets, evaluation metrics scripts, and other resources.

An Overview of the Hugging Face Ecosystem

The Huggingface Hub

As previously stated, transfer learning is one of the primary aspects driving transformer success because it allows for the reuse of pre-trained models for various tasks. It is also critical to be able to quickly load and perform experiments with pre-trained models.
The Hugging Face Hub has almost 20,000 free models. Filters for projects, frameworks, datasets, and other categories are available to help you traverse the Hub and rapidly locate prospective candidates.

The Hub also includes model and dataset cards, which let you describe the contents of models and datasets and make an informed choice about whether they’re suited for you. One of the most intriguing aspects of the Hub is the ability to immediately test any model via the many task-specific interactive widgets.

Note that PyTorch and TensorFlow have their own hubs that are worth checking out if a specific model or dataset is not available on the Hugging Face Hub.

Hugging Face Tokenizers

Each of the pipeline examples in this chapter has a tokenization step that divides the raw text into smaller bits known as tokens. Huggingface tokenizers support a wide range of tokenization schemes and are particularly fast at text tokenization because of their Rust backend. It also handles all preprocessing and postprocessing tasks, such as normalizing inputs and converting model outputs to the needed format. We can load a tokenizer with hugging face Tokenizers in the same manner that we can load pre-trained model weights with hugging face Transformers.

Hugging Face Datasets

Loading, processing, and saving datasets can be time-consuming, especially when the datasets become too large to accommodate in the RAM of your laptop. Furthermore, several scripts are frequently required to download the data and put it into a common format.

Huggingface Datasets makes this process easier by offering a consistent interface for hundreds of datasets available on the Hub. It also offers smart caching (so you don’t have to redo your preprocessing every time you run your code) and avoids RAM limitations by leveraging memory mapping, a special mechanism that stores the contents of a file in virtual memory and allows multiple processes to modify a file more efficiently. The library is also compatible with popular frameworks such as Pandas and NumPy, so you won’t have to leave your favorite data-wrangling tools behind.

A decent dataset and powerful model are useless if you can’t properly measure performance. Unfortunately, classic NLP metrics have numerous implementations that can differ significantly and produce misleading results. Huggingface Datasets make experiments more reproducible and findings more trustworthy by giving scripts for numerous metrics. In some cases, fine-grained control over the training loop is required. This is where the ecosystem’s final library, Hugging Face Accelerate, comes into action.

Hugging Face Accelerate

If you’ve ever had to develop your own training script in PyTorch, you’ve probably had some trouble porting the code from your laptop to the code that runs on your organization’s cluster. Hugging Face Accelerate adds an abstraction layer to your usual training loops that handle all of the specialized functionality required for the training infrastructure. This literally speeds up your productivity by making it easier to replace infrastructure as necessary.

Main Challenges with Transformers

We’ll get a taste of the huge range of NLP jobs that transformer models can handle. The powers of transformers are far from boundless.

Language

The English language dominates NLP research. There are various models for different languages, however, finding pre-trained models for rare or low-resource languages is more difficult. We’ll look at multilingual transformers and their capacity to do zero-shot cross-lingual transfer in the next chapters.

Data Availability

Although we may utilize transfer learning to drastically minimize the quantity of labeled training data our models require, it is still a significant amount when compared to how much a human requires to complete the work.

Working with long documents

Self-attention works incredibly well on paragraph-length materials, but it becomes prohibitively expensive when applied to longer texts, such as entire essays.

Opacity

Transformers, like other deep learning algorithms, are largely opaque. It is difficult or impossible to determine “why” a model made a particular forecast. This is a particularly difficult task when these models are used to make vital decisions.

Bias

Transformer models are typically pre-trained on text data obtained from the internet. This imprints all of the biases in the data into the models. It is difficult to ensure that these are not racist, sexist, or worse.

References

[1] Lewis Tunstall, Leandro von Werra, May 2022, Thomas Wolf Natural Language Processing with Transformers, Revised Edition

--

--

Cahit Barkin Ozer

Üretken Yapay Zeka başta olmak üzere teknoloji alanındaki yenilikleri öğrenip sizlerle paylaşıyorum.