Chapter 2 Data

All of the data is obtained from the tensorflow_datasets library. We begin with the ted_hrlr_translate resource and the Portuguese to English language pair.

import tensorflow_datasets as tfds

resource = 'ted_hrlr_translate'
pair = 'pt_to_en'
examples, metadata = tfds.load(f'{resource}/{pair}', with_info=True,
                               as_supervised=True)
                               
keys = metadata.supervised_keys
train_examples, eval_examples = examples['train'], examples['validation']

print(f'Keys: {metadata.supervised_keys}')

## Keys: ('pt', 'en')

The individual examples have the following format:

example1 = next(iter(train_examples))
print(example1)

## (<tf.Tensor: shape=(), dtype=string, numpy=b'e quando melhoramos a procura ,
## tiramos a \xc3\xbanica vantagem da impress\xc3\xa3o , que \xc3\xa9 a
## serendipidade .'>, <tf.Tensor: shape=(), dtype=string, numpy=b'and when you
## improve searchability , you actually take away the one advantage of print ,
## which is serendipity .'>)

2.1 Tokenizers

As usual for language modeling, sentences in some language must be converted to sequences of integers in order to serve as input for a neural network, in a process called tokenization. The input sentences are tokenized using the class SubwordTokenizer in the script tokenizer/subword_tokenizer.py. It is closely based on the CustomTokenizer class from the Subword tokenizer tutorial which is in turn based on the BertTokenizer from tensorflow_text.

From the tutorial: “The main advantage of a subword tokenizer is that it interpolates between word-based and character-based tokenization. Common words get a slot in the vocabulary, but the tokenizer can fall back to word pieces and individual characters for unknown words.” SubwordTokenizer takes a sentence and first splits it into words using BERT’s token splitting algorithm and then applies a subword tokenizer using the WordPiece algorithm.

The script prepare_tokenizers.py provides the prepare_tokenizers function which builds a pair of SubwordTokenizers from the input examples and saves them to disk for later reuse, as they take some time to build. The parameters below indicate that all text is converted to lowercase and that the maximum vocabulary size of both the inputs and targets is \(2^{13} = 8192\).

from prepare_tokenizers import prepare_tokenizers

TRAIN_DIR = 'train'

tokenizers, _ = prepare_tokenizers(train_examples,
                                   lower_case=True,
                                   input_vocab_size=2 ** 13,
                                   target_vocab_size=2 ** 13,
                                   name=metadata.name + '-' + keys[0] + '_to_' + keys[1],
                                   tokenizer_dir=TRAIN_DIR,
                                   reuse=True)
                                
input_vocab_size = tokenizers.inputs.get_vocab_size()
target_vocab_size = tokenizers.targets.get_vocab_size()
print("Number of input tokens: {}".format(input_vocab_size))

## Number of input tokens: 8318

print("Number of target tokens: {}".format(target_vocab_size))

## Number of target tokens: 7010

The tokenizer is demonstrated on the the English sentence from example 1 above.

example1_en_string = example1[1].numpy().decode('utf-8')
tokenizer = tokenizers.targets
print(f'Sentence: {example1_en_string}')

## Sentence: and when you improve searchability , you actually take away the one
## advantage of print , which is serendipity .

tokens = tokenizer.tokenize([example1_en_string])
print(f'Tokenized sentence: {tokens}')

## Tokenized sentence: <tf.RaggedTensor [[2, 72, 117, 79, 1259, 1491, 2362, 13,
## 79, 150, 184, 311, 71, 103, 2308, 74, 2679, 13, 148, 80, 55, 4840, 1434, 2423,
## 540, 15, 3]]>

text_tokens = tokenizer.lookup(tokens)
print(f'Text tokens: {text_tokens}')

## Text tokens: <tf.RaggedTensor [[b'[START]', b'and', b'when', b'you',
## b'improve', b'search', b'##ability', b',', b'you', b'actually', b'take',
## b'away', b'the', b'one', b'advantage', b'of', b'print', b',', b'which', b'is',
## b's', b'##ere', b'##nd', b'##ip', b'##ity', b'.', b'[END]']]>

round_trip = tokenizer.detokenize(tokens)
print(f"Convert tokens back to original sentence: " \
      f"{round_trip.numpy()[0][0].decode('utf-8')}")

## Convert tokens back to original sentence: and when you improve searchability ,
## you actually take away the one advantage of print , which is serendipity .

The tokenize method converts a sentence (or any block of text) into a sequence of tokens (i.e. integers). The SubwordTokenizer methods are intended for lists of sentences, corresponding to the batched inputs fed to the neural network, while in this example we use a batch of size one. The lookup method shows which subword each input token represents. Note that the tokenizer has added special start and end tokens accordingly to the tokenized sequence, which allows the model to understand about the start and end of each input. detokenize maps the tokens back to the original sentence.

2.2 Data Pipeline

The tf.data.Dataset API is used for the input pipeline, suitable for consumption by TensorFlow/Keras models. Since our data comes from tensorflow_datasets it is already a tf.data object to which we can apply the necessary transformations and then iterate as batches.

Our input pipeline tokenizes the sentences from both languages into sequences of integers, discards any examples where either the source or target has more than MAX_LEN tokens and collects them into batches of size BATCH_SIZE. The reason for limiting the length of the input sequences is that both the transformer run time and memory usage are quadratic in the input length, which is evident from the attention mechanism shown in equation (3.1) below.

The result is a tf.data dataset which return a tuple of (inputs, targets) for each batch. As is typical for Encoder–Decoder auto-regressive sequence-to-sequence architectures, the input is of the form (encoder_inpout, decoder_input) where encoder_input is the tokenized source sentence and decoder_input is tokenized target sentence with the last token dropped; while targets is the tokenized target sentence lagged by one for autoregression.

The input pipeline encapsulated in our Dataset class follows the TensorFlow Data Pipeline Performance Guide:

transformer/dataset.py

import tensorflow as tf

BUFFER_SIZE = 20000


class Dataset:
    """
    Provides a data pipeline suitable for use with transformers
    """
    def __init__(self, tokenizers, batch_size, input_seqlen, target_seqlen):
        self.tokenizers = tokenizers
        self.batch_size = batch_size
        self.input_seqlen = input_seqlen
        self.target_seqlen = target_seqlen

    def data_pipeline(self, examples, num_parallel_calls=None):
        return (
            examples
                .cache()
                .map(tokenize_pairs(self.tokenizers),
                     num_parallel_calls=num_parallel_calls)
                .filter(filter_max_length(max_x_length=self.input_seqlen,
                                          max_y_length=self.target_seqlen))
                .shuffle(BUFFER_SIZE)
                .padded_batch(self.batch_size)
                .prefetch(tf.data.AUTOTUNE)
        )


def filter_max_length(max_x_length, max_y_length):
    def filter(x, y):
        return tf.logical_and(tf.size(x['encoder_input']) <= max_x_length,
                              tf.size(y) < max_y_length)

    return filter


def tokenize_pairs(tokenizers):
    def tokenize(x, y):
        inputs = tokenizers.inputs.tokenize([x])[0]
        targets = tokenizers.targets.tokenize([y])[0]

        decoder_inputs = targets[:-1]
        decoder_targets = targets[1:]
        return dict(encoder_input=inputs, decoder_input=decoder_inputs), decoder_targets

    return tokenize

We extract the first batch from the data pipeline:

import tensorflow as tf
from transformer.dataset import Dataset

BATCH_SIZE = 64
MAX_LEN = 40

dataset = Dataset(tokenizers, batch_size=BATCH_SIZE, 
                  input_seqlen=MAX_LEN, target_seqlen=MAX_LEN)
data_train = dataset.data_pipeline(train_examples, 
                                   num_parallel_calls=tf.data.experimental.AUTOTUNE)    
data_eval = dataset.data_pipeline(eval_examples, 
                                  num_parallel_calls=tf.data.experimental.AUTOTUNE)                           
batch1 = next(iter(data_train))
print(batch1)

## ({'encoder_input': <tf.Tensor: shape=(64, 40), dtype=int64, numpy=
## array([[   2,  695,   14, ...,    0,    0,    0],
##        [   2,   88,   44, ...,    0,    0,    0],
##        [   2, 3248,   86, ...,    0,    0,    0],
##        ...,
##        [   2,   40,  225, ...,    0,    0,    0],
##        [   2, 3701,   14, ...,    0,    0,    0],
##        [   2,  100,  379, ...,    0,    0,    0]])>, 'decoder_input': <tf.Tensor: shape=(64, 37), dtype=int64, numpy=
## array([[   2,   36,   36, ...,    0,    0,    0],
##        [   2,   76,  196, ...,    0,    0,    0],
##        [   2,   96,  127, ...,    0,    0,    0],
##        ...,
##        [   2,   51,  795, ...,    0,    0,    0],
##        [   2, 1106, 2294, ...,    0,    0,    0],
##        [   2, 1507,  101, ...,    0,    0,    0]])>}, <tf.Tensor: shape=(64, 37), dtype=int64, numpy=
## array([[  36,   36,   77, ...,    0,    0,    0],
##        [  76,  196,   50, ...,    0,    0,    0],
##        [  96,  127,   97, ...,    0,    0,    0],
##        ...,
##        [  51,  795, 1173, ...,    0,    0,    0],
##        [1106, 2294,   74, ...,    0,    0,    0],
##        [1507,  101,   71, ...,    0,    0,    0]])>)