Token

Token#

A token is what makes up a sequence. You can tokenize on the word level. In such a case, “Hello world.” becomes [“Hello”, “World”, “.”]. Or if you decide that words are too big and say that you want to tokenize on a character level, in which case the sequence becomes [‘H’, ‘e’, ‘l’, ‘l’, ‘o’, ‘ ‘, ‘w’, ‘o’, ‘r’, ‘l’, ‘d’, ‘.’]. Notice that the space character is significant in the second case but is ignored in the first case.

In neural machine translation (NMT), the transformer encoder will take the sequence of language A, and the encoder will output the probability distribution over the tokens in language B at each time step (in the case of autoregressive NMT).

In autoregressive NMT, the decoder input will be the tokens previously generated by the decoder. Let’s say that at time T=0, the decoder outputs a token ‘P’. At the next time step T=1, the decoder generates another token ‘Q’ conditioning on the previously generated token ‘P’. So how does the decoder condition on the previously generated ‘P’? It simply takes ‘P’ as the decoder input. At T=2, the decoder input will be ‘PQ’.

The input sequence contains tokens; however, the transformer model can only take vectors (or tensors) as its input, so we need to convert each token in the input sequence into its corresponding vector, those vectors are called embedding vectors. If the vector corresponds to the input tokens (of the encoder, which would be the tokens of language A), then we call those vectors input embedding (vector). If the vector corresponds to the output tokens (of the decoder, which would be the tokens of language B), then we call those vectors output embedding (vector).