Transformer vs RNN#

Why are RNNs good?#

For many years RNNs are the undisputed champion in sequence processing. Sequences include texts, voice data, and all time-related data. The reason RNNs are so good is that it sees through all past things (or events), decides what’s important to keep, then makes the next prediction. RNN works like a human mind does, seeing through past things to predict the future.

Why are RNNs not good enough?#

All kinds of RNNs suffer from gradient explosion/vanishing. That means it’s very difficult to train large scale RNN, process over long sequences, or just continuously improve upon the result because bigger RNNs are not necessarily better.

Also, because RNNs have to be trained on the sequence tokens in a one-by-one fashion, it’s difficult to parallelize that and make it faster.

Why are transformers all the rage?#

Transformers are not RNNs. That mean, it doesn’t suffer from all those weaknesses of RNNs like training slowly or unable to scale up. However, that’s not the reason transformers have all the attention (pun intended) right now.

The reason transformers are so popular started with Bert, a massive pretrained transformer based model that you can easily use for other tasks. Being pretrained means that you don’t need to train it yourself, you can simply use the model as a preprocessor, a feature extractor, and train a much smaller model for your task. And Bert is the first wildly successful language processing model to do that.