Model Compression

Model Compression#

The need of model compression#

Models are getting larger and larger everyday. State of the art models gets super large super fast. Model compression is a method to combat the stress that this trend puts on your device: it makes your model smaller, so that it can be transferred over the Internet, it can fit in your memory to run faster, or it can just save a lot of disk usage. Model compression is the science of reducing the size of a model.

Of course, model compression does come with its downsides. After compressed, models will get less accurate. In many cases though, it’s a sacrifice that people are willing to take.

Ways of doing model compression#

There are many ways of doing model compression:

Unstructured pruning#

Because of how deep learning models are based on linear algebra, zero values in a layer in the model simply does not do anything but waste space in memory. Pruning is the art of making the model’s layers less dense and more sparse, so that it can only store things that matter.

Structured pruning#

As great as unstructured pruning is, dealing with sparse matrices (which is produced a lot in unstructured pruning) is slow because it’s difficult to run it on GPU. Structured pruning does the opposite, it finds a filter/channel/matrix to prune, so that the end result is still a network that consists of dense matrices.

Quantization#

Quantization means to store the weights of the model in a less accurate format to save weight. For example, if your model’s weight is 64-bit floating point numbers, converting those numbers to 32-bit floating point numbers will slash off half the amount of space. It’s as simple as that. Recently there are also 16-bit floating point models that makes storing the models efficiently even easier.

Some people take quantization a bit far and use integers for storing the values of the model. It’s feasible but hurts performance quite a lot.

Summary#

These three ways (or two if you merge the two pruning methods) are the main ways people reduce the size of their model without training new ones.

If training a new model is an option, also see knowledge distillation.