Life Long Learning#

Note

Life Long Learning are also called LLL.

Why LLL?#

Humans can learn and memorize things. When we learn about new things, we don’t really forget how to do it. A person can probably swim after years of staying out of a swimming pool, or ride a bicycle despite not having ridden on one in decades. However, machine learning models don’t seem to be able to do that. When performing updates to a model, the model’s ability to perform on the previous tasks are simply destroyed. That’s the reason for the existence of LLL, to help machine retain memory over its lifetime.

How LLL works?#

There are multiple ways to perform LLL. Most LLL methods use one or more of the following principle.

Don’t change the model too much.#

If a model isn’t changed too much, normally you wouldn’t expect the output to change by a huge amount. This method is called weight consolidation. Using this method, you either hard constrain the maximum distance between the old weight and the new weight, or apply penalties for big distances.

Make the model bigger.#

By making the model bigger, you have the option to not modify the old weights by constraining new weights to not affect the old tasks, and only updating the new weights in learning new tasks.

Mix-in data in old tasks when learning new tasks.#

When you’re learning something new, sometimes reviewing what you already know helps clarify the difference between what you already know and the thing that’s new. This approach is no different. When learning a new task, this approach will try to also re-learn old tasks so the model’s ability in performing old tasks don’t get affected.

Don’t learn new things that conflict with what the model already knows.#

When learning a new task, the model is updated in a special direction. This direction can be thought of as the knowledge that the model learns. What this method does is that it retains every updated paths that the model has learned. In learning any new task, the knowledge of previous tasks should never be reduced. In mathematical terms, the gradients in learning new tasks and the gradients in learning old tasks should never have a negative dot product.