Continual Learning (2)

학교 수업/ADL 2023. 9. 21. 22:03

이전 편에서 continual learning의 개념과 세 가지 접근 방식들에 대해 설명했고, 지금부터는 각 접근 방식의 대표적인 방법들을 소개할 것이다.

reference : Kirkpatrick et al., Overcoming catastrophic forgetting in neural networks, PNAS 2017

EWC (Elastic Weight Consolidation)

- 어떤 nodes or weights가 중요한지 찾는다.

- important nodes or weights에 high regularization을 준다.

- stability에 초점이 맞춰져 있다.

task A를 수행하는 모델의 parameter를 $θ A <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>θ</mi><mi>A</mi></msub></math>$ 라고 했을 때, parameter space of low error for task A는 회색이고 new task인 task B를 수행할 때 parameter space of low error for task B는 노란색이라고 하자. 모델이 task B로 forward transfer 될 때 penalty를 주지 않는다면 노란색 parameter space로 이동할 것이다. 그러나 이렇게 되면 이전의 task A에서는 high error를 갖게 된다. 간단하게 L2 regularizer를 주게 되면? 왜 바깥으로 이동하는지 잘은 모르겠다. 아마 방법이 너무 naive하기 때문 아닐까? 아무튼 EWC는 두 parameter space 가 겹치는 공간으로 $θ A <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>θ</mi><mi>A</mi></msub></math>$ 를 이동하게 하여 두 task 모두 잘 수행하려 한다.

Bayesian framework

1. Single task case with a neural network model $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$

- Let $D = {x i, y i} n i = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mo>=</mo><msubsup><mrow data-mjx-texclass="ORD"><mo fence="false" stretchy="false">{</mo><msub><mi>x</mi><mi>i</mi></msub><mo>,</mo><msub><mi>y</mi><mi>i</mi></msub><mo fence="false" stretchy="false">}</mo></mrow><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></msubsup></math>$ be the training data

- Maximum Likelihood (ML) estimate of $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$

$θ * M L = arg min θ {log p (D | θ)} = arg min θ {- n \sum i = 1 log p (y i | x i, θ)} <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msubsup><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>M</mi><mi>L</mi></mrow><mo>*</mo></msubsup><mo>=</mo><munder><mrow><mi>arg</mi><mo data-mjx-texclass="NONE"></mo><mo data-mjx-texclass="OP" movablelimits="true">min</mo></mrow><mi>θ</mi></munder><mo fence="false" stretchy="false">{</mo><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mi>p</mi><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>θ</mi><mo stretchy="false">)</mo><mo fence="false" stretchy="false">}</mo><mo>=</mo><munder><mrow><mi>arg</mi><mo data-mjx-texclass="NONE"></mo><mo data-mjx-texclass="OP" movablelimits="true">min</mo></mrow><mi>θ</mi></munder><mo fence="false" stretchy="false">{</mo><mo>-</mo><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mi>p</mi><mo stretchy="false">(</mo><msub><mi>y</mi><mi>i</mi></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mi>i</mi></msub><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo><mo fence="false" stretchy="false">}</mo></math>$

- Maximum a Posteriori (MAP) estimate of $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$

$θ * M A P = arg min θ {- log p (D | θ) - log p (θ)} = arg min θ {- n \sum i = 1 - log p (y i | x i, θ) - log p (θ)} <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msubsup><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>M</mi><mi>A</mi><mi>P</mi></mrow><mo>*</mo></msubsup><mo>=</mo><munder><mrow><mi>arg</mi><mo data-mjx-texclass="NONE"></mo><mo data-mjx-texclass="OP" movablelimits="true">min</mo></mrow><mi>θ</mi></munder><mo fence="false" stretchy="false">{</mo><mo>-</mo><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mi>p</mi><mo stretchy="false">(</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>θ</mi><mo stretchy="false">)</mo><mo>-</mo><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mi>p</mi><mo stretchy="false">(</mo><mi>θ</mi><mo stretchy="false">)</mo><mo fence="false" stretchy="false">}</mo><mo>=</mo><munder><mrow><mi>arg</mi><mo data-mjx-texclass="NONE"></mo><mo data-mjx-texclass="OP" movablelimits="true">min</mo></mrow><mi>θ</mi></munder><mo fence="false" stretchy="false">{</mo><mo>-</mo><munderover><mo data-mjx-texclass="OP">\sum</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><mo>-</mo><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mi>p</mi><mo stretchy="false">(</mo><msub><mi>y</mi><mi>i</mi></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mi>i</mi></msub><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo><mo>-</mo><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mi>p</mi><mo stretchy="false">(</mo><mi>θ</mi><mo stretchy="false">)</mo><mo fence="false" stretchy="false">}</mo></math>$

이 때 MAP의 $log p (θ) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mi>p</mi><mo stretchy="false">(</mo><mi>θ</mi><mo stretchy="false">)</mo></math>$ 는 $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ 의 prior 이며, standard Gaussian이다.

2. Two tasks arriving sequentially

- Let $D 1 D 2 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mn>1</mn></msub><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mn>2</mn></msub></math>$ be the training data for each task

- A natural MAP estimate of $θ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>θ</mi></math>$ after learning Task 2

$θ * M A P, 1 : 2 = arg min θ {- log p (D 2 | θ) - log p (θ | D 1)} <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msubsup><mi>θ</mi><mrow data-mjx-texclass="ORD"><mi>M</mi><mi>A</mi><mi>P</mi><mo>,</mo><mn>1</mn><mo>:</mo><mn>2</mn></mrow><mo>*</mo></msubsup><mo>=</mo><munder><mrow><mi>arg</mi><mo data-mjx-texclass="NONE"></mo><mo data-mjx-texclass="OP" movablelimits="true">min</mo></mrow><mi>θ</mi></munder><mo fence="false" stretchy="false">{</mo><mo>-</mo><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mi>p</mi><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mn>2</mn></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>θ</mi><mo stretchy="false">)</mo><mo>-</mo><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mi>p</mi><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mn>1</mn></msub><mo stretchy="false">)</mo><mo fence="false" stretchy="false">}</mo></math>$

$D 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mn>1</mn></msub></math>$ 의 posterior를 entire dataset의 prior로 줘서 regularizer 역할을 하도록 한다. 그러나 posterior이므로 계산하기 어려워 approximate 할 것이다.

갑자기 왜 posterior가 prior가 됐으며 왜 regularizer 역할을 하는지?

$D 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mn>1</mn></msub></math>$ 의 posterior는 어떤 parameter가 중요한지에 대한 정보를 포함하고 있으므로 이를 prior로 준다. negative loss function $- L (θ) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo>-</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">L</mi></mrow><mo stretchy="false">(</mo><mi>θ</mi><mo stretchy="false">)</mo></math>$ 인 $log p (D 2 | θ) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mi>p</mi><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mn>2</mn></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>θ</mi><mo stretchy="false">)</mo></math>$ 안에서 곱해지므로 regularizer 역할을 하게 된다. 하지만 task1의 posterior를 entire task의 prior로 주는 식이 heuristic인지 아니면 수학적으로 계산된 식인지는 여쭤봐야 할 것 같다.

3. Laplace Approximation to approximate posterior $log p (θ | D 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mi>p</mi><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mn>1</mn></msub><mo stretchy="false">)</mo></math>$

- Let the MAP solution be $θ * 1 = arg min θ {- log p (θ | D 1)} <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>θ</mi><mn>1</mn><mo>*</mo></msubsup><mo>=</mo><munder><mrow><mi>arg</mi><mo data-mjx-texclass="NONE"></mo><mo data-mjx-texclass="OP" movablelimits="true">min</mo></mrow><mi>θ</mi></munder><mo fence="false" stretchy="false">{</mo><mo>-</mo><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mi>p</mi><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mn>1</mn></msub><mo stretchy="false">)</mo><mo fence="false" stretchy="false">}</mo></math>$

- Approximate $p (θ | D 1) <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>p</mi><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mn>1</mn></msub><mo stretchy="false">)</mo></math>$ as a Gaussian with mean $θ * 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mi>θ</mi><mn>1</mn><mo>*</mo></msubsup></math>$ and covariance $σ - 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>σ</mi><mrow data-mjx-texclass="ORD"><mo>-</mo><mn>1</mn></mrow></msub></math>$

- Approximate $Σ - 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi mathvariant="normal">Σ</mi><mrow data-mjx-texclass="ORD"><mo>-</mo><mn>1</mn></mrow></msub></math>$ with Fisher Information Matrix (FIM)

$logp(θ|D1)≈logp(θ∗1|D1)+12(θ−θ∗1)⊤(∇2θlogp(θ|D1)|θ∗1)(θ−θ∗1)<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mi>p</mi><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mn>1</mn></msub><mo stretchy="false">)</mo><mo>≈</mo><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mi>p</mi><mo stretchy="false">(</mo><msubsup><mi>θ</mi><mn>1</mn><mo>∗</mo></msubsup><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mn>1</mn></msub><mo stretchy="false">)</mo><mo>+</mo><mrow data-mjx-texclass="ORD"><mfrac><mn>1</mn><mn>2</mn></mfrac></mrow><mo stretchy="false">(</mo><mi>θ</mi><mo>−</mo><msubsup><mi>θ</mi><mn>1</mn><mo>∗</mo></msubsup><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mi mathvariant="normal">⊤</mi></mrow></msup><mo stretchy="false">(</mo><msubsup><mi mathvariant="normal">∇</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow><mn>2</mn></msubsup><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mi>p</mi><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mn>1</mn></msub><mo stretchy="false">)</mo><msub><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo><mrow data-mjx-texclass="ORD"><msubsup><mi>θ</mi><mn>1</mn><mo>∗</mo></msubsup></mrow></msub><mo stretchy="false">)</mo><mo stretchy="false">(</mo><mi>θ</mi><mo>−</mo><msubsup><mi>θ</mi><mn>1</mn><mo>∗</mo></msubsup><mo stretchy="false">)</mo></math>$

$log p (θ | D 1) \approx N (θ * 1, (\nabla 2 θ log p (θ | D 1) | θ * 1) - 1) <math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mi>p</mi><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mn>1</mn></msub><mo stretchy="false">)</mo><mo>\approx</mo><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">N</mi></mrow><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><msubsup><mi>θ</mi><mn>1</mn><mo>*</mo></msubsup><mo>,</mo><mo stretchy="false">(</mo><msubsup><mi mathvariant="normal">\nabla</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow><mn>2</mn></msubsup><mi>log</mi><mo data-mjx-texclass="NONE"></mo><mi>p</mi><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mn>1</mn></msub><mo stretchy="false">)</mo><msub><mo data-mjx-texclass="ORD" fence="false" stretchy="false">|</mo><mrow data-mjx-texclass="ORD"><msubsup><mi>θ</mi><mn>1</mn><mo>*</mo></msubsup></mrow></msub><msup><mo stretchy="false">)</mo><mrow data-mjx-texclass="ORD"><mo>-</mo><mn>1</mn></mrow></msup><mo data-mjx-texclass="CLOSE">)</mo></mrow></math>$ $∇2θ(−logp(θ|D1))=∇2θ(−logp(D1|θ)−logp(θ))=1nn∑i=1∇2θ(−logp(yi|xi,θ))+I<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msubsup><mi mathvariant="normal">∇</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow><mn>2</mn></msubsup><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mo>−</mo><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mi>p</mi><mo stretchy="false">(</mo><mi>θ</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mn>1</mn></msub><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>=</mo><msubsup><mi mathvariant="normal">∇</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow><mn>2</mn></msubsup><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mo>−</mo><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mi>p</mi><mo stretchy="false">(</mo><msub><mrow data-mjx-texclass="ORD"><mi data-mjx-variant="-tex-calligraphic" mathvariant="script">D</mi></mrow><mn>1</mn></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>θ</mi><mo stretchy="false">)</mo><mo>−</mo><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mi>p</mi><mo stretchy="false">(</mo><mi>θ</mi><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>=</mo><mrow data-mjx-texclass="ORD"><mfrac><mn>1</mn><mi>n</mi></mfrac></mrow><munderover><mo data-mjx-texclass="OP">∑</mo><mrow data-mjx-texclass="ORD"><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mi>n</mi></munderover><msubsup><mi mathvariant="normal">∇</mi><mrow data-mjx-texclass="ORD"><mi>θ</mi></mrow><mn>2</mn></msubsup><mo stretchy="false">(</mo><mo>−</mo><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mi>p</mi><mo stretchy="false">(</mo><msub><mi>y</mi><mi>i</mi></msub><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><msub><mi>x</mi><mi>i</mi></msub><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo>+</mo><mi>I</mi></math>$

Then,

$∇2θ(−logp(y|x,θ))=∇2θp(y|x,θ)p(y|x,θ)+∇θp(y|x,θ)∇θp(y|x,θ)⊤p(y|x,θ)2<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msubsup><mi mathvariant="normal">∇</mi><mi>θ</mi><mn>2</mn></msubsup><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mo>−</mo><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mi>p</mi><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>=</mo><mrow data-mjx-texclass="ORD"><mfrac><mrow data-mjx-texclass="ORD"><msubsup><mi mathvariant="normal">∇</mi><mi>θ</mi><mn>2</mn></msubsup><mi>p</mi><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo></mrow></mfrac></mrow><mo>+</mo><mrow data-mjx-texclass="ORD"><mfrac><mrow data-mjx-texclass="ORD"><msub><mi mathvariant="normal">∇</mi><mi>θ</mi></msub><mi>p</mi><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo><msup><mrow data-mjx-texclass="ORD"><msub><mi mathvariant="normal">∇</mi><mi>θ</mi></msub><mi>p</mi><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo></mrow><mi mathvariant="normal">⊤</mi></msup></mrow><mrow data-mjx-texclass="ORD"><msup><mrow data-mjx-texclass="ORD"><mi>p</mi><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo></mrow><mn>2</mn></msup></mrow></mfrac></mrow></math>$

$∇2θ(−logp(y|x,θ))=−∇2θp(y|x,θ)p(y|x,θ)+∇θp(y|x,θ)∇θp(y|x,θ)⊤p(y|x,θ)2<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><msubsup><mi mathvariant="normal">∇</mi><mi>θ</mi><mn>2</mn></msubsup><mrow data-mjx-texclass="INNER"><mo data-mjx-texclass="OPEN">(</mo><mo>−</mo><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mi>p</mi><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo><mo data-mjx-texclass="CLOSE">)</mo></mrow><mo>=</mo><mo>−</mo><mrow data-mjx-texclass="ORD"><mfrac><mrow data-mjx-texclass="ORD"><msubsup><mi mathvariant="normal">∇</mi><mi>θ</mi><mn>2</mn></msubsup><mi>p</mi><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo></mrow></mfrac></mrow><mo>+</mo><mrow data-mjx-texclass="ORD"><mfrac><mrow data-mjx-texclass="ORD"><msub><mi mathvariant="normal">∇</mi><mi>θ</mi></msub><mi>p</mi><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo><msup><mrow data-mjx-texclass="ORD"><msub><mi mathvariant="normal">∇</mi><mi>θ</mi></msub><mi>p</mi><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo></mrow><mi mathvariant="normal">⊤</mi></msup></mrow><mrow data-mjx-texclass="ORD"><msup><mrow data-mjx-texclass="ORD"><mi>p</mi><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo></mrow><mn>2</mn></msup></mrow></mfrac></mrow></math>$

$=−∇2θp(y|x,θ)p(y|x,θ)+∇θlogp(y|x,θ)∇θlogp(y|x,θ)⊤<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><mo>=</mo><mo>−</mo><mrow data-mjx-texclass="ORD"><mfrac><mrow data-mjx-texclass="ORD"><msubsup><mi mathvariant="normal">∇</mi><mi>θ</mi><mn>2</mn></msubsup><mi>p</mi><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo></mrow><mrow data-mjx-texclass="ORD"><mi>p</mi><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo></mrow></mfrac></mrow><mo>+</mo><mrow data-mjx-texclass="ORD"><mrow data-mjx-texclass="ORD"><msub><mi mathvariant="normal">∇</mi><mi>θ</mi></msub><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mi>p</mi><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo><msup><mrow data-mjx-texclass="ORD"><msub><mi mathvariant="normal">∇</mi><mi>θ</mi></msub><mi>log</mi><mo data-mjx-texclass="NONE">⁡</mo><mi>p</mi><mo stretchy="false">(</mo><mi>y</mi><mrow data-mjx-texclass="ORD"><mo stretchy="false">|</mo></mrow><mi>x</mi><mo>,</mo><mi>θ</mi><mo stretchy="false">)</mo></mrow><mi mathvariant="normal">⊤</mi></msup></mrow></mrow></math>$

Take expectations leads to the identity (first term vanishes)

여기서 $F_{θ}$ 가 Fisher Information Matrix (FIM) 이다. $\nabla_{θ} \log p (y | x, θ)$ matrix는 Neural Network의 gradient이고 parameter dimension x parameter dimension 크기의 matrix이기 때문에 계산이 또 복잡해진다. 따라서 diagonal Fisher Information Matrix로 근사하여 최종적으로 covariance matrix를 구한다.

$\nabla_{θ}^{2} (- \log p (θ | D_{1})) |_{θ_{1}^{*}} \approx \frac{1}{n} \sum_{i = 1}^{n} \nabla_{θ} (\log p (y_{i} | x_{i}, θ)) {\nabla_{θ} (\log p (y_{i} | x_{i}, θ))}^{⊤} + I$

Hence, the final approximation becomes

$p (θ | D_{1}) \approx N (θ_{1}^{*}, {(diag (F_{θ}) |_{θ_{1}^{*}} + I)}^{- 1})$

그럼 task 2를 학습하는 EWC의 loss function은 다음과 같다.

$L_{2} (θ) = - \frac{1}{n} \sum_{i \in D_{2}} \log p (y_{i} | x_{i}, θ) + \frac{λ}{2} \sum_{θ_{j}} (1 + F_{θ, j j}^{1}) {(θ_{j} - θ_{1, j}^{*})}^{2}$

General task t에 대한 loss function은 FIM이 계속 더해지는 형태이다.

$L_{t} (θ) = - \log p (D_{t} | θ) + \frac{λ}{2} \sum_{θ_{j}} (1 + \sum_{s = 1}^{t - 1} F_{θ, j j}^{s}) {(θ_{j} - θ_{t - 1, j}^{*})}^{2}$

이미 알고있는 다른 분포로 posterior를 approximation하는 방법은 Variational Inference와 비슷하다. 따라서 이런 방법들을 Variational Continual Learning 이라고 한다.

EWC는 위와 같이 중요한 parameter에 regularizer를 취해줌으로써 그 weight가 크게 변하지 않고 유지되도록 한다.

저작자표시 (새창열림)

'학교 수업 > ADL' 카테고리의 다른 글

Self-Supervised Learning (2) (1)	2023.10.19
Self-Supervised Learning (1) (2)	2023.10.12
Continual Learning (4) (1)	2023.10.06
Continual Learning (3) (3)	2023.09.26
Continual Learning (1) (0)	2023.09.21

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

인기포스트

ABOUT ME

ddangchong ddangchong

EWC (Elastic Weight Consolidation)

'학교 수업 > ADL' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

인기포스트

ABOUT ME

EWC (Elastic Weight Consolidation)

'학교 수업 > ADL' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역