Continual Learning (2)

학교 수업/ADL 2023. 9. 21. 22:03

이전 편에서 continual learning의 개념과 세 가지 접근 방식들에 대해 설명했고, 지금부터는 각 접근 방식의 대표적인 방법들을 소개할 것이다.

reference : Kirkpatrick et al., Overcoming catastrophic forgetting in neural networks, PNAS 2017

EWC (Elastic Weight Consolidation)

- 어떤 nodes or weights가 중요한지 찾는다.

- important nodes or weights에 high regularization을 준다.

- stability에 초점이 맞춰져 있다.

task A를 수행하는 모델의 parameter를 $\theta_A$라고 했을 때, parameter space of low error for task A는 회색이고 new task인 task B를 수행할 때 parameter space of low error for task B는 노란색이라고 하자. 모델이 task B로 forward transfer 될 때 penalty를 주지 않는다면 노란색 parameter space로 이동할 것이다. 그러나 이렇게 되면 이전의 task A에서는 high error를 갖게 된다. 간단하게 L2 regularizer를 주게 되면? 왜 바깥으로 이동하는지 잘은 모르겠다. 아마 방법이 너무 naive하기 때문 아닐까? 아무튼 EWC는 두 parameter space 가 겹치는 공간으로 $\theta_A$를 이동하게 하여 두 task 모두 잘 수행하려 한다.

Bayesian framework

1. Single task case with a neural network model $\theta$

- Let $\mathcal{D} = {\{ x_i, y_i \}}^n_{i=1}$ be the training data

- Maximum Likelihood (ML) estimate of $\theta$

$$ \theta^*_{ML} = \underset{\theta}{\arg\min}\{\log p(\mathcal{D}|\theta)\} = \underset{\theta}{\arg\min}\{ - \sum_{i=1}^n \log p(y_i|x_i, \theta) \}$$

- Maximum a Posteriori (MAP) estimate of $\theta$

$$ \theta^*_{MAP} = \underset{\theta}{\arg\min}\{-\log p(\mathcal{D}|\theta) - \log p(\theta) \} = \underset{\theta}{\arg\min}\{ - \sum_{i=1}^n -\log p(y_i|x_i, \theta) - \log p(\theta) \}$$

이 때 MAP의 $\log p(\theta)$는 $\theta$의 prior 이며, standard Gaussian이다.

2. Two tasks arriving sequentially

- Let $\mathcal{D}_1 \mathcal{D}_2$ be the training data for each task

- A natural MAP estimate of $\theta$ after learning Task 2

$$ \theta^*_{MAP, 1:2} = \underset{\theta}{\arg\min}\{-\log p(\mathcal{D}_2|\theta) - \log p(\theta|\mathcal{D}_1) \} $$

$\mathcal{D}_1$의 posterior를 entire dataset의 prior로 줘서 regularizer 역할을 하도록 한다. 그러나 posterior이므로 계산하기 어려워 approximate 할 것이다.

갑자기 왜 posterior가 prior가 됐으며 왜 regularizer 역할을 하는지?

$\mathcal{D}_1$의 posterior는 어떤 parameter가 중요한지에 대한 정보를 포함하고 있으므로 이를 prior로 준다. negative loss function $-\mathcal{L}(\theta)$인 $\log p(\mathcal{D}_2|\theta)$안에서 곱해지므로 regularizer 역할을 하게 된다. 하지만 task1의 posterior를 entire task의 prior로 주는 식이 heuristic인지 아니면 수학적으로 계산된 식인지는 여쭤봐야 할 것 같다.

3. Laplace Approximation to approximate posterior $\log p(\theta|\mathcal{D}_1)$

- Let the MAP solution be $\theta^*_1 = \underset{\theta}{\arg\min}\{-\log p(\theta|\mathcal{D}_1)\}$

- Approximate $p(\theta|\mathcal{D}_1)$ as a Gaussian with mean $\theta^*_1$ and covariance $\sigma_{-1}$

- Approximate $\Sigma_{-1}$ with Fisher Information Matrix (FIM)

$$ \log p(\theta|\mathcal{D}_1) \approx \log p(\theta^*_1|\mathcal{D}_1) + {1 \over 2} (\theta-\theta^*_1)^{\top}(\nabla^2_{\theta} \log p(\theta| \mathcal{D}_1)\vert_{\theta^*_1})(\theta-\theta^*_1) $$

$$\log p(\theta|\mathcal{D}_1) \approx \mathcal{N}\left(\theta^*_1, (\nabla^2_{\theta} \log p(\theta| \mathcal{D}_1)\vert_{\theta^*_1})^{-1}\right) $$ $$\nabla^2_{\theta} \left(-\log p(\theta| \mathcal{D}_1)\right) = \nabla^2_{\theta} \left(-\log p(\mathcal{D}_1|\theta) - \log p(\theta)\right) = {1 \over n}\sum_{i=1}^n \nabla^2_{\theta}(-\log p(y_i| x_i, \theta))+ I $$

Then,

$$\nabla_\theta^2 \left( -\log p(y|x,\theta) \right) = {{\nabla_\theta^2 p(y|x,\theta)} \over {p(y|x,\theta)}} + {{\nabla_\theta p(y|x,\theta){\nabla_\theta p(y|x,\theta)}^\top} \over {{p(y|x,\theta)}^2}} $$

$$\nabla_\theta^2 \left( -\log p(y|x,\theta) \right) = -{{\nabla_\theta^2 p(y|x,\theta)} \over {p(y|x,\theta)}} + {{\nabla_\theta p(y|x,\theta){\nabla_\theta p(y|x,\theta)}^\top} \over {{p(y|x,\theta)}^2}} $$

$$= -{{\nabla_\theta^2 p(y|x,\theta)} \over {p(y|x,\theta)}} + {{\nabla_\theta \log p(y|x,\theta){\nabla_\theta \log p(y|x,\theta)}^\top}} $$

Take expectations leads to the identity (first term vanishes)

$$\mathbb{E}_{x,y|\theta}\left( \nabla_\theta^2 \left( -\log p(y|x,\theta) \right)\right) = \mathbb{E}_{x,y|\theta}\left( {{\nabla_\theta \log p(y|x,\theta){\nabla_\theta \log p(y|x,\theta)}^\top}} \right) \triangleq F_\theta $$

여기서 $F_\theta$가 Fisher Information Matrix (FIM) 이다. ${\nabla_\theta \log p(y|x,\theta)}$ matrix는 Neural Network의 gradient이고 parameter dimension x parameter dimension 크기의 matrix이기 때문에 계산이 또 복잡해진다. 따라서 diagonal Fisher Information Matrix로 근사하여 최종적으로 covariance matrix를 구한다.

$$\nabla_\theta^2 (-\log p(\theta|\mathcal{D}_1))|_{\theta^*_1} \approx {{1} \over {n}} \sum_{i=1}^n \nabla_\theta (\log p(y_i|x_i,\theta)){\nabla_\theta (\log p(y_i|x_i,\theta))}^\top + I $$

Hence, the final approximation becomes

$$p(\theta|\mathcal{D}_1) \approx \mathcal{N}(\theta^*_1, {(\text{diag}(F_\theta)|_{\theta^*_1}+I)}^{-1})$$

그럼 task 2를 학습하는 EWC의 loss function은 다음과 같다.

$$\mathcal{L}_2({\theta}) = -{{1}\over{n}}\sum_{i \in \mathcal{D}_2}\log p(y_i|x_i,\theta) + {{\lambda}\over{2}}\sum_{\theta_j}(1+F_{\theta, jj}^1){(\theta_j-\theta_{1,j}^*)}^2$$

General task t에 대한 loss function은 FIM이 계속 더해지는 형태이다.

$$ \mathcal{L}_t({\theta}) = -\log p(\mathcal{D}_t|\theta) + {{\lambda}\over{2}}\sum_{\theta_j}(1+\sum_{s=1}^{t-1} F_{\theta, jj}^s){(\theta_j-\theta_{t-1,j}^*)}^2$$

이미 알고있는 다른 분포로 posterior를 approximation하는 방법은 Variational Inference와 비슷하다. 따라서 이런 방법들을 Variational Continual Learning 이라고 한다.

EWC는 위와 같이 중요한 parameter에 regularizer를 취해줌으로써 그 weight가 크게 변하지 않고 유지되도록 한다.

저작자표시

'학교 수업 > ADL' 카테고리의 다른 글

Self-Supervised Learning (2) (1)	2023.10.19
Self-Supervised Learning (1) (0)	2023.10.12
Continual Learning (4) (0)	2023.10.06
Continual Learning (3) (1)	2023.09.26
Continual Learning (1) (0)	2023.09.21

ABOUT ME

ddangchong ddangchong

EWC (Elastic Weight Consolidation)

'학교 수업 > ADL' 카테고리의 다른 글

티스토리툴바

ABOUT ME

EWC (Elastic Weight Consolidation)

'학교 수업 > ADL' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바