Self-Supervised Learning (2)

학교 수업/ADL 2023. 10. 19. 21:28

마지막 Constrastive Loss을 이용한 SSL 방법이다. similar sample pair는 가깝게 하고 dissimilar pair는 멀게 하도록 representation을 학습한다. 가장 초기 contrastive loss와 triplet loss는 건너뛰고 Noise Contrastive Estimation(NCE)부터 보겠다. NCE는 2010년에 나온 논문으로 매우 오래됐다. observed data $X=(x_1, x_2, .., x_n)$ 이 있고, artifitially generated noise data $Y=(y_1, y_2,..., y_n)$ 이 있을 때, $x \sim p(x;\theta), y \sim p(y)$ 라고 해보자. 그럼 $ x \sim p(x;\theta), y \sim p(y) $의 log-odd는 ${l}_theta(u) = \log{{p_theta(u)} \over {q(u)}}$ 이다. 이걸 loss에 적용하면,

$$\mathcal{L}_{NCE} = -{1\over N}\sum_{i=1}^N[\log{\sigma(\ell_\theta(x_i) + \log{(1-\sigma(\ell_\theta(y_i)))})}]$$

이고, $q(y)$는 design choice라고 하는데 무슨 말인지는 모르겠다. 아무튼 noise sample과의 contrast를 estimation한다.

reference 논문인 "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models"에 따르면 $$h(u;\theta) = {1 \over {\exp[-G(u;\theta)]}}, G(u;\theta) = \ln{p_m(u;\theta)-\ln{p_n(u)}}$$ 일때,

$$J_T = {1\over {2T} \sum_t \ln[h(x_t;\theta)]+ln[1-h(y_t;\theta)]}$$ 라고 다시 쓸 수 있고,

이렇게 나오는 이유는, $p(u|C=1;\theta) = p_m(u;\theta), p(u|C=0) = p_n(u)$ 라고 했을 때, $$P(C=1|u;\theta) = {{p_m(u;\theta)} \over{p_m(u;\theta)+p_n(u)}} = h(u;\theta)$$ $$P(C=0|u;\theta) = 1-h(u;\theta)$$

이고, $C_t$는 Bernoulli-distributed 이기 때문에 \theta의 log-likelihood는 $$\ell(\theta) = \sum_t C_t \ln{P(C_t=1|u_t;\theta) + (1-C_t)\ln{P(C_t=0|u_t;\theta)}}$$ 이다.

여기서 $C_t=0$이 되는 항들을 제거하면, $$ = \sum_t \ln[h(x_t;\theta)]+ \ln[1-h(y_t; \theta)]$$ 가 된다.

이걸 Neural Network로 발전시킨 버전이 InfoNCE이다.

$ z_t = g_{enc}(x_t)$ 이고, $c_t = g_{ar}(z_{\le t})$는 autoregressive decoder의 output이다. $z_{t+k}$의 representation을 prediction하는 것이 objective이다. 여기서 궁금한건 샘플이 들어와서 $c_t$를 가지고 $z_{t+k}$를 prediction하는게 왜 contrastive learning인지 잘 모르겠다. CPC논문을 좀 읽어보니, encoder와 autoregressive model은 InfoNCE로 jointly optimize된다고 한다. positive sample은 $p(x_{t+k}|c_t)$에서 뽑고, negative sample은 proposal distribution $p(x_{t+k})$에서 뽑는다. 그리고 $$\mathcal{L} = -\mathbb{E}_X[\log{{f_k(x_{t+k}, c_t)} \over {\sum_{x_j \in X}f_k(x_j, c_t)}}]$$ 를 optimize한다. 여기서 $f_k(x_{t+k}, c_t) = \exp(z_{t+k}^T W_k c_t)$은 mutual information $I(x;c) = \sum_{x,c}p(x,c)\log{p(x|c) \over p(x)}$에서 $ \log{p(x|c) \over p(x)}$를 simple bilinear model로 modeling한 것이다. 왜 이렇게 modeling 되었는지는 잘 모르겠다. 아무튼 NCE 와 같이, $p(C=pos|X,c)$를 구하는데, 이번에도 context와 data들이 주어졌을때 $x$의 class가 positive일 확률 = positive일 확률/(positive일 확률 + negative일 확률) 이기 때문에,

$$p(C=pos|X,c) = {{p(x_{pos}|c)\prod_{i=1,...,N;i\neq pos}p(x_i)} \over {\sum_{j=1}^N[p(x|c)\prod_{i=1,...,N; i\neq j}p(x_i)]}}$$ 가 된다. 여기서 분모와 분자에 겹치는 것들을 제거한다.

$$ {{p(x_{pos}|c)\prod_{i=1,...,N;i\neq pos}p(x_i)} \over {\sum_{j=1}^N[p(x|c)\prod_{i=1,...,N; i\neq j}p(x_i)]}} \time {{{p(x_{pos})} \over {p(x_{pos})} }\over{{p(x_j)} \over {p(x_j)}}} = {{{ p(x_{pos|c})}\over {p(x_{pos})}} \over {\sum_{j=1}^N{p(x_j|c)}\over{p(x_j)}}}$$ mutual information과 같은 form이 되기 때문에 이를 아까 언급했던 $f_k(\cdot)I$으로 치환한다. 그럼 위의 loss function이 나오게 된다. 교수님은 minimizing InfoNCE가 lower bound to the mutual information을 maximizing하는 것도 보여주셨는데 여기에 쓰진 않겠다.

CPC를 vision에 어떻게 사용할까?

위의 그림이 나는 이해가 잘 되진 않았는데 지금 이해를 해보겠다. 일단 overlapping하는 patch들을 input으로 하고, $z_t = g_{enc}(x_t)$를 만든다. 그리고 $c_t = g_{ar}(z_t)$를 가지고 $z_{t+k}$를 predict한다는데, 이해가 안되는 건, 우리가 구한건 $x_{t+k}$의 class가 positive일 확률에 마이너스를 취한게 loss이고 이 loss를 minimize하는게 optimize하는건데, 즉 이말은 $c_t$가 주어졌을때 $x_{t+k}$의 class가 positive일 확률을 maximize 한다는 거다. 그럼 또 이 말은? 무슨 말이지? 아 $f_k(x_{t+k}, c_t) = \exp(z_{t+k}^T W_k c_t)$ 라는 걸 잠깐 까먹었다. 그럼 다시 생각해보자.. $c_t$는 이미 알고있는 값이고 $W_k$를 learning parameter로 뒀을 때, loss는 결국 (positive sample의 mutual information) / (positive sample + negative sample의 mutual information) 를 maximize하는 것이고 (마이너스를 취한 값을 minimize하므로) 이를 maximize하려면, positive sample의 mutual information은 커지면서 positive + negative sample들의 mutual information은 작아지도록 만든다. 근데 왜 굳이 이렇게 derive한거지? mutual information이 먼전지 NCE가 먼전지 모르겠다.. 아무튼 그렇게 learning parameter $W_k$를 학습하면, $\hat{z}_{t+k}$를 predict하는 꼴이 된다. 그럼 어디서부터 t인거지..? 이게 제일 의문이다. t를 그래서 어디까지 만들어놓아야 된다는거지? 그림을 다시봐서는 그냥 1부터 시작하면 될 것 같다. 이해가 됐다. 근데 그러면 모든 각각의 patch들이 그를 제외한 다른 patch들에 대한 similarity를 구한다는 얘긴데 그럼 계산량이 꽤나 많겠다. 그리고 encoder는 pretrained model을 사용할거고 실험을 살짝 봤을 때는 resnet사용하는 것 같다. 아 그리고 수식을 계산할 때는 positive sample이 한개라고 가정하고 나머지는 모두 negative sample이라고 하는데 patch에서는 자기 자신이 아니면 positive sample이 아니지 않나? 이건 진짜 모르겠다. 막연히 positive sample은 한 patch와 가까운 것들이라고 생각했는데.. 이 부분은 교수님께 여쭤봐야겠다.

그 다음에 contrastive learning의 key ingredients는 다음과 같다.

heavy data augmentation - create noisy versions of given image as positives
large batch size - large batch could include diverse negative samples
hard negative sampling - correct hard negative sampling is important

근데 positive sampling과 negative sampling을 구분하려면 어쨌든 supervision이 들어가야 하지 않나? 어쨌든 positive sample은 augmented image이고 다른 이미지들은 negative sample들인데 그러면 그 적은 positive sample들에 bias가 생기지 않을까? negative sample들 중에도 같은 class인 것들이 있을 텐데?

저작자표시

'학교 수업 > ADL' 카테고리의 다른 글

Self-Supervised Learning (3) (0)	2023.10.19
Self-Supervised Learning (1) (0)	2023.10.12
Continual Learning (4) (0)	2023.10.06
Continual Learning (3) (1)	2023.09.26
Continual Learning (2) (0)	2023.09.21

ABOUT ME

ddangchong ddangchong

'학교 수업 > ADL' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'학교 수업 > ADL' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바