[논문리뷰] ViDAR : Visual Point Cloud Forecasting enables Scalable Autonomous Driving (2)

논문 스터디 2024. 3. 15. 18:48

말이 너무 많아졌다.. 궁금한 점들이나 중요한 부분은 표시해놔야 겠다.

Latent Rendering

preliminary에서 다시 본론으로 돌아와서, ViDAR의 첫번째 method latent rendering이다. 그래서 뭘 rendering한다는 건지, rendering한 결과가 어떻게 되는건지 제발 알려주면 좋겠다.

differentiable ray casting과 달리 discriminative하고 representative 한 feature를 찾기 위해, ray-wise feature를 feature expectation function으로 계산하고 ray-wise feature를 associated conditional probability로 weighting하여 each grid의 feature를 customize 한다. 이 논문에서 associated 라는 단어가 두번째 등장하는데 이 단어좀 안쓰면 좋겠다. 관련된게 어떻게 관련됐다는 건지 모르겠다.

아무튼 differentiable raycasting 논문에서 영감을 받아, feature expectation function 은 다음과 같이 쓰여진다.

$$\hat{\mathcal{F}}^{(i)} = \sum_{k=1}^{m}\hat{\textbf{p}}^{(i,k)}\mathcal{F}^{(k)}_{bev}$$

$i$는 origin부터 $i$번째 grid까지의 ray를 extend한 것이다. 여기서 $\hat{\textbf{p}}$ 는 conditional probability인데, 뭐의 probability냐 하면, $\mathcal{F}_{bev}$로부터 project한 learnable independent probability 를 input으로 하는 conditional probability이다?.. 뭔말이지.. 일단 differential raycasting의 방법과 비슷하게 구하는데 learnable independent probability가 뭔지 모르겠다. 기존 방법은 learnable parameter가 아닌데??? 아무튼 $i$번째 ray의 $k \in {1,2,3,...m}$개의 waypoint마다 물체가 있을 확률과 $k$번째 waypoint의 $\mathcal{F}_{bev}$ 를 곱하면 $i$번째 ray의 expected feature인 $\hat{\mathcal{F}}^{(i)}$가 나온다. 아무튼 ray-wise feature가 구해진다는 말이다.

계속 의문인 점은, pointcloud forecasting을 위해서 ray casting을 하는거면 ViDAR 주장대로 temporal information이 나와야 하는데 여태까지 temporal information이 학습된 것 같지 않다. 그리고 history encoder로 image에서 bev feature를 만들었는데 ray casting을 했을때 expected feature든 expected distance를 구하면 그게 왜 좋은건지, 어디에 쓰여야 하는건지, 왜 expected feature인지, 어느 시점에서 exptected되는 feature인지 모르겠다. 이걸 알면 다른데다 적용할 수 있을 것 같다.

아무튼 ray-wise feature는 lying in the same ray인 모든 grid에서 share된다. share된다는게 어떻게 쓰인다는건지 자세히 써주면 정말 좋겠다. 이 논문은 전체적으로 너무 애매모호하게 말을 하는 듯.. same ray에 있는 all grid가 모두 같은 ray-wise feature를 가지면 안좋은거 아닌가.. differentiable raycasting이 same ray에 있는 bev grid 들이 similar feature response를 가진다고 단점을 꼬집었는데 왜 이렇게 하는건지 모르겠다. 그리고 $ \mathcal{F}^{(k)}_{bev} $ 가 왜 waypoint 에 따라 있는거지? grid마다 $\mathcal{F}_{bev}$ 가 있으면 좋지 않을까. 일단 아래 식으로 grid feature를 구한다. raywise feature에서 갑자기 어떻게 grid feature를 구할 수 있는거지?

$$ \hat{\mathcal{F}}_{bev} = \hat{\textbf{p}}\cdot\hat{\mathcal{F}}$$

식을 해석해보면, raywise feature $\hat{\mathcal{F}}$가 있고 거기에 conditional probability 를 다시 곱한다. 그러면 bev grids의 response를 higher conditional probability로 highlight 해서 discriminative 한 $ \hat{\mathcal{F}}_{bev} $를 만들 수 있다. 전혀 이해는 안가지만.. 어쨌든 이 과정은 그래서 bev encoder가 geometric feature를 pretraining동안 배울 수 있도록 한다. 그럼 이걸 지금 쓰는 camera encoder에만 붙여볼까..

그리고 multi-group latent rendering 아키텍쳐 디자인을 통해 geometric feature의 diversity를 강화한다. multi latent rendering on different feature channels 를 통해 raywise feature가 diverse information을 유지하도록하고 downstream performance에서 좋은 성능을 내도록 한다.

(1)에 나온 each bev grid의 conditioinal probability 식은 own independent response 뿐만 아니라 all prior grid의 response도 고려한다. 결국 pretraining phase에서 모델이 particular grid의 respose를 raise(????) 할때, 동시에 prior와 subsequent response들은 suppressed 된다. 그리고 이건 differentiable raycasting의 ray shaped feature issue를 pretraining동안 mitigate한다. 근데 이 식은 differential ray casting 의 식이랑 바뀐게 없는데 무슨 소리지? 좀 생각해봐야 할 것 같다. 아무튼 그 issue가 해결되었다는 얘기다.

Future Decoder

자, future decoder에 대해서 이야기 해보자. 거의 다왔다. future decoder는 previous bev latent space $\hat{\mathcal{F}}_{t-1}$에 기반하여 frame $t$의 next bev feature $\hat{\mathcal{F}}_t$ 와 expected ego motion $\textbf{e}_t$를 predict 한다. predicted feature는 그럼 pointcloud를 generate하는데 쓰인다.

이 부분은 아키텍쳐 디자인이 곧 method 이다. 근데 temporal information에 따라 달라지는 geometry 정보들을 알아야 할 것 같은데, 아키텍쳐만으로 되는지, 기존 latent feature들이 temporal information을 갖도록 학습이 되었는지 잘 모르겠다.

아무튼 t번째 iteration에서 ego motion condition $e_t$ (ego-vehicle의 expected coordinate과 heading) 는 MLP를 통해 high-dimensional embedding으로 encode되고, transformer input으로 furue BEV queries와 더해진다. future BEV queries도 비어있는건가?

아무튼 Deformable Self-Attention, Temporal Cross-Attention 으로 이루어진 6개 transformer를 지나 Future $\hat{\mathcal{F}}_t$ 를 만든다.

Temporal Cross Attentention은 Deformable Cross Attention의 design을 따른다. 다른점은 query point의 reference coordinate이 Deformable Cross Attention은 key, value의 feature map에서 corresponding positions of query points를 말한다. 그런데 Future decoder의 ego vehicle은 움직이기 때문에, last와 target frame이 ego-coordinate 상에서 align 하지 않는다. 따라서 future BEV queries의 reference points를 계산하기 위해 추가적으로 계산하는 것이다. $\hat{\mathcal{F}}_t$ 를 만들고 나면, occupancy volume $\mathcal{P}_t$를 만들기 위해 projection layer를 사용한다.

자, 대장정이 끝났고, 이제 아이디어를 좀 고민해보자.

저작자표시

'논문 스터디' 카테고리의 다른 글

[논문리뷰] ViDAR : Visual Point Cloud Forecasting enables Scalable Autonomous Driving (1) (0)	2024.03.14
[논문리뷰] BEVFormer: Learning Bird’s-Eye-ViewRepresentation from Multi-Camera Images viaSpatiotemporal Transformers (4)	2024.03.08
[논문리뷰] Improving non-transferable representation learning by harnessing content and style (0)	2024.02.08
[논문 리뷰] Self-supervised Learning via Maximum Entropy Coding (0)	2023.11.12
[논문 리뷰] Unsupervised representation learning from pre-trained diffusion probabilistic models (0)	2023.11.12

ABOUT ME

ddangchong ddangchong

Latent Rendering

Future Decoder

'논문 스터디' 카테고리의 다른 글

티스토리툴바

ABOUT ME

Latent Rendering

Future Decoder

'논문 스터디' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바