Monocular Dynamic View Synthesis

A Reality Check

Neurips 2022

Hang Gao

Ruilong Li

Shubham Tulsiani

Bryan Russell

Angjoo Kanazawa

UC Berkeley

CMU

Adobe

UC Berkeley

Paper
</Code>
Data

Overview Video

Effective Multi-view

A monocular video contains effective multi-view cues when the camera moves much faster than the scene, even though the underlying scene is observed only once at each time step.

Quantifying Effective Multi-view

We propose to quantify the amount of effective multi-view signal present in a monocular capture based on effective multi-view factors (EMFs). The higer the EMF, the more multi-view cues a monocular video has. Through EMFs, we find that the existing datasets have magnitude larger multi-view cues comparing to casual captures.

Below are actual captures in existing datasets.

Existing captures contain either teleporting cameras or quasi-static scenes. This has caused a discrepancy from the casual captures from real life. All videos are played with actual frame rate.


Input video 1	Input video 2	Input video 3


Multi-camera capture (4 out of 12)	Input video


Multi-camera capture	Input video


Multi-camera capture	Input video

Our iPhone dataset consists of 14 casually captured training videos from a hand-held smartphone without camera teleportation. For 7 sequences shown on the top, we provide multi-camera rig validation.


Test view 1	Test view 2	Input video (w/ depth and mask)


Test view 1	Test view 2	Input video (w/ depth and mask)


Test view 1	Test view 2	Input video (w/ depth and mask)


Test view 1	Test view 2	Input video (w/ depth and mask)


Test view 1	Test view 2	Input video (w/ depth and mask)


Test view 1	Test view 2	Input video (w/ depth and mask)


Test view	Input video (w/ depth and mask)


Input video (w/ depth and mask)


Input video (w/ depth and mask)


Input video (w/ depth and mask)


Input video (w/ depth and mask)


Input video (w/ depth and mask)


Input video (w/ depth and mask)


Input video (w/ depth and mask)

Impact of Effective Multi-view

We compare the existing teleporting setting and our non-teleporting settings.

Our results show that the effective multi-view in the existing experimental protocol inflates the synthesis quality of prior methods, and that truly monocular captures are more challenging.

Interestingly, we have two additional findings. First, the camera teleportation affects the warping-based methods more than plain T-NeRF baseline. Second, although camera teleportation improves rendering quality for each frame, it also makes the rendering more temporally jittering -- a phenomenon not captured in current evaluation.


Novel-view rendering	Bullet-time rendering


Novel-view rendering	Bullet-time rendering


Novel-view rendering	Bullet-time rendering


Novel-view rendering	Bullet-time rendering

Towards better experimentation practice

Monocular dynamic view synthesis is difficult to evaluate due to occlusion as NeRF cannot predict what it has not observed. This is an issue that existing methods get around by camera teleportation. We recommand future works to evaluate on co-visible pixels when experimenting with casual captures. We provide code snippets for computing co-visibility and masked image metrics.


Test view	Prediction (w/ tele.)	Prediction (w/o tele.)	Prediction (w/o tele. w/ mask)

Good dynamic view synthesis quality does not necessarily mean good correspondence which directly reflects the quality of deformation modeling. We recommand future works to evaluate correspondence by keypoint transferring. We provide code snippets to readout correspondence and annotation tools for keypoint labeling.


Source kpt.	Target kpt.	Nerfies	HyperNeRF	NSFF

Reality Check without Effective Multi-view

The state-of-the-art methods struggle in modeling acurrate shape of moving objects. We also find that the baseline time-conditioned NeRF (T-NeRF; in the first column) is quite competitive in this setting.


T-NeRF	Nerfies	HyperNeRF	NSFF


T-NeRF	Nerfies	HyperNeRF	NSFF


T-NeRF	Nerfies	HyperNeRF	NSFF


T-NeRF	Nerfies	HyperNeRF	NSFF

We find that existing approaches perform poorly (mPSNR: 14-16 dB) out-of-the-box on the proposed iPhone dataset with more diverse and complex real-life motions.


T-NeRF	Nerfies	HyperNeRF	NSFF


T-NeRF	Nerfies	HyperNeRF	NSFF


T-NeRF	Nerfies	HyperNeRF	NSFF


T-NeRF	Nerfies	HyperNeRF	NSFF

We explored incorporating additional regularizations from recent advances in neural rendering, including background augmentation, depth supervision, and sparsity regularization. We denote the methods with these enhancements as their "++" version. Even though the results get much better, there is still a large room for improvement.


T-NeRF++	Nerfies++	HyperNeRF++	NSFF


T-NeRF++	Nerfies++	HyperNeRF++	NSFF


T-NeRF++	Nerfies++	HyperNeRF++	NSFF

Citation


@inproceedings{gao2022dynamic,
    title={Monocular Dynamic View Synthesis: A Reality Check},
    author={Gao, Hang and Li, Ruilong and Tulsiani, Shubham and Russell, Bryan and Kanazawa, Angjoo},
    booktitle={NeurIPS},
    year={2022},
}

Acknowledgement

We would like to thank Zhengqi Li and Keunhong Park for valuable feedback and discussions; Matthew Tancik and Ethan Weber for proofreading. We are also grateful to our pets: Sriracha, Haru, and Mochi, for being good during capture. This project is generously supported in part by the CONIX Research Center, sponsored by DARPA, as well as the BDD and BAIR sponsors.

Inspired by HumanNeRF

Monocular Dynamic View Synthesis

A Reality Check

Overview Video

Effective Multi-view

Quantifying Effective Multi-view

Existing captures

Casual captures (our iPhone dataset)

Impact of Effective Multi-view

Towards better experimentation practice

Only evaluate co-visible pixels

Correspondence as first-class citizen

Reality Check without Effective Multi-view

On Nerfies-HyperNeRF dataset

On our iPhone dataset

Citation

Acknowledgement