SUDS: Scalable Urban Dynamic Scenes

CVPR 2023

Haithem Turki1 Jason Y. Zhang1 Francesco Ferroni2 Deva Ramanan1

Carnegie Mellon University Argo AI

Code

Additional Results

Abstract

We extend neural radiance fields (NeRFs) to dynamic large-scale urban scenes. Prior work tends to reconstruct single video clips of short durations (up to 10 seconds). Two reasons are that such methods (a) tend to scale linearly with the number of moving objects and input videos because a separate model is built for each and (b) tend to require supervision via 3D bounding boxes and panoptic labels, obtained manually or via category-specific models. As a step towards truly open-world reconstructions of dynamic cities, we introduce two key innovations: (a) we factorize the scene into three separate hash table data structures to efficiently encode static, dynamic, and far-field radiance fields, and (b) we make use of unlabeled target signals consisting of RGB images, sparse LiDAR, off-the-shelf self-supervised 2D descriptors, and most importantly, 2D optical flow. Operationalizing such inputs via photometric, geometric, and feature-metric reconstruction losses enables SUDS to decompose dynamic scenes into the static background, individual objects, and their motions. When combined with our multi-branch table representation, such reconstructions can be scaled to tens of thousands of objects across 1.2 million frames from 1700 videos spanning geospatial footprints of hundreds of kilometers, (to our knowledge) the largest dynamic NeRF built to date. We present qualitative initial results on a variety of tasks enabled by our representations, including novel-view synthesis of dynamic urban scenes, unsupervised 3D instance segmentation, and unsupervised 3D cuboid detection. To compare to prior work, we also evaluate on KITTI and Virtual KITTI 2, surpassing state-of-the-art methods that rely on ground truth 3D bounding box annotations while being 10x quicker to train.

Overview

Drive-Throughs

We visualize dynamic objects across multiple days (below) on the same city block. All renderings are generated from the same trained model.
RGB Depth Semantics

Re-simulation

We render various scenarios to illustrate potential “resimulation” workflows that SUDS enables. A long standing goal for (re)-simulation-for-robotics is the ability to regenerate sensor data that corresponds to different actions taken by the robot (in our case, a vehicle that moves in a different manner than it did during the original sensor recording) [1]. Such workflows are also used for closed-loop resimulation for model-based reinforcement learning [2].
RGB Depth Semantics
Original
Modified
We remove all dynamic objects from the scene.
Original
Modified
We shift the ego-vehicle's trajectory by four meters.
Original
Modified
We shorten the camera focal length to mimic that of fisheye cameras used for near-field detection on autonomous vehicles, propagating the rendered depth and semantic annotations.

Citation

            
                @InProceedings{turki2023suds,
                    title = {SUDS: Scalable Urban Dynamic Scenes},
                    author = {Turki, Haithem and Zhang, Jason Y and Ferroni, Francesco and Ramanan, Deva},
                    booktitle = {Computer Vision and Pattern Recognition (CVPR)},
                    year = {2023}
                }