CS 180 Project 3

Part 0: Calibrating Your Camera and Capturing a 3D Scan

To create a NeRF, we need a 3D scan of our object. To obtain this, we also need to recover camera intrinsics and distortion coefficients. To this end, we used cv2.calibrateCamera on our calibration data and ArUco tags.

Once we've recovered camera intrinsics, we can create a 3D scan of our object and recover the camera extrinsics. For this part, we again used ArUco tags to help us calibrate between 3D space and camera space.

We can then visualize our camera frustums using Viser.

Part 1: Fit a Neural Field to a 2D Images

Before we jump into 3D, we decided to try a Neural Field in 2D! The Neural Feild would convert \(\{u,v\}\rightarrow \{r,g,b\}\) using the following architecture:

Where SE is the sinusoidal positional encoding given by \[PE(x) = \{x, sin(2^0\pi x), cos(2^0\pi x), sin(2^1\pi x), cos(2^1\pi x), ..., sin(2^{L-1}\pi x), cos(2^{L-1}\pi x)\}\] Because we expose hyperparameters, we can use any arbitrary hidden layer width and L. For my purpose, I used 4 layers with a hidden layer width of 256 (and 32 later) and L=10 (and 5 later). I trained using Adam with a learning rate of 0.01 and used mean squared error as the loss function.

Training Results

(hidden layer width = 32, L = 10)

iteration 0

iteration 400

iteration 800

final image

iteration 0

iteration 400

iteration 800

final image

PSNR

MSE loss

We also tried different hyperparameters

L=10, W=256

L=10, W=32

L=5, W=256

L=5, W=32

Part 2: Fit a Neural Radiance Field from Multi-view Images

Helper functions

Now that we've done 2D radiance fields, we can move on to 3D neural radiance fields. To do this, we need to sample rays from our cameras. We created a couple helper functions to implement this. First, we needed a function to convert camera coordinates to world coordinates. Since we already have the c2w matrix, we simply just multiply that by the homogenous camera coordinates. We then implemented a function to convert pixel coordinates to camera coordinates. To do this, we used the following formula: \[\begin{align} \begin{bmatrix} x_c \\ y_c \\ z_c \end{bmatrix} = \mathbf{K}^{-1} s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} \end{align} \] Where K is the intrinsic matrix, and s is the depth of the point along the optical axis. Using the above functions, we implemented a pixel to ray function that takes the u and v coordinate of the pixel, passes it into the pixel to camera function with s=1 to obtain the camera coordinates. Then we pass the camera coordinates to our camera to world coordinate function to get the world coordinates. We use these coordinates to calculate the ray direction. The ray origin is given by the c2w matrix \[\begin{bmatrix} \mathbf{R}_{3\times3} & \mathbf{t} \\ \mathbf{0}_{1\times3} & 1 \end{bmatrix}\] where t is the ray origin. Using the ray origin and the world coordinates (Xw) from earlier we calculate the ray direction as follows \[\begin{align} \mathbf{r}_d = \frac{\mathbf{X_w} - \mathbf{r}_o}{||\mathbf{X_w} -\mathbf{r}_o||} \end{align}\] Our function then outputs both the ray origins and directions

Sampling

Now we must sample rays from our images, just as we sampled pixels in part 1. To do this, we flatten all pixels on all images and sample N rays globally. We randomly select N uv coordinates and then use our previous helper functions to get the ray origin and direction. We also obtain the pixel value at the uv coordinate and return that as well. After getting the rays, we need to discretize them into samples in 3D space. To do this, we first uniformly sample n_samples linearly between values near and far. We then perturb these values by a random amount up to the bin size to avoid overfitting. These random sampled values will be called t We then calculate the final coordinates by the following: \[\mathbf{x} = \mathbf{R}_o + \mathbf{R}_d * t\] Below we show some of the sampled rays for one of our NeRFs.

1 ray per cameras

100 rays from one camera

Neural Net

For the NeRF we created a MLP given by the architecture shown below.

Exposing L_dir, L_coord, N_hidden_layers and hidden_layer_size as hyperparameters. We set the default values to match the image above with L_dir=4 and L_coord=10.

Volume rendering

To actually render an image, we need to use volume rendering. The core volume rendering equation is as follows \[\begin{align} C(\mathbf{r})=\int_{t_n}^{t_f} T(t) \sigma(\mathbf{r}(t)) \mathbf{c}(\mathbf{r}(t), \mathbf{d}) d t, \text { where } T(t)=\exp \left(-\int_{t_n}^t \sigma(\mathbf{r}(s)) d s\right) \end{align}\] As we are computer scientists who know not of what continuous is supposed to mean, we use the discrete approximation \[\begin{align}\hat{C}(\mathbf{r})=\sum_{i=1}^N T_i\left(1-\exp \left(-\sigma_i \delta_i\right)\right) \mathbf{c}_i, \text { where } T_i=\exp \left(-\sum_{j=1}^{i-1} \sigma_j \delta_j\right) \end{align}\] where \(c_i\) is the color obtained from our network at sample location \(i\), \(T_i\) is the probability of a ray not terminating before sample location \(i\), and \(1-e^{-\sigma_i\delta_i}\) is the probability of terminating at sample location \(i\).

Results

Below you may find the results of the NeRF on the Lego scene from the original NeRF paper. We used parameters L_coord=10, L_dir=4, N_hidden_layers=8, hidden_layer_size=256, batch_size=2048, N_iter=3600, ray_samps=64, near=2, far=6. We used MSE as the loss function and the Adam optimizer with a learning rate of 5e-4

Iteration 400

Iteration 800

Iteration 1200

Iteration 1600

Iteration 2000

Iteration 2400

Iteration 2800

Iteration 3200

Iteration 3600

validation psnr

loss

Part 2.6: Training with Your Own Data

We also tried training on our own custom dataset. We use parameters L_coord=10, L_dir=4, N_hidden_layers=8, hidden_layer_size=256, batch_size=4096, N_iter=3600, ray_samps=64, near=0.04, far=0.5. We used MSE as the loss function and the Adam optimizer with a learning rate of 1e-3