SpikyPanda — Stereo Depth Estimation

Understanding the use case

▼

Stereo vision - Two cameras separated by a known baseline capture slightly different views of the same scene. Objects closer to the cameras appear shifted more between the two images (larger disparity), while distant objects appear nearly identical. By computing this disparity for every pixel, we can reconstruct a depth map of the scene without active sensors like LiDAR.

Mars rover application - On planetary exploration missions, stereo cameras are the primary depth sensor. LiDAR is heavy, power-hungry, and has mechanical parts that can fail. A stereo camera pair is lightweight, passive (no laser), and has no moving parts. The trade-off is that depth must be computed from the images, which requires a matching algorithm. A neural network can learn this matching end-to-end.

Low-energy alternative to LiDAR - In terrestrial robotics, stereo depth estimation consumes a fraction of the power of a spinning LiDAR sensor. For small drones, indoor robots, and battery-constrained platforms, a stereo camera pair with on-board neural inference provides dense depth at minimal energy cost.

Synthetic data - This demo generates simple 2D scenes with rectangles at known depths. The left image shows the scene directly; the right image shifts each rectangle horizontally by baseline × focal / depth. The ground truth disparity map records the exact shift per pixel, providing perfect training labels without manual annotation.

Understanding cross-synapses

▼

Dual-branch architecture - The stereo CNN has two parallel branches (left and right) that share the same convolutional kernels. This weight sharing ensures both branches extract identical features from their respective images, which is essential for meaningful comparison.

Disparity-offset synapses - At designated cross-layers, neurons in the left branch are connected to neurons in the right branch at horizontally offset positions. For each left neuron at column c, cross-synapses connect to right neurons at columns c, c-1, c-2, ..., c-maxDisparity. These connections allow the network to compare features at different horizontal offsets, which is exactly what stereo matching requires.

Bidirectional cross-connections - Cross-synapses are bidirectional: left-to-right and right-to-left. This allows both branches to benefit from the comparison. The cross-kernel learns a weight per channel that modulates how strongly the cross-branch signal influences each neuron.

Merge layer - After the convolutional layers, a merge layer combines the two branches. In Diff mode, it computes |L - R| per feature, which directly encodes the feature disparity. In Concat mode, it concatenates both feature maps and lets the dense layer learn the combination. The merged representation is flattened and passed through a dense output layer that produces the disparity map.

Understanding the configuration

▼

Preset - Selects the network architecture. Tiny uses a 16×16 input with one conv layer (8 filters, 3×3) and maxDisparity=8. The output is a 14×14 disparity map. Small uses 32×32 input with two conv layers (8 and 16 filters) and maxDisparity=16, producing a 28×28 disparity map. Larger presets have more cross-synapses and take significantly longer to train.

Epochs - Number of complete passes through the training set. The stereo matching problem benefits from more epochs because the network must learn both feature extraction and cross-branch comparison simultaneously. Start with 5 and increase if the loss curve has not flattened.

Learning rate - Controls weight update step size. The default of 0.003 works with the Adam optimizer. If training is unstable (loss jumping up), reduce to 0.001. The cross-synapse weights and shared conv kernels are updated together, so the learning rate affects both feature extraction and matching.

Samples - Number of synthetic stereo pairs to generate. Each sample includes a left image, right image, and ground truth disparity map. 20% is held out for testing. More samples provide better coverage of depth configurations.

Understanding the results

▼

Disparity map - The network outputs a 2D map where each pixel value represents the horizontal shift (in pixels) between the left and right images at that location. Higher disparity means the object is closer to the cameras. The disparity values are normalized to [0, 1] where 1 corresponds to maxDisparity pixels of shift.

Good results - A well-trained network produces disparity maps where foreground objects (rectangles) appear as bright regions with sharp boundaries, and the background is uniformly dark. The predicted map should closely match the ground truth, with MSE below 0.05 and RMSE below 0.22.

Bad results - Common failure modes include: (1) uniform gray output where the network ignores depth differences entirely; (2) blurred edges where object boundaries are smoothed out; (3) incorrect depth ordering where near objects appear far. These usually indicate insufficient training epochs or too few samples.

MSE and RMSE - Mean Squared Error and Root Mean Squared Error between predicted and ground truth disparity maps. Since disparity values are normalized to [0, 1], RMSE can be interpreted as average per-pixel error on that scale.

Inference time - Time to run the forward pass on all test samples. The stereo network processes two images simultaneously through separate branches, so inference involves roughly twice the computation of a single-branch CNN of similar depth.

Configuration

Preset:

Epochs:

Learning rate:

Samples:

Generate data to begin.

Training Log

Ready.

Stereo Depth Estimation