Why compress LiDAR data? - In autonomous navigation, a LiDAR sensor produces dense 3D point clouds that are projected onto a 2D occupancy grid. A typical grid at 64×64 resolution with 6 channels represents 24,576 values per frame. This is too large to feed directly into a navigation controller, especially when fused with other sensors (IMU, wheel odometry, GPS). An autoencoder compresses this grid into a compact latent vector (e.g., 64 dimensions - a 24× compression) that preserves the essential spatial structure.
Encoder-only inference - The autoencoder has two halves: an encoder (input → latent) and a decoder (latent → reconstructed output). The decoder is only needed during training to force the encoder to learn meaningful representations. In production, the decoder is discarded and only the encoder runs, producing a small latent vector that serves as input for the navigation policy.
Sensor fusion pipeline - The latent vector from the LiDAR encoder is concatenated with other sensor inputs (IMU acceleration/gyroscope, wheel speed/steering, GPS heading) and fed into a small navigation MLP that outputs driving commands. The autoencoder acts as a learned perception module that reduces raw sensor data to a manageable representation.
Synthetic data - This demo uses procedurally generated LiDAR grids with 5 scene types (straight road, curved road, intersection, obstacles, empty field). Each scene produces characteristic spatial patterns with Gaussian noise for realism. In a real system, grids would be projected from actual LiDAR point clouds captured in simulation (e.g., CARLA, Isaac Sim) or from physical sensors.
Why graph-based? - SpikyPanda represents every neuron and synapse as an explicit object in a traversable graph. This is slower than tensor-based frameworks (PyTorch, TensorFlow) but enables structural mutations (neuroevolution), heterogeneous neuron types (rate-coded + spiking in the same network), and 3D visualization of the network topology. The autoencoder validates that this graph-based runtime produces correct results on a well-understood problem.
C0 - Point density - The number of LiDAR returns per grid cell, normalized to [0, 1]. High density means the cell is well-observed (solid surface, close range). Low density means the cell is sparse (far away, occluded, or empty space). This channel acts as a confidence measure: cells with near-zero density carry little information.
C1 - Z max (maximum height) - The highest point measured in the cell. Tall values indicate obstacles (walls, vehicles, poles), while low values indicate ground-level surfaces (road, sidewalk). This is the primary channel for obstacle detection.
C2 - Z min (minimum height) - The lowest point measured in the cell. Combined with Z max, it reveals the vertical extent of objects. A large gap between Z min and Z max indicates a tall object. Low Z min with low Z max indicates flat ground. Negative Z min may indicate curbs, ditches, or below-grade features.
C3 - Std(z) (height variance) - The standard deviation of height values within the cell. This is the most discriminative channel for navigation: vegetation (trees, bushes) produces high variance because leaves and branches return at many different heights, while solid walls produce low variance despite similar Z max values. A navigation system can use this to distinguish traversable vegetation from non-traversable solid obstacles.
C4 - Reflectivity - The mean surface reflectivity (intensity of the LiDAR return). Different materials have characteristic reflectivity: asphalt is highly reflective, grass and soil are less reflective, retroreflective signs and lane markings are extremely bright. This channel helps identify surface type without relying on geometry alone.
C5 - Velocity / dynamic flag - Indicates whether points in the cell belong to a moving object. A value of 0 means static; values above 0 encode relative speed. This channel is critical for navigation: moving objects (pedestrians, vehicles) require different planning strategies than static obstacles. Not all LiDAR systems provide this information natively; it may be computed by comparing successive scans.
Resolution - The spatial size of the LiDAR grid (width × height). A 16×16 grid has 1,536 input values (16×16×6 channels) and creates a manageable graph for browser training. A 32×32 grid has 6,144 inputs - 4× more neurons and synapses, so training is significantly slower. In production, 64×64 is typical but requires server-side training.
Preset - Selects the encoder architecture depth. Tiny uses a single convolutional layer (Conv 8 filters, 3×3, Same padding → MaxPool 2×2) and is recommended for quick experiments. Small adds a second convolutional layer (Conv 16 filters) for better feature extraction at the cost of more neurons and longer training. Deeper presets capture more complex spatial patterns but may overfit on simple synthetic data.
Latent dim - The number of dimensions in the compressed representation. A smaller latent (e.g., 32) forces more aggressive compression but may lose detail. A larger latent (e.g., 128) preserves more information but provides less compression. Our benchmarks show that 64 dimensions achieve information saturation on 16×16 grids - going to 128 doubles training time with no improvement in MSE.
Epochs - The number of complete passes through the training dataset. More epochs allow the optimizer to refine weights further, but with diminishing returns. The loss curve panel shows whether additional epochs would be beneficial: if the curve has flattened, more epochs won't help much.
Learning rate - Controls how large each weight update step is. Too high (> 0.01) causes oscillations where the loss bounces up and down. Too low (< 0.0005) causes very slow convergence. The default of 0.003 works well with the Adam optimizer for this architecture. If you see unstable training, try reducing it.
Samples - The number of synthetic LiDAR grids to generate for training. More samples provide better coverage of scene variations but increase training time proportionally. 100 samples (20 per scene type) is sufficient for the synthetic data distribution used here. 20% of this count is automatically set aside as a separate test set.
Channel loss weights - Controls how much each channel contributes to the training loss. Default is 1 for all channels (uniform MSE). Increasing a channel's weight forces the optimizer to prioritize its reconstruction. For example, setting C1 (Z max) to 3 and C5 (Velocity) to 5 penalizes errors on those channels 3x and 5x respectively. Use this when sparse but critical channels (obstacles, moving objects) are poorly reconstructed. Note: the reported MSE will increase because errors are amplified - compare reconstruction quality visually, not by absolute MSE. See "Known limitations" for why weighted loss alone may not solve sparse channel reconstruction.
Synthetic data (default) - When no external dataset is loaded, the sample generates synthetic LiDAR grids procedurally. Five scene types are created: straight road, curved road, intersection, obstacles, and empty field. Each scene produces realistic channel patterns (dense road surface, sparse obstacles with Z max peaks, occasional velocity flags). This is useful for validating the architecture but does not capture the complexity of real sensor data.
Real data (recommended for validation) - For meaningful benchmarks, use real LiDAR data. The sample automatically tries to load data/lidar/train.json and data/lidar/test.json on startup. If found, real data is used instead of synthetic. The expected JSON format is:
{ "width": 64, "height": 64, "channels": 6, "count": 100,
"channelNames": ["Density","Z_max","Z_min","Std_z","Reflectivity","Velocity"],
"samples": [{ "pixels": [0.8, 0.5, ...] }] }
// pixels: CHW flat array [C0_row0_col0, C0_row0_col1, ..., C1_row0_col0, ...]
// values normalized to [0, 1]
nuScenes (with velocity) - The recommended real dataset. nuScenes mini (~4 GB, ~400 frames) is the only public dataset that provides per-object velocity annotations, which maps directly to channel C5. Use the provided projection script:
pip install nuscenes-devkit numpy pyquaternion python packages/dev/tools/python/prepare_nuscenes.py \ --dataroot /path/to/nuscenes \ --version v1.0-mini \ --output packages/host/www/data/lidar \ --grid-size 64 --max-samples 100
KITTI or raw point clouds - If you have .bin (KITTI format) or .pcd files, the same script can project them directly. Note: KITTI does not provide velocity, so C5 will be zero everywhere.
python packages/dev/tools/python/prepare_nuscenes.py \ --raw-dir /path/to/pointclouds/ \ --output packages/host/www/data/lidar \ --grid-size 64
Projection parameters - The script projects 3D points into a top-down (Bird's Eye View) grid. Default range is -50m to +50m in both X and Y axes, with Z filtered between -3m and +5m. Each cell accumulates point statistics to produce the 6 channels. Adjust --x-range and --y-range for your sensor's field of view.
Grid size tradeoff - A 64x64 grid with 100m range gives ~1.56m per cell. A 128x128 grid gives ~0.78m per cell (more spatial detail but 4x more computation). For initial experiments, 64x64 is the standard in autonomous driving research.
Original vs. Reconstructed - Each test sample is shown as two rows of 6 channel heatmaps: the original input on top, and the autoencoder's reconstruction below. Visually similar rows mean the encoder captured the essential spatial information in its latent vector.
Per-sample MSE - Displayed next to each reconstructed sample. Lower values mean better reconstruction for that specific scene. Obstacle scenes typically show higher MSE than simple road or empty scenes because randomly placed objects create high-frequency spatial features that are harder to compress.
Latent vector - The compressed representation of each input grid. Each bar shows one dimension of the latent space. Blue bars are positive values, red bars are negative. Different scene types (road, intersection, obstacles) should produce visibly different latent patterns - this is what the encoder learns.
Channel heatmaps - Each channel is rendered with a distinct color tint: cyan for density, orange for Z max, green for Z min, yellow for height variance, gray for reflectivity, red for velocity. Bright pixels indicate high values, dark pixels indicate low values.
MSE (Mean Squared Error) - Measures how different the reconstructed output is from the original input. It is the average of squared differences across all pixels and channels: MSE = (1/N) × ∑(input[i] - output[i])². Because differences are squared, larger errors are penalized more heavily. A value of 0 means perfect reconstruction. Typical good values are below 0.01.
RMSE (Root Mean Squared Error) - The square root of MSE: RMSE = √MSE. Since pixel values are normalized to [0, 1], RMSE can be interpreted as the average error per pixel on that scale. For example, RMSE = 0.09 means each pixel is off by about 9% on average.
Inference Time - Total time to run the forward pass on all test samples. This includes both the encoder (input → latent vector) and the decoder (latent → reconstructed output). In production, only the encoder is used, so real inference time would be roughly half.
Loss curve - Shows MSE per epoch during training. A healthy training curve decreases rapidly at first, then flattens. If the loss oscillates or increases, try lowering the learning rate. If it plateaus too high, the model may need more capacity (try a larger preset or more epochs).
Channel weights - Multiplies the loss contribution of each channel. Setting C1 (Z max) to 3 and C5 (Velocity) to 5 forces the optimizer to prioritize those channels. Note that the reported MSE will be higher because errors are amplified by the weights - this is expected. Compare visually, not by absolute MSE value.
Standard MSE treats all pixels equally. In a 16x16 grid with 6 channels, that is 1,536 pixels. If only 20 of them contain obstacle data (Z max, Velocity), those 20 pixels represent just 1.3% of the total. A model that reconstructs the 1,516 empty pixels perfectly but misses every obstacle would still report an excellent MSE. These sparse metrics expose what MSE hides.
Sparse Recall - Of all real obstacle pixels (ground truth above threshold), what fraction did the model reconstruct above threshold? A recall of 0.80 means 80% of obstacles were detected. Low recall means the model is missing obstacles - critical for navigation safety. Formula: Recall = TruePositive / TotalSparseGT
▲ Higher is better. Range [0, 1]. Target: > 0.7. Below 0.5 means most obstacles are lost.
Sparse Precision - Of all pixels the model predicted as obstacles, what fraction were real? A precision of 0.90 means 90% of predicted obstacles exist in reality. Low precision means the model is hallucinating obstacles - causes unnecessary avoidance maneuvers. Formula: Precision = TruePositive / TotalPredicted
▲ Higher is better. Range [0, 1]. Target: > 0.7. Below 0.5 means more than half of predicted obstacles are false alarms.
Sparse F1 - The harmonic mean of Recall and Precision. This is the headline metric for sparse channel quality. F1 penalizes models that sacrifice either detection (recall) or accuracy (precision). A model must both find real obstacles AND avoid false alarms. Formula: F1 = 2 * Recall * Precision / (Recall + Precision)
▲ Higher is better. Range [0, 1]. Target: > 0.6. A value of 0 means complete failure. Above 0.8 is excellent.
Sparse MSE - MSE computed only on pixels where the ground truth is above threshold. This removes the "empty space bias" where thousands of zero-valued pixels dominate the error. A global MSE of 0.005 might hide a sparse MSE of 0.15 on obstacle pixels.
▼ Lower is better. Range [0, +inf]. Compare relative to global MSE: if sparse MSE >> global MSE, the model is failing on obstacles specifically.
Energy Retention Ratio (ERR) - Measures how much signal amplitude the model preserves on obstacle pixels. Formula: ERR = sum(pred^2 * mask) / sum(gt^2 * mask). An ERR of 1.0 means the energy is perfectly preserved. An ERR of 0.3 means 70% of the obstacle signal energy was lost - the model is flattening obstacle peaks. CNN autoencoders typically show low ERR because convolution + pooling smooths out sharp features.
▲ Closer to 1.0 is better. Range [0, +inf]. Target: 0.7 - 1.3. Below 0.5 means severe signal loss. Above 1.5 means the model is amplifying noise.
Top-K Hit Rate - Takes the K highest-value pixels in the ground truth (the most prominent obstacles) and checks if they are also among the K highest in the reconstruction. This is the most intuitive metric: "did the model find the biggest obstacles?" A hit rate of 0.90 means 90% of the most important pixels were correctly identified as important.
▲ Higher is better. Range [0, 1]. Target: > 0.6. A random model would score ~K/N (near 0). Above 0.8 means the model reliably identifies the most critical pixels.
Contrast Preservation - Measures how well the model preserves the local sharpness of sparse features. For each obstacle pixel, computes the contrast (pixel value minus mean of its 8 spatial neighbors) in both the ground truth and the reconstruction. The ratio pred_contrast / gt_contrast tells if the peak stands out as sharply as in the original. A CNN that blurs a 3x3 obstacle into a 5x5 smooth bump will have low contrast (the peak is spread across neighbors). An attention-based model that reconstructs a sharp peak will score closer to 1.0. This is the key metric that distinguishes "obstacle detected" (F1) from "obstacle shape preserved" (Contrast).
▲ Closer to 1.0 is better. Range [0, 2]. Target: > 0.5. Below 0.3 means severe blurring - the model detects obstacles but destroys their shape. Above 0.8 means sharp reconstruction.
Channel classification - Channels are automatically classified as SPARSE (enough above-threshold pixels) or DENSE (mostly uniform values). Sparse metrics are only computed for sparse channels. Typically Z max (C1) and Velocity (C5) are sparse; Density (C0), Z min (C2), Std(z) (C3), and Reflectivity (C4) are dense.
Interpreting the comparison - When comparing architectures (CNN vs ViT vs SAT), look at Sparse F1 and ERR, not global MSE. Two models can have similar MSE but very different sparse F1. The model with higher sparse F1 preserves obstacle information better, which is what matters for navigation. This is why attention-based architectures (ViT MAE, SAT) outperform CNN on these metrics despite comparable or even higher global MSE.
Sparse high-frequency features are poorly reconstructed - Channels C1 (Z max) and C5 (Velocity) contain sparse, localized signals: a few pixels with high values on a near-zero background. The autoencoder's bottleneck (Conv + Pool + Flatten + Dense) destroys spatial precision. After pooling, a 3x3 pixel obstacle becomes ~1.5 pixels. After flattening to a 64-128 dimensional vector, the exact position is lost entirely. The decoder must reconstruct position from a global vector, which produces blurred approximations instead of sharp localized features.
Weighted loss does not solve the problem - Increasing the loss weight on sparse channels (e.g., C5 Velocity x10) forces the optimizer to focus on those channels, but the architecture itself cannot preserve pixel-level spatial detail through the bottleneck. The network lacks the structural capacity to encode "there is a 3x3 obstacle at position (12, 8)" in a flat latent vector. Weighted loss produces higher overall MSE without improving sparse channel reconstruction.
This is a known limitation of vanilla autoencoders - The same issue exists in tensor-based frameworks (PyTorch, TensorFlow). Standard solutions include: (1) Skip connections (U-Net) - the decoder receives feature maps from the encoder at each resolution, bypassing the bottleneck for spatial details; (2) Strided convolutions instead of MaxPool, which preserve more spatial information; (3) Separate detection pipeline - use the autoencoder for global scene understanding and a dedicated object detector for localized obstacles.
For navigation, this may be acceptable - The latent vector does not need to perfectly reconstruct every obstacle pixel. It needs to encode "there are obstacles in this region" as a pattern in the latent space. Different obstacle configurations produce different latent vectors, which is sufficient for a downstream navigation policy to learn appropriate responses. The visual reconstruction is a diagnostic tool, not the end goal.
Build and train the autoencoder using @spiky-panda/core:
import {
buildAutoencoderFromPreset,
AutoencoderBuilder,
CnnInferenceRuntime,
CnnTrainingRuntime,
LossFunctions,
WeightedChannelLoss,
Optimizers,
} from "@spiky-panda/core";
// 1. Build autoencoder from preset
const result = buildAutoencoderFromPreset("tiny", {
width: 16,
height: 16,
channels: 6,
latentDim: 64,
});
// result.autoencoder - full encoder+decoder graph (for training)
// result.encoder - encoder-only graph (for inference)
// 2. Or build a custom architecture
const custom = new AutoencoderBuilder({
inputWidth: 32,
inputHeight: 32,
inputChannels: 6,
latentDim: 128,
})
.addConvLayer({ filters: 16, kernelSize: 3, padding: "same" })
.addPoolLayer({ type: "max", size: 2 })
.addConvLayer({ filters: 32, kernelSize: 3, padding: "same" })
.addPoolLayer({ type: "max", size: 2 })
.build();
// 3. Train with standard MSE loss
const ae = result.autoencoder;
const runtime = new CnnInferenceRuntime(ae);
const trainer = new CnnTrainingRuntime(
ae, runtime,
LossFunctions.MSE,
0.003, // learning rate
Optimizers.Adam()
);
// 4. Or use weighted loss to prioritize specific channels
// [density, z_max, z_min, std_z, reflectivity, velocity]
const weightedLoss = new WeightedChannelLoss(
LossFunctions.MSE,
[1, 3, 1, 1, 1, 5], // boost z_max x3, velocity x5
16, 16 // grid width, height
);
const trainerW = new CnnTrainingRuntime(
ae, runtime, weightedLoss, 0.003, Optimizers.Adam()
);
// 5. Training loop (target = input for autoencoder)
for (let epoch = 0; epoch < 10; epoch++) {
let totalLoss = 0;
for (const sample of trainData) {
totalLoss += trainer.trainStep(sample.pixels, sample.pixels);
}
console.log(`Epoch ${epoch + 1} - MSE: ${(totalLoss / trainData.length).toFixed(6)}`);
}
// 6. Sync weights to encoder after training
AutoencoderBuilder.syncWeights(result);
// 7. Inference with encoder only
const encRuntime = new CnnInferenceRuntime(result.encoder);
const latent = encRuntime.run(inputPixels);
// latent is a Float64Array of 64 values - ready for navigation fusion