Download wallpaper

MNIST CNN

Train a convolutional neural network on handwritten digits

Understanding the use case

What is MNIST? - The MNIST dataset is a collection of 70,000 handwritten digit images (0-9), each 28x28 pixels in grayscale. It is the most widely used benchmark in machine learning for validating image classification models. If a model works on MNIST, its basic mechanics (convolution, backpropagation, gradient descent) are correct.

Why CNN? - Convolutional Neural Networks exploit spatial structure in images. Instead of connecting every pixel to every neuron (like an MLP), a CNN slides small filters (kernels) across the image, detecting local patterns like edges, curves, and corners. This dramatically reduces the number of parameters while improving accuracy on image tasks.

Graph-based CNN - In SpikyPanda, each pixel is an input neuron, each filter position creates a neuron, and each filter weight creates a synapse. A Conv(8, 3x3) layer on 28x28 input creates 5,408 neurons and over 100,000 synapses. This explicit graph structure is slower than tensor-based frameworks but allows inspection, mutation, and 3D visualization of the network.

Training flow - The demo loads pre-prepared MNIST data (500 training + 200 test samples), builds a CNN graph from the selected preset, trains it using backpropagation with Adam optimizer and cross-entropy loss, then evaluates accuracy on the test set with a visual grid showing each prediction.

Understanding the configuration

CNN Architecture - Three CNN presets are available. Fast uses a single Conv(4, 3x3) layer - minimal but quick, typically reaching ~85-92% accuracy. Balanced uses Conv(8, 5x5) with larger filters that capture more context, typically ~92-96%. Accuracy uses two conv layers (16 + 32 filters) for the best results (~96-99%) but creates 20K+ neurons and trains much slower in the browser.

ViT Architecture - Two Vision Transformer presets reproduce the state-of-the-art ViT architecture in a graph-based framework. The image is split into 7x7 patches (16 tokens for 28x28), each projected into an embedding space. Self-attention allows every patch to relate to every other patch from the first layer. ViT Tiny uses embed=64, 4 heads, 2 blocks. ViT Small uses embed=128, 4 heads, 4 blocks. Note: ViTs typically require more data than CNNs - accuracy may be lower on MNIST with only 500 training samples.

Epochs - Number of full passes through the training data. More epochs generally improve accuracy but with diminishing returns. 3-5 epochs is usually sufficient for MNIST with these architectures.

Learning rate - Controls step size during weight updates. The default 0.005 works well with Adam. If accuracy plateaus at a low value, try reducing it. If training is too slow, try increasing it slightly.

Configuration




Load data to begin.

Training Log

Waiting for data...

Understanding the results

Accuracy - The percentage of test images correctly classified. For MNIST, state-of-the-art tensor frameworks achieve 99.7%+. Our graph-based CNN reaches 85-99% depending on preset and epochs, which validates that the implementation is mathematically correct.

Cross-entropy loss - The training loss function. It measures how far the network's output probabilities are from the correct answer. Lower loss means the network is more confident about correct predictions. The loss should decrease steadily across epochs.

Prediction grid - Shows test images with their predicted digit. Green labels mean correct, red labels show the prediction and the actual label in parentheses. Common confusions include 4/9, 3/8, and 7/1 due to visual similarity.

Inference time - Total time to classify all test samples. In the graph-based model, each neuron is visited sequentially via a ready-queue, so inference scales linearly with graph size. Typical times range from 200ms (Fast) to 2s+ (Accuracy) for 200 test samples.

Neurons and synapses - Displayed in the training log when the model is built. The Fast preset creates ~4K neurons, Balanced ~6K, Accuracy ~20K+. Each synapse is a JavaScript object with a weight and references to source/target neurons. More synapses = slower but potentially more accurate.