Basics of Neural Networks

From the intuition of artificial “neurons” to training with backpropagation, regularization, and modern architectures. this chapter demystifies neural nets and shows when they help (and when simpler models win).

A short history & intuition

Neural nets date back to the perceptron (1950s), a simple model that could draw linear decision boundaries. For a while, limitations (and lack of compute/data) caused stagnation. Momentum returned with backpropagation in the 1980s, and exploded in the 2010s as GPUs and large datasets enabled deep networks to outperform traditional methods on vision and speech. The core idea is simple: stack many layers of simple units, each performing a tiny transformation; the composition learns rich, nonlinear mappings from inputs to outputs.

Anatomy of a neural network

Input layer: Raw features (pixels, tokens, or engineered columns).
Hidden layers: Each neuron computes z = w·x + b, then applies a nonlinearity (ReLU, GELU, tanh). Nonlinearity lets networks learn curved boundaries.
Output layer: For classification, a softmax turns scores into probabilities; for regression, a linear neuron outputs a number.
Parameters: The weights w and biases b. Training finds values that minimize loss (error).
Loss functions: Cross-entropy for classification, MSE/MAE for regression; others for ranking or segmentation.

Training with backpropagation

Forward pass: Feed inputs through layers to produce outputs and compute loss.
Backward pass: Compute gradients of loss w.r.t. each parameter (backpropagation).
Update: Adjust parameters using an optimizer (SGD, Adam). Repeat over many batches and epochs.

Hyperparameters: learning rate (step size), batch size (samples per update), number of epochs (passes over data), initialization scheme (e.g., He/Xavier), and architecture choices (layers, width, activation).

Training dynamics: Too high a learning rate causes divergence; too low slows learning or gets stuck. Use learning-rate schedules or adaptive optimizers, and monitor validation curves to avoid over/underfitting.

Regularization, normalization & generalization

Dropout: Randomly zeroes activations during training; prevents co-adaptation.
Weight decay (L2): Penalizes large weights; smooths functions.
Early stopping: Halt when validation loss stops improving.
Batch/Layer normalization: Stabilize and speed up training by normalizing activations.
Data augmentation: For vision/audio/text, create realistic perturbations to improve robustness.

Generalization comes from a mix of suitable capacity, good data, and the right regularizers. More parameters can help, but only if the data supports them.

Architectures: MLPs, CNNs, RNNs, Transformers

MLPs (feed-forward nets): Great for learned embeddings or simple tabular tasks, though tree models often outperform on small tabular datasets.
CNNs: Use convolutional filters that slide over images, capturing local patterns; excel at vision tasks (classification, detection, segmentation).
RNNs/LSTM/GRU: Process sequences step by step; suited for time-series and early NLP before transformers took over.
Transformers: Use attention to relate all positions in a sequence; dominate NLP and are increasingly strong in vision (ViT) and speech.

When to use (and when not to)

Neural nets shine on unstructured data (text, images, audio) or when you have lots of data and compute. For small tabular problems, tree-based methods are often more accurate, simpler, and cheaper. A good rule: start with a simple baseline; reach for neural nets when the task’s structure (images, language) or scale demands it.

Serving & optimization (latency, cost)

Quantization: Use lower-precision numbers (int8/fp16) to speed inference with minimal accuracy loss.
Distillation: Train a small “student” model to mimic a large “teacher.”
Caching: Cache frequent inputs (like prompts) and their outputs.
Batching: Combine small requests on the server to utilize hardware efficiently.
On-device: Run compact models where privacy and offline latency matter.

Interpretability & safety

Complex nets are harder to explain. Use feature attribution (e.g., SHAP, Integrated Gradients) to see which inputs drive predictions. For high-impact decisions (credit, health), combine models with reason codes and human review. Log decisions, support appeals, and monitor for drift.

Hands-on mini project

Goal: Predict whether a user will click an in-app tip. Data: past sessions, device, region, and recent activity. Baseline: gradient-boosted trees (fast, interpretable). Neural version: an MLP over embeddings of categorical features + normalized numeric features. Evaluation: PR-AUC; calibration; recall at fixed precision. Deployment: low-latency API; set confidence thresholds; log every decision for offline analysis. Improvement loop: weekly review of feature drift, threshold tuning, and A/B tests for user experience.