Computer Vision Basics

How machines “see”: pixels and tensors, convolutional neural networks, essential tasks (classification, detection, segmentation, keypoints), evaluation, data strategy, deployment on web/mobile/edge, and pitfalls to avoid.

How images are represented

Images are grids of pixels. A color image is typically a 3-D tensor of shape H × W × 3 (height, width, RGB channels), with values normalized to 0–1 or standardized per dataset mean/variance. Videos add time: T × H × W × C. Models don’t “see” pictures; they receive arrays of numbers. Preprocessing, resizing, center-cropping, normalization is part of your model’s contract. Keep it consistent across train/validation/test and production inference to avoid subtle errors.

A short history & core idea

Early CV relied on handcrafted features (SIFT, HOG) fed into classic ML. The breakthrough came with deep convolutional networks that learn features directly from pixels. Around 2012, such models dramatically cut error rates on the ImageNet challenge. Since then, improvements in architectures, datasets, and training tricks have fueled rapid progress. Recently, Vision Transformers (ViT) and hybrid models have matched or surpassed CNNs on many benchmarks, though CNNs remain popular for efficiency.

Core tasks

Image classification: Assign one (or multiple) labels to the whole image (dog vs cat; “contains a QR code”).
Object detection: Localize and classify multiple objects with bounding boxes (people, cars, products).
Semantic segmentation: Label each pixel with a class (road, sky, building). Great for understanding scene layout.
Instance segmentation: Like semantic segmentation but separates individual object instances (10 apples, not just “apple” pixels).
Keypoint/pose estimation: Detect specific landmarks (human joints, facial landmarks) for motion, AR, safety.
Tracking: Follow objects across frames; often detection + association (e.g., multi-object tracking in traffic).

Convolutions in plain English

A convolution slides a small filter over the image and multiplies/accumulates values, producing a feature map. Early layers detect edges and textures. Deeper layers combine them into motifs (wheels, eyes) and then objects. Pooling (downsampling) shrinks spatial size while preserving salient signals. Because filters are shared across the image, CNNs learn efficiently and generalize across positions.

Architectures you’ll hear about

ResNet: Adds skip connections that make deep nets trainable; a reliable backbone for many tasks.
MobileNet/EfficientNet: Designed for speed/size; great for mobile/edge.
U-Net/DeepLab: Popular for segmentation; U-Net uses an encoder-decoder with skips for precise masks.
Faster R-CNN/RetinaNet/YOLO: Common detection families; YOLO variants emphasize real-time speed.
Vision Transformers (ViT)/Swin: Treat images as token sequences; excel at large-scale training and transfer.

Data: labeling, augmentation & splits

Labels: For classification, a single label per image may suffice; for detection/segmentation, you’ll need bounding boxes/masks. Invest in a labeling guide with examples and edge-case policies; measure inter-annotator agreement. Inconsistent labels cap your ceiling.

Augmentation: Random flips, crops, color jitter, blur, noise, and CutMix/MixUp improve robustness. Keep label semantics intact (don’t horizontally flip text if orientation matters). For detection/segmentation, use augmentations that transform boxes/masks consistently.

Splits: Ensure train/validation/test splits reflect deployment reality. If images are related (e.g., frames from the same video), keep them in the same split to avoid leakage. Stratify across conditions (lighting, camera types) to detect performance gaps.

Evaluation metrics that matter

Classification: Top-1/Top-5 accuracy, F1, confusion matrix to see which classes the model confuses.
Detection: mAP (mean average precision) at various IoU thresholds (e.g., mAP@0.5, mAP@[.5:.95]).
Segmentation: IoU/Dice per class; boundary metrics for thin objects.
Latency/throughput: FPS on target hardware; memory and energy use for mobile/edge.
Slice analysis: Break performance by lighting, device, geography, and demographics (where relevant) to expose fairness issues.

Deployment: web, mobile, edge

Compression: Quantize to int8/fp16; prune redundant weights; distill large models into small ones.
Formats & runtimes: ONNX/TensorRT for servers; Core ML/Metal for iOS; TFLite/NNAPI for Android; WebNN/WebGPU for browsers.
Pipelines: Preprocessing matters, do the same resize/normalize in production as in training. Mismatch is a silent accuracy killer.
Edge trade-offs: On-device inference gives privacy and low latency but limited compute. Hybrid patterns (simple edge model + server fallback) work well.
Monitoring: Log confidence, input stats (brightness, blur), and model versions. Flag out-of-distribution inputs for manual review.

Fairness, safety & privacy

Vision systems can underperform on underrepresented groups or conditions (skin tones, lighting, camera quality). Build representative datasets; evaluate by cohort; provide human review for high-impact decisions. Respect privacy laws for face detection/recognition; prefer on-device and minimize retention. For moderation and safety, combine automated filters with human review and clear appeal paths.

Hands-on mini projects

Classification: Fine-tune a small MobileNet on 5–10 product categories. Evaluate confusion matrix and propose UI changes for common confusions.
Detection: YOLO-style detector for “find the QR code” in receipts; measure mAP@0.5 and latency on a budget phone.
Segmentation: U-Net to separate foreground products from background; benchmark IoU and speed for batch cropping.
Tracking: Detect + track pallet IDs in a short warehouse video; report identity switches and missed frames.

Bridging to Web3 & crypto

Proof-of-physical: Vision verification for tokenized real-world assets (serial numbers, condition grading) with human audits.
NFT authenticity cues: Perceptual hashing + similarity search to flag duplicates/wash copies (assist creators and marketplaces).
Compliance: Logo/brand detection for user-generated listings and ads; automated flags + manual review.
UX: AR overlays for on-chain collectibles or event tickets; keypoint/pose for try-on experiences.