Introduction to Machine Learning (Supervised vs Unsupervised)

A practical, hands-on tour of machine learning: what it is, a short history, how models learn, the difference between supervised and unsupervised learning, how to evaluate systems, and how to ship something useful without drowning in jargon.

A short history of ML

While “artificial intelligence” was coined in 1956, many of the practical tools we call machine learning matured later. The 1990s saw statistical learning support vector machines, decision trees, and ensemble methods; prosper thanks to growing datasets. In the 2000s, web-scale data and cheaper compute pushed ML from academic labs into mainstream products. The 2010s brought deep learning breakthroughs that transformed perception (vision, speech) and language. Through each era, one lesson held steady: better data and evaluation discipline beat algorithmic hype.

What ML actually is (and isn’t)

ML is about generalization. We give an algorithm examples “features” describing each situation and the “labels” or outcomes we want. The algorithm tunes internal parameters so that, on new, unseen data, it predicts well. ML isn’t magic: it can’t learn what isn’t in the data, and it amplifies biases if we ignore them. Success depends on good problem framing, measurement, and iteration.

Supervised learning

Supervised learning uses labeled data: each training example has inputs (features) and a correct output (label). The model tries to map inputs to labels.

Classification: predict a category (spam vs ham, churn vs retain, safe vs risky). Outputs are classes or probabilities.
Regression: predict a number (price, demand, time to deliver).
Algorithms: logistic/linear regression, decision trees, random forests, gradient-boosted trees (XGBoost/LightGBM), and neural networks.
Where it shines: When you have plenty of labeled examples that represent the real world you’ll encounter at inference time.

Unsupervised learning

Unsupervised learning finds structure without labels. It’s useful for discovery, compression, anomaly detection, and as a prelude to supervised tasks.

Clustering: group similar items (k-means, hierarchical clustering, DBSCAN). Useful for customer segments or content topic discovery.
Dimensionality reduction: project high-dimensional data into a smaller space (PCA, t-SNE, UMAP) for visualization or speed.
Anomaly detection: identify unusual events (isolation forests, one-class SVMs, autoencoders).

Unsupervised results need human interpretation: clusters aren’t “right” or “wrong” in the same way a label is, they’re hypotheses you validate.

Practical workflow: data → model → evaluation → deployment

Frame the problem: Who is the user? What decision changes? What metric defines “better” (accuracy, recall, cost)? What are constraints (latency, fairness, policy)?
Data audit: Understand sources, coverage, leakage risks, and label quality. Define train/validation/test splits that reflect reality.
Features & baselines: Start with a simple baseline (logistic/linear regression). Engineer straightforward features; avoid premature complexity.
Train & tune: Fit the baseline, then try stronger models. Use cross-validation and grid/random search for hyperparameters.
Evaluate beyond averages: Report metrics by cohort (region, device, customer segment) to expose uneven performance.
Ship a thin slice: Serve the model behind an API or on-device. Design fallbacks and human review for high-impact decisions.
Monitor & iterate: Track quality, drift, latency, and cost. Create feedback loops (user corrections) to improve steadily.

Metrics that matter (classification & regression)

Classification

Accuracy: (TP+TN)/All. Misleading on imbalanced data.
Precision & Recall: Precision = “when we predict positive, how often are we right?” Recall = “of all true positives, how many did we catch?”
F1 score: Harmonic mean of precision and recall; good single number when classes are imbalanced.
ROC-AUC & PR-AUC: Threshold-free summaries; PR-AUC is more informative on heavy imbalance.
Calibration: Do predicted probabilities match observed frequencies?

Regression

MSE/MAE: Mean squared/absolute error. MAE is robust to outliers; MSE penalizes large errors more.
R²: Variance explained; intuitive but watch for misuse on non-linear problems.
Prediction intervals: Communicate uncertainty; critical for planning and risk.

Common pitfalls: leakage, imbalance, drift

Target leakage: Using features that won’t be available (or are contaminated by the label) at prediction time. Fix splits; simulate production.
Imbalanced classes: Rare positives (fraud) can make accuracy look great while recall is terrible. Use class weights, resampling, and PR-AUC.
Overfitting: Model memorizes training quirks. Use validation, regularization, early stopping, and simpler models.
Distribution shift: Real-world changes silently degrade models. Monitor input/label drift and refresh periodically.
Bad labels: Inconsistent or noisy labels poison learning. Audit labeling processes and adjudicate disagreements.

Two case studies (churn & segmentation)

Case 1 — Supervised churn prediction

Goal: Predict whether a customer will cancel next month so retention can intervene. Features: tenure, usage patterns, last-seen, payment issues, support tickets, pricing tier, device, and marketed campaigns. Model: Start with logistic regression; compare with gradient-boosted trees. Metrics: PR-AUC, recall at a fixed precision threshold (e.g., precision ≥ 0.7), and calibration. Action: Route high-risk users to a save offer; track uplift via A/B tests.

Case 2 — Unsupervised customer segments

Goal: Discover usage-based segments to shape pricing and content. Method: Scale and cluster numerical features with k-means; profile clusters (size, revenue, satisfaction). Outcome: “Power users,” “weekend browsers,” “trial-only” cohorts. Next: Validate with stakeholders; turn insights into experiments and perhaps into features for supervised models.

Shipping & monitoring

Interface: Wrap the model in an API or run on-device. Define inputs/outputs clearly; log predictions and decisions (privacy-aware).
Guardrails: Use thresholds and human review for high-impact actions; throttle or quarantine low-confidence outputs.
Observability: Track latency, error rates, input drift, and business outcomes. Create alerts for anomalies.
Feedback loops: Capture corrections and outcomes; schedule retraining or prompt/policy updates.

Exercises and next steps

Churn baseline: Sketch the features you have today. Which are available at prediction time? What would be target leakage?
Metric choice: If positive rate is 3%, which metric will you prioritize and why? (Hint: PR-AUC and recall at fixed precision.)
Segment hypothesis: Run a mock clustering on 5–6 features you can export. Label clusters and propose one action per cluster.