Natural Language Processing (NLP) Explained

Tokens, embeddings, classic methods, transformers, evaluation, safety, and how to build language features that users actually trust.

A short history of NLP

Early NLP relied on hand-written rules and grammars. The 1990s brought statistical methods, hidden Markov models for tagging, phrase-based machine translation, and logistic/SVM classifiers for sentiment. The 2010s revolutionized NLP with distributed word vectors (Word2Vec, GloVe) and then transformers, which model long-range dependencies via attention. Today’s LLMs unify tasks, classification, extraction, summarization, through prompting and fine-tuning, enabling flexible assistants and workflows.

Tokens & embeddings

Machines don’t “see” words; they see numbers. NLP first tokenizes text into units (words, subwords, or characters). Subword tokenization (BPE, SentencePiece) balances vocabulary size and flexibility, handling rare words by composing subwords. Tokens are mapped to vectors called embeddings. In classic pipelines, embeddings were static; in modern transformers they’re contextual, the vector for “bank” shifts depending on whether the sentence is about rivers or finance.

Classic NLP (bag-of-words, TF-IDF, n-grams)

Classic pipelines convert documents into sparse vectors that count term frequencies, often weighted by inverse document frequency (TF-IDF). You can train linear models (logistic regression) or SVMs on these vectors to perform sentiment analysis, spam detection, or topic classification. Pros: fast, simple, and surprisingly strong on short texts with consistent vocabulary. Cons: poor handling of synonyms, word order, and long-range context.

Neural NLP (RNNs to Transformers)

RNNs/LSTM/GRU: Read text left-to-right; capture sequence information but struggle with very long contexts.
CNNs for text: Convolutions over tokens can capture local patterns (n-grams) efficiently.
Transformers: Use self-attention to consider relationships between all tokens simultaneously, enabling rich context understanding and parallel training. They’re the backbone of modern NLP.

With pretraining (masked-language modeling or next-token prediction), transformers learn general language patterns. Fine-tuning or prompting adapts them to specific tasks with much less data than training from scratch.

LLMs & retrieval-augmented generation (RAG)

LLMs can summarize, classify, extract, translate, and write code via instructions (“prompts”). To keep answers grounded in your sources and reduce hallucination, pair an LLM with retrieval: embed your docs, retrieve top-k relevant passages for a query, then have the model answer using those passages (and optionally cite them). RAG gives you freshness and factuality without full model retraining.

Common tasks & metrics

Classification: sentiment, intent, topic, toxicity. Metrics: accuracy, F1, PR-AUC.
Sequence labeling: named entity recognition (NER), part-of-speech tagging. Metrics: token/entity F1.
Question answering: extractive or generative. Metrics: exact-match (extractive), faithfulness/human ratings (generative).
Summarization: abstractive or extractive. Metrics: ROUGE, human evaluations focused on correctness and coverage.
Translation: BLEU/COMET plus human evals for fluency and adequacy.
Retrieval: recall@k, MRR, nDCG; evaluate on domain-specific questions.

For generative tasks, automatic metrics only go so far; you’ll want rubric-based human evaluation and spot-checks for factual accuracy.

Data cleaning, labeling & multilingual

Cleaning: Deduplicate, normalize whitespace, handle encoding issues, and remove boilerplate.
Labeling: Provide clear guidelines and examples; adjudicate disagreements; sample for quality control.
Length & truncation: Long documents may need chunking with overlap for retrieval.
Multilingual: Tokenization differs; idioms and morphology vary. Consider multilingual models and eval sets per language.
PII: Redact sensitive data; restrict inputs if needed; log with privacy in mind.

Bias, safety & secure prompting

Bias: Measure performance across cohorts; mitigate with balanced data, instructions, and post-processing rules.
Safety: Provide refusal rules; block dangerous instructions; restrict tool use; require citations for high-risk outputs.
Prompt security: Treat inputs as untrusted; avoid injecting raw URLs or executing returned code; sanitize and constrain.

How to build an NLP feature end-to-end

Scenario: Auto-triage support tickets and draft first-pass replies grounded in your policy docs.

Define outputs: Category (billing/technical/refund/other), urgency (low/med/high), and a short suggested reply with links to the exact policy sections.
Prepare data: Export historic tickets + resolutions; redact PII; create a small, high-quality labeled set for evaluation.
Choose baseline: Start with a simple TF-IDF + logistic regression classifier. Evaluate F1 by class and confusion matrix to see common mistakes.
Upgrade: Add a transformer classifier or use an LLM with few-shot prompting for classification. Compare against baseline on held-out data.
RAG: Embed policy docs; at inference, retrieve top-k passages and ask the LLM to compose a reply using only those passages; include citations.
Guardrails: Refuse if policy isn’t found; route high-risk topics to humans; log everything for audits.
Measure: Accuracy/F1 on intent; human-rated quality of replies; latency and cost. Track user satisfaction and re-open rates.
Iterate: Update retrieval index; refine prompts with examples; add evaluation rubrics for tone and correctness.

Exercises

Taxonomy design: Draft a 6–8 label taxonomy for your tickets. What ambiguity will annotators face? Write 2 rules to resolve it.
Retrieval check: Take 5 real questions and verify whether your top-k passages actually contain the answer. If not, tweak chunking and embeddings.
Rubric eval: Create a 5-point rubric for reply quality (correctness, coverage, tone, citations). Sample 50 replies and score them.