Machine Learning | AI Knowledge Base

What is machine learning?

Machine learning (ML) is a subfield of artificial intelligence focused on algorithms that detect patterns in data and improve performance on a task as exposure to data increases—without being explicitly programmed for every special case. A model maps inputs to outputs (or learns structure) by optimizing parameters against an objective, subject to constraints like generalization and fairness.

Supervised learning #

In supervised learning, each training example includes an input and a corresponding label (the correct answer or target). The learner’s job is to approximate a function that predicts labels on new, unseen inputs drawn from a similar distribution. Quality depends on label accuracy, dataset coverage, and choice of hypothesis class.

Classification

Classification assigns discrete categories—spam versus not spam, disease versus healthy, object class in an image. Models output either class indices or probability vectors over classes. Common metrics include accuracy, precision, recall, F1, and ROC-AUC, chosen according to class imbalance and the cost of different error types.

Regression

Regression predicts continuous quantities: price, temperature, time-to-event, or demand. Loss functions such as mean squared error or Huber loss guide optimization. Evaluations emphasize calibration and error magnitudes (MAE, RMSE) rather than a single thresholded decision.

Practical notes for supervised ML

Train/validation/test splits or cross-validation estimate generalization honestly.
Regularization (L1/L2, dropout in neural nets) reduces overfitting when data are limited.
Feature engineering still matters for classical models; deep learning often learns features end-to-end.

Unsupervised learning #

Unsupervised learning works with inputs alone, seeking structure such as groups, manifolds, or compressed representations. It is widely used for exploration, preprocessing, and anomaly screening before costly labeling.

Clustering

Clustering partitions data into coherent groups—examples include k-means (centroid-based), Gaussian mixture models (probabilistic clusters), hierarchical clustering (nested partitions), and density-based methods like DBSCAN for arbitrary shapes. Choice of distance metric and scale normalization strongly affects outcomes.

Dimensionality reduction

Dimensionality reduction maps high-dimensional data to fewer dimensions while preserving variance or neighborhood relationships. Principal component analysis (PCA) finds orthogonal directions of maximum variance. Techniques like t-SNE and UMAP support visualization by emphasizing local structure; they are typically not used for reversible feature compression without care.

Semi-supervised and self-supervised learning #

Semi-supervised learning combines a small labeled set with a large pool of unlabeled data. Assumptions such as smoothness (similar inputs share labels), cluster structure, or manifold regularization allow models to propagate information beyond pure supervision—valuable when labeling is expensive, as in medical imaging or legal review.

Self-supervised learning constructs supervisory signals from the data itself: predict missing words, rotated image angles, masked patches, or future frames in video. Pretext tasks pretrain representations later fine-tuned on downstream supervised tasks—central to modern NLP and computer vision.

Key algorithms and techniques #

Classical algorithms remain strong baselines. Logistic regression and linear SVMs offer interpretable margins in high dimensions when features are informative. Decision trees and random forests capture nonlinear interactions with modest tuning; gradient boosting (XGBoost, LightGBM, CatBoost) often wins tabular competitions through additive ensembles.

For sequences, hidden Markov models and dynamic time warping have niche uses; neural sequence models dominate large-scale settings. k-nearest neighbors provides a simple, interpretable nonparametric baseline. Naive Bayes remains a fast probabilistic choice for text with strong independence assumptions.

Kernel methods extend linear classifiers to nonlinear boundaries via implicit feature maps; scalability can be limiting on very large datasets. Ensembling, bagging, and boosting reduce variance or bias, while Bayesian approaches quantify uncertainty when priors and inference approximations are tractable.

When to use classical ML

Tabular data with heterogeneous features, small-to-medium datasets, or strict interpretability requirements often favor trees, linear models, or shallow networks.

When deep learning shines

Unstructured inputs—vision, speech, language—benefit from hierarchical representations learned by deep neural networks when data and compute are sufficient.

Machine learning is not a single recipe; it is a portfolio of assumptions, objectives, and compute budgets. Strong practice pairs modeling with data governance, rigorous evaluation, and monitoring after deployment—because real-world distributions drift and yesterday’s fit becomes tomorrow’s liability without maintenance.

Evaluation, leakage, and deployment #

Sound ML projects separate concerns across training, validation (for hyperparameters), and test sets—or use nested cross-validation when data are scarce. Data leakage—accidentally letting future information into features—can inflate metrics and cause painful failures after release. Temporal splits for time-series data, group splits when samples cluster (same patient, same user), and careful deduplication for near-duplicate text or images are essential hygiene steps.

After deployment, monitoring tracks input distributions, model confidence, and business KPIs. Triggers for retraining include concept drift, new categories of inputs, or policy changes. Documenting lineage—data sources, code versions, random seeds—supports audits and reproducibility, especially in regulated industries. Machine learning engineering is as much about reliable operations as about algorithmic novelty.

Fairness-aware ML adds another layer: measuring disparity across groups, testing robustness to missing or noisy attributes, and exploring post-processing or constrained optimization when business rules require parity constraints. These topics connect engineering choices to organizational values and legal obligations—reminding us that machine learning is both a technical and a social discipline.