ML glossary

There may be other ML glossaries (e.g. google), but this one will be mine, for terms that I have encountered. Sidenote: it’s possible to sort HTML tables using this Chrome extension.

Term	Domain	Definition
best fit line	statistics	The line learned by regression model; predictions based on this line; can be a hyperplane.
classification	statistics	Predict a categorical dependent variable
correlation	statistics	relationship or pattern exists between data; compare causation
causation	statistics	one event causes another; determined using experiment with differing outcomes; compare correlation
coefficient of determination	statistics	see r-squared
collinearity	statistics
dependent variable	statistics	Variable that will be predicted
ground truth	statistics	actual observed values
multicollinearity	statistics	one independent variable can be predicted from another independent variable
independent identically distributed (IID) variables	statistics
independent variable	statistics	Variable used in making a prediction
heteroscedasticity	statistics	not-normally-distributed residuals (errors); invalidates regression model; Greek: hetero=different; skedasis=dispersion
log transformation	statistics
r-squared	statistics	the proportion of independent variable variation accounted for by the predictor; $1 - \frac{SSE}{SSTO}$
regression	statistics	Predict a continuous dependent variable
variance inflation factor (VIF)	statistics	measure of multicollinearity against other feature(s); >10 = high multicollinearity; $\textrm{VIF}_j = \frac{1}{1 - R^2_j}.$ (see r-squared)
Correlation coefficient	statistics
$RSS$	statistics	Residual sum of squares
residual plot	statistics
mean squared error (MSE)	statistics	metric to evaluate regression model, predicted vs actual; $\sum{(y - \hat{y})}^2$
endogenous variable	statistics	dependent variable; i.e. having an internal cause or origin
exogenous variable	statistics	independent variable; i.e. relating or developing from external factors
maximum likelihood estimation (MLE)	statistics	process to estimate the most likely parameters of a population distribution, given a sample
one-vs-all classification	ml	train classifier for each class vs rest; for predictions, pick the one that is most confident
logistic regression	ml	classification algorithm (not regression)
decision boundary	ml	in logistic regression, the line or hyperplane that divides the plane or space; a function of parameters; parameters are function of training data; possibly non-linear
massage data	ml
grid search	ml	used to exhaustively search a subset of the hyperparameter space for optimal hyperparameters, using some kind of metric
hyperparameter	ml	“meta” parameter used to control the learning algorithm itself (and not the model)
one-hot encoding	ml	widen categorical feature -> each possible value represented by boolean column; prevents model from learning accidental ordinal relationship
training set	ml	used to fit the model
test set	ml	used for unbiased evaluation of model
cross-validation (CV) set	ml
purity	decision trees	measures the proportion of a class in a partition
Gini Index	decision trees	metric used to determine purity (minimize)
entropy	decision trees	metric used to determine purity (minimize)
cost complexity	decision trees	metric used during pruning
pruning	decision trees	used to prevent overfitting, improve generalize-ability: let tree grow to full length; use cost complexity to limit height + remove edges
ensemble methods	decision trees
bagging	decision trees	Bootstrap AGgregation; random sub-sampling improves variance
boosting	decision trees	improves bias; good for unbalanced data
AdaBoost	decision trees	Adaptive Boosting
Gradient Tree Boosting	decision trees
GentleBoost	decision trees
BrownBoost	decision trees
XGBoost	decision trees	Extreme Gradient Boosting; introduced in 2014; fast;
Quantile Sketch Algorithm	decision trees
stacking	decision trees
SMOTE	classification	Synthetic Minority Oversampling Technique; oversample imbalanced data
class weighting	classification
norm	linear algebra	length of a vector; $\|\|v\|\|$
projection	linear algebra
principal component analysis (PCA)	linear algebra
subspace	linear algebra
eigenvector	linear algebra
covariance matrix	linear algebra
singular value decomposition	linear algebra
determinant	linear algebra
projection errors	PCA	find lower-dimensional surface onto which to project data, while minimizing projection error
kernel trick	svm	implicitly transform data into higher dimensions in order to compute only the dot products in higher-dimensional space; PDF by Eric Kim, SO
radial basis function	svm	(see also: Cover’s theorem) used in ML to add dimensionality to data, so that it can be made linearly separable in higher dimensions
kernel	svm	similarity function (between “landmarks” and features)
large margin classifier	svm	another name for support vector machine
margin	svm	the distance between two hyperplanes parallel to decision boundary; leading to better decision boundary
decision boundary	svm	perpendicular to parameters $\theta$ (why?)
network architecture	neural networks	how layers are connected to each other
input layer	neural networks	data comes into this layer
hidden layer	neural networks	intermediate layer(s)
output layer	neural networks	the final value(s) computed by the hypothesis
activation function	neural networks
(artificial) neural network	neural networks	a group of neurons strung together
bias unit	neural networks	neuron at index 0 in any layer; always outputs value of 1
precision	ml	fraction of classified positives that actually are positive; $\frac{TP}{TP + FP}$
recall	ml	fraction of actual positives that we classified as positive; $\frac{TP}{TP + FN}$
accuracy	ml	fraction of correctly classified classes, regardless of actual value; flawed when skewed classes
f-score (f1-score)	ml	single metric between 0 and 1 to combine precision and recall; $F_1 = 2 \frac{PR}{P + R}$
skewed classes	ml	much more examples of one class then the other; as in cancer vs no cancer or purchased vs not purchased
false positive	ml	hypothesis predicted positive result; actual is negative
false negative	ml	hypothesis predicted negative result; actual is positive
validation set	ml	used for unbiased evaluation of model while setting hyperparameters
overfitting	ml	model is too complex, has learned the noise in the data; low bias, high variance;
underfitting	ml	model is too simple; high bias, low variance
objective function	ml	the function to maximize or minimize
unsupervised learning	ml	find patterns in the data without ground truth labels; e.g. recommender systems
clustering	ml	explore/visualize data, reduce data scale, detect outliers/anomalies, deduplicate records; market segmentation, analyze social networks/astronomical data, organize computing clusters
k-means algorithm	ml	most popular clustering algorithm; inputs: `K` (number of clusters), training set `X`
dimensionality reduction techniques	ml	Aid computation and identify outliers; e.g. Principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE)
recommender systems	ml
sample complexity	ml
zero-shot learning	ml	recognize a category without having seen prior examples of that category; aka “meta-learning”
convolution	convolutional NNs	output of operation is a new matrix with sum of filter repeatedly “pasted over” each region of same size in the image; operation denoted by asterisk
cross-correlation	convolutional NNs
filter	convolutional NNs	used for edge detection; weights can be learned by network; aka “kernel”
padding	convolutional NNs
stride	convolutional NNs
recognition	zero-shot learning	nearest category vector based on sample
retrieval	zero-shot learning	nearest sample based on category vector
semantic transfer	zero-shot learning
domain ontology	zero-shot learning
category vector	zero-shot learning	instead of single category label
content-based recommendation	recommender systems
collaborative filtering	recommender systems
co-training	semi-supervised learning
branch and bound	semi-supervised learning
S³VM	semi-supervised learning
density estimation	anomaly detection
cost function	k-means clustering	average of the squared distance between each example and its assigned cluster centroid $ \frac{1}{m} \sum_{i=1}^{m} \|\| x^{(i)} - \mu_{c^(i)} \|\|^2 $
distortion	k-means clustering	another name for the cost function/optimization objective of k-means
cluster centroid	k-means clustering	randomly initialize `K` centroid to position of random `K` samples; to avoid local optima, try 50-1000 initializations, pick lowest cost
cluster assignment step	k-means clustering	depending on which it is closest to, assign each point in dataset to a cluster centroid
move centroid step	k-means clustering	move each centroid to the mean location of all the points assigned to that centroid
stopping criteria	k-means clustering	centroids do not change position; points remain in same cluster; max iterations reached
elbow method	k-means clustering	(doesn’t work reliably) algorithm to choose optimal `K` by graphing `J` as a function of `K`; find the “elbow” where `J` begins to decrease less rapidly
choosing `K`	k-means clustering	usually K-means is used for some other downstream purpose; evaluate different values of `K` by feeding the clusters into downstream algorithm; see also: elbow method
DBSCAN	density-based clustering	Density-based spatial clustering of applications with noise
level set trees	density-based clustering
gradient descent	optimization	Iterative algorithm to determine parameters $\theta$ that will yield best fit line
L1 regularization	optimization	“lasso”; penalty term is sum of abs. value of weights; leads to sparse solution, some $\theta_{j} = 0$
L2 regularization	optimization	“ridge”; penalty term is sum of squares of weights; helps to shrink $\theta$
elastic net regularization	optimization	in-between L1 and L2; available in scikit-learn (use `l1_ratio` to determine how much L1 vs L2 is applied)
normal equation	optimization	Equation to determine parameters $\theta$ that will yield best fit line; requires calculating inverse of matrix $O(N^3)$
regularization	optimization	technique used to prevent higher-order/more-complex models; extra term(s) to penalize large values of $\sum{\theta}$
Probability density function	probability
union $\cup$	set theory	logical OR; elements that are in either set
intersection $\cap$	set theory	logical AND; elements that are in both sets