There may be other ML glossaries (e.g. google), but this one will be mine, for terms that I have encountered. Sidenote: it’s possible to sort HTML tables using this Chrome extension.

Term Domain Definition
best fit line statistics The line learned by regression model; predictions based on this line; can be a hyperplane.
classification statistics Predict a categorical dependent variable
correlation statistics relationship or pattern exists between data; compare causation
causation statistics one event causes another; determined using experiment with differing outcomes; compare correlation
coefficient of determination statistics see r-squared
collinearity statistics  
dependent variable statistics Variable that will be predicted
ground truth statistics actual observed values
multicollinearity statistics one independent variable can be predicted from another independent variable
independent identically distributed (IID) variables statistics  
independent variable statistics Variable used in making a prediction
heteroscedasticity statistics not-normally-distributed residuals (errors); invalidates regression model; Greek: hetero=different; skedasis=dispersion
log transformation statistics  
r-squared statistics the proportion of independent variable variation accounted for by the predictor; $1 - \frac{SSE}{SSTO}$
regression statistics Predict a continuous dependent variable
variance inflation factor (VIF) statistics measure of multicollinearity against other feature(s); >10 = high multicollinearity; \(\textrm{VIF}_j = \frac{1}{1 - R^2_j}.\) (see r-squared)
Correlation coefficient statistics  
\(RSS\) statistics Residual sum of squares
residual plot statistics  
mean squared error (MSE) statistics metric to evaluate regression model, predicted vs actual; \(\sum{(y - \hat{y})}^2\)
endogenous variable statistics dependent variable; i.e. having an internal cause or origin
exogenous variable statistics independent variable; i.e. relating or developing from external factors
maximum likelihood estimation (MLE) statistics process to estimate the most likely parameters of a population distribution, given a sample
one-vs-all classification ml train classifier for each class vs rest; for predictions, pick the one that is most confident
logistic regression ml classification algorithm (not regression)
decision boundary ml in logistic regression, the line or hyperplane that divides the plane or space; a function of parameters; parameters are function of training data; possibly non-linear
massage data ml  
grid search ml used to exhaustively search a subset of the hyperparameter space for optimal hyperparameters, using some kind of metric
hyperparameter ml “meta” parameter used to control the learning algorithm itself (and not the model)
one-hot encoding ml widen categorical feature -> each possible value represented by boolean column; prevents model from learning accidental ordinal relationship
training set ml used to fit the model
test set ml used for unbiased evaluation of model
cross-validation (CV) set ml  
purity decision trees measures the proportion of a class in a partition
Gini Index decision trees metric used to determine purity (minimize)
entropy decision trees metric used to determine purity (minimize)
cost complexity decision trees metric used during pruning
pruning decision trees used to prevent overfitting, improve generalize-ability: let tree grow to full length; use cost complexity to limit height + remove edges
ensemble methods decision trees  
bagging decision trees Bootstrap AGgregation; random sub-sampling improves variance
boosting decision trees improves bias; good for unbalanced data
AdaBoost decision trees Adaptive Boosting
Gradient Tree Boosting decision trees  
GentleBoost decision trees  
BrownBoost decision trees  
XGBoost decision trees Extreme Gradient Boosting; introduced in 2014; fast;
Quantile Sketch Algorithm decision trees  
stacking decision trees  
SMOTE classification Synthetic Minority Oversampling Technique; oversample imbalanced data
class weighting classification  
norm linear algebra length of a vector; \(||v||\)
projection linear algebra  
principal component analysis (PCA) linear algebra  
subspace linear algebra  
eigenvector linear algebra  
covariance matrix linear algebra  
singular value decomposition linear algebra  
determinant linear algebra  
projection errors PCA find lower-dimensional surface onto which to project data, while minimizing projection error
kernel trick svm implicitly transform data into higher dimensions in order to compute only the dot products in higher-dimensional space; PDF by Eric Kim, SO
radial basis function svm (see also: Cover’s theorem) used in ML to add dimensionality to data, so that it can be made linearly separable in higher dimensions
kernel svm similarity function (between “landmarks” and features)
large margin classifier svm another name for support vector machine
margin svm the distance between two hyperplanes parallel to decision boundary; leading to better decision boundary
decision boundary svm perpendicular to parameters \(\theta\) (why?)
network architecture neural networks how layers are connected to each other
input layer neural networks data comes into this layer
hidden layer neural networks intermediate layer(s)
output layer neural networks the final value(s) computed by the hypothesis
activation function neural networks  
(artificial) neural network neural networks a group of neurons strung together
bias unit neural networks neuron at index 0 in any layer; always outputs value of 1
precision ml fraction of classified positives that actually are positive; \(\frac{TP}{TP + FP}\)
recall ml fraction of actual positives that we classified as positive; \(\frac{TP}{TP + FN}\)
accuracy ml fraction of correctly classified classes, regardless of actual value; flawed when skewed classes
f-score (f1-score) ml single metric between 0 and 1 to combine precision and recall; \(F_1 = 2 \frac{PR}{P + R}\)
skewed classes ml much more examples of one class then the other; as in cancer vs no cancer or purchased vs not purchased
false positive ml hypothesis predicted positive result; actual is negative
false negative ml hypothesis predicted negative result; actual is positive
validation set ml used for unbiased evaluation of model while setting hyperparameters
overfitting ml model is too complex, has learned the noise in the data; low bias, high variance;
underfitting ml model is too simple; high bias, low variance
objective function ml the function to maximize or minimize
unsupervised learning ml find patterns in the data without ground truth labels; e.g. recommender systems
clustering ml explore/visualize data, reduce data scale, detect outliers/anomalies, deduplicate records; market segmentation, analyze social networks/astronomical data, organize computing clusters
k-means algorithm ml most popular clustering algorithm; inputs: K (number of clusters), training set X
dimensionality reduction techniques ml Aid computation and identify outliers; e.g. Principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE)
recommender systems ml  
sample complexity ml  
zero-shot learning ml recognize a category without having seen prior examples of that category; aka “meta-learning”
convolution convolutional NNs output of operation is a new matrix with sum of filter repeatedly “pasted over” each region of same size in the image; operation denoted by asterisk
cross-correlation convolutional NNs  
filter convolutional NNs used for edge detection; weights can be learned by network; aka “kernel”
padding convolutional NNs  
stride convolutional NNs  
recognition zero-shot learning nearest category vector based on sample
retrieval zero-shot learning nearest sample based on category vector
semantic transfer zero-shot learning  
domain ontology zero-shot learning  
category vector zero-shot learning instead of single category label
content-based recommendation recommender systems  
collaborative filtering recommender systems  
co-training semi-supervised learning  
branch and bound semi-supervised learning  
S3VM semi-supervised learning  
density estimation anomaly detection  
cost function k-means clustering average of the squared distance between each example and its assigned cluster centroid $ \frac{1}{m} \sum_{i=1}^{m} || x^{(i)} - \mu_{c^(i)} ||^2 $
distortion k-means clustering another name for the cost function/optimization objective of k-means
cluster centroid k-means clustering randomly initialize K centroid to position of random K samples; to avoid local optima, try 50-1000 initializations, pick lowest cost
cluster assignment step k-means clustering depending on which it is closest to, assign each point in dataset to a cluster centroid
move centroid step k-means clustering move each centroid to the mean location of all the points assigned to that centroid
stopping criteria k-means clustering centroids do not change position; points remain in same cluster; max iterations reached
elbow method k-means clustering (doesn’t work reliably) algorithm to choose optimal K by graphing J as a function of K; find the “elbow” where J begins to decrease less rapidly
choosing K k-means clustering usually K-means is used for some other downstream purpose; evaluate different values of K by feeding the clusters into downstream algorithm; see also: elbow method
DBSCAN density-based clustering Density-based spatial clustering of applications with noise
level set trees density-based clustering  
gradient descent optimization Iterative algorithm to determine parameters \(\theta\) that will yield best fit line
L1 regularization optimization “lasso”; penalty term is sum of abs. value of weights; leads to sparse solution, some \(\theta_{j} = 0\)
L2 regularization optimization “ridge”; penalty term is sum of squares of weights; helps to shrink \(\theta\)
elastic net regularization optimization in-between L1 and L2; available in scikit-learn (use l1_ratio to determine how much L1 vs L2 is applied)
normal equation optimization Equation to determine parameters \(\theta\) that will yield best fit line; requires calculating inverse of matrix \(O(N^3)\)
regularization optimization technique used to prevent higher-order/more-complex models; extra term(s) to penalize large values of $\sum{\theta}$
Probability density function probability  
union \(\cup\) set theory logical OR; elements that are in either set
intersection \(\cap\) set theory logical AND; elements that are in both sets