ML glossary
There may be other ML glossaries (e.g. google), but this one will be mine, for terms that I have encountered. Sidenote: it’s possible to sort HTML tables using this Chrome extension.
Term | Domain | Definition |
---|---|---|
best fit line | statistics | The line learned by regression model; predictions based on this line; can be a hyperplane. |
classification | statistics | Predict a categorical dependent variable |
correlation | statistics | relationship or pattern exists between data; compare causation |
causation | statistics | one event causes another; determined using experiment with differing outcomes; compare correlation |
coefficient of determination | statistics | see r-squared |
collinearity | statistics | |
dependent variable | statistics | Variable that will be predicted |
ground truth | statistics | actual observed values |
multicollinearity | statistics | one independent variable can be predicted from another independent variable |
independent identically distributed (IID) variables | statistics | |
independent variable | statistics | Variable used in making a prediction |
heteroscedasticity | statistics | not-normally-distributed residuals (errors); invalidates regression model; Greek: hetero=different; skedasis=dispersion |
log transformation | statistics | |
r-squared | statistics | the proportion of independent variable variation accounted for by the predictor; $1 - \frac{SSE}{SSTO}$ |
regression | statistics | Predict a continuous dependent variable |
variance inflation factor (VIF) | statistics | measure of multicollinearity against other feature(s); >10 = high multicollinearity; \(\textrm{VIF}_j = \frac{1}{1 - R^2_j}.\) (see r-squared) |
Correlation coefficient | statistics | |
\(RSS\) | statistics | Residual sum of squares |
residual plot | statistics | |
mean squared error (MSE) | statistics | metric to evaluate regression model, predicted vs actual; \(\sum{(y - \hat{y})}^2\) |
endogenous variable | statistics | dependent variable; i.e. having an internal cause or origin |
exogenous variable | statistics | independent variable; i.e. relating or developing from external factors |
maximum likelihood estimation (MLE) | statistics | process to estimate the most likely parameters of a population distribution, given a sample |
one-vs-all classification | ml | train classifier for each class vs rest; for predictions, pick the one that is most confident |
logistic regression | ml | classification algorithm (not regression) |
decision boundary | ml | in logistic regression, the line or hyperplane that divides the plane or space; a function of parameters; parameters are function of training data; possibly non-linear |
massage data | ml | |
grid search | ml | used to exhaustively search a subset of the hyperparameter space for optimal hyperparameters, using some kind of metric |
hyperparameter | ml | “meta” parameter used to control the learning algorithm itself (and not the model) |
one-hot encoding | ml | widen categorical feature -> each possible value represented by boolean column; prevents model from learning accidental ordinal relationship |
training set | ml | used to fit the model |
test set | ml | used for unbiased evaluation of model |
cross-validation (CV) set | ml | |
purity | decision trees | measures the proportion of a class in a partition |
Gini Index | decision trees | metric used to determine purity (minimize) |
entropy | decision trees | metric used to determine purity (minimize) |
cost complexity | decision trees | metric used during pruning |
pruning | decision trees | used to prevent overfitting, improve generalize-ability: let tree grow to full length; use cost complexity to limit height + remove edges |
ensemble methods | decision trees | |
bagging | decision trees | Bootstrap AGgregation; random sub-sampling improves variance |
boosting | decision trees | improves bias; good for unbalanced data |
AdaBoost | decision trees | Adaptive Boosting |
Gradient Tree Boosting | decision trees | |
GentleBoost | decision trees | |
BrownBoost | decision trees | |
XGBoost | decision trees | Extreme Gradient Boosting; introduced in 2014; fast; |
Quantile Sketch Algorithm | decision trees | |
stacking | decision trees | |
SMOTE | classification | Synthetic Minority Oversampling Technique; oversample imbalanced data |
class weighting | classification | |
norm | linear algebra | length of a vector; \(||v||\) |
projection | linear algebra | |
principal component analysis (PCA) | linear algebra | |
subspace | linear algebra | |
eigenvector | linear algebra | |
covariance matrix | linear algebra | |
singular value decomposition | linear algebra | |
determinant | linear algebra | |
projection errors | PCA | find lower-dimensional surface onto which to project data, while minimizing projection error |
kernel trick | svm | implicitly transform data into higher dimensions in order to compute only the dot products in higher-dimensional space; PDF by Eric Kim, SO |
radial basis function | svm | (see also: Cover’s theorem) used in ML to add dimensionality to data, so that it can be made linearly separable in higher dimensions |
kernel | svm | similarity function (between “landmarks” and features) |
large margin classifier | svm | another name for support vector machine |
margin | svm | the distance between two hyperplanes parallel to decision boundary; leading to better decision boundary |
decision boundary | svm | perpendicular to parameters \(\theta\) (why?) |
network architecture | neural networks | how layers are connected to each other |
input layer | neural networks | data comes into this layer |
hidden layer | neural networks | intermediate layer(s) |
output layer | neural networks | the final value(s) computed by the hypothesis |
activation function | neural networks | |
(artificial) neural network | neural networks | a group of neurons strung together |
bias unit | neural networks | neuron at index 0 in any layer; always outputs value of 1 |
precision | ml | fraction of classified positives that actually are positive; \(\frac{TP}{TP + FP}\) |
recall | ml | fraction of actual positives that we classified as positive; \(\frac{TP}{TP + FN}\) |
accuracy | ml | fraction of correctly classified classes, regardless of actual value; flawed when skewed classes |
f-score (f1-score) | ml | single metric between 0 and 1 to combine precision and recall; \(F_1 = 2 \frac{PR}{P + R}\) |
skewed classes | ml | much more examples of one class then the other; as in cancer vs no cancer or purchased vs not purchased |
false positive | ml | hypothesis predicted positive result; actual is negative |
false negative | ml | hypothesis predicted negative result; actual is positive |
validation set | ml | used for unbiased evaluation of model while setting hyperparameters |
overfitting | ml | model is too complex, has learned the noise in the data; low bias, high variance; |
underfitting | ml | model is too simple; high bias, low variance |
objective function | ml | the function to maximize or minimize |
unsupervised learning | ml | find patterns in the data without ground truth labels; e.g. recommender systems |
clustering | ml | explore/visualize data, reduce data scale, detect outliers/anomalies, deduplicate records; market segmentation, analyze social networks/astronomical data, organize computing clusters |
k-means algorithm | ml | most popular clustering algorithm; inputs: K (number of clusters), training set X |
dimensionality reduction techniques | ml | Aid computation and identify outliers; e.g. Principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) |
recommender systems | ml | |
sample complexity | ml | |
zero-shot learning | ml | recognize a category without having seen prior examples of that category; aka “meta-learning” |
convolution | convolutional NNs | output of operation is a new matrix with sum of filter repeatedly “pasted over” each region of same size in the image; operation denoted by asterisk |
cross-correlation | convolutional NNs | |
filter | convolutional NNs | used for edge detection; weights can be learned by network; aka “kernel” |
padding | convolutional NNs | |
stride | convolutional NNs | |
recognition | zero-shot learning | nearest category vector based on sample |
retrieval | zero-shot learning | nearest sample based on category vector |
semantic transfer | zero-shot learning | |
domain ontology | zero-shot learning | |
category vector | zero-shot learning | instead of single category label |
content-based recommendation | recommender systems | |
collaborative filtering | recommender systems | |
co-training | semi-supervised learning | |
branch and bound | semi-supervised learning | |
S3VM | semi-supervised learning | |
density estimation | anomaly detection | |
cost function | k-means clustering | average of the squared distance between each example and its assigned cluster centroid $ \frac{1}{m} \sum_{i=1}^{m} || x^{(i)} - \mu_{c^(i)} ||^2 $ |
distortion | k-means clustering | another name for the cost function/optimization objective of k-means |
cluster centroid | k-means clustering | randomly initialize K centroid to position of random K samples; to avoid local optima, try 50-1000 initializations, pick lowest cost |
cluster assignment step | k-means clustering | depending on which it is closest to, assign each point in dataset to a cluster centroid |
move centroid step | k-means clustering | move each centroid to the mean location of all the points assigned to that centroid |
stopping criteria | k-means clustering | centroids do not change position; points remain in same cluster; max iterations reached |
elbow method | k-means clustering | (doesn’t work reliably) algorithm to choose optimal K by graphing J as a function of K ; find the “elbow” where J begins to decrease less rapidly |
choosing K |
k-means clustering | usually K-means is used for some other downstream purpose; evaluate different values of K by feeding the clusters into downstream algorithm; see also: elbow method |
DBSCAN | density-based clustering | Density-based spatial clustering of applications with noise |
level set trees | density-based clustering | |
gradient descent | optimization | Iterative algorithm to determine parameters \(\theta\) that will yield best fit line |
L1 regularization | optimization | “lasso”; penalty term is sum of abs. value of weights; leads to sparse solution, some \(\theta_{j} = 0\) |
L2 regularization | optimization | “ridge”; penalty term is sum of squares of weights; helps to shrink \(\theta\) |
elastic net regularization | optimization | in-between L1 and L2; available in scikit-learn (use l1_ratio to determine how much L1 vs L2 is applied) |
normal equation | optimization | Equation to determine parameters \(\theta\) that will yield best fit line; requires calculating inverse of matrix \(O(N^3)\) |
regularization | optimization | technique used to prevent higher-order/more-complex models; extra term(s) to penalize large values of $\sum{\theta}$ |
Probability density function | probability | |
union \(\cup\) | set theory | logical OR; elements that are in either set |
intersection \(\cap\) | set theory | logical AND; elements that are in both sets |