ML glossary
There may be other ML glossaries (e.g. google), but this one will be mine, for terms that I have encountered. Sidenote: it’s possible to sort HTML tables using this Chrome extension.
| Term | Domain | Definition |
|---|---|---|
| best fit line | statistics | The line learned by regression model; predictions based on this line; can be a hyperplane. |
| classification | statistics | Predict a categorical dependent variable |
| correlation | statistics | relationship or pattern exists between data; compare causation |
| causation | statistics | one event causes another; determined using experiment with differing outcomes; compare correlation |
| coefficient of determination | statistics | see r-squared |
| collinearity | statistics | |
| dependent variable | statistics | Variable that will be predicted |
| ground truth | statistics | actual observed values |
| multicollinearity | statistics | one independent variable can be predicted from another independent variable |
| independent identically distributed (IID) variables | statistics | |
| independent variable | statistics | Variable used in making a prediction |
| heteroscedasticity | statistics | not-normally-distributed residuals (errors); invalidates regression model; Greek: hetero=different; skedasis=dispersion |
| log transformation | statistics | |
| r-squared | statistics | the proportion of independent variable variation accounted for by the predictor; $1 - \frac{SSE}{SSTO}$ |
| regression | statistics | Predict a continuous dependent variable |
| variance inflation factor (VIF) | statistics | measure of multicollinearity against other feature(s); >10 = high multicollinearity; \(\textrm{VIF}_j = \frac{1}{1 - R^2_j}.\) (see r-squared) |
| Correlation coefficient | statistics | |
| \(RSS\) | statistics | Residual sum of squares |
| residual plot | statistics | |
| mean squared error (MSE) | statistics | metric to evaluate regression model, predicted vs actual; \(\sum{(y - \hat{y})}^2\) |
| endogenous variable | statistics | dependent variable; i.e. having an internal cause or origin |
| exogenous variable | statistics | independent variable; i.e. relating or developing from external factors |
| maximum likelihood estimation (MLE) | statistics | process to estimate the most likely parameters of a population distribution, given a sample |
| one-vs-all classification | ml | train classifier for each class vs rest; for predictions, pick the one that is most confident |
| logistic regression | ml | classification algorithm (not regression) |
| decision boundary | ml | in logistic regression, the line or hyperplane that divides the plane or space; a function of parameters; parameters are function of training data; possibly non-linear |
| massage data | ml | |
| grid search | ml | used to exhaustively search a subset of the hyperparameter space for optimal hyperparameters, using some kind of metric |
| hyperparameter | ml | “meta” parameter used to control the learning algorithm itself (and not the model) |
| one-hot encoding | ml | widen categorical feature -> each possible value represented by boolean column; prevents model from learning accidental ordinal relationship |
| training set | ml | used to fit the model |
| test set | ml | used for unbiased evaluation of model |
| cross-validation (CV) set | ml | |
| purity | decision trees | measures the proportion of a class in a partition |
| Gini Index | decision trees | metric used to determine purity (minimize) |
| entropy | decision trees | metric used to determine purity (minimize) |
| cost complexity | decision trees | metric used during pruning |
| pruning | decision trees | used to prevent overfitting, improve generalize-ability: let tree grow to full length; use cost complexity to limit height + remove edges |
| ensemble methods | decision trees | |
| bagging | decision trees | Bootstrap AGgregation; random sub-sampling improves variance |
| boosting | decision trees | improves bias; good for unbalanced data |
| AdaBoost | decision trees | Adaptive Boosting |
| Gradient Tree Boosting | decision trees | |
| GentleBoost | decision trees | |
| BrownBoost | decision trees | |
| XGBoost | decision trees | Extreme Gradient Boosting; introduced in 2014; fast; |
| Quantile Sketch Algorithm | decision trees | |
| stacking | decision trees | |
| SMOTE | classification | Synthetic Minority Oversampling Technique; oversample imbalanced data |
| class weighting | classification | |
| norm | linear algebra | length of a vector; \(||v||\) |
| projection | linear algebra | |
| principal component analysis (PCA) | linear algebra | |
| subspace | linear algebra | |
| eigenvector | linear algebra | |
| covariance matrix | linear algebra | |
| singular value decomposition | linear algebra | |
| determinant | linear algebra | |
| projection errors | PCA | find lower-dimensional surface onto which to project data, while minimizing projection error |
| kernel trick | svm | implicitly transform data into higher dimensions in order to compute only the dot products in higher-dimensional space; PDF by Eric Kim, SO |
| radial basis function | svm | (see also: Cover’s theorem) used in ML to add dimensionality to data, so that it can be made linearly separable in higher dimensions |
| kernel | svm | similarity function (between “landmarks” and features) |
| large margin classifier | svm | another name for support vector machine |
| margin | svm | the distance between two hyperplanes parallel to decision boundary; leading to better decision boundary |
| decision boundary | svm | perpendicular to parameters \(\theta\) (why?) |
| network architecture | neural networks | how layers are connected to each other |
| input layer | neural networks | data comes into this layer |
| hidden layer | neural networks | intermediate layer(s) |
| output layer | neural networks | the final value(s) computed by the hypothesis |
| activation function | neural networks | |
| (artificial) neural network | neural networks | a group of neurons strung together |
| bias unit | neural networks | neuron at index 0 in any layer; always outputs value of 1 |
| precision | ml | fraction of classified positives that actually are positive; \(\frac{TP}{TP + FP}\) |
| recall | ml | fraction of actual positives that we classified as positive; \(\frac{TP}{TP + FN}\) |
| accuracy | ml | fraction of correctly classified classes, regardless of actual value; flawed when skewed classes |
| f-score (f1-score) | ml | single metric between 0 and 1 to combine precision and recall; \(F_1 = 2 \frac{PR}{P + R}\) |
| skewed classes | ml | much more examples of one class then the other; as in cancer vs no cancer or purchased vs not purchased |
| false positive | ml | hypothesis predicted positive result; actual is negative |
| false negative | ml | hypothesis predicted negative result; actual is positive |
| validation set | ml | used for unbiased evaluation of model while setting hyperparameters |
| overfitting | ml | model is too complex, has learned the noise in the data; low bias, high variance; |
| underfitting | ml | model is too simple; high bias, low variance |
| objective function | ml | the function to maximize or minimize |
| unsupervised learning | ml | find patterns in the data without ground truth labels; e.g. recommender systems |
| clustering | ml | explore/visualize data, reduce data scale, detect outliers/anomalies, deduplicate records; market segmentation, analyze social networks/astronomical data, organize computing clusters |
| k-means algorithm | ml | most popular clustering algorithm; inputs: K (number of clusters), training set X |
| dimensionality reduction techniques | ml | Aid computation and identify outliers; e.g. Principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) |
| recommender systems | ml | |
| sample complexity | ml | |
| zero-shot learning | ml | recognize a category without having seen prior examples of that category; aka “meta-learning” |
| convolution | convolutional NNs | output of operation is a new matrix with sum of filter repeatedly “pasted over” each region of same size in the image; operation denoted by asterisk |
| cross-correlation | convolutional NNs | |
| filter | convolutional NNs | used for edge detection; weights can be learned by network; aka “kernel” |
| padding | convolutional NNs | |
| stride | convolutional NNs | |
| recognition | zero-shot learning | nearest category vector based on sample |
| retrieval | zero-shot learning | nearest sample based on category vector |
| semantic transfer | zero-shot learning | |
| domain ontology | zero-shot learning | |
| category vector | zero-shot learning | instead of single category label |
| content-based recommendation | recommender systems | |
| collaborative filtering | recommender systems | |
| co-training | semi-supervised learning | |
| branch and bound | semi-supervised learning | |
| S3VM | semi-supervised learning | |
| density estimation | anomaly detection | |
| cost function | k-means clustering | average of the squared distance between each example and its assigned cluster centroid $ \frac{1}{m} \sum_{i=1}^{m} || x^{(i)} - \mu_{c^(i)} ||^2 $ |
| distortion | k-means clustering | another name for the cost function/optimization objective of k-means |
| cluster centroid | k-means clustering | randomly initialize K centroid to position of random K samples; to avoid local optima, try 50-1000 initializations, pick lowest cost |
| cluster assignment step | k-means clustering | depending on which it is closest to, assign each point in dataset to a cluster centroid |
| move centroid step | k-means clustering | move each centroid to the mean location of all the points assigned to that centroid |
| stopping criteria | k-means clustering | centroids do not change position; points remain in same cluster; max iterations reached |
| elbow method | k-means clustering | (doesn’t work reliably) algorithm to choose optimal K by graphing J as a function of K; find the “elbow” where J begins to decrease less rapidly |
choosing K |
k-means clustering | usually K-means is used for some other downstream purpose; evaluate different values of K by feeding the clusters into downstream algorithm; see also: elbow method |
| DBSCAN | density-based clustering | Density-based spatial clustering of applications with noise |
| level set trees | density-based clustering | |
| gradient descent | optimization | Iterative algorithm to determine parameters \(\theta\) that will yield best fit line |
| L1 regularization | optimization | “lasso”; penalty term is sum of abs. value of weights; leads to sparse solution, some \(\theta_{j} = 0\) |
| L2 regularization | optimization | “ridge”; penalty term is sum of squares of weights; helps to shrink \(\theta\) |
| elastic net regularization | optimization | in-between L1 and L2; available in scikit-learn (use l1_ratio to determine how much L1 vs L2 is applied) |
| normal equation | optimization | Equation to determine parameters \(\theta\) that will yield best fit line; requires calculating inverse of matrix \(O(N^3)\) |
| regularization | optimization | technique used to prevent higher-order/more-complex models; extra term(s) to penalize large values of $\sum{\theta}$ |
| Probability density function | probability | |
| union \(\cup\) | set theory | logical OR; elements that are in either set |
| intersection \(\cap\) | set theory | logical AND; elements that are in both sets |