## Resources π

### Implementation π

#### NLP π

https://pmbaumgartner.github.io/blog/notes-on-nlp-projects/

## Confusion Matrix π

Predicted Negative Predicted Positive
Actual Negative True Negative (TN) False Positive (FP)
Actual Positive False Negative (FN) True Positive (TP)
• Recall = Sensitivity = TPR = TP/(TP+FN)
• FPR = FP/(FP+TN)
• Precision = positive predictive value = TP/(TP+FP)
• ROC plots TPR (recall) vs FPR
• PR plots precision vs recall

### Importance of precision π

$Precision = \frac{\text{True Positive}}{\text{All predicted positives}} = \frac{TP}{TP + FP}$ Out of all predicted positives, which ones were really positive? When the cost of a false positive is high, we need to ensure precision is high. E.g. In spam detection, when a non-spam email (Actual Negative) is classified as spam (Predicted Positive). Important not to falsely accuse non-spam emails!

### Importance of recall π

How many actual positives were captured. $Recall = \frac{\text{True Positive}}{\text{All actual positives}}$ When the cost of a false negative is high, recall needs to be high. E.g. in fraud detection or contagious disease detection, a true negative slipping through is dangerous.

### F1 Score π

$F1 = 2 \cdot \frac{Precision\cdot Recall}{Precision + Recall}$ High accuracy can be misleading if itβs largely made up of true negatives, since true negatives might not be the focus in business cases, whereas false negatives (recall) and false positives (precision) usually have business costs.

• Accuracy is used when we are only concerned with the TP and TN, while F1-score is used when the False Negatives and False Positives are more crucial. Accuracy can be used when the class distribution is similar while F1-score is a better metric when there are imbalanced classes as in the above case.
• F1-score balances between precision and recall.
• Accuracy = TP + TN / TP + TN + FP + FN
• In extreme class imbalance itβs almost TN / TN + FP, just the fraction of negative
• F2 emphasizes recall over precision. E.g. when we want to always detect cancer symptoms at the expense of false alarms.

## Information Theory π

### Entropy π

Shannon defined entropy $$H$$ as the expected information content $$I$$ of random variable $$X$$, or how surprised we would expect to be on average after sampling from $$X$$. $${\displaystyle \mathrm {H} (X)=\operatorname {E} [\operatorname {I} (X)]=\operatorname {E} [-\log(\mathrm {P} (X))].}$$

Intuitively, when an event has a low probability, it is more surprising and therefore contains more information, which is why $$I(X)$$ and $$P(X)$$ are inversely correlated: $$I(X) = -\log(P(X)) = \log(\frac{1}{P(X)})$$.

Using the definition of expected value, the entropy of a probability distribution $$p(x)$$ is explicitly:

$H(p(x)) = - \sum_{i=0}^N p(x_i) \cdot \log p(x_i)$ If we use $$\log_2$$ we can interprety entropy as the minimum number of bits to encode the information. More generally, $$H$$ quantifies how much information is in data, specifically the theoretical lower bound on the number of bits/nats/bats/etc. we need.

## Bayesian modeling π

Compilation of notes on Bayesian statistical methods

### The duality of $$p(x|\theta)$$ π

It is sometimes referred to as the likelihood of the data, and sometimes referred to as a statistical model. The difference is whether we are looking at $$p(x | \theta)$$ asβ¦

#### a function of $$x$$, where $$\theta$$ is known π

If $$\theta$$ is a known model parameter, then $$p_x(x|\theta) = p(x; \theta) = p_\theta(x)$$ is the probability of $$x$$ according to a model parameterized by $$\theta$$, also known as a model/statistical model/observation model measuring uncertainty about $$x$$ given $$\theta$$.

(If $$\theta$$ is a known random variable, $$p(x|\theta)$$ is just a conditional probability, $$\frac{p(x, \theta)}{p(\theta)}$$.)

#### a function of $$\theta$$, where $$x$$ is known π

Unlike the above, the emphasis is on investigating the unknown $$\theta$$.

$$p(x|\theta)$$ is the probability of some observed data $$x$$, that resulted from the random variable $$\theta$$ taking on different values.

When doing MLE to find the assignment $$\hat{\theta}$$ for $$\theta$$ that maximizes likelihood $$p(x|\theta)$$, $$p(x|\hat{\theta})$$ is also called the maximum likelihood of $$\theta$$ given $$x$$, $$\mathcal L(\hat\theta|x)$$.

In other words, itβs a function of $$\theta$$ (written more explicitly as $$p_\theta(x|\theta)$$) that measures the extent to which observed $$x$$ supports particular values of $$\theta$$ in a parametric model.

### MCMC Algorithms π

#### psis-loo π

Fit a Pareto distribution (a power-law distribution) to the tail of importance sampling weights

### Bayesian Neural Networks π

http://mlss.tuebingen.mpg.de/2015/slides/ghahramani/gp-neural-nets15.pdf A multilayer perceptron (neural network) with infinitely many hidden units and Gaussian priors on the weights -> a GP mind is blown??????? with a certain covariance function:

Source: http://videolectures.net/gpip06%5Fmackay%5Fgpb/ Alsoβ¦ Gaussian Processes can be thought of as applying the kernel trick to an infinite-dimensional feature space. Source: http://katbailey.github.io/gp%5Ftalk/Gaussian%5FProcesses.pdf

#### Examples of parametric and nonparametric models π

non-parametric version of neural networks

### Gaussian Processes π

Nice video by Nando de Freitas Cholesky decomposition: $$X = L * L^T$$

### Resources π

GP, DP, CRP, Indian buffet processes…