## Resources 🔗

### Implementation 🔗

#### NLP 🔗

https://pmbaumgartner.github.io/blog/notes-on-nlp-projects/

## Confusion Matrix 🔗

Predicted Negative Predicted Positive
Actual Negative True Negative (TN) False Positive (FP)
Actual Positive False Negative (FN) True Positive (TP) Figure 1: From An introduction to ROC analysis by Tom Fawcett
• Recall = Sensitivity = TPR = TP/(TP+FN)
• FPR = FP/(FP+TN)
• Precision = positive predictive value = TP/(TP+FP)
• ROC plots TPR (recall) vs FPR
• PR plots precision vs recall

### Importance of precision 🔗

$Precision = \frac{\text{True Positive}}{\text{All predicted positives}} = \frac{TP}{TP + FP}$ Out of all predicted positives, which ones were really positive? When the cost of a false positive is high, we need to ensure precision is high. E.g. In spam detection, when a non-spam email (Actual Negative) is classified as spam (Predicted Positive). Important not to falsely accuse non-spam emails!

### Importance of recall 🔗

How many actual positives were captured. $Recall = \frac{\text{True Positive}}{\text{All actual positives}}$ When the cost of a false negative is high, recall needs to be high. E.g. in fraud detection or contagious disease detection, a true negative slipping through is dangerous.

### F1 Score 🔗

$F1 = 2 \cdot \frac{Precision\cdot Recall}{Precision + Recall}$ High accuracy can be misleading if it’s largely made up of true negatives, since true negatives might not be the focus in business cases, whereas false negatives (recall) and false positives (precision) usually have business costs.

• Accuracy is used when we are only concerned with the TP and TN, while F1-score is used when the False Negatives and False Positives are more crucial. Accuracy can be used when the class distribution is similar while F1-score is a better metric when there are imbalanced classes as in the above case.
• F1-score balances between precision and recall.
• Accuracy = TP + TN / TP + TN + FP + FN
• In extreme class imbalance it’s almost TN / TN + FP, just the fraction of negative
• F2 emphasizes recall over precision. E.g. when we want to always detect cancer symptoms at the expense of false alarms.

## Information Theory 🔗

### Entropy 🔗

Shannon defined entropy $$H$$ as the expected information content $$I$$ of random variable $$X$$, or how surprised we would expect to be on average after sampling from $$X$$. $$\mathrm {H} (X)=\operatorname {E} [\operatorname {I} (X)]=\operatorname {E} [-\log(\mathrm {P} (X))].$$

Intuitively, when an event has a low probability, it is more surprising and therefore contains more information, which is why $$I(X)$$ and $$P(X)$$ are inversely correlated: $$I(X) = -\log(P(X)) = \log(\frac{1}{P(X)})$$.

Using the definition of expected value, the entropy of a probability distribution $$p(x)$$ is explicitly:

$H(p(x)) = - \sum_{i=0}^N p(x_i) \cdot \log p(x_i)$ If we use $$\log_2$$ we can interprety entropy as the minimum number of bits to encode the information. More generally, $$H$$ quantifies how much information is in data, specifically the theoretical lower bound on the number of bits/nats/bats/etc. we need.

## Bayesian modeling 🔗

Compilation of notes on Bayesian statistical methods

### The duality of $$p(x|\theta)$$ 🔗

It is sometimes referred to as the likelihood of the data, and sometimes referred to as a statistical model. The difference is whether we are looking at $$p(x | \theta)$$ as

#### a function of $$x$$, where $$\theta$$ is known 🔗

If $$\theta$$ is a known model parameter, then $$p_x(x|\theta) = p(x; \theta) = p_\theta(x)$$ is the probability of $$x$$ according to a model parameterized by $$\theta$$, also known as a model/statistical model/observation model measuring uncertainty about $$x$$ given $$\theta$$.

(If $$\theta$$ is a known random variable, $$p(x|\theta)$$ is just a conditional probability, $$\frac{p(x, \theta)}{p(\theta)}$$.)

#### a function of $$\theta$$, where $$x$$ is known 🔗

Unlike the above, the emphasis is on investigating the unknown $$\theta$$.

$$p(x|\theta)$$ is the probability of some observed data $$x$$, that resulted from the random variable $$\theta$$ taking on different values.

When doing MLE to find the assignment $$\hat{\theta}$$ for $$\theta$$ that maximizes likelihood $$p(x|\theta)$$, $$p(x|\hat{\theta})$$ is also called the maximum likelihood of $$\theta$$ given $$x$$, $$\mathcal L(\hat\theta|x)$$.

In other words, it’s a function of $$\theta$$ (written more explicitly as $$p_\theta(x|\theta)$$) that measures the extent to which observed $$x$$ supports particular values of $$\theta$$ in a parametric model.

### MCMC Algorithms 🔗

#### psis-loo 🔗

Fit a Pareto distribution (a power-law distribution) to the tail of importance sampling weights

### Bayesian Neural Networks 🔗

http://mlss.tuebingen.mpg.de/2015/slides/ghahramani/gp-neural-nets15.pdf A multilayer perceptron (neural network) with infinitely many hidden units and Gaussian priors on the weights -> a GP mind is blown??????? with a certain covariance function: Source: http://videolectures.net/gpip06%5Fmackay%5Fgpb/ Also… Gaussian Processes can be thought of as applying the kernel trick to an infinite-dimensional feature space. Source: http://katbailey.github.io/gp%5Ftalk/Gaussian%5FProcesses.pdf

#### Examples of parametric and nonparametric models 🔗 non-parametric version of neural networks

### Gaussian Processes 🔗

Nice video by Nando de Freitas Cholesky decomposition: $$X = L * L^T$$

### Resources 🔗

GP, DP, CRP, Indian buffet processes…