Resources πŸ”—

Amazing people/blogs πŸ”—

Light reading πŸ”—

Courses/Holistic tutorials πŸ”—

Notes πŸ”—

Notebooks πŸ”—

Research πŸ”—

Paper summaries/notes πŸ”—

https://vitalab.github.io/deep-learning/2018/08/10/neural-processes.html πŸ”—

https://github.com/fregu856/papers/blob/master/summaries/ πŸ”—

Implementation πŸ”—

NLP πŸ”—

https://pmbaumgartner.github.io/blog/notes-on-nlp-projects/

Confusion Matrix πŸ”—

Predicted Negative Predicted Positive
Actual Negative True Negative (TN) False Positive (FP)
Actual Positive False Negative (FN) True Positive (TP)
Figure 1: From An introduction to ROC analysis by Tom Fawcett

Figure 1: From An introduction to ROC analysis by Tom Fawcett

Figure 2: From Precision and recall - Wikipedia

Figure 2: From Precision and recall - Wikipedia

Importance of precision πŸ”—

\[Precision = \frac{\text{True Positive}}{\text{All predicted positives}} = \frac{TP}{TP + FP}\] Out of all predicted positives, which ones were really positive? When the cost of a false positive is high, we need to ensure precision is high. E.g. In spam detection, when a non-spam email (Actual Negative) is classified as spam (Predicted Positive). Important not to falsely accuse non-spam emails!

Importance of recall πŸ”—

How many actual positives were captured. \[Recall = \frac{\text{True Positive}}{\text{All actual positives}}\] When the cost of a false negative is high, recall needs to be high. E.g. in fraud detection or contagious disease detection, a true negative slipping through is dangerous.

F1 Score πŸ”—

\[F1 = 2 \cdot \frac{Precision\cdot Recall}{Precision + Recall} \] High accuracy can be misleading if it’s largely made up of true negatives, since true negatives might not be the focus in business cases, whereas false negatives (recall) and false positives (precision) usually have business costs.

Information Theory πŸ”—

Entropy πŸ”—

Shannon defined entropy \(H\) as the expected information content \(I\) of random variable \(X\), or how surprised we would expect to be on average after sampling from \(X\). \({\displaystyle \mathrm {H} (X)=\operatorname {E} [\operatorname {I} (X)]=\operatorname {E} [-\log(\mathrm {P} (X))].}\)

Intuitively, when an event has a low probability, it is more surprising and therefore contains more information, which is why \(I(X)\) and \(P(X)\) are inversely correlated: \(I(X) = -\log(P(X)) = \log(\frac{1}{P(X)})\).

Using the definition of expected value, the entropy of a probability distribution \(p(x)\) is explicitly:

\[ H(p(x)) = - \sum_{i=0}^N p(x_i) \cdot \log p(x_i)\] If we use \(\log_2\) we can interprety entropy as the minimum number of bits to encode the information. More generally, \(H\) quantifies how much information is in data, specifically the theoretical lower bound on the number of bits/nats/bats/etc. we need.

Bayesian modeling πŸ”—

Compilation of notes on Bayesian statistical methods

The duality of \(p(x|\theta)\) πŸ”—

It is sometimes referred to as the likelihood of the data, and sometimes referred to as a statistical model. The difference is whether we are looking at \(p(x | \theta)\) as…

a function of \(x\), where \(\theta\) is known πŸ”—

If \(\theta\) is a known model parameter, then \(p_x(x|\theta) = p(x; \theta) = p_\theta(x)\) is the probability of \(x\) according to a model parameterized by \(\theta\), also known as a model/statistical model/observation model measuring uncertainty about \(x\) given \(\theta\).

(If \(\theta\) is a known random variable, \(p(x|\theta)\) is just a conditional probability, \(\frac{p(x, \theta)}{p(\theta)}\).)

a function of \(\theta\), where \(x\) is known πŸ”—

Unlike the above, the emphasis is on investigating the unknown \(\theta\).

\(p(x|\theta)\) is the probability of some observed data \(x\), that resulted from the random variable \(\theta\) taking on different values.

When doing MLE to find the assignment \(\hat{\theta}\) for \(\theta\) that maximizes likelihood \(p(x|\theta)\), \(p(x|\hat{\theta})\) is also called the maximum likelihood of \(\theta\) given \(x\), \(\mathcal L(\hat\theta|x)\).

In other words, it’s a function of \(\theta\) (written more explicitly as \(p_\theta(x|\theta)\)) that measures the extent to which observed \(x\) supports particular values of \(\theta\) in a parametric model.

MCMC Algorithms πŸ”—

http://elevanth.org/blog/2017/11/28/build-a-better-markov-chain/ 😻 http://mlwhiz.com/blog/2015/08/19/MCMC%5FAlgorithms%5FBeta%5FDistribution/ MC, MCMC, and metropolis hastings HMC 😻: http://arogozhnikov.github.io/2016/12/19/markov%5Fchain%5Fmonte%5Fcarlo.html

psis-loo πŸ”—

Fit a Pareto distribution (a power-law distribution) to the tail of importance sampling weights

Bayesian Neural Networks πŸ”—

http://mlss.tuebingen.mpg.de/2015/slides/ghahramani/gp-neural-nets15.pdf A multilayer perceptron (neural network) with infinitely many hidden units and Gaussian priors on the weights -> a GP mind is blown??????? with a certain covariance function:

Source: http://videolectures.net/gpip06%5Fmackay%5Fgpb/ Also… Gaussian Processes can be thought of as applying the kernel trick to an infinite-dimensional feature space. Source: http://katbailey.github.io/gp%5Ftalk/Gaussian%5FProcesses.pdf

Examples of parametric and nonparametric models πŸ”—

non-parametric version of neural networks

Gaussian Processes πŸ”—

Nice video by Nando de Freitas Cholesky decomposition: \(X = L * L^T\)

Predicting mean and variance of a test point, x* πŸ”—

Resources πŸ”—

GP, DP, CRP, Indian buffet processes…