statistics:

# p(A and B) = p(A|B)p(B)

It's easy to forget that Bayes' rule is simply p(A|B)=P(A and B)∕p(B). With data, we get p(B) by selecting all cases with B for the first event and any result for the second event (here assuming A is second and B is first). Then, we select all cases with B as 1st and A as 2nd, and divide. Example: probability of passing an exam provided you have already read its statement and find it easy? Imagine we ask people to say whether they find the exam easy or not and collect the answers alongside the final results. Now, p(pass|easy) = p(pass AND easy) ∕ p(easy). We find that 23 % found the statement easy, so p(easy)=0.23. We find that 10% found it easy and passed, so p(pass AND easy)=0.1. Then, p(pass|easy)=0.1∕0.23≈0.43.


# Kendall rank correlation coefficient

Let A, B and C be three objects of study, and m1, m2 two different metrics. The three objects can be ranked according to either m1 or m2. But what is the correlation between these two rankings? This can be measured by coefficient τ. If τ=1, both ranks are identical. If τ=-1, the one rank is the inverse of the other. If τ=0, the ranks are independent. It is calculated as τ = (CP - DP)/P, where CP = number of concordant pairs, DP = number of discordant pairs and P = number of pairs. In R, just import data with `data = read.csv("data.csv")`. Then, `kendall_cor = cor(data, method = "kendall")`. Finally, print the matrix with `print(kendall_cor)`. In Python, use scipy.stats.kendalltau. Both will give τB, which is a version of the parameter that accounts for ties. Notice that Kendall's approach is just for ranking. If both the magnitude and direction of each metric is to be considered, then Pearson's correlation, measuring the linear relationship between metrics, is a better approach.


# The key application of Bayes' rule

A model gives p(data values | parameters values) along with the prior, p(parameters values). Then, we use the rule to convert this to how strongly we should trust some parameters, given the data, i.e. p(parameters values | data values). In summary, p(parameters|data) = p(data|parameters)p(parameters)/p(data). Usually, data is given in rows and parameters in columns. Let data be D and parameters be λ. Then, for each cell in the table we can compute p(Di,λj)=p(Di|λj)p(λj), and for each row we get a marginal p(Di)=∑_j p(Di,λj). For each column we get the marginal p(λ). Bayes' rule consists in dividing each cell by their row marginal. In other words, the posterior distribution on p(λ) is obtained by renormalising on the current row. Bayes' rule is about moving the attention from the margin of the table (where marginals are written) to the row. In other words, shifting from the prior (marginal) to the posterior (row). A summary: p(λ|D) = p(D|λ)p(λ)/p(D) or posterior = (likelihood)·(prior)/(evidence or marginal likelihood). We can go from the theoretical prior p(λ) to the data-supported posterior p(λ|D).


# Bayes' rule

From p(A|B)=p(A,B)/p(A) it's easy to derive Bayes' rule: p(A|B)=p(B|A)p(A)/p(B), where p(B) = ∑_ip(B|A_i)p(A_i). The numerator is just the joint probability p(A,B) or p(A and B). The denominator is a *marginal probability p(X)*, a sum (or ∫) of p(X,Y) over all possible values of Y (or p(X)=∫p(X,Y)dY). Marginal is prior. Joint is posterior.


# Frequentist vs Bayesian

Bayes firstly discovered the rule, but it was Laplace who rediscovered and developed the method. The alternative school, the frequentist, was pioneered by Fisher. The 20th century has been frequentist. The 21th century is shifting towards Bayesian analysis.


# Independent events

If A and B are independent, p(A|B)=p(A) and p(B|A)=p(B), or equivalently, p(A and B) = p(A)p(B).


# Bayes and the arrow of time

p(A|B) = p(A given B) = p(A and B) / p(A), where p(A) = ∑\_i p(A and Bi). Interestingly, there is no inherent arrow of time here. p(A|B) does not mean that B happens before A. Yet, in QM, probability collapse seems aligned with causality.


# Standard deviation vs mean absolute deviation

*Doing Bayesian Data Analysis* by John K. Kruschke. The mean of a distribution minimises (x-mean)², and the median minimises |x-median|. Mean is way more used than median, but why? According to Taleb [1], the standard deviation gives "much larger weight to larger observations". But why should larger deviations weight more? Using the mean absolute deviation, mad = ∑|x-median|/n seems more intuitive than using the standard deviation std = sqrt(∑(x-mean)²/n). For a Gaussian distribution, srd/mad = sqrt(π/2)≈1.25 (close to 1). Take a million data, with 999999 0's and one 10⁶. In this case, mean = 1 and std ≈ 999.9995 ≈ 1000, whereas median = 0 and mad ≈ 1.999998 ≈ 2. Then, std/mad ≈ 500. With fat tails, this ratio increases dramatically. In summary, std is way more sensitive to outliers and farther from the intuitive idea of what dispersion is. And, most importantly, the std/mad ratio increases as tails become more prominent, so for normal distributions we don't care much, but when we depart from it, the std vs mad decision needs to be carefully taken.
[1]
https://youtu.be/iKJy2YpYPe8?feature=shared&t=176