Probability and Statistics Cookbook

Probability and Statistics
Cookbook
c Matthias Vallentin, 2015
Copyright vallentin@icir.org
31st March, 2015
12 Parametric Inference
12.1 Method of Moments . . . . . . . . .
12.2 Maximum Likelihood . . . . . . . . .
California in Berkeley but also influenced by other sources [2, 3].
12.2.1 Delta Method . . . . . . . . .
If you find errors or have suggestions for further topics, I would
12.3 Multiparameter Models . . . . . . .
appreciate if you send me an email. The most recent version
12.3.1 Multiparameter delta method
of this document is available at http://matthias.vallentin.
12.4 Parametric Bootstrap . . . . . . . .
This cookbook integrates a variety of topics in probability theory and statistics. It is based on literature and in-class material
from courses of the statistics department at the University of
net/probability-and-statistics-cookbook/. To reproduce,
please contact me.
.
.
.
.
.
.
.
.
.
.
.
.
13 Hypothesis Testing
14 Exponential Family
Contents
.
.
.
.
.
.
.
.
.
.
16
22 Math
16
22.1 Gamma Function
17
22.2 Beta Function . .
17
22.3 Series . . . . . .
17
22.4 Combinatorics .
18
6 Inequalities
8
9 16 Sampling Methods
16.1 Inverse Transform Sampling . . . . . .
9
16.2 The Bootstrap . . . . . . . . . . . . .
16.2.1 Bootstrap Confidence Intervals
9
16.3 Rejection Sampling . . . . . . . . . . .
16.4 Importance Sampling . . . . . . . . . .
10
.
.
.
.
.
18
18
18
18
19
19
7 Distribution Relationships
10
1 Distribution Overview
1.1 Discrete Distributions . . . . . . . . . .
1.2 Continuous Distributions . . . . . . . .
3
3
5
2 Probability Theory
8
3 Random Variables
3.1 Transformations . . . . . . . . . . . . .
4 Expectation
5 Variance
15 Bayesian Inference
15.1 Credible Intervals . . . .
15.2 Function of parameters .
15.3 Priors . . . . . . . . . .
15.3.1 Conjugate Priors
15.4 Bayesian Testing . . . .
13 20 Stochastic Processes
20.1 Markov Chains . . . . . . . . . .
13
20.2 Poisson Processes . . . . . . . . .
14
14
15 21 Time Series
21.1 Stationary Time Series . . . . . .
15
21.2 Estimation of Correlation . . . .
15
21.3 Non-Stationary Time Series . . .
21.3.1 Detrending . . . . . . . .
15
21.4 ARIMA models . . . . . . . . . .
21.4.1 Causality and Invertibility
16
21.5 Spectral Analysis . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
20
20
20
.
.
18.3 Multiple Regression . . . .
18.4 Model Selection . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
20
21
21
22
19 Non-parametric Function Estimation
19.1 Density Estimation . . . . . . . . . . . .
11 Statistical Inference
12
19.1.1 Histograms . . . . . . . . . . . .
11.1 Point Estimation . . . . . . . . . . . . . 12
19.1.2 Kernel Density Estimator (KDE)
11.2 Normal-Based Confidence Interval . . . 13
19.2 Non-parametric Regression . . . . . . .
11.3 Empirical distribution . . . . . . . . . . 13
11.4 Statistical Functionals . . . . . . . . . . 13
19.3 Smoothing Using Orthogonal Functions
22
22
23
23
23
24
8 Probability
Functions
and
Moment
Generating
9 Multivariate Distributions
9.1 Standard Bivariate Normal . . . . . . .
9.2 Bivariate Normal . . . . . . . . . . . . .
9.3 Multivariate Normal . . . . . . . . . . .
17 Decision Theory
17.1 Risk . . . . . . . . . . . .
17.2 Admissibility . . . . . . .
11
17.3 Bayes Rule . . . . . . . .
17.4 Minimax Rules . . . . . .
11
11 18 Linear Regression
11
18.1 Simple Linear Regression
11
18.2 Prediction . . . . . . . . .
.
.
.
.
.
10 Convergence
11
10.1 Law of Large Numbers (LLN) . . . . . . 12
10.2 Central Limit Theorem (CLT) . . . . . 12
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
24
. . . . 24
. . . . 25
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
26
26
26
27
27
28
28
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
29
29
29
30
1
Distribution Overview
1.1
Discrete Distributions
Notation1
Uniform
Unif {a, . . . , b}
FX (x)


0

1
Bernoulli
Binomial
Multinomial
Hypergeometric
Negative Binomial
1 We
Bern (p)
Bin (n, p)
x<a
a≤x≤b
x>b
(1 − p)1−x
bxc−a+1
 b−a
I1−p (n − x, x + 1)
Mult (n, p)
Hyp (N, m, n)
NBin (r, p)
Geometric
Geo (p)
Poisson
Po (λ)
≈Φ
x − np
p
np(1 − p)
!
Ip (r, x + 1)
1 − (1 − p)x
e−λ
x ∈ N+
x
X
λi
i!
i=0
fX (x)
E [X]
V [X]
MX (s)
I(a ≤ x ≤ b)
b−a+1
a+b
2
(b − a + 1)2 − 1
12
eas − e−(b+1)s
s(b − a)
px (1 − p)1−x
p
p(1 − p)
1 − p + pes
np
np(1 − p)
(1 − p + pes )n
!
n x
p (1 − p)n−x
x
k
X
n!
x
xi = n
px1 1 · · · pkk
x1 ! . . . xk !
i=1
m m−x
x
npi
p(1 − p)x−1
λx e−λ
x!
x ∈ N+
npi (1 − pi )
!n
si
pi e
i=0
nm
N
n−x
N
x
!
x+r−1 r
p (1 − p)x
r−1
k
X
r
1−p
p
nm(N − n)(N − m)
N 2 (N − 1)
r
1−p
p2
1
p
1−p
p2
λ
λ
p
1 − (1 − p)es
r
pes
1 − (1 − p)es
eλ(e
s
−1)
use the notation γ(s, x) and Γ(x) to refer to the Gamma functions (see §22.1), and use B(x, y) and Ix to refer to the Beta functions (see §22.2).
3
Uniform (discrete)
Binomial
●
Geometric
●
●
●
n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
0.8
Poisson
●
●
●
●
● ●
p = 0.2
p = 0.5
p = 0.8
●
●
●
λ=1
λ=4
λ = 10
●
0.3
0.2
0.6
●
●
●
●
●
●
●
●
●
●
0.4
●
●
●
●
●
b
●
●
0
●
●
●●●
10
30
1.00
2.5
1.0
●
●
●
●
5.0
●
●
●
●
●
●
●
7.5
●
●
●
0.0
10.0
●
● ● ●
●
●
0
●
●
●
●
●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ●
● ●
5
●
●
●
●
●
●
10
●
●
●
1.00
●
●
20
●
●
●
●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ●
● ●
● ●
●
●
●
●
●
●
●
●
●
0.75
●
●
●
15
Poisson
●
●
0.8
●
●
●
●
●
x
●
●
●
●
●
●
Geometric
●●●●●●●●●●●●●●●●
● ● ●●●●●●●●●●●●●●●●●●●●●
●●
●
●
●
●
0.75
●
●
●
●
●
●
0.50
0.6
●
●
●
b
x
10
●
●
●
●
●
●
●
●
●
●
●●
●●●● ●
●●●●●●●●●●●●●●●●●
●●●●●●●●●●
0
●
0.25
●
●
a
●
0.4
●
0.00
●
●
0.25
●
●
●
●
●
●
0.50
●
●
●
●
CDF
CDF
CDF
●
0
0.0
●
●
●
●
x
●
●
i
n
40
Binomial
●
●
●
x
Uniform (discrete)
i
n
●
0.0
●
●
● ●
●
●
●
●●●●●
●●●●●●●●●●●●●●●●
20
x
1
●
●
●
●
●
●
●
●
●
0.1
●
●
●
●●
●●●●●●●
●●●●
●●●●●●●
0.0
0.2
●
●
●
a
●
● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
0.1
0.2
PMF
●
PMF
●
CDF
1
n
PMF
PMF
●
●
20
x
●
●
●
30
n = 40, p = 0.3
n = 30, p = 0.6
n = 25, p = 0.9
40
0.2
●
●
●
●
0.0
2.5
5.0
x
7.5
p = 0.2
p = 0.5
p = 0.8
10.0
●
●
0.00
●
●
●
●
●
●
● ● ● ●
0
5
10
15
λ=1
λ=4
λ = 10
20
x
4
1.2
Continuous Distributions
Notation
FX (x)
Uniform
Unif (a, b)
Normal
N µ, σ 2


0
x<a
a<x<b

1
x>b
Z x
Φ(x) =
φ(t) dt
x−a
 b−a
−∞
Log-Normal
ln N µ, σ 2
Multivariate Normal
MVN (µ, Σ)
Student’s t
Chi-square
Student(ν)
χ2k
1
1
ln x − µ
+ erf √
2
2
2σ 2
fX (x)
E [X]
V [X]
MX (s)
I(a < x < b)
b−a
(x − µ)2
1
φ(x) = √ exp −
2σ 2
σ 2π
(ln x − µ)2
1
√
exp −
2σ 2
x 2πσ 2
a+b
2
(b − a)2
12
µ
σ2
esb − esa
s(b − a)
σ 2 s2
exp µs +
2
1
(2π)−k/2 |Σ|−1/2 e− 2 (x−µ)
ν ν Ix
,
2 2
k x
1
γ
,
Γ(k/2)
2 2
T
Σ−1 (x−µ)
−(ν+1)/2
Γ ν+1
x2
2
1
+
√
ν
νπΓ ν2
1
xk/2−1 e−x/2
2k/2 Γ(k/2)
r
eµ+σ
2
2
(eσ − 1)e2µ+σ
/2
µ
2
1
exp µT s + sT Σs
2
Σ
ν
ν−2
∞
(
0
ν>2
1<ν≤2
k
2k
d2
d2 − 2
2d22 (d1 + d2 − 2)
d1 (d2 − 2)2 (d2 − 4)
(1 − 2s)−k/2 s < 1/2
d
F
F(d1 , d2 )
Exponential
Exp (β)
Gamma
Inverse Gamma
Dirichlet
Beta
Weibull
Pareto
I
d1 x
d1 x+d2
Gamma (α, β)
InvGamma (α, β)
d1 d1
,
2 2
Pareto(xm , α)
β
β2
γ(α, x/β)
Γ(α)
Γ α, βx
Γ (α)
1
xα−1 e−x/β
Γ (α) β α
αβ
αβ 2
β α −α−1 −β/x
x
e
Γ (α)
P
k
k
Γ
i=1 αi Y α −1
xi i
Qk
i=1 Γ (αi ) i=1
β
α>1
α−1
β2
α>2
(α − 1)2 (α − 2)
Γ (α + β) α−1
x
(1 − x)β−1
Γ (α) Γ (β)
α
α+β
1
λΓ 1 +
k
αβ
(α + β)2 (α + β + 1)
2
λ2 Γ 1 +
− µ2
k
αxm
α>1
α−1
xα
m
α>2
(α − 1)2 (α − 2)
k
1 − e−(x/λ)
1−
xB
d1 d1
, 2
2
1 −x/β
e
β
Ix (α, β)
Weibull(λ, k)
(d1 x+d2
)d1 +d2
1 − e−x/β
Dir (α)
Beta (α, β)
(d1 x)d1 d2 2
x α
m
x
x ≥ xm
k x k−1 −(x/λ)k
e
λ λ
xα
m
α α+1
x
x ≥ xm
αi
Pk
i=1
αi
1
(s < 1/β)
1 − βs
α
1
(s < 1/β)
1 − βs
p
2(−βs)α/2
−4βs
Kα
Γ(α)
E [Xi ] (1 − E [Xi ])
Pk
i=1 αi + 1
1+
∞
X
k−1
Y
k=1
∞
X
n n
n=0
r=0
α+r
α+β+r
!
sk
k!
s λ
n
Γ 1+
n!
k
α(−xm s)α Γ(−α, −xm s) s < 0
5
Uniform (continuous)
Normal
2.0
Log−Normal
1.00
µ = 0, σ2 = 0.2
µ = 0, σ2 = 1
µ = 0, σ2 = 5
µ = −2, σ2 = 0.5
1.0
●
●
●
a
b
0.00
−5.0
−2.5
0.0
x
x
χ
F
2
k=1
k=2
k=3
k=4
k=5
0.1
0.25
0.0
1.00
0.3
0.2
0.50
0.5
2.5
5.0
ν=1
ν=2
ν=5
ν=∞
PDF
●
PDF
1
b−a
0.75
PDF
PDF
1.5
Student's t
0.4
µ = 0, σ2 = 3
µ = 2, σ2 = 2
µ = 0, σ2 = 1
µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
µ = 0.125, σ2 = 1
0.0
0
1
2
3
−5.0
−2.5
0.0
x
Exponential
3
2.0
β=2
β=1
β = 0.4
α = 1, β = 2
α = 2, β = 2
α = 3, β = 2
α = 5, β = 1
α = 9, β = 0.5
1.5
1.5
0.75
5.0
Gamma
2.0
d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100
2.5
x
PDF
0.50
PDF
PDF
PDF
2
1.0
1.0
1
0.5
0.5
0.25
0
0.00
0
2
4
6
8
0.0
0.0
0
1
x
2
3
4
5
0
1
2
x
Inverse Gamma
4
5
5
0
5
10
15
20
x
Weibull
2.0
α = 0.5, β = 0.5
α = 5, β = 1
α = 1, β = 3
α = 2, β = 2
α = 2, β = 5
4
4
x
Beta
α = 1, β = 1
α = 2, β = 1
α = 3, β = 1
α = 3, β = 0.5
3
Pareto
λ = 1, k = 0.5
λ = 1, k = 1
λ = 1, k = 1.5
λ = 1, k = 5
4
xm = 1, k = 1
xm = 1, k = 2
xm = 1, k = 4
3
1.5
3
2
PDF
PDF
PDF
PDF
3
2
1.0
2
1
0.5
1
1
0
0
0
1
2
3
x
4
5
0
0.0
0.00
0.25
0.50
x
0.75
1.00
0.0
0.5
1.0
1.5
x
2.0
2.5
1.0
1.5
2.0
2.5
x
6
Uniform (continuous)
Normal
1
Log−Normal
1.00
Student's t
1.00
µ = 0, σ2 = 3
µ = 2, σ2 = 2
µ = 0, σ2 = 1
µ = 0.5, σ2 = 1
µ = 0.25, σ2 = 1
µ = 0.125, σ2 = 1
0.75
0.75
0.75
CDF
CDF
CDF
CDF
0.50
0.50
0.50
0.25
0.25
0.25
µ = 0, σ = 0.2
µ = 0, σ2 = 1
µ = 0, σ2 = 5
µ = −2, σ2 = 0.5
2
0
0.00
a
b
−5.0
−2.5
0.0
x
2.5
0.00
5.0
χ
0.00
4
6
Exponential
0.75
0.75
0.50
0.00
1
x
2
3
4
1
2
0.00
3
4
5
0.00
1.00
0.75
0.75
0.25
0.50
x
10
0.75
1.00
15
20
0.50
0.25
λ = 1, k = 0.5
λ = 1, k = 1
λ = 1, k = 1.5
λ = 1, k = 5
0.00
0.00
5
Pareto
1.00
0.25
0.25
α = 1, β = 1
α = 2, β = 1
α = 3, β = 1
α = 3, β = 0.5
0
x
0.50
0.50
0.25
0.00
5
CDF
0.50
4
Weibull
α = 0.5, β = 0.5
α = 5, β = 1
α = 1, β = 3
α = 2, β = 2
α = 2, β = 5
CDF
0.75
CDF
0.75
3
x
Beta
1.00
x
0
α = 1, β = 2
α = 2, β = 2
α = 3, β = 2
α = 5, β = 1
α = 9, β = 0.5
CDF
Inverse Gamma
2
β=2
β=1
β = 0.4
x
1.00
1
0.25
0.00
5
5.0
0.50
0.25
d1 = 1, d2 = 1
d1 = 2, d2 = 1
d1 = 5, d2 = 2
d1 = 100, d2 = 1
d1 = 100, d2 = 100
2.5
CDF
0.75
0
0.0
Gamma
1.00
8
−2.5
x
CDF
CDF
k=1
k=2
k=3
k=4
k=5
−5.0
1.00
0.25
0.25
3
1.00
0.50
0.50
0
2
x
CDF
0.75
2
1
F
1.00
0
0.00
0
x
2
ν=1
ν=2
ν=5
ν=∞
0.0
0.5
1.0
1.5
x
2.0
2.5
xm = 1, k = 1
xm = 1, k = 2
xm = 1, k = 4
0.00
1.0
1.5
2.0
2.5
x
7
2
Probability Theory
Law of Total Probability
Definitions
•
•
•
•
P [B] =
Sample space Ω
Outcome (point or element) ω ∈ Ω
Event A ⊆ Ω
σ-algebra A
P [B|Ai ] P [Ai ]
n
G
Ai
i=1
Bayes’ Theorem
P [B | Ai ] P [Ai ]
P [Ai | B] = Pn
j=1 P [B | Aj ] P [Aj ]
Inclusion-Exclusion Principle
[
X
n
n
Ai =
(−1)r−1
• Probability Distribution P
i=1
1. P [A] ≥ 0 ∀A
2. P [Ω] = 1
"∞ #
∞
G
X
3. P
Ai =
P [Ai ]
3
r=1
X
Ω=
n
G
Ai
i=1
\
r
A
ij i≤i1 <···<ir ≤n j=1
Random Variables
Random Variable (RV)
i=1
X:Ω→R
• Probability space (Ω, A, P)
Probability Mass Function (PMF)
Properties
•
•
•
•
•
•
•
•
Ω=
i=1
1. ∅ ∈ A
S∞
2. A1 , A2 , . . . , ∈ A =⇒ i=1 Ai ∈ A
3. A ∈ A =⇒ ¬A ∈ A
i=1
n
X
P [∅] = 0
B = Ω ∩ B = (A ∪ ¬A) ∩ B = (A ∩ B) ∪ (¬A ∩ B)
P [¬A] = 1 − P [A]
P [B] = P [A ∩ B] + P [¬A ∩ B]
P [Ω] = 1
P [∅] = 0
S
T
T
S
¬( n An ) = n ¬An ¬( n An ) = n ¬An
DeMorgan
S
T
P [ n An ] = 1 − P [ n ¬An ]
P [A ∪ B] = P [A] + P [B] − P [A ∩ B]
=⇒ P [A ∪ B] ≤ P [A] + P [B]
• P [A ∪ B] = P [A ∩ ¬B] + P [¬A ∩ B] + P [A ∩ B]
• P [A ∩ ¬B] = P [A] − P [A ∩ B]
fX (x) = P [X = x] = P [{ω ∈ Ω : X(ω) = x}]
Probability Density Function (PDF)
Z
P [a ≤ X ≤ b] =
b
f (x) dx
a
Cumulative Distribution Function (CDF)
FX : R → [0, 1]
FX (x) = P [X ≤ x]
1. Nondecreasing: x1 < x2 =⇒ F (x1 ) ≤ F (x2 )
2. Normalized: limx→−∞ = 0 and limx→∞ = 1
3. Right-Continuous: limy↓x F (y) = F (x)
Continuity of Probabilities
• A1 ⊂ A2 ⊂ . . . =⇒ limn→∞ P [An ] = P [A]
• A1 ⊃ A2 ⊃ . . . =⇒ limn→∞ P [An ] = P [A]
S∞
whereA = i=1 Ai
T∞
whereA = i=1 Ai
Z
P [a ≤ Y ≤ b | X = x] =
b
fY |X (y | x)dy
Independence ⊥
⊥
fY |X (y | x) =
A⊥
⊥ B ⇐⇒ P [A ∩ B] = P [A] P [B]
f (x, y)
fX (x)
Independence
Conditional Probability
P [A | B] =
a≤b
a
P [A ∩ B]
P [B]
P [B] > 0
1. P [X ≤ x, Y ≤ y] = P [X ≤ x] P [Y ≤ y]
2. fX,Y (x, y) = fX (x)fY (y)
8
3.1
Z
Transformations
• E [XY ] =
xyfX,Y (x, y) dFX (x) dFY (y)
X,Y
Transformation function
• E [ϕ(Y )] 6= ϕ(E [X])
(cf. Jensen inequality)
• P [X ≥ Y ] = 1 =⇒ E [X] ≥ E [Y ]
• P [X = Y ] = 1 ⇐⇒ E [X] = E [Y ]
∞
X
• E [X] =
P [X ≥ x]
Z = ϕ(X)
Discrete
X
fZ (z) = P [ϕ(X) = z] = P [{x : ϕ(x) = z}] = P X ∈ ϕ−1 (z) =
f (x)
x∈ϕ−1 (z)
x=1
Sample mean
Continuous
n
X
¯n = 1
Xi
X
n i=1
Z
FZ (z) = P [ϕ(X) ≤ z] =
with Az = {x : ϕ(x) ≤ z}
f (x) dx
Az
Special case if ϕ strictly monotone
dx d
1
fZ (z) = fX (ϕ−1 (z)) ϕ−1 (z) = fX (x) = fX (x)
dz
dz
|J|
Conditional expectation
Z
• E [Y | X = x] = yf (y | x) dy
• E [X] = E [E [X | Y ]]
∞
Z
The Rule of the Lazy Statistician
• E[ϕ(X, Y ) | X = x] =
E [Z] =
ϕ(x, y)fY |X (y | x) dx
Z −∞
∞
Z
ϕ(x) dFX (x)
• E [ϕ(Y, Z) | X = x] =
ϕ(y, z)f(Y,Z)|X (y, z | x) dy dz
−∞
Z
E [IA (x)] =
• E [Y + Z | X] = E [Y | X] + E [Z | X]
• E [ϕ(X)Y | X] = ϕ(X)E [Y | X]
• E[Y | X] = c =⇒ Cov [X, Y ] = 0
Z
dFX (x) = P [X ∈ A]
IA (x) dFX (x) =
A
Convolution
Z
• Z := X + Y
∞
fX,Y (x, z − x) dx
fZ (z) =
−∞
Z
X,Y ≥0
Z
z
fX,Y (x, z − x) dx
=
0
∞
• Z := |X − Y |
• Z :=
4
X
Y
fZ (z) = 2
fX,Y (x, z + x) dx
0
Z ∞
Z ∞
⊥
⊥
fZ (z) =
|x|fX,Y (x, xz) dx =
xfx (x)fX (x)fY (xz) dx
−∞
−∞
5
Variance
Definition and properties
2
2
• V [X] = σX
= E (X − E [X])2 = E X 2 − E [X]
" n
#
n
X
X
X
• V
Xi =
V [Xi ] + 2
Cov [Xi , Yj ]
i=1
Expectation
• V
Definition and properties
Z
• E [X] = µX =
x dFX (x) =
i=1
#
Xi =
i=1
X

xfX (x)



 x
Z




 xfX (x) dx
• P [X = c] = 1 =⇒ E [X] = c
• E [cX] = c E [X]
• E [X + Y ] = E [X] + E [Y ]
" n
X
X discrete
n
X
i6=j
V [Xi ]
if Xi ⊥
⊥ Xj
i=1
Standard deviation
sd[X] =
p
V [X] = σX
Covariance
X continuous
•
•
•
•
Cov [X, Y ] = E [(X − E [X])(Y − E [Y ])] = E [XY ] − E [X] E [Y ]
Cov [X, a] = 0
Cov [X, X] = V [X]
Cov [X, Y ] = Cov [Y, X]
9
7
• Cov [aX, bY ] = abCov [X, Y ]
• Cov [X + a, Y + b] = Cov [X, Y ]


n
m
n X
m
X
X
X
• Cov 
Xi ,
Yj  =
Cov [Xi , Yj ]
i=1
j=1
Distribution Relationships
Binomial
• Xi ∼ Bern (p) =⇒
i=1 j=1
n
X
Xi ∼ Bin (n, p)
i=1
Correlation
• X ∼ Bin (n, p) , Y ∼ Bin (m, p) =⇒ X + Y ∼ Bin (n + m, p)
• limn→∞ Bin (n, p) = Po (np)
(n large, p small)
• limn→∞ Bin (n, p) = N (np, np(1 − p))
(n large, p far from 0 and 1)
Cov [X, Y ]
ρ [X, Y ] = p
V [X] V [Y ]
Independence
Negative Binomial
X⊥
⊥ Y =⇒ ρ [X, Y ] = 0 ⇐⇒ Cov [X, Y ] = 0 ⇐⇒ E [XY ] = E [X] E [Y ]
Sample variance
n
S2 =
1 X
¯ n )2
(Xi − X
n − 1 i=1
Conditional variance
2
• V [Y | X] = E (Y − E [Y | X])2 | X = E Y 2 | X − E [Y | X]
• V [Y ] = E [V [Y | X]] + V [E [Y | X]]
6
X ∼ NBin (1, p) = Geo (p)
Pr
X ∼ NBin (r, p) = i=1 Geo (p)
P
P
Xi ∼ NBin (ri , p) =⇒
Xi ∼ NBin ( ri , p)
X ∼ NBin (r, p) . Y ∼ Bin (s + r, p) =⇒ P [X ≤ s] = P [Y ≥ r]
Poisson
• Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒
n
X
Xi ∼ Po
i=1
n
X
!
λi
i=1


X
n
X
n
λ
i

Xj ∼ Bin 
Xj , Pn
• Xi ∼ Po (λi ) ∧ Xi ⊥⊥ Xj =⇒ Xi λ
j
j=1
j=1
j=1
Inequalities
Cauchy-Schwarz
•
•
•
•
Exponential
2
E [XY ] ≤ E X 2 E Y 2
Markov
P [ϕ(X) ≥ t] ≤
Chebyshev
P [|X − E [X]| ≥ t] ≤
Chernoff
P [X ≥ (1 + δ)µ] ≤
• Xi ∼ Exp (β) ∧ Xi ⊥
⊥ Xj =⇒
E [ϕ(X)]
t
δ
Xi ∼ Gamma (n, β)
i=1
• Memoryless property: P [X > x + y | X > y] = P [X > x]
V [X]
t2
e
(1 + δ)1+δ
n
X
Normal
• X ∼ N µ, σ 2
δ > −1
Hoeffding
X1 , . . . , Xn independent ∧ P [Xi ∈ [ai , bi ]] = 1 ∧ 1 ≤ i ≤ n
¯ −E X
¯ ≥ t ≤ e−2nt2 t > 0
P X
2 2
¯ −E X
¯ | ≥ t ≤ 2 exp − Pn 2n t
P |X
t>0
2
i=1 (bi − ai )
Jensen
E [ϕ(X)] ≥ ϕ(E [X]) ϕ convex
•
•
•
•
X−µ
σ
∼ N (0, 1)
X ∼ N µ, σ 2 ∧ Z = aX + b =⇒ Z ∼ N aµ + b, a2 σ 2
X ∼ N µ1 , σ12 ∧ Y ∼ N µ2 , σ22 =⇒ X + Y ∼ N µ1 + µ2 , σ12 + σ22
P
P
P 2
Xi ∼ N µi , σi2 =⇒
X ∼N
i µi ,
i σi
i i
b−µ
a−µ
P [a < X ≤ b] = Φ σ − Φ σ
=⇒
• Φ(−x) = 1 − Φ(x)
φ0 (x) = −xφ(x)
φ00 (x) = (x2 − 1)φ(x)
• Upper quantile of N (0, 1): zα = Φ−1 (1 − α)
Gamma
• X ∼ Gamma (α, β) ⇐⇒ X/β ∼ Gamma (α, 1)
Pα
• Gamma (α, β) ∼ i=1 Exp (β)
10
P
• Xi ∼ Gamma (αi , β) ∧ Xi ⊥
⊥ Xj =⇒
Z ∞
Γ(α)
=
xα−1 e−λx dx
•
λα
0
9.2
P
Xi ∼ Gamma ( i αi , β)
i
Bivariate Normal
Let X ∼ N µx , σx2 and Y ∼ N µy , σy2 .
Beta
f (x, y) =
1
Γ(α + β) α−1
•
xα−1 (1 − x)β−1 =
x
(1 − x)β−1
B(α, β)
Γ(α)Γ(β)
B(α + k, β)
α+k−1
• E Xk =
=
E X k−1
B(α, β)
α+β+k−1
• Beta (1, 1) ∼ Unif (0, 1)
8
"
z=
x − µx
σx
2
+
Xt
• MX (t) = GX (e ) = E e
=E
#
"∞
X (Xt)i
i!
i=0
∞
X
E Xi
=
· ti
i!
i=0
− 2ρ
x − µx
σx
y − µy
σy
#
(Precision matrix Σ−1 )


V [X1 ]
· · · Cov [X1 , Xk ]


..
..
..
Σ=

.
.
.
Covariance matrix Σ
(i)
GX (0)
i!
• E [X] = G0X (1− )
(k)
• E X k = MX (0)
X!
(k)
• E
= GX (1− )
(X − k)!
Cov [Xk , X1 ] · · ·
−1/2
fX (x) = (2π)−n/2 |Σ|
2
d
•
•
•
•
Multivariate Distributions
Standard Bivariate Normal
Let X, Y ∼ N (0, 1) ∧ X ⊥
⊥ Z where Y = ρX +
p
10
1
x2 + y 2 − 2ρxy
p
exp −
2(1 − ρ2 )
2π 1 − ρ2
(X | Y = y) ∼ N ρy, 1 − ρ2
Convergence
Let {X1 , X2 , . . .} be a sequence of rv’s and let X be another rv. Let Fn denote
the cdf of Xn and let F denote the cdf of X.
Conditionals
and
Z ∼ N (0, 1) ∧ X = µ + Σ1/2 Z =⇒ X ∼ N (µ, Σ)
X ∼ N (µ, Σ) =⇒ Σ−1/2 (X − µ) ∼ N (0, 1)
X ∼ N (µ, Σ) =⇒ AX ∼ N Aµ, AΣAT
X ∼ N (µ, Σ) ∧ kak = k =⇒ aT X ∼ N aT µ, aT Σa
1 − ρ2 Z
Joint density
1
exp − (x − µ)T Σ−1 (x − µ)
2
Properties
• GX (t) = GY (t) =⇒ X = Y
(Y | X = x) ∼ N ρx, 1 − ρ2
V [Xk ]
If X ∼ N (µ, Σ),
• V [X] = G00X (1− ) + G0X (1− ) − (G0X (1− ))
f (x, y) =
2
Multivariate Normal
• P [X = i] =
9.1
y − µy
σy
σX
(Y − E [Y ])
σY
p
V [X | Y ] = σX 1 − ρ2
9.3
• P [X = 0] = GX (0)
• P [X = 1] = G0X (0)
9
z
exp −
2
2(1 − ρ2 )
1−ρ
E [X | Y ] = E [X] + ρ
|t| < 1
t
Conditional mean and variance
Probability and Moment Generating Functions
• GX (t) = E tX
2πσx σy
1
p
Types of convergence
D
1. In distribution (weakly, in law): Xn → X
Independence
X⊥
⊥ Y ⇐⇒ ρ = 0
lim Fn (t) = F (t)
n→∞
∀t where F continuous
11
P
10.2
2. In probability: Xn → X
(∀ε > 0) lim P [|Xn − X| > ε] = 0
Central Limit Theorem (CLT)
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ, and V [X1 ] = σ 2 .
n→∞
√ ¯
¯n − µ
n(Xn − µ) D
X
→Z
Zn := q =
σ
¯n
V X
as
3. Almost surely (strongly): Xn → X
h
P
i
h
i
lim Xn = X = P ω ∈ Ω : lim Xn (ω) = X(ω) = 1
n→∞
n→∞
where Z ∼ N (0, 1)
lim P [Zn ≤ z] = Φ(z)
z∈R
n→∞
qm
4. In quadratic mean (L2 ): Xn → X
CLT notations
Zn ≈ N (0, 1)
2
¯ n ≈ N µ, σ
X
n
σ2
¯
Xn − µ ≈ N 0,
n
√
2
¯
n(Xn − µ) ≈ N 0, σ
√ ¯
n(Xn − µ)
≈ N (0, 1)
σ
lim E (Xn − X)2 = 0
n→∞
Relationships
qm
P
D
• Xn → X =⇒ Xn → X =⇒ Xn → X
as
P
• Xn → X =⇒ Xn → X
D
P
• Xn → X ∧ (∃c ∈ R) P [X = c] = 1 =⇒ Xn → X
•
•
•
•
Xn
Xn
Xn
Xn
P
→X
qm
→X
P
→X
P
→X
∧ Yn
∧ Yn
∧ Yn
=⇒
P
P
→ Y =⇒ Xn + Yn → X + Y
qm
qm
→ Y =⇒ Xn + Yn → X + Y
P
P
→ Y =⇒ Xn Yn → XY
P
ϕ(Xn ) → ϕ(X)
D
Continuity correction
D
• Xn → X =⇒ ϕ(Xn ) → ϕ(X)
qm
• Xn → b ⇐⇒ limn→∞ E [Xn ] = b ∧ limn→∞ V [Xn ] = 0
qm
¯n →
• X1 , . . . , Xn iid ∧ E [X] = µ ∧ V [X] < ∞ ⇐⇒ X
µ
¯n ≤ x ≈ Φ
P X
¯n ≥ x ≈ 1 − Φ
P X
Slutzky’s Theorem
D
P
D
• Xn → X and Yn → c =⇒ Xn + Yn → X + c
D
P
D
• Xn → X and Yn → c =⇒ Xn Yn → cX
D
D
D
• In general: Xn → X and Yn → Y =⇒
6
Xn + Yn → X + Y
10.1
Law of Large Numbers (LLN)
x + 12 − µ
√
σ/ n
Yn ≈ N
11
σ2
µ,
n
=⇒ ϕ(Yn ) ≈ N
11.1
as
¯n →
X
µ
n→∞
Strong (SLLN)
σ2
ϕ(µ), (ϕ (µ))
n
0
2
iid
Weak (WLLN)
n→∞
Statistical Inference
Let X1 , · · · , Xn ∼ F if not otherwise noted.
¯n → µ
X
x − 12 − µ
√
σ/ n
Delta method
Let {X1 , . . . , Xn } be a sequence of iid rv’s, E [X1 ] = µ.
P
Point Estimation
• Point estimator θbn of θ is a rv: θbn = g(X1 , . . . , Xn )
h i
• bias(θbn ) = E θbn − θ
P
• Consistency: θbn → θ
12
• Sampling distribution: F (θbn )
r h i
• Standard error: se(θbn ) = V θbn
h
i
h i
• Mean squared error: mse = E (θbn − θ)2 = bias(θbn )2 + V θbn
Nonparametric 1 − α confidence band for F
L(x) = max{Fbn − n , 0}
U (x) = min{Fbn + n , 1}
s
1
2
=
log
2n
α
• limn→∞ bias(θbn ) = 0 ∧ limn→∞ se(θbn ) = 0 =⇒ θbn is consistent
θbn − θ D
→ N (0, 1)
• Asymptotic normality:
se
• Slutzky’s Theorem often lets us replace se(θbn ) by some (weakly) consistent estimator σ
bn .
11.2
Normal-Based Confidence Interval
b 2 . Let zα/2 = Φ−1 (1 − (α/2)), i.e., P Z > zα/2 = α/2
Suppose θbn ≈ N θ, se
and P −zα/2 < Z < zα/2 = 1 − α where Z ∼ N (0, 1). Then
b
Cn = θbn ± zα/2 se
P [L(x) ≤ F (x) ≤ U (x) ∀x] ≥ 1 − α
11.4
•
•
•
•
Statistical Functionals
Statistical functional: T (F )
Plug-in estimator of θ = (F ): θbn = T (Fbn )
R
Linear functional: T (F ) = ϕ(x) dFX (x)
Plug-in estimator for linear functional:
Z
T (Fbn ) =
11.3
Empirical distribution
b 2 =⇒ T (Fbn ) ± zα/2 se
b
• Often: T (Fbn ) ≈ N T (F ), se
Empirical Distribution Function (ECDF)
Pn
Fbn (x) =
i=1
n
1X
ϕ(Xi )
ϕ(x) dFbn (x) =
n i=1
I(Xi ≤ x)
n
(
1
I(Xi ≤ x) =
0
Xi ≤ x
Xi > x
Properties (for any fixed x)
h i
• E Fbn = F (x)
h i F (x)(1 − F (x))
• V Fbn =
n
F (x)(1 − F (x)) D
• mse =
→0
n
P
• Fbn → F (x)
Dvoretzky-Kiefer-Wolfowitz (DKW) inequality (X1 , . . . , Xn ∼ F )
• pth quantile: F −1 (p) = inf{x : F (x) ≥ p}
¯n
• µ
b=X
n
1 X
¯ n )2
(Xi − X
• σ
b2 =
n − 1 i=1
Pn
1
b)3
i=1 (Xi − µ
n
• κ
b=
b3
Pσ
n
¯
¯
i=1 (Xi − Xn )(Yi − Yn )
qP
• ρb = qP
n
n
¯ 2
¯ 2
i=1 (Xi − Xn )
i=1 (Yi − Yn )
12
Parametric Inference
Let F = f (x; θ) : θ ∈ Θ be a parametric model with parameter space Θ ⊂ Rk
and parameter θ = (θ1 , . . . , θk ).
12.1
Method of Moments
j th moment
2
P sup F (x) − Fbn (x) > ε = 2e−2nε
x
αj (θ) = E X j =
Z
xj dFX (x)
13
j th sample moment
Fisher information (exponential family)
n
1X j
α
bj =
X
n i=1 i
∂
I(θ) = Eθ − s(X; θ)
∂θ
Method of moments estimator (MoM)
Observed Fisher information
α1 (θ) = α
b1
α2 (θ) = α
b2
.. ..
.=.
Inobs (θ) = −
αk (θ) = α
bk
Properties of the mle
Properties of the MoM estimator
•
•
•
θbn exists with probability tending to 1
P
Consistency: θbn → θ
Asymptotic normality:
√
D
n(θb − θ) → N (0, Σ)
where Σ = gE Y Y T g T , Y = (X, X 2 , . . . , X k )T ,
∂ −1
αj (θ)
g = (g1 , . . . , gk ) and gj = ∂θ
12.2
P
• Consistency: θbn → θ
• Equivariance: θbn is the mle =⇒ ϕ(θbn ) ist the mle of ϕ(θ)
• Asymptotic normality:
p
1. se ≈ 1/In (θ)
(θbn − θ) D
→ N (0, 1)
se
q
b ≈ 1/In (θbn )
2. se
(θbn − θ) D
→ N (0, 1)
b
se
Maximum Likelihood
• Asymptotic optimality (or efficiency), i.e., smallest variance for large samples. If θen is any other estimator, the asymptotic relative efficiency is
Likelihood: Ln : Θ → [0, ∞)
Ln (θ) =
n
∂2 X
log f (Xi ; θ)
∂θ2 i=1
n
Y
f (Xi ; θ)
h i
V θbn
are(θen , θbn ) = h i ≤ 1
V θen
i=1
Log-likelihood
`n (θ) = log Ln (θ) =
n
X
log f (Xi ; θ)
• Approximately the Bayes estimator
i=1
Maximum likelihood estimator (mle)
12.2.1
Ln (θbn ) = sup Ln (θ)
θ
Score function
s(X; θ) =
∂
log f (X; θ)
∂θ
Fisher information
I(θ) = Vθ [s(X; θ)]
In (θ) = nI(θ)
Delta Method
b where ϕ is differentiable and ϕ0 (θ) 6= 0:
If τ = ϕ(θ)
(b
τn − τ ) D
→ N (0, 1)
b τ)
se(b
b is the mle of τ and
where τb = ϕ(θ)
b b
b = ϕ0 (θ)
b θn )
se
se(
14
12.3
13
Multiparameter Models
Hypothesis Testing
Let θ = (θ1 , . . . , θk ) and θb = (θb1 , . . . , θbk ) be the mle.
Hjj
∂ 2 `n
=
∂θ2
Hjk
H0 : θ ∈ Θ0
∂ 2 `n
=
∂θj ∂θk
Definitions
Fisher information matrix

Eθ [H11 ] · · ·

..
..
In (θ) = − 
.
.
Eθ [Hk1 ] · · ·

Eθ [H1k ]

..

.
Eθ [Hkk ]
Under appropriate regularity conditions
(θb − θ) ≈ N (0, Jn )
with Jn (θ) = In−1 . Further, if θbj is the j th component of θ, then
(θbj − θj ) D
→ N (0, 1)
bj
se
h
i
b 2j = Jn (j, j) and Cov θbj , θbk = Jn (j, k)
where se
12.3.1
•
•
•
•
•
•
•
•
•
•
•
Null hypothesis H0
Alternative hypothesis H1
Simple hypothesis θ = θ0
Composite hypothesis θ > θ0 or θ < θ0
Two-sided test: H0 : θ = θ0 versus H1 : θ 6= θ0
One-sided test: H0 : θ ≤ θ0 versus H1 : θ > θ0
Critical value c
Test statistic T
Rejection region R = {x : T (x) > c}
Power function β(θ) = P [X ∈ R]
Power of a test: 1 − P [Type II error] = 1 − β = inf β(θ)
θ∈Θ1
• Test size: α = P [Type I error] = sup β(θ)
θ∈Θ0
Multiparameter delta method
Let τ = ϕ(θ1 , . . . , θk ) and let the gradient of ϕ be


∂ϕ
 ∂θ1 
 . 

∇ϕ = 
 .. 
 ∂ϕ 
H0 true
H1 true
1−Fθ (T (X))
p-value
< 0.01
0.01 − 0.05
0.05 − 0.1
> 0.1
(b
τ − τ) D
→ N (0, 1)
b τ)
se(b
b τ) =
se(b
b and ∇ϕ
b = ∇ϕ b.
and Jbn = Jn (θ)
θ=θ
12.4
Parametric Bootstrap
b
∇ϕ
T
Type II Error (β)
Reject H0
Type
√ I Error (α)
(power)
• p-value = supθ∈Θ0 Pθ [T (X) ≥ T (x)] = inf α : T (x) ∈ Rα
• p-value = supθ∈Θ0
Pθ [T (X ? ) ≥ T (X)]
= inf α : T (X) ∈ Rα
{z
}
|
b Then,
Suppose ∇ϕθ=θb 6= 0 and τb = ϕ(θ).
r
Retain H0
√
p-value
∂θk
where
H1 : θ ∈ Θ1
versus
since T (X ? )∼Fθ
evidence
very strong evidence against H0
strong evidence against H0
weak evidence against H0
little or no evidence against H0
Wald test
b
Jbn ∇ϕ
• Two-sided test
θb − θ0
• Reject H0 when |W | > zα/2 where W =
b
se
• P |W | > zα/2 → α
• p-value = Pθ0 [|W | > |w|] ≈ P [|Z| > |w|] = 2Φ(−|w|)
Sample from f (x; θbn ) instead of from Fbn , where θbn could be the mle or method
of moments estimator.
Likelihood ratio test (LRT)
15
supθ∈Θ Ln (θ)
Ln (θbn )
=
supθ∈Θ0 Ln (θ)
Ln (θbn,0 )
k
X
D
iid
• λ(X) = 2 log T (X) → χ2r−q where
Zi2 ∼ χ2k and Z1 , . . . , Zk ∼ N (0, 1)
• T (X) =
Vector parameter
(
fX (x | θ) = h(x) exp
= h(x) exp {η(θ) · T (x) − A(θ)}
= h(x)g(θ) exp {η(θ) · T (x)}
fX (x | η) = h(x) exp {η · T(x) − A(η)}
= h(x)g(η) exp {η · T(x)}
= h(x)g(η) exp η T T(x)
15
Bayesian Inference
Bayes’ Theorem
f (θ | x) =
Pearson Chi-square Test
k
X
(Xj − E [Xj ])2
where E [Xj ] = np0j under H0
E [Xj ]
j=1
D
• T → χ2k−1
• p-value = P χ2k−1 > T (x)
D
2
• Faster → Xk−1
than LRT, hence preferable for small n
f (x | θ)f (θ)
f (x | θ)f (θ)
=R
∝ Ln (θ)f (θ)
f (xn )
f (x | θ)f (θ) dθ
Definitions
•
•
•
•
Independence testing
X n = (X1 , . . . , Xn )
xn = (x1 , . . . , xn )
Prior density f (θ)
Likelihood f (xn | θ): joint density of the data
n
Y
f (xi | θ) = Ln (θ)
In particular, X n iid =⇒ f (xn | θ) =
i=1
• I rows, J columns, X multinomial sample of size n = I ∗ J
X
• mles unconstrained: pbij = nij
X
• mles under H0 : pb0ij = pbi· pb·j = Xni· n·j
PI PJ
nX
• LRT: λ = 2 i=1 j=1 Xij log Xi· Xij·j
PI PJ (X −E[X ])2
• PearsonChiSq: T = i=1 j=1 ijE[Xij ]ij
D
• LRT and Pearson → χ2k ν, where ν = (I − 1)(J − 1)
• Posterior density f (θ | xn )
R
• Normalizing constant cn = f (xn ) = f (x | θ)f (θ) dθ
• Kernel: part of a density that depends Ron θ
R
θLn (θ)f (θ)
• Posterior mean θ¯n = θf (θ | xn ) dθ = R
Ln (θ)f (θ) dθ
15.1
Credible Intervals
Posterior interval
n
14
ηi (θ)Ti (x) − A(θ)
Natural form
• The approximate size α LRT rejects H0 when λ(X) ≥ χ2k−1,α
• T =
)
i=1
i=1
• p-value = Pθ0 [λ(X) > λ(x)] ≈ P χ2r−q > λ(x)
Multinomial LRT
Xk
X1
,...,
• mle: pbn =
n
n
Xj
k Y
pbj
Ln (b
pn )
=
• T (X) =
Ln (p0 )
p0j
j=1
k
X
pbj
D
• λ(X) = 2
Xj log
→ χ2k−1
p
0j
j=1
s
X
Exponential Family
b
Z
f (θ | xn ) dθ = 1 − α
P [θ ∈ (a, b) | x ] =
a
Equal-tail credible interval
Z a
Z
n
f (θ | x ) dθ =
Scalar parameter
fX (x | θ) = h(x) exp {η(θ)T (x) − A(θ)}
= h(x)g(θ) exp {η(θ)T (x)}
−∞
∞
f (θ | xn ) dθ = α/2
b
Highest posterior density (HPD) region Rn
16
1. P [θ ∈ Rn ] = 1 − α
2. Rn = {θ : f (θ | xn ) > k} for some k
15.3.1
Continuous likelihood (subscript c denotes constant)
Rn is unimodal =⇒ Rn is an interval
15.2
Conjugate Priors
Function of parameters
Likelihood
Conjugate prior
Unif (0, θ)
Pareto(xm , k)
Exp (λ)
Gamma (α, β)
i=1
2
Let τ = ϕ(θ) and A = {θ : ϕ(θ) ≤ τ }.
Posterior CDF for τ
N µ, σc
H(r | xn ) = P [ϕ(θ) ≤ τ | xn ] =
Z
2
N µ0 , σ0
f (θ | xn ) dθ
N µc , σ 2
A
Posterior density
N µ, σ 2
h(τ | xn ) = H 0 (τ | xn )
Bayesian delta method
b
b se
b ϕ0 (θ)
τ | X n ≈ N ϕ(θ),
15.3
Posterior hyperparameters
max x(n) , xm , k + n
n
X
xi
α + n, β +
MVN(µ, Σc )
MVN(µc , Σ)
Priors
Choice
• Subjective bayesianism.
• Objective bayesianism.
• Robust bayesianism.
Scaled Inverse Chisquare(ν, σ02 )
Normalscaled
Inverse
Gamma(λ, ν, α, β)
MVN(µ0 , Σ0 )
InverseWishart(κ, Ψ)
Pareto(xmc , k)
Gamma (α, β)
Pareto(xm , kc )
Pareto(x0 , k0 )
Gamma (αc , β)
Gamma (α0 , β0 )
Pn
µ0
1
n
i=1 xi
+
/
+ 2 ,
σ2
σ2
σ02
σc
0
c−1
1
n
+ 2
σ02
σc
Pn
νσ02 + i=1 (xi − µ)2
ν + n,
ν+n
νλ + n¯
x
n
,
ν + n,
α + ,
ν+n
2
n
2
1X
γ(¯
x
−
λ)
β+
(xi − x
¯)2 +
2 i=1
2(n + γ)
−1
Σ−1
0 + nΣc
−1
−1
Σ−1
x
¯ ,
0 µ0 + nΣ
−1 −1
Σ−1
0 + nΣc
n
X
(xi − µc )(xi − µc )T
n + κ, Ψ +
i=1
n
X
xi
x
mc
i=1
x0 , k0 − kn where k0 > kn
n
X
α0 + nαc , β0 +
xi
α + n, β +
log
i=1
Types
• Flat: f (θ) ∝ constant
R∞
• Proper: −∞ f (θ) dθ = 1
R∞
• Improper: −∞ f (θ) dθ = ∞
• Jeffrey’s prior (transformation-invariant):
f (θ) ∝
p
I(θ)
f (θ) ∝
p
det(I(θ))
• Conjugate: f (θ) and f (θ | xn ) belong to the same parametric family
17
log10 BF10
Discrete likelihood
Likelihood
Conjugate prior
Posterior hyperparameters
Bern (p)
Beta (α, β)
α+
Bin (p)
Beta (α, β)
α+
n
X
i=1
n
X
xi , β + n −
xi , β +
i=1
NBin (p)
Beta (α, β)
Po (λ)
α + rn, β +
Gamma (α, β)
Multinomial(p)
α+
Dir (α)
α+
n
X
i=1
n
X
Beta (α, β)
n
X
xi
i=1
Ni −
i=1
n
X
xi
xi , β + n
16
1+
p
1−p BF10
evidence
1 − 1.5
1.5 − 10
10 − 100
> 100
Weak
Moderate
Strong
Decisive
where p = P [H1 ] and p∗ = P [H1 | xn ]
Sampling Methods
16.1
Inverse Transform Sampling
Setup
x(i)
α + n, β +
p∗ =
xi
i=1
i=1
i=1
Geo (p)
n
X
n
X
0 − 0.5
0.5 − 1
1−2
>2
p
BF
10
1−p
BF10
n
X
xi
i=1
• U ∼ Unif (0, 1)
• X∼F
• F −1 (u) = inf{x | F (x) ≥ u}
Algorithm
15.4
Bayesian Testing
1. Generate u ∼ Unif (0, 1)
2. Compute x = F −1 (u)
If H0 : θ ∈ Θ0 :
Z
Prior probability P [H0 ] =
f (θ) dθ
16.2
f (θ | xn ) dθ
Let Tn = g(X1 , . . . , Xn ) be a statistic.
Θ0
Posterior probability P [H0 | xn ] =
Z
Θ0
1. Estimate VF [Tn ] with VFbn [Tn ].
2. Approximate VFbn [Tn ] using simulation:
Let H0 , . . . , HK−1 be K hypotheses. Suppose θ ∼ f (θ | Hk ),
f (xn | Hk )P [Hk ]
,
P [Hk | xn ] = PK
n
k=1 f (x | Hk )P [Hk ]
Marginal likelihood
f (xn | Hi ) =
Z
f (xn | θ, Hi )f (θ | Hi ) dθ
Θ
P [Hi | x ]
P [Hj | xn ]
16.2.1
n
=
f (x | Hi )
f (xn | Hj )
| {z }
Bayes Factor BFij
Bayes factor
∗
∗
, . . . , Tn,B
, an iid sample from
(a) Repeat the following B times to get Tn,1
b
the sampling distribution implied by Fn
i. Sample uniformly X1∗ , . . . , Xn∗ ∼ Fbn .
ii. Compute Tn∗ = g(X1∗ , . . . , Xn∗ ).
(b) Then
!2
B
B
X
X
1
1
∗
bb =
vboot = V
Tn,b
−
T∗
Fn
B
B r=1 n,r
b=1
Posterior odds (of Hi relative to Hj )
n
The Bootstrap
×
P [Hi ]
P [Hj ]
| {z }
prior odds
Bootstrap Confidence Intervals
Normal-based interval
b boot
Tn ± zα/2 se
Pivotal interval
1. Location parameter θ = T (F )
18
2. Pivot Rn = θbn − θ
3. Let H(r) = P [Rn ≤ r] be the cdf of Rn
∗
∗
4. Let Rn,b
= θbn,b
− θbn . Approximate H using bootstrap:
2. Generate u ∼ Unif (0, 1)
Ln (θcand )
3. Accept θcand if u ≤
Ln (θbn )
B
1 X
∗
b
H(r)
=
I(Rn,b
≤ r)
B
16.4
Sample from an importance function g rather than target density h.
Algorithm to obtain an approximation to E [q(θ) | xn ]:
b=1
∗
∗
5. θβ∗ = β sample quantile of (θbn,1
, . . . , θbn,B
)
iid
∗
∗
), i.e., rβ∗ = θβ∗ − θbn
6. rβ∗ = β sample quantile of (Rn,1
, . . . , Rn,B
7. Approximate 1 − α confidence interval Cn = a
ˆ, ˆb where
a
ˆ=
ˆb =
b −1 1 − α =
θbn − H
2
α
−1
b
=
θbn − H
2
Percentile interval
16.3
∗
θbn − r1−α/2
=
∗
2θbn − θ1−α/2
∗
θbn − rα/2
=
∗
2θbn − θα/2
Rejection Sampling
• We can easily sample from g(θ)
• We want to sample from h(θ), but it is difficult
k(θ)
k(θ) dθ
• Envelope condition: we can find M > 0 such that k(θ) ≤ M g(θ) ∀θ
• We know h(θ) up to a proportional constant: h(θ) = R
Algorithm
1. Draw θcand ∼ g(θ)
2. Generate u ∼ Unif (0, 1)
k(θcand )
3. Accept θcand if u ≤
M g(θcand )
4. Repeat until B values of θcand have been accepted
Example
We can easily sample from the prior g(θ) = f (θ)
Target is the posterior h(θ) ∝ k(θ) = f (xn | θ)f (θ)
Envelope condition: f (xn | θ) ≤ f (xn | θbn ) = Ln (θbn ) ≡ M
Algorithm
1. Draw θcand ∼ f (θ)
1. Sample from the prior θ1 , . . . , θn ∼ f (θ)
Ln (θi )
∀i = 1, . . . , B
2. wi = PB
i=1 Ln (θi )
PB
3. E [q(θ) | xn ] ≈ i=1 q(θi )wi
17
Decision Theory
Definitions
∗
∗
Cn = θα/2
, θ1−α/2
Setup
•
•
•
•
Importance Sampling
• Unknown quantity affecting our decision: θ ∈ Θ
• Decision rule: synonymous for an estimator θb
• Action a ∈ A: possible value of the decision rule. In the estimation
b
context, the action is just an estimate of θ, θ(x).
• Loss function L: consequences of taking action a when true state is θ or
b L : Θ × A → [−k, ∞).
discrepancy between θ and θ,
Loss functions
• Squared error loss: L(θ, a) = (θ − a)2
(
K1 (θ − a) a − θ < 0
• Linear loss: L(θ, a) =
K2 (a − θ) a − θ ≥ 0
• Absolute error loss: L(θ, a) = |θ − a| (linear loss with K1 = K2 )
• Lp loss: L(θ, a) = |θ − a|p
(
0 a=θ
• Zero-one loss: L(θ, a) =
1 a 6= θ
17.1
Risk
Posterior risk
Z
h
i
b
b
L(θ, θ(x))f
(θ | x) dθ = Eθ|X L(θ, θ(x))
Z
h
i
b
b
L(θ, θ(x))f
(x | θ) dx = EX|θ L(θ, θ(X))
r(θb | x) =
(Frequentist) risk
b =
R(θ, θ)
19
18
Bayes risk
ZZ
b =
r(f, θ)
h
i
b
b
L(θ, θ(x))f
(x, θ) dx dθ = Eθ,X L(θ, θ(X))
h
h
ii
h
i
b = Eθ EX|θ L(θ, θ(X)
b
b
r(f, θ)
= Eθ R(θ, θ)
h
h
ii
h
i
b = EX Eθ|X L(θ, θ(X)
b
r(f, θ)
= EX r(θb | X)
Linear Regression
Definitions
• Response variable Y
• Covariate X (aka predictor variable or feature)
18.1
Simple Linear Regression
Model
17.2
Yi = β0 + β1 Xi + i
Admissibility
E [i | Xi ] = 0, V [i | Xi ] = σ 2
Fitted line
• θb0 dominates θb if
b0
b
∀θ : R(θ, θ ) ≤ R(θ, θ)
b
∃θ : R(θ, θb0 ) < R(θ, θ)
• θb is inadmissible if there is at least one other estimator θb0 that dominates
it. Otherwise it is called admissible.
rb(x) = βb0 + βb1 x
Predicted (fitted) values
Ybi = rb(Xi )
Residuals
ˆi = Yi − Ybi = Yi − βb0 + βb1 Xi
Residual sums of squares (rss)
17.3
Bayes Rule
rss(βb0 , βb1 ) =
Bayes rule (or Bayes estimator)
n
X
ˆ2i
i=1
b = inf e r(f, θ)
e
• r(f, θ)
θ
R
b
b = r(θb | x)f (x) dx
• θ(x)
= inf r(θb | x) ∀x =⇒ r(f, θ)
Least square estimates
βbT = (βb0 , βb1 )T : min rss
b0 ,β
b1
β
Theorems
• Squared error loss: posterior mean
• Absolute error loss: posterior median
• Zero-one loss: posterior mode
17.4
Minimax Rules
Maximum risk
b = sup R(θ, θ)
b
¯ θ)
R(
¯
R(a)
= sup R(θ, a)
θ
θ
Minimax rule
b = inf R(
e = inf sup R(θ, θ)
e
¯ θ)
sup R(θ, θ)
θ
θe
θe
θ
b =c
θb = Bayes rule ∧ ∃c : R(θ, θ)
Least favorable prior
θbf = Bayes rule ∧ R(θ, θbf ) ≤ r(f, θbf ) ∀θ
¯n
βb0 = Y¯n − βb1 X
Pn
Pn
¯
¯
¯
i=1 Xi Yi − nXY
i=1 (Xi − Xn )(Yi − Yn )
b
Pn
= P
β1 =
n
2
¯
2
2
i=1 (Xi − Xn )
i=1 Xi − nX
h
i
β0
E βb | X n =
β1
P
h
i
σ 2 n−1 ni=1 Xi2 −X n
V βb | X n = 2
1
−X n
nsX
r Pn
2
σ
b
i=1 Xi
√
b βb0 ) =
se(
n
sX n
σ
b
√
b βb1 ) =
se(
sX n
Pn
Pn 2
1
where s2X = n−1 i=1 (Xi − X n )2 and σ
b2 = n−2
ˆi (unbiased estimate).
i=1 Further properties:
P
P
• Consistency: βb0 → β0 and βb1 → β1
20
18.3
• Asymptotic normality:
βb0 − β0 D
→ N (0, 1)
b βb0 )
se(
and
βb1 − β1 D
→ N (0, 1)
b βb1 )
se(
Multiple Regression
Y = Xβ + where
• Approximate 1 − α confidence intervals for β0 and β1 :

b βb1 )
and βb1 ± zα/2 se(
b βb0 )
βb0 ± zα/2 se(
• Wald test for H0 : β1 = 0 vs. H1 : β1 6= 0: reject H0 if |W | > zα/2 where
b βb1 ).
W = βb1 /se(
Xn1
Pn b
Pn 2
2
ˆ
rss
i=1 (Yi − Y )
R = Pn
= 1 − Pn i=1 i 2 = 1 −
2
tss
i=1 (Yi − Y )
i=1 (Yi − Y )
Likelihood
L2 =

β1
 
β =  ... 
Xnk
n
Y
i=1
n
Y
i=1
n
Y
f (Xi , Yi ) =
n
Y
fX (Xi ) ×
n
Y
(
fY |X (Yi | Xi ) ∝ σ
2
1 X
exp − 2
Yi − (β0 − β1 Xi )
2σ i
βb = (X T X)−1 X T Y
h
i
V βb | X n = σ 2 (X T X)−1
)
Under the assumption of Normality, the least squares parameter estimators are
also the MLEs, but the least squares variance estimator is not the MLE
βb ≈ N β, σ 2 (X T X)−1
1X 2
ˆ
n i=1 i
rb(x) =
k
X
βbj xj
j=1
Unbiased estimate for σ 2
Prediction
Observe X = x∗ of the covariate and want to predict their outcome Y∗ .
Yb∗ = βb0 + βb1 x∗
h i
h i
h i
h
i
V Yb∗ = V βb0 + x2∗ V βb1 + 2x∗ Cov βb0 , βb1
Prediction interval
Estimate regression function
n
σ
b2 =
N
X
(Yi − xTi β)2
If the (k × k) matrix X T X is invertible,
fX (Xi )
−n
n
i=1
fY |X (Yi | Xi ) = L1 × L2
i=1
i=1
 
1
 .. 
=.
βk
rss = (y − Xβ)T (y − Xβ) = kY − Xβk2 =
i=1
18.2

1
L(µ, Σ) = (2πσ 2 )−n/2 exp − 2 rss
2σ
2
L1 =

X1k
.. 
. 
Likelihood
R2
L=
···
..
.
···
X11
 ..
X= .
Pn
2
2
2
i=1 (Xi − X∗ )
b
P
ξn = σ
b
¯ 2j + 1
n i (Xi − X)
Yb∗ ± zα/2 ξbn
σ
b2 =
n
1 X 2
ˆ
n − k i=1 i
ˆ = X βb − Y
mle
¯
µ
b=X
σ
b2 =
n−k 2
σ
n
1 − α Confidence interval
b βbj )
βbj ± zα/2 se(
21
18.4
Model Selection
Akaike Information Criterion (AIC)
Consider predicting a new observation Y ∗ for covariates X ∗ and let S ⊂ J
denote a subset of the covariates in the model, where |S| = k and |J| = n.
Issues
• Underfitting: too few covariates yields high bias
• Overfitting: too many covariates yields high variance
AIC(S) = `n (βbS , σ
bS2 ) − k
Bayesian Information Criterion (BIC)
k
BIC(S) = `n (βbS , σ
bS2 ) − log n
2
Procedure
1. Assign a score to each model
2. Search through all models to find the one with the highest score
Validation and training
bV (S) =
R
Hypothesis testing
m
X
(Ybi∗ (S) − Yi∗ )2
m = |{validation data}|, often
i=1
n
n
or
4
2
H0 : βj = 0 vs. H1 : βj 6= 0 ∀j ∈ J
Leave-one-out cross-validation
Mean squared prediction error (mspe)
h
i
mspe = E (Yb (S) − Y ∗ )2
bCV (S) =
R
n
X
(Yi − Yb(i) )2 =
i=1
Prediction risk
R(S) =
n
X
mspei =
i=1
n
X
h
i
E (Ybi (S) − Yi∗ )2
n
X
i=1
Yi − Ybi (S)
1 − Uii (S)
!2
U (S) = XS (XST XS )−1 XS (“hat matrix”)
i=1
Training error
btr (S) =
R
n
X
(Ybi (S) − Yi )2
19
Non-parametric Function Estimation
i=1
R
19.1
2
btr (S)
rss(S)
R
R2 (S) = 1 −
=1−
=1−
tss
tss
Pn b
2
i=1 (Yi (S) − Y )
P
n
2
i=1 (Yi − Y )
The training error is a downward-biased estimate of the prediction risk.
h
i
btr (S) < R(S)
E R
n
h
i
h
i
X
b
b
bias(Rtr (S)) = E Rtr (S) − R(S) = −2
Cov Ybi , Yi
i=1
Adjusted R2
R2 (S) = 1 −
Density Estimation
Estimate f (x), where f (x) = P [X ∈ A] =
Integrated square error (ise)
L(f, fbn ) =
Z R
A
f (x) dx.
Z
2
f (x) − fbn (x) dx = J(h) + f 2 (x) dx
Frequentist risk
Z
h
i Z
R(f, fbn ) = E L(f, fbn ) = b2 (x) dx + v(x) dx
n − 1 rss
n − k tss
Mallow’s Cp statistic
b
btr (S) + 2kb
R(S)
=R
σ 2 = lack of fit + complexity penalty
h
i
b(x) = E fbn (x) − f (x)
h
i
v(x) = V fbn (x)
22
19.1.1
Histograms
KDE
n
x − Xi
1X1
K
fbn (x) =
n i=1 h
h
Z
Z
1
1
4
00
2
b
R(f, fn ) ≈ (hσK )
K 2 (x) dx
(f (x)) dx +
4
nh
Z
Z
−2/5 −1/5 −1/5
c
c2 c3
2
2
h∗ = 1
c
=
σ
,
c
=
K
(x)
dx,
c
=
(f 00 (x))2 dx
1
2
3
K
n1/5
Z
4/5 Z
1/5
c4
5 2 2/5
∗
2
00 2
b
R (f, fn ) = 4/5
K (x) dx
(f ) dx
c4 = (σK )
4
n
|
{z
}
Definitions
•
•
•
•
Number of bins m
1
Binwidth h = m
Bin Bj has νj observations
R
Define pbj = νj /n and pj = Bj f (u) du
Histogram estimator
fbn (x) =
m
X
pbj
j=1
h
C(K)
I(x ∈ Bj )
Epanechnikov Kernel
pj
E fbn (x) =
h
h
i p (1 − p )
j
j
V fbn (x) =
nh2
Z
h2
1
2
R(fbn , f ) ≈
(f 0 (u)) du +
12
nh
!1/3
1
6
∗
h = 1/3 R
2 du
n
(f 0 (u))
2/3 Z
1/3
C
3
2
∗ b
0
R (fn , f ) ≈ 2/3
(f (u)) du
C=
4
n
h
i
(
K(x) =
√
3
√
4 5(1−x2 /5)
|x| <
0
otherwise
5
Cross-validation estimate of E [J(h)]
Z
JbCV (h) =
n
n
n
2Xb
2
1 X X ∗ Xi − Xj
fbn2 (x) dx −
K
+
K(0)
f(−i) (Xi ) ≈
2
n i=1
hn i=1 j=1
h
nh
∗
K (x) = K
(2)
(x) − 2K(x)
K
(2)
Z
(x) =
K(x − y)K(y) dy
Cross-validation estimate of E [J(h)]
Z
JbCV (h) =
19.1.2
n
m
2Xb
2
n+1 X 2
fbn2 (x) dx −
f(−i) (Xi ) =
−
pb
n i=1
(n − 1)h (n − 1)h j=1 j
19.2
Non-parametric Regression
Estimate f (x) where f (x) = E [Y | X = x]. Consider pairs of points
(x1 , Y1 ), . . . , (xn , Yn ) related by
Kernel Density Estimator (KDE)
Yi = r(xi ) + i
E [i ] = 0
Kernel K
•
•
•
•
K(x) ≥ 0
R
K(x) dx = 1
R
xK(x) dx = 0
R 2
2
x K(x) dx ≡ σK
>0
V [i ] = σ 2
k-nearest Neighbor Estimator
rb(x) =
1
k
X
i:xi ∈Nk (x)
Yi
where Nk (x) = {k values of x1 , . . . , xn closest to x}
23
20
Nadaraya-Watson Kernel Estimator
rb(x) =
n
X
Stochastic Processes
Stochastic Process
wi (x)Yi
i=1
wi (x) =
x−xi
h
Pn
x−xj
K
j=1
h
K
(
{0, ±1, . . . } = Z discrete
T =
[0, ∞)
continuous
{Xt : t ∈ T }
∈ [0, 1]
Z
4 Z 2
h4
f 0 (x)
2 2
00
0
R(b
rn , r) ≈
x K (x) dx
r (x) + 2r (x)
dx
4
f (x)
R
Z 2
σ K 2 (x) dx
+
dx
nhf (x)
c1
h∗ ≈ 1/5
n
c2
∗
R (b
rn , r) ≈ 4/5
n
• Notations Xt , X(t)
• State space X
• Index set T
20.1
Markov Chains
Markov chain
P [Xn = x | X0 , . . . , Xn−1 ] = P [Xn = x | Xn−1 ]
∀n ∈ T, x ∈ X
Cross-validation estimate of E [J(h)]
JbCV (h) =
n
X
(Yi − rb(−i) (xi ))2 =
i=1
19.3
n
X
i=1
(Yi − rb(xi ))2
1−
K(0)
x−x Pn
j
j=1 K
h
Smoothing Using Orthogonal Functions
Approximation
r(x) =
∞
X
βj φj (x) ≈
j=1
J
X
βj φj (x)
Multivariate regression
where
ηi = i
···
..
.
φ0 (xn ) · · ·
!2
pij ≡ P [Xn+1 = j | Xn = i]
pij (n) ≡ P [Xm+n = j | Xm = i]
Transition matrix P (n-step: Pn )
Chapman-Kolmogorov

φJ (x1 )
.. 
. 
pij (m + n) =
φJ (xn )
βb = (ΦT Φ)−1 ΦT Y
1
≈ ΦT Y (for equally spaced observations only)
n
Cross-validation estimate of E [J(h)]
2

n
J
X
X
bCV (J) =
Yi −
R
φj (xi )βbj,(−i) 
j=1
X
pij (m)pkj (n)
k
Pm+n = Pm Pn
Least squares estimator
i=1
n-step
• (i, j) element is pij
• pij > 0
P
•
i pij = 1
i=1
Y = Φβ + η

φ0 (x1 )
 ..
and Φ =  .
Transition probabilities
Pn = P × · · · × P = Pn
Marginal probability
µn = (µn (1), . . . , µn (N ))
where
µi (i) = P [Xn = i]
µ0 , initial distribution
µn = µ0 Pn
24
20.2
Poisson Processes
Autocorrelation function (ACF)
Poisson process
• {Xt : t ∈ [0, ∞)} = number of events up to and including time t
• X0 = 0
• Independent increments:
γ(s, t)
Cov [xs , xt ]
=p
ρ(s, t) = p
V [xs ] V [xt ]
γ(s, s)γ(t, t)
Cross-covariance function (CCV)
∀t0 < · · · < tn : Xt1 − Xt0 ⊥
⊥ · · · ⊥⊥ Xtn − Xtn−1
γxy (s, t) = E [(xs − µxs )(yt − µyt )]
• Intensity function λ(t)
Cross-correlation function (CCF)
– P [Xt+h − Xt = 1] = λ(t)h + o(h)
– P [Xt+h − Xt = 2] = o(h)
• Xs+t − Xs ∼ Po (m(s + t) − m(s)) where m(t) =
Rt
0
ρxy (s, t) = p
λ(s) ds
γxy (s, t)
γx (s, s)γy (t, t)
Homogeneous Poisson process
λ(t) ≡ λ =⇒ Xt ∼ Po (λt)
Backshift operator
λ>0
B k (xt ) = xt−k
Waiting times
Wt := time at which Xt occurs
1
Wt ∼ Gamma t,
λ
Difference operator
∇d = (1 − B)d
Interarrival times
White noise
St = Wt+1 − Wt
1
St ∼ Exp
λ
2
• wt ∼ wn(0, σw
)
St
Wt−1
21
Wt
iid
2
Gaussian: wt ∼ N 0, σw
E [wt ] = 0 t ∈ T
V [wt ] = σ 2 t ∈ T
γw (s, t) = 0 s 6= t ∧ s, t ∈ T
Random walk
Time Series
Mean function
t
•
•
•
•
Z
∞
µxt = E [xt ] =
xft (x) dx
−∞
• Drift δ
Pt
• xt = δt + j=1 wj
• E [xt ] = δt
Autocovariance function
Symmetric moving average
γx (s, t) = E [(xs − µs )(xt − µt )] = E [xs xt ] − µs µt
γx (t, t) = E (xt − µt )2 = V [xt ]
mt =
k
X
j=−k
aj xt−j
where aj = a−j ≥ 0 and
k
X
j=−k
aj = 1
25
21.1
Stationary Time Series
21.2
Estimation of Correlation
Sample mean
Strictly stationary
n
x
¯=
P [xt1 ≤ c1 , . . . , xtk ≤ ck ] = P [xt1 +h ≤ c1 , . . . , xtk +h ≤ ck ]
1X
xt
n t=1
Sample variance
n |h|
1 X
1−
γx (h)
V [¯
x] =
n
n
∀k ∈ N, tk , ck , h ∈ Z
h=−n
Weakly stationary
Sample autocovariance function
∀t ∈ Z
• E x2t < ∞
2
• E xt = m
∀t ∈ Z
• γx (s, t) = γx (s + r, t + r)
∀r, s, t ∈ Z
Sample autocorrelation function
Autocovariance function
•
•
•
•
•
n−h
1 X
(xt+h − x
¯)(xt − x
¯)
n t=1
γ
b(h) =
γ(h) = E [(xt+h − µ)(xt − µ)]
γ(0) = E (xt − µ)2
γ(0) ≥ 0
γ(0) ≥ |γ(h)|
γ(h) = γ(−h)
∀h ∈ Z
ρb(h) =
γ
b(h)
γ
b(0)
Sample cross-variance function
γ
bxy (h) =
n−h
1 X
(xt+h − x
¯)(yt − y)
n t=1
Autocorrelation function (ACF)
Sample cross-correlation function
Cov [xt+h , xt ]
γ(t + h, t)
γ(h)
=p
=
ρx (h) = p
γ(0)
V [xt+h ] V [xt ]
γ(t + h, t + h)γ(t, t)
Jointly stationary time series
γ
bxy (h)
ρbxy (h) = p
γ
bx (0)b
γy (0)
Properties
γxy (h) = E [(xt+h − µx )(yt − µy )]
ρxy (h) = p
γxy (h)
γx (0)γy (h)
1
• σρbx (h) = √ if xt is white noise
n
1
• σρbxy (h) = √ if xt or yt is white noise
n
21.3
Linear process
Non-Stationary Time Series
Classical decomposition model
xt = µ +
∞
X
ψj wt−j
where
j=−∞
2
γ(h) = σw
∞
X
|ψj | < ∞
xt = µt + st + wt
j=−∞
∞
X
j=−∞
ψj+h ψj
• µt = trend
• st = seasonal component
• wt = random noise term
26
21.3.1
Detrending
Moving average polynomial
θ(z) = 1 + θ1 z + · · · + θq zq
Least squares
Moving average operator
1. Choose trend model, e.g., µt = β0 + β1 t + β2 t2
2. Minimize rss to obtain trend estimate µ
bt = βb0 + βb1 t + βb2 t2
3. Residuals , noise wt
θ(B) = 1 + θ1 B + · · · + θp B p
MA (q) (moving average model order q)
Moving average
xt = wt + θ1 wt−1 + · · · + θq wt−q ⇐⇒ xt = θ(B)wt
• The low-pass filter vt is a symmetric moving average mt with aj =
vt =
1
2k + 1
k
X
1
2k+1 :
E [xt ] =
xt−1
(
γ(h) = Cov [xt+h , xt ] =
i=−k
ARIMA models
Autoregressive polynomial
z ∈ C ∧ φp 6= 0
2
σw
0
Pq−h
j=0
θj θj+h
0≤h≤q
h>q
xt = wt + θwt−1

2 2

(1 + θ )σw h = 0
2
γ(h) = θσw
h=1


0
h>1
(
θ
h=1
2
ρ(h) = (1+θ )
0
h>1
• µt = β0 + β1 t =⇒ ∇xt = β1
ARMA (p, q)
xt = φ1 xt−1 + · · · + φp xt−p + wt + θ1 wt−1 + · · · + θq wt−q
Autoregressive operator
φ(B)xt = θ(B)wt
φ(B) = 1 − φ1 B − · · · − φp B
p
Partial autocorrelation function (PACF)
Autoregressive model order p, AR (p)
xt = φ1 xt−1 + · · · + φp xt−p + wt ⇐⇒ φ(B)xt = wt
• xih−1 , regression of xi on {xh−1 , xh−2 , . . . , x1 }
• φhh = corr(xh − xh−1
, x0 − xh−1
) h≥2
0
h
• E.g., φ11 = corr(x1 , x0 ) = ρ(1)
ARIMA (p, d, q)
AR (1)
∇d xt = (1 − B)d xt is ARMA (p, q)
• xt = φk (xt−k ) +
k−1
X
φj (wt−j )
k→∞,|φ|<1
j=0
=
∞
X
φj (wt−j )
j=0
|
{z
φ(B)(1 − B)d xt = θ(B)wt
Exponentially Weighted Moving Average (EWMA)
xt = xt−1 + wt − λwt−1
}
linear process
• E [xt ] =
P∞
j=0
j
φ (E [wt−j ]) = 0
• γ(h) = Cov [xt+h , xt ] =
• ρ(h) =
θj E [wt−j ] = 0
MA (1)
Differencing
φ(z) = 1 − φ1 z − · · · − φp zp
q
X
j=0
Pk
1
• If 2k+1
i=−k wt−j ≈ 0, a linear trend function µt = β0 + β1 t passes
without distortion
21.4
z ∈ C ∧ θq 6= 0
γ(h)
γ(0)
xt =
2 h
σw
φ
1−φ2
(1 − λ)λj−1 xt−j + wt
when |λ| < 1
j=1
= φh
• ρ(h) = φρ(h − 1) h = 1, 2, . . .
∞
X
x
˜n+1 = (1 − λ)xn + λ˜
xn
Seasonal ARIMA
27
• Denoted by ARIMA (p, d, q) × (P, D, Q)s
d
s
• ΦP (B s )φ(B)∇D
s ∇ xt = δ + ΘQ (B )θ(B)wt
Periodic mixture
xt =
21.4.1
Causality and Invertibility
∞
X
P∞
j=0
ψj < ∞ such that
wt−j = ψ(B)wt
j=0
ARMA (p, q) is invertible ⇐⇒ ∃{πj } :
π(B)xt =
(Uk1 cos(2πωk t) + Uk2 sin(2πωk t))
k=1
ARMA (p, q) is causal (future-independent) ⇐⇒ ∃{ψj } :
xt =
q
X
• Uk1 , Uk2 , for k = 1, . . . , q, are independent zero-mean rv’s with variances σk2
Pq
• γ(h) = k=1 σk2 cos(2πωk h)
Pq
• γ(0) = E x2t = k=1 σk2
Spectral representation of a periodic process
P∞
j=0
∞
X
πj < ∞ such that
γ(h) = σ 2 cos(2πω0 h)
σ 2 −2πiω0 h σ 2 2πiω0 h
e
+
e
2
2
Z 1/2
=
e2πiωh dF (ω)
=
Xt−j = wt
j=0
Properties
−1/2
• ARMA (p, q) causal ⇐⇒ roots of φ(z) lie outside the unit circle
∞
X
θ(z)
ψ(z) =
ψj z =
φ(z)
j=0
j
Spectral distribution function


0
F (ω) = σ 2 /2

 2
σ
|z| ≤ 1
ω < −ω0
−ω ≤ ω < ω0
ω ≥ ω0
• ARMA (p, q) invertible ⇐⇒ roots of θ(z) lie outside the unit circle
∞
X
φ(z)
πj z j =
π(z) =
θ(z)
j=0
• F (−∞) = F (−1/2) = 0
• F (∞) = F (1/2) = γ(0)
|z| ≤ 1
Spectral density
Behavior of the ACF and PACF for causal and invertible ARMA models
ACF
PACF
21.5
AR (p)
tails off
cuts off after lag p
MA (q)
cuts off after lag q
tails off q
ARMA (p, q)
tails off
tails off
Spectral Analysis
Periodic process
xt = A cos(2πωt + φ)
= U1 cos(2πωt) + U2 sin(2πωt)
•
•
•
•
Frequency index ω (cycles per unit time), period 1/ω
Amplitude A
Phase φ
U1 = A cos φ and U2 = A sin φ often normally distributed rv’s
∞
X
γ(h)e−2πiωh
−
|γ(h)| < ∞ =⇒ γ(h) =
R 1/2
f (ω) =
h=−∞
• Needs
P∞
h=−∞
1
1
≤ω≤
2
2
−1/2
e2πiωh f (ω) dω
h = 0, ±1, . . .
• f (ω) ≥ 0
• f (ω) = f (−ω)
• f (ω) = f (1 − ω)
R 1/2
• γ(0) = V [xt ] = −1/2 f (ω) dω
2
• White noise: fw (ω) = σw
• ARMA (p, q) , φ(B)xt = θ(B)wt :
|θ(e−2πiω )|2
|φ(e−2πiω )|2
Pp
Pq
where φ(z) = 1 − k=1 φk z k and θ(z) = 1 + k=1 θk z k
2
fx (ω) = σw
28
• I0 (a, b) = 0
I1 (a, b) = 1
• Ix (a, b) = 1 − I1−x (b, a)
Discrete Fourier Transform (DFT)
d(ωj ) = n−1/2
n
X
xt e−2πiωj t
22.3
i=1
Fourier/Fundamental frequencies
Finite
ωj = j/n
Inverse DFT
xt = n−1/2
n−1
X
•
d(ωj )e2πiωj t
•
j=0
•
2
I(j/n) = |d(j/n)|
Scaled Periodogram
4
I(j/n)
n
!2
n
2X
=
xt cos(2πtj/n +
n t=1
P (j/n) =
22.1
n
2X
xt sin(2πtj/n
n t=1
!2
Math
k=
k=1
n
X
n(n + 1)
2
•
(2k − 1) = n2
k=1
n
X
•
k=1
n
X
ck =
k=0
cn+1 − 1
c−1
k=0
n X
k
= 2n
r+k
r+n+1
=
k
n
k=0
n
X k
n+1
•
=
m
m+1
•
k2 =
n X
n
k=0
• Vandermonde’s Identity:
r X
m
n
m+n
=
k
r−k
r
k=0
c 6= 1
• Binomial
Theorem:
n X
n n−k k
a
b = (a + b)n
k
k=0
Infinite
Gamma Function
Z
•
∞
ts−1 e−t dt
0
Z ∞
• Upper incomplete: Γ(s, x) =
ts−1 e−t dt
x
Z x
• Lower incomplete: γ(s, x) =
ts−1 e−t dt
• Ordinary: Γ(s) =
0
• Γ(α + 1) = αΓ(α)
α>1
• Γ(n) = (n − 1)!
n∈N
√
• Γ(1/2) = π
22.2
Binomial
n
X
n(n + 1)(2n + 1)
6
k=1
2
n
X
n(n + 1)
•
k3 =
2
Periodogram
22
Series
∞
X
k=0
∞
X
pk =
1
,
1−p
∞
X
p
|p| < 1
1−p
k=1
!
∞
X
d
1
1
k
p
=
=
dp 1 − p
(1 − p)2
pk =
d
dp
k=0
k=0
∞
X r + k − 1
•
xk = (1 − x)−r r ∈ N+
k
k=0
∞ X
α k
•
p = (1 + p)α |p| < 1 , α ∈ C
k
•
kpk−1 =
|p| < 1
k=0
Beta Function
1
Γ(x)Γ(y)
• Ordinary: B(x, y) = B(y, x) =
tx−1 (1 − t)y−1 dt =
Γ(x + y)
0
Z x
• Incomplete: B(x; a, b) =
ta−1 (1 − t)b−1 dt
Z
0
• Regularized incomplete:
a+b−1
B(x; a, b) a,b∈N X
(a + b − 1)!
Ix (a, b) =
=
xj (1 − x)a+b−1−j
B(a, b)
j!(a
+
b
−
1
−
j)!
j=a
29
22.4
Combinatorics
Sampling
k out of n
w/o replacement
nk =
ordered
k−1
Y
(n − i) =
i=0
w/ replacement
n!
(n − k)!
n
nk
n!
=
=
k
k!
k!(n − k)!
unordered
Stirling numbers, 2nd kind
n
n−1
n−1
=k
+
k
k
k−1
nk
n−1+r
n−1+r
=
r
n−1
(
1
n
=
0
0
1≤k≤n
n=0
else
Partitions
Pn+k,k =
n
X
Pn,i
k > n : Pn,k = 0
n ≥ 1 : Pn,0 = 0, P0,0 = 1
i=1
Balls and Urns
|B| = n, |U | = m
f :B→U
f arbitrary
mn
B : D, U : D
B : ¬D, U : D
B : D, U : ¬D
f injective
(
mn m ≥ n
0
else
m+n−1
n
m X
n
k=1
B : ¬D, U : ¬D
D = distinguishable, ¬D = indistinguishable.
m
X
k=1
m
n
m!
n
m
n−1
m−1
(
1
0
m≥n
else
n
m
(
1
0
m≥n
else
Pn,m
k
Pn,k
f surjective
f bijective
(
n! m = n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
(
1 m=n
0 else
References
[1] L. M. Leemis and J. T. McQueston. Univariate Distribution Relationships. The American
Statistician, 62(1):45–53, 2008.
[2] A. Steger. Diskrete Strukturen – Band 1: Kombinatorik, Graphentheorie, Algebra.
Springer, 2001.
[3] A. Steger. Diskrete Strukturen – Band 2: Wahrscheinlichkeitstheorie und Statistik.
Springer, 2002.
30
31
Univariate distribution relationships, courtesy Leemis and McQueston [1].