Download Report

PS 170A: Introductory Statistics for Political
Science and Public Policy
Lecture Notes (Part 6)
Langche Zeng
zeng@ucsd.edu
2
Simple Correlation and Regression Analysis
1. Introduction
• Things exist in relationships. Nothing can be well understood in
isolation.
• Correlation and regression analysis are tools for studying relationships
between two or more variables.
• Correlation: is there a (linear) relationship? how strong? x and y
symmetric
• Regression: can x help explain y? what’s the relationship in mathe-
3
matical form? how to use that form to predict y?
• examples? (lifeexp data being one: lexp as a function of safewater)
• linear relationship: pp.257-259. interpretation of α and β
• example of nonlinear relationship, p.260, fig.9.5
2. Simple Correlation Analysis
• start with examining the scatter plot
e.g. sysuse lifeexp.
scatter lifeexp safewater
scatter lifeexp gnppc
4
• the Pearson coefficient of correlation, r: numerical measure of direction of strength of the linear relationship between two quantitative
variables.
• formula: p.270
correlate x1 x2
• to test H0 : r = 0:
r
s
n−2
1−r2
has a t distribution with n-2 degrees of freedom.
stata:
pairwise, with significance test of r = 0: pwcorr x1 x2 x3, sig
try lifeexp data
5
• properties: (p.271)
falls in [-1,1]
meaning of sign and magnitude
symmetric to x and y
fig.9.11, p. 272: example of perfect non-linear relationship with linear
correlation 0.
• r2 is called the coefficient of determination. it measures the proportion of variations in y “explained” by x
• formula and interpretation: p. 274, fig.9.13
r2 can be generalized to multiple regression case. software routinely
6
reports.
3. Simple Regression Analysis
• what’s the form of the linear relationship? once we know the form,
we can do a lot of useful things (interpretation, prediction).
• how do we best estimate the relationship with a straight line? How
do we fit a straight line to the scatter plot? according to what
method/principle?
• real data are rarely (if ever) represented exactly by a straight line.
there are errors involved:
yi = α + βxi + ²i
7
• in estimating a linear model, we try to find the “best” guess of α
and β.
• “Least Square” is one estimation principle that optimizes things in
certain sense: it minimizes the sum of squared errors: min P ²2i
draw figure.
• optimization is a mathematical procedure. it results in “estimators”
for the coefficients. An estimator is a formula that translates input
data into an estimated value for a quantity of interest.
ˆ p.261
• formula for α
ˆ and β:
for β, it’s the same as
cov(x,y)
V (x)
8
Messy looking, but easy to get from statistical packages.
regress lexp safewater
R: ?lm for details
• goodness of fit: R2
verify that R2 is “Model SS/Total SS” in the stata output.
R2 can fall between 0 and 1.
MSS is
TSS is
P
P
ˆ − Y¯ )2
i (Y i
i (Yi
− Y¯ )2
ESS (or SSE) is
P
i (Yi
− Yˆi)2
Show in figure: distance from a point Yi to the mean of Y can
9
be decomposed to two parts: the distance from Yi to the model
prediction Yˆi (residual error), and that from the model prediction to
Y¯ .
•
s
SSE
n−2
is the “Root MSE” in stata output. it is an estimate for the
standard deviation of the residuals (which have mean 0). gives an
idea of the average size of the residual errors.
it provides an estimate for σ. the meaning of which is illustrated
in fig. 9.8, p.267 for a relationship in the population (when the
parameter values are known).
ˆ is called the prediction equation.
• yˆ = α
ˆ + βx
10
e.g., lifeexp data
also example 9.4, p.261.
stata: predict ybar
list ybar lexp safewater
se of prediction:
predict ybarse, stdp
R: predict.lm(lm.out,se.fit = FALSE); or lm.out$fitted.values
• effects of outliers
fig.9.6, p.262
fig.9.17, p.285
11
• inferences for the coefficients: model assumptions
p.276.
assumption 1: the sample is representative of the population
assumption 2: the model is correct
assumption 3: homoscadesticity. fig.9.9, p.269
assumption 4: normality (uncritical for large N).
•α
ˆ and βˆ as random variables
different sample data give different estimated coefficient values. we
can think about how the estimated values are distributed over possible
samples.
12
• and we can discuss properties of the estimators: unbiasedness, efficiency, etc.
• can show that α
ˆ and βˆ follow t distributions with n − 2 df. see stata
output for t-values and p-values for testing null hypothesis. and see
CIs constructed using the distributions.
p.277: test of β = 0
• violations of assumptions
reality check (later chapters)
13
Multivariate Relationships and Multiple Regression
1. Association and Causation
• Real world relationships usually involve multiple variables.
e.g. what are possible predictors of college GPA? voting behavior?
• causal effect inference is central to social science research.
• association does not equal causation.
e.g. fig. 10.1: p.305. causal effect of height on math achievement?
relationship disappears controlling for the grade level/age.
14
“spurious association” caused by some common cause.
• other types of multivariate relationships: table 10.5, p.315
chain: father’s education → son’s education → son’s income
interaction: effect of x1 on y depends on the value of x2. e.g., effect
of education on income may depend on race or gender.
direct and indirect effects: e.g., gender on party ID, both directly and
through ideology.
spurious non-association: suppose education → income for the same
age group. suppose age is positively related to income but negatively
with education. then without controlling for age the relationship
15
between education and income may not show up.
• for a relationship to be considered causal, need to satisfy these necessary conditions:
a) association
b) appropriate time order (e.g., gender/race are causally prior to
behavioral variables)
c) elimination of alternative explanations
• to achieve c), we frequently need to “control” the influence of other
variables by holding their values constant.
e.g. controlling for grade level in the height/math score example.
16
grade level is called a “control variable” in this example. holding its
value constant is a case of “statistical control”.
In regression analysis, control variables enter the model as independent variables, along with the key causal variable of interest.
17
2. Multiple regression analysis
• For k > 1 explanatory variables. e.g. state level violent crime rate
as a function of “poverty” (percentage of the state population living
in poverty) and “metro” (percentage of the state population living
in a metropolitan area, could be a common cause of crime rate and
poverty). Model:
yi = α + β1x1i + . . . + βk xki + ²i
Or equivalently:
E(Y ) = α + β1x1i + . . . + βk xki i = 1, 2, . . . n
Where the ²i’s are assumed to be independent and distributed
18
N (0, σ), among other things. (Same as in simple linear model.)
• Meaning of βk : A one unit increase in xk is associated with k units
increase in E(y), holding all other x’s constant. If βk is 0 in the
population, then there is no relationship between xk and y (for any
i). k is called the “marginal effect” of xk on E(y). Meaning of α:
E(y) when all x=0.
• How to find the “best” α and β’s? According to the same OLS
principle. Now fitting a “plane” in k+1 dimensional space, rather
than a line in 2 dimensional space. (Imagine k=2.) We minimize the
sum of squared errors from observed data points to the regression
19
plane. See fig. 11.1, p.322.
• Property of OLS estimators (under the “classical linear model assumptions”): BLUE (best linear unbiased estimator).
• Example: Violent Crime Rates = f(poverty, metro)
The estimated regression model is
E(Crime) = -495.87 + 33.16*Poverty + 9.57*Metro
Mean(Poverty)=13.9%, mean(Metro)=66.0%. For all states with
Poverty=13.9% and Metro= 66%, we predict the average violent
crime rate to be -495.87 + 33.16*13.9 + 9.57*66 = 597 (per 100,000
population)
20
Prediction for an individual state is the same, but with higher uncertainty due to the random error term.
The marginal effect of “poverty” on the average crime rate, holding
“metro” constant (at any value), is 33.16— every 1% increase in
”poverty” corresponds to 33.16 more cases of violent crime per 100k
population.
• Standardized coefficients:
To compare the relative effects of different independent variables,
need to have standardized coefficients. The original coefficients depend on the units of measurement.
21
The standardized coefficient for xk , obtained by multiplying the estimated coefficient with
Sxk
Sy ,
measures the standard deviation change
in y given a standard deviation change in xk .
The standardized coefficients can also be obtained by using the zscores of the original variables in the regression model.
Stata: regress y x1 x2 x3, beta
• Goodness of Fit: the Coefficient of Determination (R 2)
Measures how well the regression model fits the data R 2 measures
how much variation in the values of the response variable (y) is explained by the regression model (i.e., by all the independent variables
22
collectively.
The distance between an observed Y and the mean of Y in the data
set can be decomposed into two parts: from Y to E(Y) given by the
regression model, and from E(Y) to the mean of all Y. R 2 is defined
as MSS/TSS, or 1-ESS/TSS (p.332). The higher the R 2, the better
the fit.
Adding more independent variables to the model never decreases
R2—Stata reports the “adjusted R2” to account for model complexity.
Ultimately, goodness of fit measures should not be used as the model
23
selection criterion, as a model could possibly over-fit the data. Compare out-of-sample prediction performance instead.
• Checking functional form assumption: partial regression plot (also
called added-variable plot)
Plots the relationship between y and xk after removing the effects
of the other predictors: residual from “reg y z” against residual from
“reg x z”, where z denotes the set of all other independent variables.
stata: avplots
• Residual plots: residual against fitted values: “rvfplot”, see pattern
of residuals, whether violation of assumptions.
24
Test of heteroscedasticity: estat hettest
• Multicollinearity: when there is relatively strong correlation among
some of the xk ’s, some of the individual variables may not add much
predictive power. Correlation also makes interpretation of results
difficult, since “holding all others constant” while moving one variable
is unrealistic when the variables are strongly correlation. Try to use
predictors with weak correlations if possible.
• Hypothesis Testing: Is There a Relationship?
The estimated k values are based on one particular sample set, and
so are the sample intercept/slopes. What are the corresponding pop-
25
ulation parameters?
a) Testing a population slope being 0 (intercept similar, but less
interesting) being zero (i.e., testing the hypothesis that there is no
relationship between some x and y):
Recall logic of hypothesis test. Under the null hypothesis, the sampling distribution of the estimated βk /sd(βk ) is shown to follow the
“Student-t” distribution (assuming unknown σ) with n-k-1 degrees
of freedom. Software routinely reports the p-values from the test.
(See Stata output)
b) We can also test the “global” hypothesis that all β k ’s are si-
26
multaneously 0, i.e., our independent variables as a group have no
significant effect on our dependent variable. This is done using the
so-called “F-test”, p-values for F(1,n-k-1) routinely reported by software. Rejecting null means: at least one x “matters”.
Formula for F: p.336. figure 11.9.
Stata output F statistic.
c) testing a subgroup of parameters being 0:
The global F-test being a special case.
formula for F: p.345 bottom.
Stata: test P=M=0 (for example)
27
• Beware...
OLS not robust to outliers (regardless of the number of x variables)
Extrapolation beyond observed data region dangerous
Correlation does not imply causation.
Properties of OLS estimators hold only if the model assumptions are
satisfied
• Modeling Interaction Effects: Special Case of Non-linearity
In the linear additive model, the marginal effect of some x on E(y)
is constant, independent of the values of the other x’s in the model.
This is generally not true in a non-linear model.
28
Interaction effect model is a special case of a non-linear model.
Simple example:
E(Y ) = α + β1x1 + β2x2 + β3x1x2
In this model, the marginal effect of x1 depends on the value of x2.
e.g. x1 = Gender (female=1), x2 = Education (high=1), Y=Prochoice abortion opinion (higher score → stronger pro-choice views).
Estimated model: (showing reversed gender gap)
E(Y ) = 4.04 − .55x1 + 1.09x2 + 1.16x1x2
male/low educ: 4.04; female/low educ: 4.04-.55; male/high educ:
4.04+1.09; female/high educ: 4.04-.55+1.09+1.16
29
Can write out the prediction model for separate groups.
The slope of x1, for example, (as well as the intercept of the model),
differs when x2 takes different values.
Another example: p.342. Fig. 11.10.
Try “reg VR P M PM” (after “use http://dss.ucsd.edu/˜lazeng/ps170/table9.1.dta
gen PM=P*M”. )
What is the marginal effect of P (M) when M (P) is at the mean?
• Dummy X Variables
Sometimes one or more of our independent variables may be categorical variables, such as gender or race. Multiple valued categorical
30
variables can be recoded into a set of binary “dummy” variables taking values 0/1. e.g. White/Black/Hispanic/Asian (Why we don’t
want to use the multiple valued variable “race” in the regression
model, if it’s coded say 1,2,3,4?)
If there are m categories, we use m-1 dummies in the model, since
the last one does not add any information: knowing the value of
“White”, “Black”, and “Hispanic” we can infer the value of “Asian”
(assuming these exhaust the racial categories in the data). Similarly,
for “gender” we only need one variable, not two.
Dummy variables change the intercept and/or the slope if the relationships for different groups represented by the dummy.
31
The most natural way of interpreting the effect of a dummy variable
is to see its effect on Y as it goes from 0 to 1.
If Y is a dummy variable, standard linear regression model doesn’t
apply. We’ll need to use models for binary distributions, such as logit
or probit, to which we turn next.
32
Basics of Logit/Probit Models
• Binary dv, model:
Y ∗ = xβ + ²
Y = 1 if Y ∗ > 0
Assuming probability distribution for ²,
P (Y = 1|X) = P (Y ∗ > 0|X) = P (² > −Xβ)
= P (² ≤ Xβ) = Φ(Xβ) (Probit)
or
1
1+e−Xβ
(logit)
Graph: fig.15.1, p.484
33
• what can be learned from Stata output, what cann’t
use http://dss.ucsd.edu/˜lazeng/ps170/class.dta
gen abortion=(ab==”y”)
gen lifeafter=(ld==”y”)
gen gender=(ge==”f”)
logit abortion gender lifeafter pi
• Predicted probabilities, first differences, etc., with uncertainty/CI:
“findit clarify” to find and install clarify
estsimp logit abortion gender lifeafter pi
setx gender 1 lifeafter 1
34
simqi, fd(pr) changex(pi 1 7)
mean diff in pr=-.6751331, sd=.2051763,
95% CI=(-.939707, -.155823)
(“help estsimp” and so on)
R: glm, Zelig
35
Relationships between categorical variables
• When both dependent and indepedent variables are categorical, data
can be presented in a contingency table. e.g. table 8.1, p.222. Party
ID and gender.
• Non-parametric analysis of the relationship: is there an association?
What’s the (cell) pattern of the association, and what’s the strength
of the association?
No distributional assumptions on “error” terms. Just working with
the actual bserved raw data
• Independence: the distribution of Y is independent of X:
36
P (Y |X = x1) = P (Y |X = x2) = . . . = P (Y |X = xr ) = P (Y )
e.g., table 8.3, p.224. Party ID distribution independent of race.
P(Y=D)=.44 whatever the race value is. Similar for other values of
Y.
In contrast, table 8.2, p.222 might be evidence for dependence/association.
“Might be” because this is sample data, thus uncertainty about the
population relationship.
• Chi-Sqaure test of independence: (for nominal data)
Under the null hypothesis that there is no association, the distribution
of Y should be independent of the values of X. So knowing the total
37
number of cases, and the total number for each value of Y , we can
write down the expected frequency of observed Y values.
table 8.4, p.225: under H0, P (Y = D) = ND /N = 959/2771,
P (Y = R) = NR/N = 821/2771, P (Y = I) = NI /N =
991/2771,
These probabilities should stay the same regardless of whether “Gender” value we are looking at. So for example the expected frequency
for the “Female” and “Democrat” cell should be x such that
x/1511 = 959/2771, → x = (959/2771) ∗ 1511
Expected frequencies for Other cells are similarly filled.
38
Now we can compared the observed frequencies with the expected.
If H0 is true, we expect that they don’t differ much. The sum of
normalized squared differences supplies the χ2 test statistic on p.225,
with degrees of freedom (r − 1) ∗ (c − 1) (r: no. of rows; c: no.
of cols), assuming N reasonablly large: expected frequency in each
cell> 5 (if not satisfied, use Fisher’s exact test.)
Density of χ2 distributions: fig.8.2, p.226.
Summary of the Chi-sqaure test: table 8.5, p.228.
• Cell residual pattern: how do the data differ from the expected pattern?
39
Standardized residuals (box, p.230).
example data: table 8.8, p.231.
• Strength of association
Chi-square test does not measure the strength, but test the existence,
of an association.
For nominal data, hard to summarize the strength of association
with a single number for larger than 2x2 tables, too many possible
association patterns.
2x2 tables: can look at the difference in proportions, magnitude in
[0,1]. e.g., table 8.11, p.234.
40
General idea is to look at how different P (Y ) can be for different X
values. So distance measures could apply.