Assignment - Jeff Goldsmith

Biostatistics P8111, Spring 2015
Homework 5
Due April 24 by 6:00pm
Please email an electronic copy (PDF) of your solutions to ajg2202@cumc.columbia.edu. Use the title
“P8111HW5_Lastname_Firstname.pdf”.
Solutions to Problem 1 can be typed or handwritten and scanned; Problems 2-4 should be completed using R
markdown. In addition to your PDF, please also submit the .Rmd that produces your written solutions to
Problems 2-4 with the title “P8111HW5_Lastname_Firstname.Rmd“.
Problem 1. [4+3 = 7 points]
Consider the random effect model
y = Xβ + Zb + with response vector y, fixed effect design matrix X, random effect design matrix Z, coefficient vector β,
random effect vector b, and error vector . Assume that
• That errors are multivariate normal: ∼ N(0, σ2 I).
• That random effects are multivariate normal: b ∼ N(0, σb2 I).
Recast this model treating u = Zb + as the “total error”.
(a) What is the error variance in your new model?
(b) Assume that the random effect structure is a random intercept at the subject level and show that your
answer in part (a) is equivalent to the marginal correlation structure derived in class.
Problem 2. [5+3 = 8 points]
Suppose I have a balanced longitudinal dataset with three equally-spaced visits for each subject. I treat visit
number as a continuous variable (with values 1, 2, and 3) and pose a random effects model with visit as a
predictor and a random intercept and random slope on visit number for each subject:
yij = β0 + bi,0 + β1 visij + bi,1 visij + ij
where
• bi,0 ∼ N[0, τ02 ]
• bi,1 ∼ N[0, τ12 ]
• ij ∼ N[ν 2 ]
(a) The marginal covariance is block-diagonal with 3x3 blocks Vi giving the within-subject covariance.
What are the entries in Vi ?
(b) Simulate 10000 subjects according to the model above and compute the marginal covariance to confirm
your derivation in part (a).
1
Problem 3 [2+5+3+5 = 15 points]
The Alzheimer’s Disease Neuroimaging Initiative (ADNI) has collected demographic and neuroimaging
information from many affected patients and controls to better understand the progression of disease. A
subsample of the data (processed for this analysis by Wes Thompson) can be loaded into R using
data = read.csv("http://jeffgoldsmith.com/P8111/P8111_HWs/ADNI8111.csv")
This dataset contains four variables:
•
•
•
•
age in years
e4, a binary indicator of APOE risk allele presence (0 negative, 1 positive)
y, a standardized coritical thickness measure
id, the subject ID number.
The cortical thickness outcome is a biomarker for disease progression; lower values indicates worse disease.
(a) Fit a linear model examining the effect of age, e4, and their interaction on the outcome, ignoring the
correlation within subjects. Plot this model and interpret the coefficients.
(b) Fit a model with the same fixed effects and a random intercept to account for repeated observations.
Plot this model and interpret the parameters, including the coefficients and variances.
(c) Compare your models from part (a) and (b). What are possible scientific explanations for the differences
between models?
(d) Construct a confidence interval for the intraclass correlation coefficient.
Problem 4 [4+3+8 = 15 points]
This problem extends the analysis of Nepalese children considered in Homework 4. In the current dataset,
there are up to five observations per subject, and we are interested in understanding the association between
arm circumference, weight, and sex while accounting for repeated observations at the subject level. The data
can be loaded using
data = read.csv("http://jeffgoldsmith.com/P8111/P8111_HWs/NepalLong.csv")
This dataset contains the following variables:
•
•
•
•
•
•
id, the subject ID number.
sex, a binary indicator of sex (0 male, 1 female)
wt, weight in kg
ht, height in cm
arm, arm circumference in cm
age, age in months
(a) As a first step in the analysis, you will need to remove missing observations from the dataset. As is
commonly done, missing values for arm and wt are indicated using “unnaturally large values." Identify
the values used to indicate missingness, and remove incomplete observations from the dataset. After
removing missing values, describe the study population – how many observations do we typically have
for each child? What is the average age at first visit?
(b) Make a spaghetti plot of arm circumference against weight and describe the main features (trends,
within-subject correlation, amount of measurement error).
(c) Model the association between arm circumference and weight and sex, while accounting for within-child
correlation. Present your final model graphically and succinctly describe its interpretation.
2