Descriptive statistics - University of Warwick

DTC Quantitative Research Methods
Descriptive Statistics
Thursday 16th October 2014
Some relevant online course extracts
• Cramer (1998) Chapter 2:
- Measurement and univariate analysis.
• Diamond and Jefferies (2001) Chapter 5:
- Measures and displays of spread.
• Sarantakos (2007) Chapter 5:
- Graphical displays.
• Huizingh (2007) Chapter 12:
- SPSS material.
Some basic terminology
• Quantitative measures are typically referred to
as variables.
• Some variables are generated directly via the
data generation process, but other, derived
variables may be constructed from the
original set of variables later on.
• As the next slide indicates, variables are
frequently referred to in more specific ways.
Cause(s) and effect…?
• Often, one variable (and occasionally more than
one variable) is viewed as being the dependent
variable.
• Variables which are viewed as impacting upon
this variable, or outcome, are often referred to as
independent variables.
• However, for some forms of statistical analyses,
independent variables are referred to in more
specific ways (as can be seen within the menus of
SPSS for Windows)
Levels of measurement
(Types of quantitative data)
• A nominal variable relates to a set of categories such as
ethnic groups or political parties which is not ordered.
• An ordinal variable relates to a set of categories in
which the categories are ordered, such as social classes
or levels of educational qualification.
• An interval-level variable relates to a ‘scale’ measure,
such as age or income, that can be subjected to
mathematical operations such as averaging.
How many variables?
• The starting point for statistical analyses is typically an
examination of the distributions of values for the
variables of interest. Such examinations of variables
one at a time are a form of univariate analysis.
• Once a researcher moves on to looking at relationships
between pairs of variables she or he is engaging in
bivariate analyses.
• … and if they attempt to explain why two variables are
related with reference to another variable or variables
they have moved on to a form of multivariate analysis.
Looking at categorical variables
• For nominal/ordinal variables this largely
means looking at the frequencies of each
category, often pictorially using, say, barcharts or pie-charts.
• It is usually easier to get a sense of the relative
importance of the various categories if one
converts the frequencies into percentages!
Example of a frequency table
Place met marital or cohabiting partner
Frequency
At school, college or university
%
872
12.4
At/through work
1405
19.9
In a pub/cafe/restaurant/
bar/club
2096
29.7
At a social event organised by
friend(s)
1055
14.9
Other
1631
23.1
TOTAL
7059
100.0
Example of a pie-chart
At school, college or university
Other
At/through work
At a social event
organised by friend(s)
In a pub/cafe/restaurant/
bar/club
What are percentages?
• It may seem self-evident, but percentages are
a form of descriptive statistic
• Specifically, they are useful in describing the
distributions (of frequencies) for nominal or
ordinal (i.e. categorical) variables
• When we consider interval-level variables or
more than one variable, we need (somewhat)
more sophisticated descriptive statistics
Descriptive statistics...
• ... are data summaries which provide an
alternative to graphical representations of
distributions of values (or relationships)
• ... aim to describe key aspects of distributions
of values (or relationships)
• ... are of most relevance when we are thinking
about interval-level variables (scales)
Description or inference?
• Descriptive statistics summarise relevant features of a
set of values.
• Inferential statistics help researchers decide whether
features of quantitative data from a sample can be
safely concluded to be present in the population.
• Generalizing from a sample to a population is part of
the process of statistical inference
• One objective may be to produce an estimate of the
proportion of people in the population with a
particular characteristic, i.e. a process of estimation.
Types of (univariate) descriptive statistics
Measures of ...
• ... location (averages)
• ... spread
• ... skewness (asymmetry)
• ... kurtosis
• We typically want to know about the first two,
sometimes about the third, and rarely about the
fourth!
What is ‘kurtosis’ anyway?
• Increasing kurtosis is associated with the “movement of
probability mass from the shoulders of a distribution into its
center and tails.” (Balanda, K.P. and MacGillivray, H.L. 1988.
‘Kurtosis: A Critical Review’, The American Statistician 42:2:
111–119.)
• Below, kurtosis increases from left to right...
Visualising ‘scale’ variables
• For interval-level data the appropriate visual
summary of a distribution is a histogram,
examining which can allow the researcher to
assess whether it is reasonable to assume that
the quantity of interest has a particular
distributional shape (and whether it exhibits
skewness).
• Unlike bar charts, distances along the ‘horizontal’
dimension of a histogram have a well-defined,
consistent meaning: i.e. they represent
differences between values on the interval-level
scale in question.
Example of a histogram
Measures of location
• Mean
(the arithmetic average of the values,
i.e. the result of dividing the sum of
the values by the total number of cases)
• Median (the middle value, when the values
are ranked/ordered)
• Mode
(the most common value)
... and measures of spread
• Standard deviation (and Variance)
(This is linked with the mean, as it is based on
averaging [squared] deviations from it. The
variance is simply the standard deviation squared).
• Interquartile range / Quartile deviation
(These are linked with the median, as they
are also based on the values placed in order).
Measures of location and spread:
an example (household size)
Mean = 2.94, Median = 2, Mode = 2
Mean = 2.96, Median = 3, Mode = 2
s.d. = 1.93, skewness = 2.10; kurtosis = 5.54
s.d. = 1.58, skewness = 1.27; kurtosis = 2.24
West Midlands
London
Why is the
standard deviation so important?
• The standard deviation (or, more precisely, the
variance) is important because it introduces
the idea of summarising variation in terms of
summed, squared deviations.
• And it is also central to some of the statistical
theory used in statistical testing/statistical
inference...
An example of the calculation of a standard deviation
• Number of seminars attended by a sample of undergraduates:
5, 4, 4, 7, 9, 8, 9, 4, 6, 5
• Mean = 61/10 = 6.1
• Variance = ((5 – 6.1)2 + (4 – 6.1)2 + (4 – 6.1)2 + (7 – 6.1)2 +
(9 – 6.1)2 + (8 – 6.1)2 + (9 – 6.1)2 + (4 – 6.1)2 + (6 – 6.1)2 +
(5 – 6.1)2)/(10 – 1) = 36.9 /9 = 4.1
• Standard deviation = Square root of variance = 2.025
The Empire Median Strikes Back!
• Comparing descriptive statistics between
groups can be done graphically in a rather nice
way using a form of display called a ‘boxplot’.
• Boxplots are based on medians and quartiles
rather than on the more commonly found
mean and standard deviation.
Example of a boxplot
Moving on to
bivariate ‘descriptive statistics'...
• These are referred to as ‘Measures of
association’, as they quantify the (strength of the)
association between two variables
• The most well-known of these is the (Pearson)
correlation coefficient, often referred to as ‘the
correlation coefficient’, or even ‘the correlation’
• This quantifies the closeness of the relationship
between two interval-level variables (scales)
Positive and negative relationships
Positive or direct relationships
• If the points cluster around a line
that runs from the lower left to upper
right of the graph area, then the
relationship between the two
variables is positive or direct.
• An increase in the value of x is more
likely to be associated with an
increase in the value of y.
• The closer the points are to the line,
the stronger the relationship.
Negative or inverse relationships
• If the points tend to cluster around
a line that runs from the upper left
to lower right of the graph, then the
relationship between the two
variables is negative or inverse.
• An increase in the value of x is
more likely to be associated with a
decrease in the value of y.
Working out the correlation coefficient
(Pearson’s r)
• Pearson’s r tells us how much one variable changes as the values
of another changes – their covariation.
• Variation is measured with the standard deviation. This measures
average variation of each variable from the mean for that variable.
• Covariation is measured by calculating the amount by which each
value of X varies from the mean of X, and the amount by which
each value of Y varies from the mean of Y and multiplying the
differences together and finding the average (by dividing by n-1).
• Pearson’s r is calculated by dividing this by (SD of x) x (SD of y) in
order to standardize it.
  x  X  y  Y 
(n  1) sx s y
Working out the correlation coefficient
(Pearson’s r)
• Because r is standardized it will always fall
between +1 and -1.
• A correlation of either 1 or -1 means perfect
association between the two variables.
• A correlation of 0 means that there is no
association.
• Note: correlation does not mean causation. We
can only investigate causation by reference to
our theory. However (thinking about it the other
way round) there is unlikely to be causation if
there is not correlation.
A scatterplot of the values of
two interval-level variables
Example of calculating a
correlation coefficient
(corresponding to the last slide)
•
•
•
•
•
•
X = 5, 4, 4, 7, 9, 8, 9, 4, 6, 5
Mean(X) = 6.1
Y = 8, 7, 9, 7, 8, 8, 8, 5, 5, 6
Mean(Y) = 7.1
(5 - 6.1)(8 – 7.1) + (4 – 6.1)(7 – 7.1) ... etc.
-0.99 + 0.21 + ... = 7.9 (Covariation)
S.D. (X) = 2.02 ; S.D. (Y) = 1.37
(7.9 / 9) / (2.02 x 1.37) = 0.316
Looking at the relationship between
two categorical variables
If two variables are nominal or ordinal, i.e.
categorical, we can look at the relationship
between them in the form of a cross-tabulation,
using percentages to summarize the pattern.
(Typically, if there is one variable that can be
viewed as depending on the other, i.e. a
dependent variable, and the categories of this
variable make up the columns of the crosstabulation, then it makes sense to have
percentages that sum to 100% across each row;
these are referred to as row percentages).
An example of a cross-tabulation
(from Jamieson et al., 2002#)
‘When you and your current partner first decided to set up home or
move in together, did you think of it as a permanent arrangement or
something that you would try and then see how it worked?’
Both
Both
Different
‘permanent’ ‘try and see’ answers
TOTAL
Cohabiting without
marriage
15 (48%)
4 (13%)
12 (39%)
31 (100%)
Cohabited and then
married
16 (67%)
1
(4%)
7 (29%)
24 (100%)
9 (100%)
0
(0%)
0
Married without
cohabiting
(0%)
9 (100%)
# Jamieson, L. et al. 2002. ‘Cohabitation and commitment: partnership plans of young
men and women’, Sociological Review 50.3: 356–377.
Alternative forms of percentage
• In the following example, row percentages
allow us to compare outcomes between the
categories of an independent variable.
• However, we can also use column percentages
to look at the composition of each category of
the dependent variable.
• In addition, we can use total percentages to
look at how the cases are distributed across
combinations of the two variables.
Example Cross-tabulation II:
Row percentages
Class origin * Class destination Crosstabulation
Class destination
Service
Class origin
Service
Count
% within Class origin
Intermediate
Count
% within Class origin
Working
Count
% within Class origin
Total
Count
% within Class origin
Intermediate
Working
Total
730
323
189
1242
58.8%
26.0%
15.2%
100.0%
857
1140
1108
3105
27.6%
36.7%
35.7%
100.0%
786
1385
2916
5087
15.5%
27.2%
57.3%
100.0%
2373
2848
4213
9434
25.2%
30.2%
44.7%
100.0%
Derived from: Goldthorpe, J.H. with Llewellyn, C. and Payne, C. (1987). Social Mobility
and Class Structure in Modern Britain (2nd Edition). Oxford: Clarendon Press.
Example Cross-tabulation II:
Column percentages
Class origin * Class destination Crosstabulation
Class destination
Service
Class origin
Service
Count
% within Class destination
Intermediate
Count
% within Class destination
Working
Count
% within Class destination
Total
Count
% within Class destination
Intermediate
Working
Total
730
323
189
1242
30.8%
11.3%
4.5%
13.2%
857
1140
1108
3105
36.1%
40.0%
26.3%
32.9%
786
1385
2916
5087
33.1%
48.6%
69.2%
53.9%
2373
2848
4213
9434
100.0%
100.0%
100.0%
100.0%
Example Cross-tabulation II:
Total percentages
Class origin * Class destination Crosstabulation
Class destination
Service
Class origin
Service
Count
Total
Total
323
189
1242
7.7%
3.4%
2.0%
13.2%
857
1140
1108
3105
9.1%
12.1%
11.7%
32.9%
786
1385
2916
5087
% of Total
8.3%
14.7%
30.9%
53.9%
Count
2373
2848
4213
9434
25.2%
30.2%
44.7%
100.0%
Count
% of Total
Working
Working
730
% of Total
Intermediate
Intermediate
Count
% of Total
Percentages and Association
• It is possibly self-evident that the differences
between the percentages in different rows (or
columns) can collectively be viewed as measuring
association
• In the case of a 2x2 cross-tabulation (i.e. one with
two rows and two columns), the difference
between the percentages is a measure of
association for that cross-tabulation
• But there are other ways of quantifying the
association in the cross-tabulation…
Odds ratios as a measure of association
• The patterns in the social mobility table examined in an
earlier session can clearly be expressed as differences in
percentages (e.g. the differences between the percentages
of sons with fathers in classes I and VII who are themselves
in classes I and VII.
• However, an alternative way of quantifying these class
differences is to compare the odds of class I fathers having
sons in class I as opposed to class VII with the odds of class
VII fathers having sons in class I as opposed to class VII.
• The ratio of these two sets of odds is an odds ratio, which
will have a value of close to 1.0 if the two sets of odds are
similar, i.e. if there is little or no difference between the
chances of being in classes I and VII for sons with fathers in
classes I and VII respectively.
Odds Ratios vs. % Differences
An Example: Gender and Higher Education
Age 30-39
Men
Women
Degree
No Degree
56 (13.0%)
374
70 (13.8%)
438
% difference
= -0.8%
Odds ratio = ((56/374)/(70/438))
= 0.937
Age 40-49
Men
Women
Degree
No Degree
56 (14.4%)
334
38 (9.1%)
378
% difference
= 5.3%
Odds ratio = ((56/334)/(38/378))
= 1.668
Age 50-59
Men
Women
Degree
No Degree
34 (9.9%)
308
18 (5.2%)
329
% difference
= 4.7%
Odds ratio = ((34/308)/(18/329))
= 2.018
Choice of measure can matter!
• The choice of differences between percentages versus
odds ratios as a way of quantifying differences between
groups can matter, as in the preceding example of the
‘effect’ of gender on the likelihood of having a degree,
according to age.
• The % difference values of 4.7%, 5.3% and -0.8% suggest
that inequality increased before it disappeared, whereas
the odds ratios of 2.018, 1.668 and 0.937 suggest a small
decrease in inequality before a larger decrease led to
approximate equality!
• Evidently, there are competing ways of measuring
association in a cross-tabulation. But neither differences
between percentages nor odds ratios provide an overall
summary of the association in a cross-tabulation…
Another measure of association
• If we need an overall measure of association for
two cross-tabulated (categorical) variables, one
standard possibility is Cramér’s V
• Like the Pearson correlation coefficient, it has a
maximum of 1, and 0 indicates no relationship,
but it can only take on positive values, and makes
no assumption of linearity.
• It is derived from a test statistic (inferential
statistic), chi-square, which we will consider in a
later session…
An example of Cramér’s V
Cramér’s V = 0.074
Other measures of association for
cross-tabulations…
• In a literature review more than thirty years ago,
Goodman and Kruskal identified several dozen of
these:
Goodman, L.A. and Kruskal, W.H. 1979. Measures of association for cross
classifications. New York, Springer-Verlag.
• … and I added one of my own, Tog, which measures
inequality (in a particular way) where both variables
are ordinal…
One of Tog’s
(distant) relatives
What if one variable is a set of
categories, and the other is a scale?
• The equivalent to comparing percentages in
this instance is comparing means… but there
may be quite a lot of these!
• So one possible overall measure of association
used in this situation is eta2 (η2) (eta-squared)
• But this is a less familiar measure (at least to
researchers in some social science disciplines!)