Assoc Prof Dr Zamalia Mahmud Center of Studies for Statistics and Decision Science Faculty of Computer and Mathematical Sciences November 2011 1 Why is Statistics useful in research? • It helps us to make sense of the information • It helps us to understand how decisions are made • It helps us to determine the cause and effect of a phenomena • It helps us to arrive at a conclusion • It lends us the tools and techniques for collecting, analyzing and interpreting data 2 1 How to prepare yourself to be competent in statistical data analysis? • • • • • • Take up several statistics courses Learn statistical packages Read statistics books Learn the relevant statistical techniques Get to know your data very well. Be prepared to analyze data with minimum help from a statistics consultant. • Be prepared to deal with your fear and anxiety towards statistics 3 What do you need to know prior to doing data analysis? • Know how to collect the right data using the appropriate instrument • Know the nature of data to be collected • Know the type of data to be collected • Know the levels of measurement of the data to be collected • Know the types of variables associated with the data to be collected 4 2 Continue... • Know how to relate the data with your research questions • Know how to relate research questions with the types of analyses to be done • Know how to relate research hypotheses with the types of analyses to be done • Know how to recognize the variables to be measured from the research questions 5 Always begin your research inquiries with measurable Research Questions • Is there a significant relationship between smoking habit and lung infection? • Does job-related stress affect lecturers’ performance at the University? • Is there a significant difference in the job satisfaction level between Maxis and Celcom employees? 6 3 • Does increase in motivation cause job satisfaction level of employees to increase? • Does motivation moderate the relationship between job loyalty and job satisfaction? • Is there a significant difference in the knowledge on occupational safety and health between male and female employees? • What are the factors that motivate individuals to work at the private hospitals? 7 Examples of Research Objectives • To determine if there is a relationship between smoking habit and lung infection. • To determine if job-related stress affect workers’ performance at the private hospitals. • To determine if there is a difference in the job satisfaction level between Maxis and Celcom employees. • To identify factors that motivate individuals to work at the private hospitals. 8 4 Examples of Research Hypotheses • Hypothesis is a statement that proposes an explanation, which can be tested through data obtained from further observation or experimentation. • Two types of hypothesis: Null and Alternative Example: • There is a relationship between smoking habit and lung infection. • There is a difference in the mean score job satisfaction between Maxis and Celcom employees. 9 Why are Research Questions, Research Objectives and Research Hypotheses important in your analysis? • It helps you to stay focus on what is to be measured • It helps you to focus on the pertinent variables to be measured • It helps you to do the correct and appropriate analysis for your research 10 5 Data Sources Primary Secondary Data Collection Data Compilation Print or Electronic Observation Survey Experimentation Chap 1-11 Types of Data Data Categorical Numerical Examples:   Marital Status Political Party  Eye Color (Defined categories) Discrete Examples:   Number of workers Number of defective items (Counted items) Continuous Examples: Weight Voltage (Measured characteristics)   Chap 1-12 6 Types of Samples Used Samples Does not require sampling frame Require sampling frame Non-Probability Samples Judgement Snowball Quota Convenience Probability Samples Simple Random Stratified Systematic Cluster Chap 1-13 Levels of Measurement • RATIO • INTERVAL • ORDINAL • NOMINAL 14 7 NOMINAL ORDINAL INTERVAL RATIO A scale in which the numbers or letters assigned to objects serve as labels for identification. Categories only. Data cannot be arranged in an ordering scheme. Gender: Male (1) Female (2) A scale that arranges objects or alternatives according to their magnitude or rank-order Monthly Salary: < RM1000 (1) RM1001 – RM2000 (2) RM2001 – RM3000 (3) We know “excellent” is better than “good”, but we do not know by how much. Overall ratings of the hospital services: Excellent (4) Good (3) Fair (2) Poor (1) Marital Status: Married (1) Single (2) Widowed (3) It has a rank-order feature of ordinal scales, but it also includes the additional characteristic of equal distances, or equal intervals, between numbers on the scale. It has no true zero point (i.e., zero is not the starting point) Temperature : -2O C 0 O C – freezing point 2O C It has all the above properties plus It has a true zero point. Weight: 20.8 kg 45.5 kg 87.5 kg Likert Scale: 1 – Strongly disagree 2 – Disagree 3 – Neutral; 4 - Agree; 5 – Strongly disagree Height: 150 cm 126.8 cm 106. 5 cm 15 16 8 Descriptive and Inferential Statistics Statistical Analysis Descriptive Inferential Combines the methods of descriptive statistics with the theory of probability for the purpose of learning what samples of data tell about the characteristics of populations from which they where drawn Used to describe the basic features of the data obtained in a study: tables, charts, graphs 6/5/2004 20 Prof Madya Dr Rasimah Aripin 17 Strategy for Data Analysis QUALITATIVE Percentage, mode, median, charts and tables (No measure for Variation) FREQUENCY DISTRIBUTION FOR EVERY VARIABLE DESCRIPTIVE STATISTICS/CHARTS MEASURES OF ASSOCIATION Cross-tabulation, nonparametric measures of association HYPOTHESIS TESTING/ MODELLING Non Parametric Methods 6/5/2004 ESTIMATION, PREDICTION, FORECASTING Prof Madya Dr Rasimah Aripin QUANTITATIVE Mean, median, variance, graphs, and many more Correlation Analysis Parametric Methods 36 18 9 Type of Descriptive Statistics Descriptive for Different byStatistics Measurement Scales Types of Measurement Type of Descriptive Statistics Type of measurement Two Categories Frequency Table Proportion (percentage) Mode Nominal More than two categories Frequency table Category proportion (%) Mode Rank Order Median Ordinal Interval Arithmetic Mean, Inde x numbers, variance, standard deviation, range, Percentiles Ratio 19 Selection of Univariate/Bivariate Techniques Classification of Univariate Technique for Testing of the Mean and Median Univariate/Bivariate Technique s Interval/Ratio Data One Sample Nominal/Ordinal Data Two or More Sample One Sample • t test • Frequency • z test • Chi-Square • K-S • Runs Independent • t test • Z test • One-way ANOVA 5/25/98 Related • Paired t test Independent • Chi-Square • Mann-Whitney • Median • K-S • K-W ANOVA Two or More Sample Related • Sign • Wilcoxon • McNemar 20 10 Measures of Association by Measurement Scale MEASURES OF ASSOCIATION Scales Coefficients Research Questions Interval/Ratio Pearson’s r Simple Regression Is moisture content related to temperature? Ordinal Spearman Rank Kendall’s Rank Is preference related to convenience of locations? Nominal Phi-Coefficient Contingency Coeff Is gender associated with brand preference? 6/5/2004 Prof Madya Dr Rasimah Aripin 41 21 Selection of Multivariate Techniques Multivariate Techniques Dependent Techniques One Dependent Variable • Cross -tabulation • analysis of variance and covariance • Multiple regression • Two -group discriminant analysis • Conjoint analysis 6/5/2004 Interdependent Techniques More than One Dependent Variable • Multivariate analysis of variance and covariance • Canonial Correlation • Multiple discriminant analysis Variable Interdependence • Factor Analysis Prof Madya Dr Rasimah Aripin Interobject Similarity • Cluster Analysis • Multidimensio nal scaling 22 68 11 BASIC DATA ANALYSIS • DESCRIPTIVE STATISTICS • FREQUENCY & PERCENTAGE TABLES • CROSSTABULATION • DATA TRANSFORMATION • DATA COMPUTATION • GRAPHICAL REPRESENTATION 23 Gender Valid Female Male Total Frequency 216 258 474 Percent 45.6 54.4 100.0 Valid Percent 45.6 54.4 100.0 Cumulativ e Percent 45.6 100.0 24 12 Descriptives Beginning Salary Gender Female Male Mean 95% Conf idence Interv al f or Mean Median Variance St d. Dev iation Minimum Maximum Range Interquart ile Range Skewness Kurt osis Mean 95% Conf idence Interv al f or Mean Median Variance St d. Dev iation Minimum Maximum Range Interquart ile Range Skewness Kurt osis Lower Bound Upper Bound Lower Bound Upper Bound St at ist ic $13,091.97 $12,698.26 St d. Error $199.74 $13,485.67 $12,375.00 8617742.738 $2,935.60 $9,000 $30,000 $21,000 $3,118.75 1.767 5.352 $20,301.40 $19,184.30 .166 .330 $567.27 $21,418.49 $15,750.00 83024550.57 $9,111.78 $9,000 $79,980 $70,980 $7,687.50 2.390 8.488 .152 .302 25 Clerical Custodial Manager Total Em ploy ment Category Count % 363 76.6% 27 5.7% 84 17.7% 474 100.0% 26 13 27 CONTINGENCY TABLE The results of a crosstabulation between two categorical variables (smoking habit and hospitalization) 28 14 Graphical Methods Pie Chart of Employment Category Bar Chart f o Employment Category 100 Manager 17.7% 80 Custodial 77 5.7% 60 40 Cleric al 76.6% Percent 20 18 6 0 Cleric al Custodial Manager Employment Category Bar Chart Pie Chart 29 Comparative Histogram 30 15 Box-and-Whisker Plot Normal Q-Q Plot 31 HYPOTHESIS TESTING • WHAT IS A HYPOTHESIS? An unproven proposition or supposition that tentatively explains certain facts or phenomena. • NULL HYPOTHESIS A conservative statement which communicates the notion that any change from what has been thought to be true or observed in the past will be due entirely to error. 32 16 • ALTERNATIVE HYPOTHESIS A statement indicating the opposite of the null hypothesis. • SIGNIFICANCE LEVEL The critical probability in choosing between the null and alternative hypothesis; the probability level (say,  = 0.05 or 0.01) that is too low to warrant support of a null hypothesis. 33 • CRITICAL VALUE or p-VALUE The value that lie exactly on the boundary of the region of rejection. 34 17 p-Value Solution Calculate the p-value and compare to  (For a two sided test the p-value is always two sided) Do not reject H0 Reject H0 /2 = .025 Reject H0 /2 = .025 .0068 .0068 -1.96 Z = -2.47 0 1.96 P(Z  2.47)  P(Z  2.47)  2(.0068)  0.0136 p-value = .0136: Z = 2.47 Reject H0 since p-value = .0136 <  = .05 35 TEST OF DIFFERENCES Investigation of hypotheses that state two (or more) groups differ with respect to measures on a variable. e.g. To determine if male and female employees differ in their attitude towards their job. 36 18 TEST OF DIFFERENCES COMMON BIVARIATE TESTS OF DIFFERENCE Types of Measurement Interval and Ratio Differences among two independent groups Independent groups: t-test or Z-test Ordinal Mann-Whitney U-test Wilcoxon test (Non-parametric) Nominal Z-test (two props.) Chi-square test (Non-parametric) Differences among three or more Independent groups One-way ANOVA Kruskal-Wallis Test (Non-parametric) Chi-square test (Non-parametric) 37 TEST OF ASSOCIATION • CHI-SQUARE (2) TEST OF INDEPENDENCE A test conducted to investigate if there is an association/relationship between two nominal, two ordinal or between nominal and ordinal variables. 38 19 39 Example H0 : There is no association between gender and preference for colours H1 : There is association between gender and preference for colours 40 20 Another example H0 : There is no association between gender and employment category H1 : There is association between gender and employment category 41 Case Processing Summary Valid N Gender * Employ ment Category Percent 474 N 100.0% Cases Missing Percent 0 .0% Total N Percent 474 100.0% Gender * Employment Category Crosstabulation Gender Female Male Total Count Expected Count % wit hin Gender % wit hin Employ ment Category Count Expected Count % wit hin Gender % wit hin Employ ment Category Count Expected Count % wit hin Gender % wit hin Employ ment Category Employ ment Category Clerical Custodial Manager 206 0 10 165.4 12.3 38.3 95.4% .0% 4.6% Total 216 216.0 100.0% 56.7% .0% 11.9% 45.6% 157 197.6 60.9% 27 14.7 10.5% 74 45.7 28.7% 258 258.0 100.0% 43.3% 100.0% 88.1% 54.4% 363 363.0 76.6% 27 27.0 5.7% 84 84.0 17.7% 474 474.0 100.0% 100.0% 100.0% 100.0% 100.0% Chi-Square Tests Pearson Chi-Square Likelihood Ratio N of Valid Cases Value 79.277a 95.463 474 df 2 2 Asy mp. Sig. (2-sided) .000 .000 a. 0 cells (.0%) hav e expect ed count less than 5. The minimum expected count is 12.30. 42 21 CORRELATION COEFFICIENT (r) It is a statistical measure of the covariation of or association between two variables. It indicates the strength of the relationship between two variables. Correlation coefficient (r) ranges from +1.0 to -1.0. 43 If r = +1.0  perfect positive linear relationship If r = -1.0  perfect negative linear relationship If r = 0  no correlation If r = -0.92  a relatively strong inverse relationship i.e., the greater the value measured by variable X, the less the value measured by variable Y. If r =+0.92  a relatively strong positive relationship i.e., the greater the value measured by variable X, the more the value measured by variable Y. 44 22 Testing the Significance of the correlation coefficient H0 :  = 0 (No correlation exist between two variables) H1 :   0 (Correlation exist between two variables) 45 REGRESSION ANALYSIS A technique used for measuring the linear association between a dependent and independent variable. Regression analysis attempts to predict the values of a continuous, interval-scaled dependent variable from the specific values of the independent variable. 46 23 Simple Linear Regression Model The population regression model: Population Y intercept Dependent Variable Population Slope Coefficient Random Error term Independent Variable Yi  β0  β1Xi  ε i Linear component Random Error component 47 Simple Linear Regression Model (continued) Y Yi  β0  β1Xi  ε i Observed Value of Y for Xi εi Predicted Value of Y for Xi Slope = β1 Random Error for this Xi value Intercept = β0 Xi X 48 24 Simple Linear Regression Equation The simple linear regression equation provides an estimate of the population regression line Estimated (or predicted) Y value for observation i Estimate of the regression intercept Estimate of the regression slope ˆ  b b X Y i 0 1 i Value of X for observation i The individual random error terms ei have a mean of zero 49 Least Squares Method • b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared differences between Y and Yˆ : ˆ )2  min (Y  (b  b X ))2 min (Yi Y i i 0 1 i 50 25 Interpretation of the slope and the Intercept • b0 is the estimated average value of Y when the value of X is zero • b1 is the estimated change in the average value of Y as a result of a one-unit change in X 51 Simple Linear Regression Example • A real estate agent wishes to examine the relationship between the selling price of a home and its size (measured in square feet) • A random sample of 10 houses is selected – Dependent variable (Y) = house price in $1000s – Independent variable (X) = size 52 26 Sample Data for House Price Model House Price in $1000s (Y) Size in sq. ft. (X) 245 1400 312 1600 279 1700 308 1875 199 1100 219 1550 405 2350 324 2450 319 1425 255 1700 53 Graphical Presentation • House price model: scatter plot and regression line House Price ($1000s) 450 Intercept = 98.248 400 350 Slope = 0.10977 300 250 200 150 100 50 0 0 500 1000 1500 2000 2500 3000 Square Feet house price  98.24833  0.10977 (size) 54 27 Interpretation of the Intercept, bo house price  98.24833  0.10977 (size) • b0 is the estimated average value of Y when the value of X is zero (if X = 0 is in the range of observed X values) – Here, no houses had 0 square feet, so b0 = 98.24833 just indicates that, for houses within the range of sizes observed, $98,248.33 is the portion of the house price not explained by square feet 55 Interpretation of the slope, b1 • b1 measures the estimated change in the average value of Y as a result of a oneunit change in X – Here, b1 = .10977 tells us that the average value of a house increases by .10977($1000) = $109.77, on average, for each additional one square foot of size 56 28 Predictions using Regression Analysis Predict the price for a house with 2000 square feet: houseprice  98.25  0.1098 (sq.ft.)  98.25  0.1098(2000)  317.85 The predicted price for a house with 2000 square feet is 317.85($1,000s) = $317,850 57 SPSS Output – House Price Model 58 29 Coefficient of Determination, r2 • The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable • The coefficient of determination is also called r-squared and is denoted as r2 0  r2  1 59 SPSS Output – Coeff. Of Determination (r2 values) 58.1% of the variation in house prices is explained by variation in square feet 60 30 Inference about the Slope: t Test • t test for a population slope – Is there a linear relationship between X and Y? • Null and alternative hypotheses H0: β1 = 0 H1: β1  0 (no linear relationship) (linear relationship does exist) • Test statistic b β t 1 1 Sb1 d.f.  n  2 where: b1 = regression slope coefficient β1 = hypothesized slope Sb1 = standard error of the slope 61 Inference about the Slope: t Test House Price in $1000s (y) Square Feet (x) 245 1400 312 1600 279 1700 308 1875 199 1100 219 1550 405 2350 324 2450 319 1425 255 1700 Estimated Regression Equation: houseprice  98.25  0.1098 (sq.ft.) The slope of this model is 0.1098 Does size of the house affect its sales price? 62 31 Inferences about the Slope: t Test From SPSS output: Sb1 b1 H0: β1 = 0 H1: β1  0 P-value b  β1 0.10977  0 t 1   3.32938 Sb1 0.03297 63 Inferences about the Slope: t Test (continued) Test Statistic: t = 3.329 H0: β1 = 0 H1: β1  0 From SPSS output: Intercept Square Feet Coefficients Standard Error t Stat P-value 98.24833 58.03348 1.69296 0.12892 0.10977 0.03297 3.32938 0.01039 d.f. = 10-2 = 8 /2=.025 /2=.025 Conclusion: Reject H0 Do not reject H0 -tα/2 -2.3060 0 tα/2 Reject H0 2.3060 3.329 Reject H0 There is sufficient evidence that size of house affects house 64 price 32 Inferences about the Slope: t Test (continued) P-value = 0.01039 H0: β1 = 0 H1: β1  0 From Excel output: Intercept Square Feet Coefficients Standard Error t Stat P-value 98.24833 0.10977 58.03348 1.69296 0.12892 0.03297 3.32938 0.01039 This is a two-tail test, so the p-value is Decision: P-value < α so P(t > 3.329)+P(t < -3.329) = 0.01039 Conclusion: Reject H0 There is sufficient evidence that square footage or house size 65 affects house price (for 8 d.f.) F-Test for Significance • F Test statistic: where F MSR MSE MSR  SSR k MSE  SSE n  k 1 where F follows an F distribution with k numerator and (n – k - 1) denominator degrees of freedom (k = the number of independent variables in the regression model) 66 33 67 End of Presentation Contact: Center of Studies for Statistics and Decision Science Faculty of Computer and Mathematical Sciences zamal669@salam.uitm.edu.my; zamalia@tmsk.uitm.edu.my Tel: 03-55435367; Fax: 03-55435501 Hp: 012-2197985 34