Correlation and Regression Tutorial
This tutorial comprises a series of exercises illustrating various aspects of correlation and regression
The data for the exercises is available for download on an Excel file. You are advised to save the Excel file on a floppy disk and cut and paste the data into Minitab as needed. If you use Excel for the tutorial, you can simply create new worksheets or workbooks and copy & paste the data as needed. Alternatively you should have some storage space on the I: drive, which reduces the need to carry floppy disks.
When you have worked through the various exercises, you can download an Excel file with results on them for comparison. There is also a set of solutions for the various tutorial exercises based on Minitab
Linear Regression
Data set 1 is a set of experimental data in which the relationship between the rotational speed of a centrifuge and the fat content of cream was investigated. The centrifuge speed is the predictor and the fat content the response.
Question. Can centrifuge speed be used to determine fat content of cream?
The data may be analysed either using Excel or Minitab.
Using Minitab Scatter Plots A scatter plot of these two variables can be obtained by Minitab. Select: Graph > Plot In the dialog box , enter the Yvalue as Fat and the Xvalue and Speed click OK and examine the graph that is produced. Does there appear to be a linear trend (or any relationship) between the fat content of the cream and the centrifuge speed? or does the pattern resemble a random scattering of points? Correlation The correlation coefficient is a measure of how good a relationship exists between the two variables. This is a first step in carrying out a regression analysis. To compute the correlation coefficient, r between x and y perform this procedure: Stat > Basic Statistics > Correlation Enter the two variables as Fat and Speed and ensure that "display pvalue" is selected click OK Perform the following statistical test H_{o}: There is no correlation between fat content and speed. H_{1}: Fat content and speed are linearly related Decision rule: Reject Null hypothesis if pvalue of correlation coefficient < 0.05 Examine the Minitab output for correlation coefficient. If pvalue>0.05, fail to reject null hypothesis and stop. Otherwise, reject null hypothesis and proceed to perform regression analysis. .Regression First take a few minutes to write null and alternate hypothesis and decision rule(s) for linear regression Perform this procedure: Select Stat > Regression > Regression... and a dialog box will appear. Response = Fat, Predictor = Speed. Select Graph... Under "residuals for plots", check "regular" Click in the box under "residuals vs. fits" Click OK then select Storage... Check the residuals box then click OK and OK again Look at the output of Minitab Question: Is this model useful for predicting fat content from a given centrifuge speed?
If all these conditions are met, then the equation calculated is a reasonable predictor for the fat content of cream that will be produced for a given centrifuge speed. As a final exercise, examine the fitted line on the data Select: Stat>Regression>Fitted Line Plot... Response (Y): Fat Content Predictors (X): Centrifuge Speed Ensure that linear is selected under type of model Click on Options and select to display prediction and confidence bands. Ensure the value in the confidence interval box is 95. Click OK. (twice) Examine the graph which will give a scatter plot with the straight line fitted and the limiting lines which fit the data. The significance of the lines is as follows
Testing for a line through the origin If the line passes through the origin the intercept will be zero. The Minitab output allows us to test for this. H_{0}: intercept = 0 (The line passes through the origin) H_{1}: intercept ≠ 0 (the line does not pass through the origin) Decision Rule: Reject H. if P < α (In this case, 0.05) on the Minitab output, we find under the regression equation Constant; coeff = 75.39, StDev = 16.47, T = 4.58, P = 0.002 From this we can conclude that the intercept is nonzero and the line does not pass through the origin. i.e. the intercept is in fact 75.4 Estimating the error in slope and intercept The same table that we used to test whether the line passed through the origin also enables us to estimate the 95% confidence intervals for the slope and intercept. The 95% confidence interval for the intercept is calculated from 95% CI = intercept ± StdDev x t_{0.05} t_{0.05} must be looked up in statistical tables. From tables, for α = 0.05 (2tailed), t = 2.262 Hence 95% CI = 75.4 ± 2.62 x 16.47 = 75.4 ±43.15 Now estimate the 95% CI for the slope using the same method. 

Using Excel Scatter Plots A scatter plot can be obtained in Excel by highlighting the cells containing the data and selecting the chart wizard. Select xy scatter then follow the instructions to create the graph. Note: Ensure your xvalues are in the left hand
column Does there appear to be a linear trend (or any
relationship) between the metabolic rate and body weight, or does
the pattern resemble a random scattering of points? Correlation The correlation coefficient is a measure of how good a relationship exists between the two variables. This is a first step in carrying out a regression analysis. To find the correlation coefficient, select a cell where you want the result to be stored, then use the function wizard (fx button) to insert the correlation coefficient, correl(). Perform the following statistical test H_{o}: There is no correlation between fat content and speed. H_{1}: Fat content and speed are linearly related Decision rule: Reject Null hypothesis if pvalue of correlation coefficient < 0.05 Use statistical tables to check whether the correlation coefficient (r) is significant at P = 0.05. If r<rvalue at p=0.05, fail to reject the null hypothesis and stop. Otherwise, reject null hypothesis and proceed to perform regression analysis. Regression First take a few minutes to write null and alternate hypothesis and decision rule(s) for linear regression A regression line may be inserted on an Excel chart. Using your mouse, point at one of the data points and click the right mouse button. Select "Trendline" from the menu that appears. In the dialog box select linear, then select the "Options" Tab. In Options, check the display equation on chart and display R^{2} value boxes. Select OK A regression line will be inserted and the equation and the R^{2} value will appear on the chart. Examine these. Does the straight line appear to be a good fit? Regression analysis Regression analysis in Excel requires the Analysis add in to be installed. To check this select Tools>Add InTHEN s … Ensure that the Analysis tool pack box is checked and select OK. To carry out a regression analysis, select Tools>Data Analysis … then select "Regression" from the resulting dialog box In the Regression dialog box enter the following The cells containing the x and yvalues. Under "Output options", select the "Output Range" button and enter the address of the top left cell where the output is to appear. It is suggested that this cell is just below your data and chart. Under Residuals, check the "Residual Plots" box. There is no need to select "Line Fit Plots" as your chart has already done this Under "Normal Probability" check "Normal Probability Plots" Select OK Look at the output from Excel Question: Is this model useful for predicting fat content from a given centrifuge speed?
If all these conditions are met, then the equation calculated is a reasonable predictor for the fat content of cream that will be produced for a given centrifuge speed.
Testing for line through the origin Estimating error for slope and intercept Look in the Minitab column for how to calculate these quantities. The method is the same and the same basic information is available in your Spreadsheet. 
Further regression exercises
These exercises apply the techniques you have just used to nonlinear regression
Data set 2 is a comparison of the body weight and the metabolic rate of a child.
Let x be the body weight (in kilograms), and let y be the metabolic rate of the child which is the energy expended by the body per unit time (100 Kcal/24 h).
Carry out a linear regression using Minitab. You will find that a linear equation is a satisfactory fit, but the residual plot suggests there is are systematic errors present. Careful inspection of a scatter plot of the data suggests that the true relationship is probably a curve.
Try fitting quadratic and cubic equations to the data. To do this, you need to create two new columns using the calculator. One column will contain the square of the body weight and the other the cube of the body weight. Label these columns Wtsquare and Wtcube respectively.
Carry out two further regressions using Minitab. For the first one use weight and Wtsquare as the predictors and for the second one, use weight, Wtsquare and Wtcube as the predictors. Examine the Anova, R^{2} and residual plots. Which looks like the best fit.
Also produce fitted line plots. Check Quadratic and Cubic in turn as the type of regression model. Which model produces the best fit?
Which model produces the best fit for this data?
Data set 3 is a relationship between shear stress (yaxis) and shear rate (xaxis) for a polymeric material. These properties characterise the flow behaviour of viscous materials. The relationship between shear stress and shear rate is normally expressed as a power law
Where 
t is the shear stress

This relationship can be linearised by taking logs to produce a loglog relationship. Create two new columns, one containing log shear stress and the other containing log shear rate (In Minitab, use the calculator for this). Determine the correlation coefficient for log shear stress and log shear rate and carry out a linear regression on the loglog data. Is a power law relationship reasonable in this case? Comment on your findings.
Non Parametric linear regression
We will determine Spearman’s rank correlation coefficient and use Theil’s incomplete method of fitting a straight line using data set 1. This will have to be done using Excel as these methods are not available in Minitab.
Copy the data into a new Excel workbook, putting the centrifuge speed in column A and the Fat content in column B.
As the centrifuge speed is already in order, insert the ranks, 1  10 of the centrifuge speed in column C.
Now highlight your data with the mouse and select the Data>Sort from the menu. Sort the data by fat content in ascending order.
Insert ranks, 1 – 10 for the fat content in column D. You can if you wish, sort the data back into order of centrifuge speed at this point.
In column E calculate the difference between the rank of the centrifuge speed and the rank of the fat content and in column F calculate the square of the differences.
At the foot of column F, calculate the sum of the squares of the differences.
In a cell at the foot of the table, insert the number of data points (10) and calculate the value of Spearman’s rank correlation coefficient, r from
Check the significance of r using statistical tables and compare the value with Pearson’s correlation coefficient.
You can also compute Spearman's correlation coefficient from Minitab, by carrying out a Pearson correlation on the data ranks.
In Minitab, select Manip/Rank... In the dialog box, enter the column containing the xvalues and a column to store the ranks. Repeat this for the yvalues.
Then carry out a Pearson correlation on the ranks.
Theil’s incomplete method
Copy and paste the data into a fresh table starting at Cell A21 for the data (use rows 19 and 20 for titles and column headerings). Copy centrifuge speed into column A (cells A21A30) and fat content into column B (cells B21B30).
Theil’s method involves dividing the data into two equal groups. With 10 data points this means two groups of 5. The first group will be in cells A21:B25 and the second group in cells B26:B30
Calculate the slope of the line joining the first points in each of the two groups in cell C21. The formula will be "=(B26B21)/(A26A21)". Don’t forget the () they are important!
Fill this formula down as far as cell C25 to give five slopes of lines connecting corresponding points of each groups.
In order to determine the median value we need the values of slope in order. Sorting data in columns where there are formulae can cause problems in Excel, so we are going to use an Excel function RANK() to find the order of the data.
In cell D21, type the formula =6RANK(C21,$C$21:$C$25) and press ENTER. Copy this formula to cells C22  C25. This will give the ranks of the slopes with the smallest value of slope having the rank "1".
The slope of the regression line is the median value which is the value having the rank of 3 (in cell C23). Type this value into a cell below the table.
Using this value of slope, calculate the intercept for each of the 10 data points from
Intercept = fat – slope ´ speed. Use column E for these calculations.
The value of intercept you want is the median. Type the formula =11RANK(E21,$E$21:$E$30) into cell F21, copy this formula to cells F22:F30. This will give the ranks of the intercepts in column E. The median value is the mean of the values having the fifth and sixth ranks.
Insert a line of best fit in column G and residuals in column H. Produce charts to compare the fitted line with the least squares method and also a residual plot.
You can also do a nonparametric regression using Minitab.
Select Stats/Nonparametrics/Pairwise slopes...
Enter the columns containing the y and x variables and a column to store the slopes (call this column “slopes”). Select OK.
Select Calc/Column Statistics... Select the “Median” button and enter “slopes” as the Input variable, then select OK.
The median will be displayed. This is your slope.
You now need to calculate the intercept.
Select Calc/Calculator... Enter a column to store the output (call this “intercepts”).
In the Expression box enter “yvalue  slope * xvalue”. (replace the words in italics by either a column name or, in the case of the slope, the median previously calculated.
Select OK and a series of intercept values corresponding to each (x, y) pair will be calculated.
The median of this set of values is your intercept.
Exercise
Data set 4 is data on two experiments on enzyme kinetics using two different enzymes. Analyse the data using the techniques in this tutorial.
The equation relating rate of reaction (V_{o}) to the substrate concentration ([S]) for an enzyme reaction is
V_{max} and K_{m} are constants and properties of the particular system.
This equation can be linearised as follows
Produced by Geoff
Walker
Last
Modified 3 January 2003