Correlation and Regression Tutorial

This tutorial comprises a series of exercises illustrating various aspects of correlation and regression

The data for the exercises is available for download on an Excel file. You are advised to save the Excel file on a floppy disk and cut and paste the data into Minitab as needed. If you use Excel for the tutorial, you can simply create new worksheets or workbooks and copy & paste the data as needed.  Alternatively you should have some storage space on the I: drive, which reduces the need to carry floppy disks.

When you have worked through the various exercises, you can download an Excel file with results on them for comparison.  There is also a set of solutions for the various tutorial exercises based on Minitab

Linear Regression

Data set 1 is a set of experimental data in which the relationship between the rotational speed of a centrifuge and the fat content of cream was investigated. The centrifuge speed is the predictor and the fat content the response.

Question. Can centrifuge speed be used to determine fat content of cream?

The data may be analysed either using Excel or Minitab.

Using Minitab

Scatter Plots

A scatter plot of these two variables can be obtained by Minitab. Select:

Graph > Plot

In the dialog box , enter the Y-value as Fat and the X-value and Speed

click OK and examine the graph that is produced.

Does there appear to be a linear trend (or any relationship) between the fat content of the cream and the centrifuge speed? or does the pattern resemble a random scattering of points?

Correlation

The correlation coefficient is a measure of how good a relationship exists between the two variables. This is a first step in carrying out a regression analysis.

To compute the correlation coefficient, r between x and y perform this procedure:

Stat > Basic Statistics > Correlation

Enter the two variables as Fat and Speed and ensure that "display p-value" is selected

click OK Perform the following statistical test

Ho: There is no correlation between fat content and speed.

H1: Fat content and speed are linearly related

Decision rule: Reject Null hypothesis if p-value of correlation coefficient < 0.05

Examine the Minitab output for correlation coefficient. If p-value>0.05, fail to reject null hypothesis and stop.

Otherwise, reject null hypothesis and proceed to perform regression analysis.

.Regression

First take a few minutes to write null and alternate hypothesis and decision rule(s) for linear regression

Perform this procedure:

Select Stat > Regression > Regression... and a dialog box will appear.

Response = Fat, Predictor = Speed.

Select Graph...

Under "residuals for plots", check "regular"

Click in the box under "residuals vs. fits"

Click OK then select Storage...

Check the residuals box then click OK and OK again

Look at the output of Minitab

Question: Is this model useful for predicting fat content from a given centrifuge speed?

  • Identify the equation for the straight line of best fit. The least squares method will always give a line of best fit but how good is the fit? 

  • Is there is a significant linear relationship between fat content and centrifuge speed? (p value < alpha = 0.05, from the ANOVA table) 

  • The coefficient of determination (R2) determines the % of total variability in fat content explained by centrifuge speed. For a good fit R2>0.8 or 80% 

  • Is a straight line the best form of relationship on this equation. Sometimes there appears to be a good fit to a straight line, but the underlying relationship is actally a curve. A residual plot should pick this up. The residuals are the difference between the measured y-value and the y-value calculated from the equation for a given x-value. For a good fit a plot of residuals vs. x-value should give a random scatter of points either side of the zero. If there is an underlying pattern to the data, the residual plot will have a pattern to it. 

  • Check that the scatter of the points on the residual plot looks reasonably random with equal scatter either side of the zero. 

  • Check that the errors are normally distributed.
    Select Graph>Probability Plot... and select RESI1 as your variable.
    Check that all the points lie within the 95% confidence limits (shown as dotted lines)

If all these conditions are met, then the equation calculated is a reasonable predictor for the fat content of cream that will be produced for a given centrifuge speed.

As a final exercise, examine the fitted line on the data

Select: Stat>Regression>Fitted Line Plot...

Response (Y): Fat Content

Predictors (-X): Centrifuge Speed

Ensure that linear is selected under type of model

Click on Options and select to display prediction and confidence bands. Ensure the value in the confidence interval box is 95.

Click OK. (twice)

Examine the graph which will give a scatter plot with the straight line fitted and the limiting lines which fit the data. The significance of the lines is as follows

  • The solid black line is the regression line i.e. the line that best fits the data
  • The red dotted line is the 95% confidence band. i.e. we are 95% confident that the true line will lie within these bands
  • The blue dashed line is the 95% prediction interval. i.e. It gives the 95% confidence limits for any y value calculated using the regression equation.

Testing for a line through the origin

If the line passes through the origin the intercept will be zero. The Minitab output allows us to test for this.

H0: intercept = 0 (The line passes through the origin)

H1: intercept ≠ 0 (the line does not pass through the origin)

Decision Rule: Reject H. if P < α (In this case, 0.05)

on the Minitab output, we find under the regression equation

Constant; coeff = 75.39, StDev = 16.47, T = -4.58, P = 0.002

From this we can conclude that the intercept is non-zero and the line does not pass through the origin. i.e. the intercept is in fact 75.4

Estimating the error in slope and intercept

The same table that we used to test whether the line passed through the origin also enables us to estimate the 95% confidence intervals for the slope and intercept. The 95% confidence interval for the intercept is calculated from

95% CI = intercept ± StdDev x t0.05

t0.05 must be looked up in statistical tables.

From tables, for α = 0.05 (2-tailed), t = 2.262

Hence 95% CI = 75.4 ± 2.62 x 16.47 = 75.4 ±43.15

Now estimate the 95% CI for the slope using the same method.
The value of t is the same.

 

Using Excel

Scatter Plots

A scatter plot can be obtained in Excel by highlighting the cells containing the data and selecting the chart wizard. Select x-y scatter then follow the instructions to create the graph.

Note: Ensure your x-values are in the left hand column
 
 
 

Does there appear to be a linear trend (or any relationship) between the metabolic rate and body weight, or does the pattern resemble a random scattering of points?
 
 

Correlation

The correlation coefficient is a measure of how good a relationship exists between the two variables. This is a first step in carrying out a regression analysis.

To find the correlation coefficient, select a cell where you want the result to be stored, then use the function wizard (fx button) to insert the correlation coefficient, correl().

Perform the following statistical test

Ho: There is no correlation between fat content and speed.

H1: Fat content and speed are linearly related

Decision rule: Reject Null hypothesis if p-value of correlation coefficient < 0.05

Use statistical tables to check whether the correlation coefficient (r) is significant at P = 0.05.

If r<r-value at p=0.05, fail to reject the null hypothesis and stop.

Otherwise, reject null hypothesis and proceed to perform regression analysis.

Regression

First take a few minutes to write null and alternate hypothesis and decision rule(s) for linear regression

A regression line may be inserted on an Excel chart. Using your mouse, point at one of the data points and click the right mouse button. Select "Trendline" from the menu that appears.

In the dialog box select linear, then select the "Options" Tab.

In Options, check the display equation on chart and display R2 value boxes.

Select OK

A regression line will be inserted and the equation and the R2 value will appear on the chart.

Examine these. Does the straight line appear to be a good fit?

Regression analysis

Regression analysis in Excel requires the Analysis add in to be installed. To check this select Tools>Add InTHEN s …

Ensure that the Analysis tool pack box is checked and select OK.

To carry out a regression analysis, select Tools>Data Analysis … then select "Regression" from the resulting dialog box

In the Regression dialog box enter the following

The cells containing the x and y-values.

Under "Output options", select the "Output Range" button and enter the address of the top left cell where the output is to appear. It is suggested that this cell is just below your data and chart.

Under Residuals, check the "Residual Plots" box. There is no need to select "Line Fit Plots" as your chart has already done this

Under "Normal Probability" check "Normal Probability Plots"

Select OK

Look at the output from Excel

Question: Is this model useful for predicting fat content from a given centrifuge speed?

  • Identify the equation for the straight line of best fit. The least squares method will always give a line of best fit but how good is the fit? 

  • Is there is a significant linear relationship between fat content and centrifuge speed? (p value < alpha = 0.05, from the ANOVA table) 

  • The coefficient of determination (R2) determines the % of total variability in fat content explained by centrifuge speed. For a good fit R2>0.8 or 80% 

  • Is a straight line the best form of relationship on this equation. Sometimes there appears to be a good fit to a straight line, but the underlying relationship is actally a curve. A residual plot should pick this up. The residuals are the difference between the measured y-value and the y-value calculated from the equation for a given x-value. For a good fit a plot of residuals vs. x-value should give a random scatter of points either side of the zero. If there is an underlying pattern to the data, the residual plot will have a pattern to it. 

  • Check that the scatter of the points on the residual plot looks reasonably random with equal scatter either side of the zero. 

  • Check the normally probability plot is reasonably close to a straight line (expand the graph and add trendline). Unfortunately 95% CI limits are not printed so you will just have to make a visual check.

If all these conditions are met, then the equation calculated is a reasonable predictor for the fat content of cream that will be produced for a given centrifuge speed.

 

Testing for line through the origin

Estimating error for slope and intercept

Look in the Minitab column for how to calculate these quantities. The method is the same and the same basic information is available in your Spreadsheet.


Further regression exercises

These exercises apply the techniques you have just used to non-linear regression

Data set 2 is a comparison of the body weight and the metabolic rate of a child.

Let x be the body weight (in kilograms), and let y be the metabolic rate of the child which is the energy expended by the body per unit time (100 Kcal/24 h).

Carry out a linear regression using Minitab. You will find that a linear equation is a satisfactory fit, but the residual plot suggests there is are systematic errors present. Careful inspection of a scatter plot of the data suggests that the true relationship is probably a curve.

Try fitting quadratic and cubic equations to the data. To do this, you need to create two new columns using the calculator. One column will contain the square of the body weight and the other the cube of the body weight. Label these columns Wt-square and Wt-cube respectively.

Carry out two further regressions using Minitab. For the first one use weight and Wt-square as the predictors and for the second one, use weight, Wt-square and Wt-cube as the predictors. Examine the Anova, R2 and residual plots. Which looks like the best fit.

Also produce fitted line plots. Check Quadratic and Cubic in turn as the type of regression model. Which model produces the best fit?

Which model produces the best fit for this data?
 
 

Data set 3 is a relationship between shear stress (y-axis) and shear rate (x-axis) for a polymeric material. These properties characterise the flow behaviour of viscous materials. The relationship between shear stress and shear rate is normally expressed as a power law

 

Where

t is the shear stress
g is the shear rate
K, n are constants.

This relationship can be linearised by taking logs to produce a log-log relationship. Create two new columns, one containing log shear stress and the other containing log shear rate (In Minitab, use the calculator for this). Determine the correlation coefficient for log shear stress and log shear rate and carry out a linear regression on the log-log data. Is a power law relationship reasonable in this case? Comment on your findings.


Non Parametric linear regression

We will determine Spearman’s rank correlation coefficient and use Theil’s incomplete method of fitting a straight line using data set 1. This will have to be done using Excel as these methods are not available in Minitab.

Copy the data into a new Excel workbook, putting the centrifuge speed in column A and the Fat content in column B.

As the centrifuge speed is already in order, insert the ranks, 1 - 10 of the centrifuge speed in column C.

Now highlight your data with the mouse and select the Data>Sort from the menu. Sort the data by fat content in ascending order.

Insert ranks, 1 – 10 for the fat content in column D. You can if you wish, sort the data back into order of centrifuge speed at this point.

In column E calculate the difference between the rank of the centrifuge speed and the rank of the fat content and in column F calculate the square of the differences.

At the foot of column F, calculate the sum of the squares of the differences.

In a cell at the foot of the table, insert the number of data points (10) and calculate the value of Spearman’s rank correlation coefficient, r from

 

Check the significance of r using statistical tables and compare the value with Pearson’s correlation coefficient.

You can also compute Spearman's correlation coefficient from Minitab, by carrying out a Pearson correlation on the data ranks.

In Minitab, select Manip/Rank... In the dialog box, enter the column containing the x-values and a column to store the ranks. Repeat this for the y-values.

Then carry out a Pearson correlation on the ranks.

 

Theil’s incomplete method

Copy and paste the data into a fresh table starting at Cell A21 for the data (use rows 19 and 20 for titles and column headerings). Copy centrifuge speed into column A (cells A21-A30) and fat content into column B (cells B21-B30).

Theil’s method involves dividing the data into two equal groups. With 10 data points this means two groups of 5. The first group will be in cells A21:B25 and the second group in cells B26:B30

Calculate the slope of the line joining the first points in each of the two groups in cell C21. The formula will be "=(B26-B21)/(A26-A21)". Don’t forget the () they are important!

Fill this formula down as far as cell C25 to give five slopes of lines connecting corresponding points of each groups.

In order to determine the median value we need the values of slope in order. Sorting data in columns where there are formulae can cause problems in Excel, so we are going to use an Excel function RANK() to find the order of the data.

In cell D21, type the formula =6-RANK(C21,$C$21:$C$25) and press ENTER. Copy this formula to cells C22 - C25. This will give the ranks of the slopes with the smallest value of slope having the rank "1".

The slope of the regression line is the median value which is the value having the rank of 3 (in cell C23). Type this value into a cell below the table.

Using this value of slope, calculate the intercept for each of the 10 data points from

Intercept = fat – slope ´ speed. Use column E for these calculations.

The value of intercept you want is the median. Type the formula =11-RANK(E21,$E$21:$E$30) into cell F21, copy this formula to cells F22:F30. This will give the ranks of the intercepts in column E. The median value is the mean of the values having the fifth and sixth ranks.

Insert a line of best fit in column G and residuals in column H. Produce charts to compare the fitted line with the least squares method and also a residual plot.

You can also do a non-parametric regression using Minitab.

Select Stats/Nonparametrics/Pairwise slopes...

Enter the columns containing the y and x variables and a column to store the slopes (call this column “slopes”). Select OK.

Select Calc/Column Statistics... Select the “Median” button and enter “slopes” as the Input variable, then select OK.

The median will be displayed. This is your slope.

You now need to calculate the intercept.

Select Calc/Calculator... Enter a column to store the output (call this “intercepts”).

In the Expression box enter “y-value - slope * x-value”. (replace the words in italics by either a column name or, in the case of the slope, the median previously calculated.

Select OK and a series of intercept values corresponding to each (x, y) pair will be calculated.

The median of this set of values is your intercept.


Exercise

Data set 4 is data on two experiments on enzyme kinetics using two different enzymes. Analyse the data using the techniques in this tutorial.

The equation relating rate of reaction (Vo) to the substrate concentration ([S]) for an enzyme reaction is

Vmax and Km are constants and properties of the particular system.

This equation can be linearised as follows

Produced by Geoff Walker
Last Modified 3 January 2003