The equation for the standard error may be seen in Equation 4.3: SEM¼ standard deviation If we wanted to check that the value of the standard error calculated in the Descriptive Statisti
Trang 1true population mean The equation for the standard error may be seen in Equation 4.3:
SEM¼ standard deviation
If we wanted to check that the value of the standard error calculated in the Descriptive Statistics function was correct then we would insert the following formula into a cell on the spreadsheet, using the data from Group 1 as an example:
¼ 3.7/SQRT(9) Figure 4.1 Descriptive Statistics functions in Excel
Trang 2where 3.7 is the standard deviation of the sample, for which there were nine observations, so it could be calculated by:
¼ STDEV (range of values in sample)/SQRT (number in sample)
When presenting graphs showing mean values it is usually expected that error bars are included by using either the standard deviation values to demonstrate the variability in the sample, or the standard error to demonstrate the deviation of the sample from the true population mean
Kurtosis and skewness
Values for kurtosis and skewness are also produced by the Descriptive Statistics function These are used to characterize the data relative to a normal distribution Skewness is a measure of symmetry.Where data are symmetrical about the mean the skewness would be expected to have a value of around 0 If data are skewed to the left or right then the centre of the data is not around the mean and so a negative or positive value for skewness would be obtained Skewed distributions are further discussed in section 4.2
Kurtosis compares the shape of the data to a normal distribution and is a measure of whether the data tend to be peaked or £at.Where a high value for kurtosis is observed, data show a distinct peak about the mean and then decline rapidly For lower kurtosis values, data are more spread out, giving a
£at top to the shape of the distribution rather than a peak A value of around 3 would represent a normal distribution
Figure 4.2 Descriptive Statistics for the television viewing data
Trang 3Coefficient of variation
This function also does not appear in Excel but is a very useful parameter to calculate The coe⁄cient of variation represents the standard deviation as a percentage of the mean value; it is particularly useful when comparing the reproducibility of results In quantitative analytical methods, the coe⁄cient of variation is used as a measure of precision in quality control determinations The coe⁄cient of variation is calculated as shown in Equation 4.4:
coefficient of variation¼standard deviation
mean 100% ðEquation 4:4Þ
The coe⁄cient of variation is usually given as a percentage and expresses the variability (from the standard deviation) of the sample compared to the mean value It is a useful parameter to use when comparing two or more samples with di¡erent means to see if the variability is the same in each sample
Exercise 4.1
If we take as an example a laboratory analysis conducted by
two students Each performed an assay to determine the
protein concentration of a sample containing 125mgml1 of
protein Each repeated the analysis 10 times and the results
are shown in Table 4.3
Enter the data on a spreadsheet in Excel and perform the
descriptive statistics on the data Using the data for the mean
and standard deviation for each sample, enter the following
equation into one of cells on the worksheet, inserting the
appropriate value for the mean and standard deviation in each
case:
¼ (value for standard deviation/value for the mean)*100
When comparing the means you should find that both students
have a mean value of 125mgml1from their protein
determi-nations, but student 2 has a more precise technique as the
coefficient of variation is 2.3 per cent for their analysis
compared with 7.3 per cent for student 1
Trang 44.2 Frequency distributions
When we conduct scienti¢c investigations, we collect data by taking samples from much larger populations In order to learn something about the popula-tion we use descriptive statistics, but we also need to examine the characteristics of the distribution in order to determine the best way to summarize and analyse data
In Section 3 we learnt about presenting data in the form of bar charts We can draw bar charts of data in which we measure frequency (the number of times a particular occurrence takes place, for example the number of indivi-duals in a population with blue eyes); if we draw a line at the midpoint of the bar then we obtain a frequency polygon Increasing the number of bars in the plot, providing there is su⁄cient data to do so, will eventually produce a smooth curve, the shape of which will tell us something about the character-istics of the population Figure 4.3 shows how a frequency polygon may be produced from a bar chart, using data showing height of a sample of adults from a population This type of bar chart is known as a histogram
Table 4.3 Protein determinations performed by two students with a sample 125 mgml 1
Student 1 125 120 122 130 115 140 130 121 125 Student 2 121 124 127 122 125 126 128 126 126
Figure 4.3 Normal distribution of heights of subjects
Trang 5Figure 4.4 Skewed and bimodal distributions
Trang 6Where the resulting frequency polygon resembles a bell-shape we can see that the population is symmetrical and the shape of the curve is said to be
‘bell-shaped’ At each end, or tail, of the curve, there is a small number of extremely small or extremely large values, but the majority of the observations fall in the middle part of the curve, i.e they are centred around the value for the mode If we were to calculate the mean and the median for these data we would ¢nd that values would be virtually identical A curve is said to follow a normal distribution where this occurs, so as the mean will re£ect the central tendency of the distribution it should also resemble the midpoint of the distribution, represented by the median
It is useful when considering the shape of a population to look at the tail of the curve that is produced In Figure 4.4 we can see two distributions that cannot be normal as they do not follow a bell-shape; these are known as skewed distributions, of which there are two types, positive and negative (see also the subsection ‘Descriptive statistics in Excel’ in section 4.1)
A distribution with a positive skew will contain more extremely large values than extremely small ones and therefore resembles Chart A Clearly the mean calculated for these data would not represent the central location of the distribution Similarly, if we consider Chart B there are clearly more extremely small values than extremely large ones, in which case the data are negatively skewed For each of these curves, the best measure of the central tendency for the data would be represented by the median value and not the mean
Sometimes the shape of this distribution appears as if two normal (bell-shaped) distributions have been combined together, as shown in Chart C in Figure 4.4 This would suggest that there is a mixed population, which might arise where a population contains two species
In plotting these curves we have split the data into groups, or intervals, that are equally spaced apart.The more intervals we are able to divide the data into, the more well-de¢ned the curve becomes We will see how by using raw data for heights of individuals we are able to produce a frequency distribution and how the Excel Paste Function may be applied to aid this process
Exercise 4.2
The data in Table 4.4 have been collected from a sample of 40 individuals from a population Enter the data in one column in a new workbook in Excel The height of each subject was recorded to the nearest centimetre, so in terms of the absolute accuracy of the results, a person whose height is between
Trang 7153.5 and 154.4 cm would still be recorded as 154 cm (by
rounding up or down) Height would therefore be described as
being a continuous variable, but because we are taking
recorded measurements correct to the nearest centimetre, we
are sampling discrete values
The data on the worksheet make little sense as they stand
and need to be organized The first, most obvious step is to
place them in order Using the DatajjSort command (as
described in Section 3), organize the data into ascending
order Look down the column of data to see the results We
can now see that the smallest (minimum) value for height
is 147 cm whereas the largest (maximum) is 188 cm, so the
heights of the individuals range from 147 to 188 cm Even
after sorting, the data are still difficult to interpret as each
value has to be examined in relation to all the others (and
what if we had thousands of measurements?) The next
stage is clearly to group the data; this is done by dividing it
into classes – with evenly spaced intervals between
groups
Rule : When data are divided into intervals it should usually be into no
more than 10 intervals and no less than ¢ve intervals Each interval should
be of an equal width
To determine how many groups to divide the data into, count the number
of observations In this case n ¼ 40
Take the square root of the total and round to the nearest whole number
(p
40¼ 6.325), i.e 6
Excel is able to automatically group frequency data but needs
to be given the parameters by which to do this You
Table 4.5 Height (cm) of forty individuals from a university tutorial group
147 154 157 163 163 165 168 171 173 177
151 155 152 161 161 169 169 172 175 177
158 155 159 161 164 167 165 182 175 172
154 156 165 162 160 188 176 173 170 167
Trang 8will first of all have to make some decisions about your data
Firstly, look at the range of the data (147–188 cm) In order to group the data we need to work out how to have evenly spaced intervals Clearly, if we group the data into six classes then the interval between them should be:
interval¼ðhighest numberlowest numberÞ
number of classes ðEquation 4:5Þ
¼ (1887147)/6 which gives us an answer of 6.83, so the interval between the classes should be 7 cm In Table 4.5 we can see how the data need to be grouped The number in the class column is the lower value for the class and moves upwards in steps of 7 cm
The first class (147–153) will contain the discrete values:
147 148 149 150 151 152 153
where 147 is the lower class boundary and 153 is the upper class boundary
In Excel, data are divided into bins (classes) in which you define the upper class boundary Using these bins, frequency data can be produced from a list of observations, so you will need to enter onto your data sheet the classes (bins) in which you want to categorize your data On the worksheet, type in the upper class boundaries for the data (so from Table 4.5 the upper class boundaries will be 153, 160, 167, 174, 181 and 188; enter the data in one column)
Table 4.5 Classes for the student height data
Height (cm) 147^153 154^160 161^167 168^174 175^181 182^188
Trang 9Using the histogram function
From the Tools menu select Data Analysis and from the list
provided choose Histogram A dialogue box should appear as
shown in Figure 4.5 Enter the input range of the data and then
the range of cells containing your bins Click on the Chart
Output box so that a histogram of the data is plotted on the
worksheet and confirm your selections
A table should now appear on the worksheet in which the
data has been placed into the six classes provided The data
should be presented as in Table 4.6
We now have what is known as a frequency distribution of our
data The data is also presented in a histogram as in
Figure 4.5 Using the Histogram function in Excel
Table 4.6 Output table from Excel showing grouping of data into bins
Bin Frequency
Trang 10Figure 4.6 We can see that this appears to approximate to a normal distribution, but it is difficult to be certain with a limited number in the data set If the sample were larger we could increase the number of bars in the frequency histogram by setting classes (bins) closer together; the histogram would appear more as a smooth curve The shape of the distribution is represented by the shape of this curve
When considering the statistical testing of data, it is important to establish
in conducting an experiment:
(a) whether a sample is su⁄ciently large enough to represent the population
as a whole
(b) that the characteristics of the population are known (i.e normal, skewed, bimodal) in order to choose the correct test to be applied to the data and the most appropriate summary statistics to describe it
4.3 Correlation and linear regression
Sometimes we conduct an investigation to determine whether there is an association between two variables of interest.The starting point of ¢nding out
Figure 4.6 Frequency histogram for heights of university students
Trang 11whether such a relationship exists is by visually examining the data in the form
of a scattergraph; this will show us whether:
there is a distinctive trend between the two variables (x and y) or the relationship is entirely random, i.e related or independent
the relationship, where found, is rectilinear or curvilinear
the relationship is positive or negative
We can then explore associations statistically by quantifying the correlation between variables; the closeness of the relationship is expressed by the correlation coe⁄cient, r
When r ¼ +1 the two variables are positively related
When r¼ 1 the two variables are negatively related
A value of 1 for r indicates an undisputed relationship between x and y, so this would indicate a perfect correlation between the two variables A value of 0 would indicate no possible relationship between x and y, so there would be no
Figure 4.7 Scattergraphs showing positive, negative and questionable correlations
Trang 12correlation whatsoever In practice these values represent two extremes and most correlation coe⁄cients lie in between these values; a judgement on the association between variables is therefore made on the proximity of the values
to either 0 or 1 Figure 4.7 shows a number of scattergraphs and their corresponding correlation coe⁄cients
Correlation
In order to determine statistically whether a correlation exists between two variables, x and y, we use the correlation coe⁄cient represented by r Using Excel it is very easy to plot a scattergraph, determine a correlation between variables and demonstrate the relationship between them by inserting a trendline (where appropriate) between data points Note that in order for two variables to be correlated, they do not necessarily need to demonstrate a linear trend between them
Exercise 4.3
The mean radius of lichens growing on gravestones was measured in a churchyard, selecting the largest radius in each case This was recorded together with the date on the gravestone The data are presented in Table 4.7 As can be seen from the table, the first task that must be performed
Table 4.7 Mean radius of lichens found on gravestones in a churchyard
Date on gravestone
Mean radius of lichen colony (mm)
Trang 13is to place the dates in chronological order Enter the data into
an Excel worksheet and then, using the Sort command from
the Data menu, arrange the dates into ascending order
(making sure that you select all of the data for sorting)
Using Chart Wizard, plot the data and choose the XY Scatter
format Add a suitable title and labels for thex- and y-axes
Scattergraphs
In Chart Wizard select the Scattergraph option, XY (Scatter), without lines
connecting points Make sure you edit the scale of axes where points are
clustered in one portion of the chart to ensure that all of the points are
spread out.This is accomplished by selecting the appropriate axis (x or y),
right clicking the mouse button and from the Format Axis menu selecting
the Scale tab You will then be able to adjust the minimum or maximum
value on the axis.To add a trendline to the graph, select one of the points
and right click the mouse button From the options, select Add Trendline
View the di¡erent types of trendlines that are available and see how well
they ¢t the points Options available can be seen in Figure 4.8
With polynomial and moving average trendlines you may need to adapt
the ¢t of the line by increasing the Order (default value 2)
Figure 4.8 Inserting trendlines