Filling up – fuelling quantitative analysis4 Chapter objectives This chapter will help you to: ■ understand key statistical terms ■ distinguish between primary and secondary data ■ recog
Trang 1Filling up – fuelling quantitative analysis
4
Chapter objectives
This chapter will help you to:
■ understand key statistical terms
■ distinguish between primary and secondary data
■ recognize different types of data
■ arrange data using basic tabulation and frequency distributions
■ use the technology: arrange data in EXCEL, MINITAB andSPSS
In previous chapters we have concentrated on techniques or modelsinvolving single values that are known with certainty Examples of theseare break-even analysis and linear programming, which we looked at inChapter 2, and the Economic Order Quantity model featured inChapter 3 In break-even analysis the revenue per unit, the fixed costand the variable cost per unit are in each case a specified single value
In linear programming we assume that both profit per unit and resourceusage are constant amounts In the Economic Order Quantity modelthe order cost and the stock-holding cost per unit are each known singlevalues Because these types of models involve values that are fixed or
predetermined they are called deterministic models.
Deterministic models can be useful means of understanding andresolving business problems Their reliance on known single valueinputs makes them relatively easy to use but is their key shortcoming.Companies simply cannot rely on a figure such as the amount of
Trang 2raw material used per unit of production being a single constant value.
In practice, such an amount may not be known with certainty, because
it is subject to chance variation Because of this company managersmay well need to study the variation and incorporate it within the modelsthey use to guide them
Models that use input values that are uncertain rather than certain,values that are subject to chance variation rather than known, are
called probabilistic models, after the field of probability, which involves
the measurement and analysis of chance We shall be dealing withprobability in later chapters
Before you can use probability to reflect the chance variation in ness situations you need to know how to get some idea of the variation
busi-To do this we have to start by ascertaining where relevant informationmight be found Having identified these sources you need to knowhow to arrange and present what you find from them in forms that willhelp you understand and communicate the variation In order to dothis in appropriate ways it is important that you are aware of the differenttypes of data that you may meet
The purpose of this chapter is therefore to acquaint you with someessential preliminaries for studying variation We will start with defin-itions of some key terms, before looking into sources of data and con-sidering the different types of data Subsequently we shall look at basicmethods of arranging data
4.1 Some key words you need to know
There are several important terms that you will find mentionedfrequently in this and subsequent chapters They are:
Data The word data is a plural noun (the singular form is datum), which
means a set of known or given things, facts Data can be numerical (e.g.wages of employees) or non-numerical (e.g job titles of employees)
Variable A variable is a quantity that varies, the opposite of a constant.For example, the number of telephone calls made to a call centre perhour is a variable, whereas the number of minutes in an hour is a
constant Often a capital letter, usually X or Y, is used to represent a
variable
Value A value is a specific amount that a variable could be Forexample the number of telephone calls made to a call centre perhour could be 47 or 71.These are both possible values of thevariable ‘number of calls made’
Trang 3Observation or Observed value This is a value of a variable that hasactually occurred, i.e been counted or measured For example, if
58 telephone calls are made to a call centre in a particular hour that
is an observation or observed value of the variable ‘number of callsmade’
An observation is represented by the lower case of the letter used
to represent the variable; for instance ‘x’ represents a single observed value of the variable ‘X’.A small numerical suffix is added to
distinguish particular observations in a set; x1would represent the
first observed value, x2the second and so on
Data set A data set consists of all the observations of all the variablescollected in the course of a study or investigation, together with thevariable names
Random This describes something that occurs in an unplanned way,
by chance
Random variable A random variable has observed values that arise bychance.The number of new cars a car dealer sells during a month is arandom variable; whereas the number of days in a month is a variablethat is not random because its observed values are pre-determined
Distribution The pattern exhibited by the observed values of a variable
when they are arranged in order of magnitude.A theoretical
distribution is one that has been deduced, rather than compiled fromobserved values
Population Generally this means the total number of persons residing in
a defined area at a given time In quantitative methods a population
is the complete set of things or elements we want to investigate.These
may be human, such as all the people who have purchased a particularproduct, or inanimate, such as all the cars repaired at a garage
Sample A sample is a subset of a population, that is, a smaller number
of items picked from the population.A random sample is a sample that
has components chosen in a random way, on the basis that any singleitem in the population has no more or less chance than any other to
be included in the sample
A typical quantitative investigation of a business problem might involve
defining the population and specifying the variables to be studied Following this a sample of elements from the population is selected and observations of the variables for each element in the sample recorded.
Once the data set has been assembled work can begin on arranging
and presenting the data so that the patterns of variation in the butions of values can be examined.
distri-At this point you may find it useful to try Review Question 4.1 at the
end of the chapter
Trang 44.2 Sources of data
The data that form the basis of an investigation might be collected atfirst hand in response to a specific problem This type of data, col-
lected by direct observation or measurement, is known as primary data.
The procedures used to gather primary data are surveys, experimentsand observational methods A survey might involve asking consumerstheir opinion of a product A series of experiments might be conducted
on products to assess their quality Observation might be used to ascertainthe hazards at building sites
The advantages of using primary data are that they should match therequirements of those conducting the investigation and they are up-to-date The disadvantages are that gathering such data is both costly andtime-consuming
An alternative might be to find data that have already been collected
by someone else This is known as secondary data A company looking
for data for a specific study will have access to internal sources of ondary data, but as well as those there are a large number of externalsources; government statistical publications, company reports, aca-demic and industry publications, and specialist information servicessuch as the Economist Intelligence Unit The advantages of using sec-ondary data are that they are usually easier and cheaper to obtain Thedisadvantages are that they could be out of date and may not beentirely suitable for the purposes of the investigation
sec-4.3 Types of data
Collecting data is usually not an end in itself When collected the datawill be in ‘raw’ form, a state that might lead someone to refer to it as
‘meaningless data’ Once it is collected the next stage is to begin
trans-forming it into information, literally to enable it to inform us about the
issue being investigated
There is a wide range of techniques that you can use to organize, play and represent data Selecting which ones to use depends on the type
dis-of data you have The nature dis-of the raw material you are working withdetermines your choice of tools Scissors are fine for cutting paper but nogood for cutting wood A saw will cut wood but is useless for cutting paper
It is therefore essential that you understand the nature of the data youwant to analyse before embarking on the analysis, so in this section we willlook at several ways of distinguishing between different types of data
Trang 5There are different types of data because there are different ways inwhich facts are gathered Some data may exist because specific thingshave characteristics that have been categorized whereas other data mayexist as a result of things being counted, or measured, on some sort ofscale.
Perhaps the most important way of contrasting data types is onthe basis of the scales of measurement used in obtaining them The
acronym NOIR stands for Nominal, Ordinal, Interval, Ratio; the four
basic data types Nominal is the ‘lowest’ form of data, which containsthe least amount of information Ratio is the ‘highest’ form of data,which contains the most amount of information
The word nominal comes from the same Latin root as the wordname Nominal data are data that consist solely of names or labels.These labels might be numeric such as a bank account number, or theymight be non-numeric such as gender Nominal data can be categor-ized using the labels themselves to establish, for instance the number
of males and females It is possible to represent and analyse nominaldata using proportions and modes (the modal category is the one thatcontains the most observations), but carrying out more sophisticatedanalysis such as calculating an average is inappropriate; for example,adding a set of telephone numbers together and dividing by the numberthere are to get an average would be meaningless
Like nominal data, ordinal or ‘order’ data consist of labels that can
be used to categorize the data, but order data can also be ranked.Examples of ordinal data are academic grades and finishing positions
in a horse race An academic grade is a label (an ‘A’ grade student)that also belongs to a ranking system (‘A’ is better than ‘B’) Becauseordinal data contain more information than nominal data we can use
a wider variety of techniques to represent and analyse them As well as
proportions and modes we can also use order statistics, such as identifying the middle or median observation However, any method involving arith-
metic is not suitable for ordinal data because although the data can be
Example 4.1
Holders of a certain type of investment account are described as ‘wealthy’
To investigate this we could use socio-economic definitions of class to categorize each account holder, or we could count the number of homes owned by each account holder,
or we could measure the income of each account holder.
Trang 6ranked the intervals between the ranks are not consistent For instance,the difference between the horse finishing first in a race and the onefinishing second is one place The difference between the horse fin-ishing third and the one finishing fourth is also one place, but this doesnot mean that there is the same distance between the third- and fourth-placed horses as there is between the first- and second-placed horses.Interval data consist of labels and can be ranked, but in addition theintervals are measured in fixed units so the differences between valueshave meaning It follows from this that unlike nominal and ordinal,both of which can be either numeric or non-numeric, interval data arealways numeric Because interval data are based on a consistent numer-ical scale, techniques using arithmetical procedures can be applied tothem Temperatures measured in degrees Fahrenheit are intervaldata The difference between 30° and 40° is the same as the differencebetween 80° and 90°.
What distinguishes interval data from the highest data form, ratiodata, is that interval data are measured on a scale that does not have ameaningful zero point to ‘anchor’ it The zero point is arbitrary, forinstance 0° Fahrenheit does not mean a complete lack of heat, nor is itthe same as 0° Celsius The lack of a meaningful zero also means thatratios between the data are not consistent, for example 40° is not half
as hot as 80° (The Celsius equivalents of these temperatures are 4.4°and 26.7°, the same heat levels yet they have a completely differentratio between them.)
Ratio-type data has all the characteristics of interval data – it consists
of labels that can be ranked as well as being measured in fixed amounts
on a numerical scale The difference is that the scale has a meaningfulzero and ratios between observations are consistent Distances are ratiodata whether we measure them in miles or kilometres Zero kilometresand zero miles mean the same – no distance Ten miles is twice as far
as five, and their kilometre equivalents, 16 and 8, have the same ratiobetween them
Example 4.2
Identify the data types of the variables in Example 4.1
The socio-economic classes of account holders are ordinal data because they arelabels for the account holders and they can be ranked
The numbers of homes owned by account holders and the incomes of account ers are both ratio data Four homes are twice as many as two, and £60,000 is twice asmuch income as £30,000
Trang 7hold-At this point you may find it useful to try Review Question 4.2 at the
end of the chapter
Another important distinction you need to make is between tive data and quantitative data Qualitative data consist of categories or
qualita-types of a characteristic or attribute and are always either nominal orordinal The categories form the basis of the analysis of qualitativedata Quantitative data are based on counting ‘how many’ or measur-ing ‘how much’ and are always of interval or ratio type The numericalscale used to produce the figures forms the basis of the analysis ofquantitative data
There are two different types of quantitative data: discrete and uous Discrete data are quantitative data that can take only a limited
contin-number of values because they are produced by counting in distinct or
‘discrete’ steps, or measuring against a scale made up of distinct steps.There are three types of discrete data that you may come across.First, data that can only take certain values because other values simplycannot occur, for example the number of hats sold by a clothingretailer in a day There could be 12 sold one day and 7 on another, butselling 9.3 hats in a day is not possible because there is no such thing as0.3 of a hat Such data are discrete by definition
Secondly, data that take only certain values because those are theones that have been established by long-standing custom and practice,for example bars in the UK sell draught beer in whole and half pints.You could try asking for three-quarters of a pint, but the bar staff would
no doubt insist that you purchase the smaller or larger quantity Theysimply would not have the equipment or pricing information to hand
to do otherwise
There are also data that only take certain values because the peoplewho have provided the data or the analysis have decided, for conveni-ence, to round values that do not have to be discrete This is what youare doing when you give your age to the last full year Similarly, the tem-peratures given in weather reports are rounded to the nearest degree,and the distances on road signs are usually rounded to the nearestmile These data are discrete by convention rather than by definition
They are really continuous data.
Discrete data often but not always consist of whole number values.The number of visitors to a website will always be a whole number, butshoe sizes include half sizes In other cases, like the UK standard sizes
of women’s clothing, only some whole numbers occur
The important thing to remember about discrete data is that thereare gaps between the values that can occur, that is why this type of data
is sometimes referred to as discontinuous data In contrast, continuous
Trang 8data consist of numerical values that are not restricted to specificnumbers Such data are called continuous because there are no gapsbetween feasible values This is because measuring on a continuousscale such as distance or temperature yields continuous data.
The precision of continuous data is limited only by how precisely thequantities are measured For instance, we measure both the length of busjourneys and athletic performances using the scale of time In the firstcase a clock or a wristwatch is sufficiently accurate, but in the second case
we would use a stopwatch or an even more sophisticated timing device
The terms discrete variable and continuous variable are used in describing
data sets A discrete variable has discrete values whereas a continuousvariable has continuous values
At this point you may find it useful to try Review Questions 4.3 and
4.4at the end of the chapter
In most of your early work on analysing variation you will probably beusing data that consist of observed values of a single variable Howeveryou may need to analyse data that consist of observed values of two variables in order to find out if there is a connection between them Forinstance, we might want to ascertain how cab fares are related to journeytimes
In dealing with a single variable we apply univariate analysis, whereas
in dealing with two variables we apply bivariate analysis The prefixes
uni- and bi- in these words convey the same meanings as they do inother words like unilateral and bilateral You may also find reference to
multivariate analysis, which involves exploring relationships between
more than two variables
Example 4.3
A motoring magazine describes cars using the following variables:
Type of vehicle – Hatchback/Estate/MPV/Off-Road/Performance
Number of passengers that can be carried
Fuel type – petrol/diesel
Fuel efficiency in miles per gallon
Which variables are qualitative and which quantitative?
The type of car and fuel type are qualitative; the number of passengers and the fuelefficiency are quantitative
Which quantitative variables are discrete and which continuous?
The number of passengers is discrete; the fuel efficiency is continuous
Trang 9You may come across data referred to as either hard or soft Hard data
are facts, measurements or characteristics arising from situations thatactually exist or were in existence Temperatures recorded at a weatherstation and the nationalities of tourists are examples of hard data Softdata are about beliefs, attitudes and behaviours Asking consumerswhat they know about a product or how they feel about an advertise-ment will yield soft data The implication of this distinction is that harddata can be subjected to a wider range of quantitative analysis Softdata is at best ordinal and therefore offers less scope for quantitativeanalysis
A further distinction you need to know is between cross-section and time series data Cross-section data are data collected at the same point
in time or based on the same period of time Time series data consist
of observations collected at regular intervals over time The volumes ofwine produced in European countries in 2002 are cross-section datawhereas the volumes of wine produced in Italy in the years 1992 to
2002 are time series data
At this point you may find it useful to try Review Question 4.5 at the
end of the chapter
4.4 Arrangement of data
Arranging or classifying data in some sort of systematic manner is the
vital first stage you should take in transforming the data into tion, and hence getting it to ‘talk to you’ The way you approach thisdepends on the type of data you wish to analyse
informa-4.4.1 Arranging qualitative data
Dealing with qualitative data is quite straightforward as long as thenumber of categories of the characteristic being studied is relativelysmall Even if there are a large number of categories, the task can bemade easier by merging categories
The most basic way you can present a set of qualitative data is to late it, to arrange it in the form of a summary table A summary table
tabu-consists of two parts, a list of categories of the characteristic, and the
number of things that fall into each category, known as the frequency of
the category Compiling such a table is simply a matter of countinghow many elements in the study fall into each category
Trang 10In Table 4.1 the outlet types are qualitative data The ‘Other’ egory, which might contain several different types of outlet, such ashypermarkets and market stalls, has been created in order to keep thesummary table to manageable proportions.
cat-Notice that for each category, the number of outlets as a percentage
of the total, the relative frequency of the category, is listed on the right
hand side This is to make it easier to communicate the contents; ing 30.8% of the outlets are shoe shops is more effective than saying12/39ths of them were shoe shops, although they are different ways ofsaying the same thing
say-You may want to use a summary table to present more than one
attrib-ute Such a two-way tabulation is also known as a contingency table because
it enables us to look for connections between the attributes,in other
words to find out whether one attribute is contingent upon another.
Example 4.4
Suppose we want to find how many different types of retail outlet in an area sell trainers
We could tour the area or consult the telephone directory in order to compile a list
of outlets, but the list itself may be too crude a form in which to present our results
By listing the types of outlet and the number of each type of outlet we find we canconstruct a summary table:
Example 4.5
Four large retailers each operate their own loyalty scheme Customers can apply forloyalty cards and receive points when they present them whilst making purchases.These points are accumulated and can subsequently be used to obtain gifts ordiscounts
Trang 11At this point you may find it useful to try Review Questions 4.6 to 4.8
at the end of the chapter
4.4.2 Arranging quantitative data
The nature of quantitative data is different to qualitative data andtherefore the methods used to arrange quantitative data are rather dif-ferent However, the most appropriate way of arranging some quanti-tative data is the same as the approach we have used to arrangequalitative data
This applies to the analysis of a discrete quantitative variable that has
a very few feasible values You simply treat the values as you would thecategories of a characteristic and tabulate the data to show how ofteneach value occurs When quantitative data are tabulated, the resulting
table is called a frequency distribution because it demonstrates how
frequently each value in the distribution occurs
A survey of usage levels of loyalty cards provided the information in the followingtable:
Table 4.2
Number of transactions by loyalty card use
Transactions Retailer With card Without card Total
Aptyeka 236 705 941 Botinky 294 439 733 Crassivy 145 759 904 Total 675 1903 2578
Example 4.6
The UREA department store offers free refills when customers purchase hot beverages
in its cafe The numbers of refills taken by 20 customers were:
Trang 12At this point you may find it useful to try Review Questions 4.9 to
4.11at the end of the chapter
We can present the data in Example 4.6 in the form of a simple tablebecause there are only a very limited number of values Unfortunatelythis is not always the case, even with discrete quantitative data
For instance, if Example 4.6 included customers who spent all day inthe café and drank 20 or so cups of coffee each then the number ofrefills might go from none to 30 This would result in a table with fartoo many rows to be of use
To get around this problem we can group the data into fewer egories or classes by compiling a grouped frequency distribution This
cat-shows the frequency of observations in each class
These figures can be tabulated as follows:
Table 4.3
Number of hot beverage refills taken
Trang 13In order to compile a grouped frequency distribution you will need toexercise a little judgement because there are many sets of classes thatcould be used for a specific set of data To help you, there are three rules:
1 Don’t use classes that overlap
2 Don’t leave gaps between classes
3 The first class must begin low enough to include the lowestobservation and the last class must finish high enough toinclude the highest observation
In Example 4.7 it would be wrong to use the classes 0–20, 20–40, 40–60and so on because a value on the very edge of the classes like 20 could beput into either one, or even both, of two classes Although there arenumerical gaps between the classes that have been used in Example 4.7,they are not real gaps because no feasible value could fall into them Thefirst class finishes on 19 and the second begins on 20, but since the num-ber of messages received is a discrete variable a value like 19.6, whichwould fall into the gap, simply will not occur Since there are no observedvalues lower than zero or higher than 99, the third rule is satisfied
We could sum up these rules by saying that anyone looking at agrouped frequency distribution should be in no doubt where each feas-ible value belongs Every piece of data must have one and only oneplace for it to be To avoid any ambiguity whatsoever, you may like touse the phrase ‘and under’ between the beginning and end of eachclass The classes in Example 4.7 could be rewritten as:
0 and under 20
20 and under 40 … and so on
It is especially important to apply these rules when you are dealing withcontinuous quantitative data Unless you decide to use ‘and under’ or
a similar style of words, it is vital that the beginning and end of eachclass is specified to at least the same degree of precision as the data
Example 4.8
The results of measuring the contents (in millilitres) of a sample of 30 bottles of ‘Nogat’nail polish labelled as containing 10 ml were:
10.30 10.05 10.06 9.82 10.09 9.85 9.98 9.97 10.28 10.01 9.9210.03 10.17 9.95 10.23 9.92 10.05 10.11 10.02 10.06 10.21 10.0410.12 9.99 10.19 9.89 10.05 10.11 10.00 9.92