This dataset is POSITIVELY SKEWED since most values are clustered around the left tail ofthe distribution while the right tail of the distribution is longer, all of the values in the his
Trang 1NATIONAL ECONOMICS UNIVERSITY
-*** -MID-TERM EXAMNINATION
BUSINESS STATISTIC
Group 4: 11210447 Mai Le Chau Anh
11215591 Nguyen Quynh Anh
11210899 Nguyen Ba Gia Bach
11211907 Nguyen Ngan Ha
11214285 Dinh Bao Ngoc
11215555 Ta Thi Minh Thu
Class: Advanced International Business Administration 63B
Lecturor: Assoc.Prof Tran Thi Bich
0
Trang 2TABLE OF CONTENTS
PART I: QUESTIONS 2
PART II: ANSWERS 3
Question 1 3
1.1 3
1.2 5
1.3 6
1.4 7
Question 2: 8
2.1 8
2.2 9
2.3 9
2.4 10
2.5 11
Question 3: 12
I INTRODUCTION 12
II DATA DESCRIPTION 12
III DESCRIPTIVE STATISTICS 12
IV.ANALYTICAL TECHNIQUES 16
V FINDINGS AND INSIGHTS 16
VI CONCLUSION 17
1
Trang 3PART I: QUESTIONSQuestion 1: Consider the variable income in gss.sav file (the variable is total family income in
the year before the survey)
1 Make a frequency table for the variable Does the frequency table make sense? Does it make sense to make a histogram of the variable? A bar chart?
2 What is the scale of measurement for the variablẻ
3 What descriptive statistics are appropriate for describing this variable and why? Does it make sense to compute a mean?
4 Discuss the advantages and disadvantages of recording income in the manner Describeother ways of recording income and the problem associated with each of them
Question 2: In the gss.sav file, the variable tvhours tells you how many hours per day GSS
respondents say they watch TV
1 Make a frequency table of the hours of television watched Do any of the values strike you as strange? Explain
2 Based on the frequency table, answer the following questions: Of the people whoanswered the question, what percentage don’t watch any television? What percentage watch twohours or less? Five hours or more? Of the people who watch TV, what percentage watch onehour? What percentage watch four hours or less?
3 From the frequency table, estimate the 25th, 50th, 75th, 95th percentiles What is the value for the Median, Mode?
4 Make a bar chart of the hours of TV watched What problem do you see with this display?
5 Make a histogram of the hours of TV watched What causes all of the values to beclumped together? Compare this histogram to the bar chart you generated in question 2d Which
is a better display for these data?
Question 3:
Find a data set which is related to a specific organizational problem (either at the macro or microlevel) and apply all possible descriptive statistical techniques that you think suitable to theproblem Write a short report, which includes the objectives of your analysis, the researchquestions and your findings The maximum length of the report is 5 pages including Tables andFigures
2
Trang 4PART II: ANSWERS Question 1
1.1.
=> This frequency table makes sense because this frequency table is a Grouped FrequencyTable, and there are so many values in income data, we need a frequency table to accuratelydescribe pay groups Furthermore, all of the frequency, percentage, and cumulative percentagesreflect the family income category in general
3
Trang 5A histogram is the most common graph used to display frequency distributions The intervals onthe histogram's X-axis represent the scale of values within which the measurements fall, whilethe Y-axis represents the number of times the values occurred inside the intervals While anequal width histogram of the income variable is achievable since family income is normallycontinuous data with the same class intervals, it is not recommended and so does not make sense
in this circumstance The missing values column is not expressed clearly since it is containedwithin the non-missing values column, which might lead to many misconceptions As a result,using the histogram for this variable makes no sense
BAR CHART
A bar chart is a feasible alternative when displaying a distribution of data points or comparingmetric values We can observe which group has the greatest number or how they compare toother groups using the bar chart Despite the multiple numbers on the X-axis, we can readilydetect the trend and make a conclusion from this bar chart So, in this case using a bar chartmakes sense
4
Trang 6This variable income has an ordinal scale of measurement since it has been separated intovarious categories that are not measured but only labeled Furthermore, the scale of measurementfor revenue in the gss.sav file is ordinal
The yellow row is the variable “income”
5
1.3.
- There are 4 types of descriptive statistics:
Measures of Measures of Central
Trang 7Frequency Tendency
Measures of Dispersion Measures of Position
or Variation
- Count, Percent, - Mean, Median, and - Range, Variance, - Percentile Ranks,
- Displays how - Locates the - Identifies the spread - Describes howfrequently something distribution by of scores by stating scores fall in relation
- Use this to display - Use this when you - Range = High/Low Relies on
response is delivered average or most - Variance or Standard - Use this when you
commonly indicated Deviation =difference need to compare
score and mean normalized score
- Use this when you (e.g., a national
"spread out" the dataare It is helpful toknow when your dataare so spread out that itaffects the mean
=> The Measure of Central Tendency (Median and Mode) is most suited for defining this variable since it is closer to our goal of determining the most often reported response
Trang 86
Trang 9In this case, identifying a Mean makes no sense, and there are several reasons to computeMedian and Mode rather than Mean:
● We have no information on the values in this range (less than $1,000 and more than
$110,000) In this range, the severe courses are available
● We can observe that the histogram (drawn in part 1.1) is strongly skewed to the right when we draw it and the Coefficient of Skewness is smaller than 0 (negative) As a result, for
skewed distributions, the mean is a poor descriptive statistic
● Because this frequency table has 65 missing values, using Mean may produce erroneous results
1.4.
Advantages:
● It may help in determining the form and spread of income distribution
● It may help in determining the most common or usual number or range, known as the mode
● It might be useful for comparing income data across several categories, such as gender, age, or occupation
● It can assist in identifying outliers or extreme income numbers that are much higher orlower than the rest of the data
Trang 10Question 2:
2.1. As can be seen from the frequency table below, the figure that stands out to me is 12
-which corresponds to the 12 hours a day spent watching television The number of individuals
who watch TV from 0 to 10 hours is pretty high, but the data begin to fall precipitously after the
variable 11 However, only variable 12 is unusually higher in this set of variables ranging from
11 to 24, which strikes us as unusual
Hours per day watching TV
Trang 11Of the people who answered the question:
● 6% of the people don’t watch any televisions
● 53.1% of the people watched TV for two hours or less
● 16,6% of the people watched TV for five hours or more (100% - (6% = 20,9% + 26,3% + 17,5% + 12,7%) = 16,6%)
● 83.4%, which is the total valid percent of whom watching TV from 0 to 4 hours (6% + 20,9% + 26,3% + 17,5% + 12,7% = 83,4%
Of the people who watch TV (which means the values of variable 0 is excluded):
● 82.27% watch TV for four hours or less ( x 100% = 82,27%)
● 852 is the total number of people who watch TV (906 – 54 = 852)
● 701 is the total number of people who watch TV from 1 to 4 hours per day (189 + 238 + 159 + 115 = 701)
2.3.
StatisticsHours per day watching TV
MissingMedian
Mode
507595
In a data distribution, a percentile is the number below which a specified proportion of valuesfalls In SPSS, there are several methods for calculating percentiles, as well as several equations.Our group will compute the 25th, 50th, 75th, and 95th percentiles for the variable TV hours.We'll select the Frequencies option, which uses a weighted average algorithm to determinepercentiles (as displayed in the SPSS data view above)
As can be seen from the results which appear in the SPSS output view:
● The value for 25th percentiles is 1.00
● The value for 50th percentiles is 2.00
9
Trang 12● The value for 75 percentiles is 4.00
● The value for 95th percentiles is 8.00
● The values for Median 2.00
● The values for Mode is 2
2.4.
BAR CHART
There are a few problems with this bar chart:
● A few outliers with low frequencies exist, however they can be used in a huge number ofdiscrete values
● Some values (9, 13, 16, 17, 18, 19, 21, 22, 23) are missing since they do not occur in thesurvey answer The bar chart below lacks a gap that reflects these uncollected data (missingvalues), which might lead to misinterpretation for readers at first look
● The real form of the distribution is difficult to determine because there are low frequencies indicated for higher classes
2.5.
10
Trang 13This dataset is POSITIVELY SKEWED (since most values are clustered around the left tail ofthe distribution while the right tail of the distribution is longer), all of the values in the histogramare grouped together This indicates that most of the survey respondents watch television forbetween one and four hours, with only a small percentage watching it for more than 10 hours.
In this situation, we believe a histogram would perform better than a bar chart As we mentioned
in paragraph (2.4), the bar chart DOES NOT indicate the gap that represents the uncollecteddata, but the histogram tells a different tale Therefore, since the histogram can both "show thedistributions of the values of data collected" and "show a gap to represent these uncollecteddata," it would be a superior way to display the data
11
Trang 14This study has two core reaseach questions:
1.1 Evaluating Perceptions of HATCO: "How do customers perceive HATCO across variousattributes, including delivery speed, pricing, flexibility in negotiations, manufacturer's image,service quality, salesforce image, and product quality? What are the strengths and areas in need
of improvement according to customer ratings?"
1.2 Examining Purchase Outcomes: "What are the outcomes of customer interactions withHATCO in terms of usage levels and satisfaction levels? How does this data inform HATCO'smarket share within its customer base and overall customer satisfaction?"
2.1 Dataset Origin:
The dataset used in this study was acquired from the fictitious Hair, Anderson, and TathamCompany (HATCO), an industrial supplier created solely for research purposes
2.2 Dataset Structure:
This dataset comprises 100 data points, with each data point associated with 14 variables.
These variables can be grouped into three primary categories:
2.2.1 HATCO Perceptions (Variables X1 to X7):
These attributes encompass the speed of product delivery (X1), perceived pricing level (X2),willingness to negotiate prices (X3), the overall image of the manufacturer (X4), service quality(X5), the image of HATCO's salesforce (X6), and product quality (X7)
2.2.2 Purchase Outcomes (Variables X9 and X10):
Two variables capture the outcomes of customer interactions with HATCO:
- X9, "Usage level," quantifies the percentage of a firm's total product purchases made from HATCO, with values ranging from 0 to 100 percent on a 100-point scale
- X10, "Satisfaction level," assesses customer satisfaction with prior purchases from HATCO using a visual rating scale, similar to the one applied to measure perceptions (X1 to X7)
III DESCRIPTIVE STATISTICS
3.1 Perceptions of HATCO
3.1.1 Measures of central tendency (mean, median, mode) and dispersion (range, variance,
standard deviation) for variables X1 to X7:
Statistics
Delivery Price Price Manufacturer Service Salesforc Product
12
Trang 15Delivery Price Price Manufacturer Service Salesforce Product
Trang 1613
Trang 17From the below result, here we visualize the distribution shapes:
The coefficient of skewness of Delivery Speed, Price Level, Manufacturer Image and Salesforce Image have a relatively same shape: The distribution is slightly skewed to the
right This means that the right tail of the distribution is slightly longer than the left tail A
slightly skewed to the right distribution is still relatively symmetrical
Meanwhile, The coefficient of skewness of Price Flexibility, Service and Product Quality witness the similar shape: T he distribution is slightly skewed to the left This means that the
left tail of the distribution is slightly longer than the right tail A slightly skewed to the leftdistribution is still relatively symmetrical It is possible to see this skewness if you look at ahistograms
3.1.3 Calculate percentiles to understand the distribution of responses for each attribute:
Statistics
Deliver Price Price Manufactu Service Salesfor Product
3.2.1 Measures of central tendency (mean, median) and dispersion (variance, standard
deviation) for variables X9 (Usage level) and X10 (Satisfaction level):
Trang 19From the below result, here we visualize the distribution shapes:
C of S of Usage Level (-0.13) indicates that the distribution is slightly skewed to the left.This means that the left tail of the distribution is slightly longer than the right tail So it is a
negative skew A moderately skewed to the left distribution is not perfectly symmetrical.
However, it is not as skewed as a severely skewed to the left distribution Similarly, C of S of
Satisfaction Level (-0.28) is moderately skewed to the left distribution as well.
3.2.3 Analyze the relationship between satisfaction (X10) and perceptions (X1 to X7) using
scatterplots or correlation coefficients:
Trang 21** Correlation is significant at the 0.01 level (2-tailed).
Overall, the Pearson Correlation technique suggests that there is a very strong positive
correlation between Product Quality and Usage Level; Satisfaction Level and Price Level; Satisfaction Level and Delivery Speed This means that as the Product Quality increases, the Usage Level also tends to increase, those other correlations as well However the result also
shows that the relationship between Product Quality and Usage Level may not be perfectly
linear In other words, there may be some Product Quality outliers, or there may be some Usage Level values that are associated with much higher or lower Satisfaction levels than
expected
IV ANALYTICAL TECHNIQUES
4.1 Descriptive statistic
- Depict calculated measures of central tendency (mean, median, mode) and measures of
dispersion (range, variance, standard deviation) for variables X1 to X7
4.2 Histogram analysis
- Visualize the distribution shapes of perceptions for each attribute (X1 to X7)
4.3 Correlation Analysis
- Calculate correlation coefficients (e.g., Pearson) to assess relationships between continuous
variables (e.g., between X1 to X7, X9, and X10)
- Interpret the strength and direction of correlations between variables
5.1 Perceptions of HATCO
5.1.1 Summarize the central tendencies and distributions of perceptions:
The analysis reveals varying customer perceptions of HATCO across different attributes.Delivery Speed (X1) receives a moderately centered rating, indicating room for consistencyimprovement Price Level (X2) is perceived as relatively lower on the scale, suggesting a needfor pricing strategy enhancement Price Flexibility (X3) garners a moderately positiveperception Manufacturer Image (X4) is positively rated with moderate consistency Service(X5) shows consistent but lower ratings Salesforce Image (X6) indicates a somewhat positiveperception, while Product Quality (X7) is rated moderately positively with some variability
5.1.2 Identify attributes that receive the highest and lowest ratings: