SECTION 1 DESCRIPTIVE STATISTICS WITH TABULARAND GRAPHICAL DISPLAYS 1.1 Question 1 A frequency distribution is a tabular summary of data showing the number frequency ofobservations in ea
Trang 1HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY SCHOOL OF ECONOMICS AND MANAGEMENT
GROUP ASSIGNMENT APPLIED STATISTICS IN BUSINESS
NGUYEN PHUONG ANH
HANOI, June 2021
Lecturer’s Signature
Trang 2We assure that this is my own research report All the data, figures in the report are from my own study and cited fully from known sources We do not copy from any documents and do not violate the regulations for plagiarism
Trang 3The success and final outcome of this project required a lot of guidance and assistance from many people and we are extremely fortunate to have got this all along the completion of my project work Whatever we have done is only due to such guidance and assistance and we would not forget to thank them.
We respect and thank Mr Nguyen Tien Dung, for giving us an opportunity to
do the project work in Applied Statistics and Experimental Design We are extremely grateful to him for providing such a nice lecture in every online class
on Microsoft Teams.
Trang 4TABLE OF CONTENTS
List of Figures
List of Tables
Executive Summary
Section 1 Descriptive Statistics with Tabular and Graphical Displays 3
1.1 Question 1 x
1.2 Question 2 x
1.3 Question 3 x
1.4 Question 4 x
1.5 Question 5 x
Section 2 Descriptive Statistics with Numerical Measures 2.1 Question 1 x
2.2 Question 2 x
2.3 Question 3 x
2.4 Question 4 x
2.5 Question 5 x
Section 3 Hypothesis Tests 3.1 Question 1 x
3.2 Question 2 x
3.3 Question 3 x
3.4 Question 4 x
3.5 Question 5 x
Section 4 Experimental Design and ANOVA 4.1 Question 1 x
4.2 Question 2 x
4.3 Question 3 x
4.4 Question 4 x
4.5 Question 5 x
Section 5 Statistical Analysis with Real Data 5.1 Data Description x
5.2 Analysis Objectives x
5.3 Data Analysis and Interpretation x
5.4 Concluding Remarks x
References x
Appendices x
Trang 5LIST OF FIGURES
Figure 1.1 xxxx xFigure 2.1 xxxx xFigure 2.2 xxxx x
LIST OF TABLES
Table 1.1 xxxx xTable 2.1 xxxx xTable 2.2 xxxx x
Trang 7SECTION 1 DESCRIPTIVE STATISTICS WITH TABULAR
AND GRAPHICAL DISPLAYS 1.1 Question 1
A frequency distribution is a tabular summary of data showing the number (frequency) ofobservations in each of several nonoveralpping categories or classes
A percent frequency distribution summarizes the percent frequency distribution for the keyvariables
Thus, to help management develop a customer profile, firstly, we contruct the percentfrequency distribution for the key variabiles, in this case, which is type of customers, items,net sales, mehod of payment, gender, maritual status and age group
1.1.1 Percent frequency distribution for Type of Customers
Table 1.1 Percent frequency distribution for Type of Customers
From the table above, we can see that in the sample of 100 customers, there are 70promotional customers and 30 regular customers
1.1.2 Percent frequency distribution for Items
Trang 8Table 1.6 Percent frequency distribution for Age Group
The sum of the frequencies in frequency distribution is 100, which equals the number ofobservations In addition, the sum of the percentage in a percent fequency distribution alwaysequals 100
These percent frequency distributions provide a profile of Pelican’s customers We canconclude that:
Over half of the customers purchase 1 or 2 items, but a few make numerous purchases
Trang 9 The percent frequency distribution of net sales shows that 61% of the customers spent
$50 or more
Customers are distributed across all adults age groups
The overwhelming majority of customers are female
Most of the customers are married
1.2 Question 2
To contruct a bar chart showing the number of customer purchases attributed to the method
of payment, we statistic the number of customer according to the method of payment by usingPivotTable in Excel
Excel’s PivotTalbe Report is an interactive tool that allows us to quickly summarize data in
a variety of ways, including developing a frequency distribution for quantitative data
1.2.1 PivotTable showing the number of customer purchases attributable to the method of payment
Table 1.7 The number of customer purchases attributable to the method of payment
1.2.2 Bar chart showing the number of customer purchases attributable to the method of payment
0 10
Figure 1.1 The number of customer purchases attributable to the method of payment
Trang 10From the bar chart above, we conclude that a large majority of the customers usepropretary credit card.
24.99
49.99
25.00- 74.99
50.00- 99.99
75.00- 124,99
100.00- 149.99
125.00- 174.99
150.00- 199.99
175.00-200+ Total
Table 1.7 A crosstabulation of types of customer versus net sales
1.3.2 Comment on similarities or differences present
In terms of similarities figure of promotional and regular customers, we have someconclusions:
- Both types of customers have highest total amount charged to the credit card in range
of 25.00-49.99 and 50.00-74.99
- There are a few customers charged above $125.00
In terms of differences, we can conclude that:
- Customers who use promotional coupons have net sales above 175.00, but regulardoes not
In conclusion, from the crosstabulation above, it appears that net sales are larger forpromotional customers
Trang 110.00 50.00 100.00 150.00 200.00 250.00 300.00 350.00 0
The relationship between net values and customer age
Sales
Figure 1.2 The relationship between net values and customer age
A trendline has been fitted to the data From this, it appears that there is no relationshipbetween net sales and age Thus, age is not a factor in determining net sales
1.5 Conclusion
By using the tabular and graphical methods of descriptive statistics, we can conclude thatpromotional coupons and proprietary card might affect store’s net sales, they increase netsales in detail, while age is not a factor in determining net sales
Trang 12
SECTION 2 DESCRIPTIVE STATISTICS WITH NUMERICAL MEASURES 2.1 Question 1
2.1.1 Descriptive statistics on net sales
Table 2.1 Descriptive statistics on net sales
From the above statistics, it can be observed that the average net sales is 77.60 units Thevalue of median is 59.71 which is less than mean This indicates that data is left-skewed That
is, more than half of the customers give net sales worth of 59.71 units This conclusion issupported by the value of skewness 1.715
The maximum net sales is 287.59 units while the minimum is 13.23 units
2.1.2 Descriptive statistics on net sales by various classifications of customers
Table 2.2 Descriptive statistics on net sales by promotional and regular customers
From above statistics, a few observations can be made:
Trang 13Customers taking advantage of the promotional coupons spent more money on average.The mean amount spent by all customers is $77.60; the average amount spent by promotionalcustomers was $84.29.
The standard deviation of sales is $55.66 This indicates a fairly wide variability inpurchase amounts across customers This variability is quite a bit smaller for the regularcustomers
The distribution of the sales data is skewed to the right The mean ($77.60) is larger thanthe median ($59.71) and the skewness measure (1.71) is positive Positive skewness is typicalfor this kind of data There are no negative sales amounts and there are a few large purchases
2.2 Question 2
To determine the relationship between Age and Net sales, we calculate the correlationcoefficient
Let be the age variable, be the net sales variable
We applied the formula of the correlation coefficient for a sample, denoted by
We use MegaStat to determine descriptive statistics on Age
Table 2.3 Descriptive statistics on age
It indicates that the sample standard deviation of Age
Sample covariance will be calculated by using the formula To get the result, we use Excel
as an assistant system Thus
Trang 14Since the value of near zero, it indicates a weak linear relationship between Net sales andAge variable In other words, age is not a factor in determining Net sales.
Trang 15SECTION 3 HYPOTHESIS TEST 3.1 Question 1
After conducting a hypothesis test for 4 samples at the 0.01 level of significance, we havethe hypothesis testing results as follow:
Table 3.1 Hypothesis testing results
Only sample 3 leads to the rejection of the hypothesis Thus, corrective action is warrantedfor sample 3 The other samples indicate cannot be rejected; thus, the process is operatingsatisfactorily Sample 3 with shows the process is operating below the desired mean Sample 4with is on the high side, but the -value of 0.03 is not sufficient to reject p
Trang 16This would be an increase in the probability of a making a type I error.
3.5 Conclusion
Trang 17SECTION 4 EXPERIMENTAL DESIGN AND ANOVA 4.1 Question 1
Anova: Single Factor Data from Medical 1
:There is significant difference in the mean depression score of healthy people in the threelocation where:
= the mean depression score of healthy people in Florida
= the mean depression score of healthy people in New York
= the mean depression score of healthy people in North Carolina
Rejection Rule: Reject the null hypothesis, if the calculated value of F statistic is greater
Trang 18Conclusion: The null hypothesis is rejected, because the sample provides enough evidence
to support score of healthy people in the three locations (F≥) so all geographical means arenot equal The factor that makes this difference is the mean between New York andFlorida
Medical 2
Hypothesis tested
:There is no significant difference in the mean depression score of healthy people in thethree location
:There is significant difference in the mean depression score of healthy people in the threelocation where:
= the mean depression score of healthy people in Florida
= the mean depression score of healthy people in New York
= the mean depression score of healthy people in North Carolina
Rejection Rule: Reject the null hypothesis, if the calculated value of F statistic is greater
Trang 19Conclusion: The null hypothesis cannot be rejected, because the sample does not provide
enough mean depression score of healthy people in the three locations (F≤) so allgeographical mean are equal There is no relation between location and depression score
4.3 Question 3
From the above two output results we observe that:
- There is no interaction between health and locations
- There is a big difference of depression scores between good health and chronic health
- With people in reasonable good heath, geographical locations affect the levels ofdepression However, if they have some kind of chronic health problem, there will not
be a depression variation in States
Trang 20SECTION 5 STATISTICAL ANALYSIS WITH REAL DATA
1 Population and
samples
5.1 Introduction
5.1.1 Population and Sample
Our teams found the datasets on a website named “Kaggle” The survey gathered basicinformation such as height and weight from 500 respondents Our major goal is to see if there
is a difference in average height between boys and girls, as well as the relationship betweenthe respondents' height and weight
5.1.2 Sample size
Following a debate among team members, everyone agreed to select a sample of 500 students As this number is sufficiently large for us to obtain a definite and appropriate proportion for our testing and would result in a more accurate result Furthermore,when the sample size is too narrow, the overview of that sample size on how height andweight they are may not reflect the actual condition of the total number of respondents
5.1.3 Sampling method
There are several sampling methods available to test person height and weight However,because such a large total population as the total number of obese people, with approximately
650 million people, along with a sample size of 500 was chosen led us to the decision of using
a simple random sampling method to collect the responses
5.1.4 Data Collection
After selecting the main objective and content, our group seeked datasets on the Internet.Thanks to a recommendation by Mr.Nguyen Tien Dung, we came across a survey of heightand weight on a website named “Kaggle” The data is viewed by nearly under 100 thousandusers; therefore, we are confident that the collected data is highly authentic Finally, wedocumented (gender, height, weight, index) and processed the data with the help of GoogleExcel
The tables of the data are presented in Appendix C of this report
5.1.5 Data Processing
Acknowledging that the information obtained is rather large to compute manually, wecollect theinformation and analyse the figure with the assistance of Microsoft Excelapplications We inputted the data into Excel and did some statistics by using “MegaStat '' and
“Pivot Table” Furthermore, our team also used the graph tools of Excel to visualize data incharts namely pie charts, histogram and regression line, which make readers much easier tounderstand
5.1.6 Significance level of sample test
Trang 21According to our research, the average height of men and women is roughly 170cm.Therefore, according to what we have learnt, we form a hypothesis that there is no difference
in the average height between boy and girl with the level of significance of 5%
5.2 Descriptive Statistics
After having raw data materials, we decided to divide data into 2 groups based on gender
in order to conduct further statistics
49.00%
51.00%
Gend er proportion
Male Female
Figure 5.1 The gender proportion
The percentages of boys and girls among 500 interviewees are roughly equal, as seen in thepie chart, with 51 percent and 49 percent respectively It can ensure that the sample size issufficient and that the outcome is not biased
Firstly, we did some descriptive statistics for male The table is shown below:
Table 5.1 The male’s height descriptive statistic
Then we did the frequency distribution table and histogram
Trang 22lower upper midpoint width frequency percent
Figure 5.2 The male’s height histogram
We can observe some basic information from the graphs The poll was conducted by 245boys with an average height of 169cm Their heights range from 140 to 199 centimeters, with
a sample standard deviation of 17.07 centimeters Furthermore, among boys, the mostcommon height (Mode) is 179cm, which is higher than the Median (171cm) and Mean(169cm), indicating a left-skewed distribution
Trang 23Range 110 Confidence interval 95.% lower 102.31 Confidence interval 95.% upper 110.32
Table 5.3 The male’s weight descriptive statistic
lower upper midpoint width frequency percent
Figure 5.3 The male’s weight histogram
The sample mean weight is 106.31kg, according to the data The range is 110kg, with thelowest point being 50kg Furthermore, the most common weight (Mode) is 80kg, which is
Trang 24lower than the Median (105kg) and Mean (106.31kg), indicating that the frequencydistribution is right skewed with a lengthy right tail.
We did the same step for females:
Table 5.5 The female’s height descriptive statistic
lower upper midpoint width frequency percent