Variable name ContentRegion_Narrow Region name Region_Broad Continent name Laborforce_2008_thousands Workforce population thousands 2008 Laborforce_2013_thousands 2013 Laborforce_growth
Trang 1Introduction of Empirical
Analysis using Stata:
For Beginners
Project Research Associate Department of Technology Management for innovation Graduate School of Engineering, the University of Tokyo
t-koba@tmi.t.u-tokyo.ac.jp
This lecture material can be used secondary according to the Creative Commons name display.
Please note that there are some areas that do not adequately touch the statistical rigor.
Acknowledgements: Mr Kisa Sugihara and Mr Akihiro Kawamura made a great
contribution to the English translation.
Trang 2• Studying an organizational management in technology and design development
• Assistant in legal affairs in the univ start-up (Signpost, Corp.)
→Policy analysit in a private think tank (Mitsubishi Res Inst.)
→Hitotsubashi Univ & Univ of Tokyo
Trang 3• We will learn basic knowledge and skills to
reveal (or proof) a causal relationship.
be able to analyze by yourself after the seminar
formula is used.
Trang 4How to Load data
Trang 5① Setting research questions
Carried out
in the head With dataEmbodiment
Trang 6I Preparation for the Analysis:
How to Load data
Trang 7Stata SPSS R GRETL
Features High High Medium
(High w/add-in) Medium
(High w/add-in)
User
Support Official
Support + a couple of books
Official support + Books
A variety of information online
+ Books
Information online
Characteri
stics Strong in the analysis
of the social science
A little strong
in the analysis of the natural science
Strong in data
processing
Strong in analysis of the
economics
Trang 8• Created from OECD, Main Science and Technology Indicator
growth in 2013 (compare to those in 2008)
• Workforce population (thousands)
• PCT Patent applications (number of patents) Number of patent applications that are willing to apply to foreign countries
• Industry Value added (US $ million)
• Technology trade received (US $ Million)
• Technology trade payments (US $ Million)
• Technical trade balance (US $ million) Amount Received - payment
Trang 9Variable name Content
Region_Narrow Region name
Region_Broad Continent name
Laborforce_2008_thousands Workforce population
(thousands) (2008) Laborforce_2013_thousands (2013)
Laborforce_growthrate Growth
(2008-2013) pctpatentapplication_2008 Number of international
patent applications (2008) pctpatentapplication_2013 (2013)
Pct_growthrate Growth rate(2008-2013)
Valueadded_2008_m_usd Industry Value added (US
$ Million) (2008) Valueadded_2013_m_usd (2013)
Valueadded_growthrate Growth rate (2008-2013)
ValueAdded_Growth_M_USD Growth value (2008-2013)
Techpayments_2008_m_usd Technology trade
payments (US $ Million) (2008)
Techpayments_2013_m_usd (2013) Techpayments_growthrate Growth rate(2008-2013) Techbalance_2008_m_usd Technical trade balance of
payment (US $ Million) (2008)
Techbalance_2013_m_usd (2013) Techbalance_growth_m_usd Growth value (US
$ Million) (2008-2013) Laborforce_growth_dummy Dummy variable takes 1 if
labor force population growth rate > 0 Techbalance_growth_dummy Dummy variable takes 1 if
technology trade balance growth rate > 0
Asiapacific_dummy Dummy variable takes 1 if
the country is in Asia or Paficif (including North America)
Europe_dummy Dummy variable takes 1 if
the country is in Europe Eu_dummy Dummy variable takes 1 if
the country is one of the
EU members
Trang 11• IMPP_Eng_DATA.txt or IMPP_EnglishEdu_En.xlsx
• Surveys to public high schools and junior high schools
Trang 12Prefecture Pref_Str High School English
Teachers' English Skill
(MEXT English Skill
High School Students'
English Skill (MEXT
English Skill Survey in
2016)
Those who took an English examination among (d) (e) HS_S_EXAM Those who graded Eiken Pre-2 and upper among (e) (f) HS_S_E2 Those who are regarded as equivalent to Eiken Pre-2 and
upper except (f) (g)
HS_S_OT
Trang 13Classification Items Variable names
Junior High School
English Teachers'
English Skill (MEXT
English Skill Survey in
2016)
Number of English teachers in public JHS (h) JH_T_ALLThose who took an English examination among (h) (i) JH_T_EXAMThose who graded Eiken Pre-1 and upper and these
Junior High School
Students' English Skill
(MEXT English Skill
Survey in 2016)
Those who took an English examination among (k) (l) JH_S_EXAMThose who graded Eiken Pre-2 and upper among (l) (m) JH_S_E2Those who are regarded as equivalent to Eiken Pre-2 and
Trang 14who newly attend
Num of students who newly attend collages and
Num of students who newly attend junior collages (by
Trang 15• What factors do influence on English skills of high school students?
Trang 16• Individual observation target in vertical direction (row
direction)
horizontal direction (column direction)
Trang 17• How to name variables
should prevent use other symbols or Japanese
Trang 18• It is possible for STATA (though the old version does not
work)
• CSV data separate variables by ","(comma).
• In the numeric data, Excel and other database softwares may add "," as the digit indication
• To avoid to be treated as separeted variables, these softwares add double-quotation like “333,231,298” when file is saved
• Loading the file, R and Stata may treat numeric variables as a string
• If the file is separated by tab, you can prevent this
Trang 19• FileMenu>Import> Choose Text data created by a spreadsheet
Trang 20[i] Keep checking delimited data in
tab-advance
Trang 21• On the file open window, choose Text Files (*.txt)
and then open the data file
Change to Text Files (*.txt)
Trang 22Here
Trang 23• Note the type of each variable in the imported data
Number (can be calculated)
String (not calculated)
※When there is garbage in the data or output to a tab- delimited text format with
“”
0/1(Can be calculated)
You can see it here
Trang 24[Variable Manager]
Here
Trang 25• The correct method
fixed in DataMenu >Create or change data>Other transformation Commands>Convert variables from string to numeric.
Trang 26variable-II Descriptive Statistics and Graphs
Trang 27• View descriptive statistics
• Statistics Menu >Summaries, tables, and tests
>Summary and descriptive statistics >Summary Statistics
Trang 28• Just click on the data you want to aggregate in “Variables”
[ii]OK [I] Just click and choose
Trang 29summarize laborforce_growthrate pct_growthrate valueadded_growthrate techbalance_gro
#Command lines for descriptive statistics
Standard deviation Long variable names
are omitted
Trang 30• by/If/in Tags can be narrowed and aggregated by group
[i]Check here [ii]Select a variable to
be the groupʻs base (For example
Europe_dummy)
Trang 31#Descriptive statistics by groups
pct_growthrate
Trang 32• Statistics > Summaries, tables and tests > Summary and descriptive statistics > Correlations and covariances
#Correlations
Trang 33• Correlations between variables (cont.)
Trang 34correlate valueadded_growthrate techbalance_growth_m_usd
techbalance_growth_dummy laborforce_growthrate pct_growthrate eu_dummy
Trang 35• Drawing a histogram
Trang 36Select a variable
Trang 37• Drawing a histogram (cont.): Results
Trang 38• You can create a histogram for each group in the “By” tab
[i]Click “By”
[ii] Select variables to use for grouping
Trang 39• Drawing a histogram by groups
# Drawing a histogram by groups
Trang 40• Graphics Menu>Twoway graph (scatter, line, etc.)
Trang 41• Drawing a scatter chart
[i]Click Create
Trang 42#Drawing a scatter chart
[i]Select the Scatter in the basic plots
[ii]Select each axis variable
[iii] Press accept to return to the previous screen Then press ok
Trang 43• Drawing a scatter chart (cont.): Results
Trang 44plot matrix
• Graphics >
Scatterplot matrix
Trang 45• Drawing a scatter plot matrix
-.5 0 5
-.1 0 1 2
-.1 0 1 2
0 1 2 3
0 1 2 3 0
.5 1
Select variables
#Drawing a scatter plot matrix
pct_growthrate eu_dummy
Trang 46• Graphics >
Box plot
Trang 47• Drawing a box plot
#Drawing a box plot
Trang 48#Drawing a box plot by groups
Trang 49• Drawing a box plot by groups: Results
Trang 50variable contains errors
and value added related variables
histgrams, and scatter plots
Trang 51• Answer
Trang 52III Data Processing
Trang 53• How to compute a new variable
Trang 54[i]Fill the name of the new variable
[ii]Click
“Create”
Trang 55• How to compute a new variable (cont.)
[i] The mathematical process can be
chosen from Function >Mathmatical [ii] You can choose a variable from variables
#Create a new variable
Trang 57• Save the modified
dataset [1]
#Save the dataset in a tab delimited format text file
delimiter(tab) replace
Input a file name Check “Tab-delimited”
Trang 59IV Regression Analysis
Trang 60influence of each factor
Performance =a*Factor 1
+b*Factor 2+c
Regression Analysis
(note) Generally, green layer is not triangle,
but in this example, we put limitation on Factor 1 and 2 (>0)
and Performance (< p)
c
b
a
Trang 61• Key terms
• The variable to be estimated In many cases, performance indicators
• Variables that are affected (or think there is a strong correlation with) dependent variable
Trang 622 Significantly
3 Not Significant Significantly
(+) Positive impact is non-linear
4 Significantly (+) Not Significant A linear positive impact
Trang 63• What can be used as a explanatory variable? (cont.)
• Use when there is a condition and how the explanatory variable works differently (check the moderator effect)
• Estimates along with each explanatory variables and see the degree of influence of both
Trang 64ii Cross section (cont.)
• Notes:
• Cross section often cause multicollinearity with original explanatory variables: Need centering or standardization
• Centering: Original value – mean value
• Standardization: (Original value - mean) / standard deviation
• If there is an unbalance between two explanatory variables, cross section will have biased influence: Need standardization or alignment
of the number of digits
Trang 65• What can be used as a explanatory variable? (cont.)
• The variable takes 1 if fulfill specific condition, otherwise 0
• Useful to control the differences of conditions or affiliations
Bolton, R N., & Chapman, R G (1986) Searching for positive returns at the track: A multinomial logit model
for handicapping horse races Management Science, 32(8), 1040-1060.
(Example)
Previous race win dummy:
Takes 1 if the horse won in the previous race
(Source) JRA
Trang 66i All explanatory variables are data derived from the experiment.
(An uncertain value that takes a certain range = not a random variable)
ii The expected value of the error is 0
• The error term is not unevenly distributed (see next page)
• Variable describing the explained variable is not lacking
• There are no variables that affect both the description variable and the explanatory variable.
• It also says “There is no endogenous” or "error terms are non-correlated"
Trang 67• Conditions that OLS can be used
1 Add missing variables to model
2 Logarithmic translation of explanatory variables and explained variables
3 Use a robust standard error
4 Estimating by Weighted least squares method (details, practice omitted) , maximum likelihood method
Check by Breusch-pagan Test,
or LM test
If there is uneven dispersion
Trang 68• iv) No correlation between error and explanatory variable
= no endogeneity (or no omitted variable bias)
Amount of knowledge
Highly ratedresearch papers
Research time
Luck
Evaluation from Instructors/
Awards/
Number of paper cited
Cannot measure
Example:Scenes in which the seminar instructor's influence works
both the number of accessible articles and the evaluation
• It cannot estimate the pure effect of the amount of knowledge as long as it
is not possible to measure the goodness of the head of the person Must be consider before the
analysis.
Durbin-wu-hausman test detect the endogeneity
Solution
• Fixed effect model estimation on panel data
• Adding control variables
• Adopt method of instrumental variables (IV)
If there is an endogeneity
appear in the error sector
Knowledge
volume and
correlation
Trang 69• Conditions that OLS can be used
• iv) No correlation between error and explanatory variable
= no endogeneity (or no omitted variable bias)
• Phenomena observed when omitted variable bias exists
• R2 is low (the model's explanatory power is weak)
• We have not added explanatory variables and control variables
(It is not important in causality model, but it affects variable to be explained) that have been confirmed to have a significant influence on previous studies using the same explained variable
Solution
- check the previous research carefully!
Trang 70• iv) No correlation between error and explanatory variable
= no endogeneity (simultaneity bias or reverse causality)
Amount of knowledge
Highly rated research papers
Research time
Evaluation from Instructors/
Awards/
Number of paper cited
Number of papers read
Number of hours spent
Devoted to research
Example: Scenes where you can
concentrate on research by being known as
writing a good paper • Correct calculation is impossible in
circulation.
Must be consider before the analysis Detectable by Durbin-Wu-Hausman test.
Trang 71• Conditions that OLS can be used
• If the error is not normally distributed, the estimated line is not the correct slope
Frequency ofvalue to takeerror with actualsamples
Confirm whether the residual is normal distribution by Kurtosis / Skewness Test or Shapiro-Wilk Normality Test
Solution
1 Logarithmically transform (Log) and squared the dependent variable and explanatory variable
2 Calculate by the maximum likelihood method, like Possison model, Probit model, or Tobit model
If it is not a normal distribution
However, if the sample is large enough (about a few hundreds)
no verification required
Trang 72• vi)No strong correlation between explanatory variables:
non-existent of multicollinearity
• Multi-collinearity: it is not known which variables to influence among highly correlated explanatory variables, and the estimated coefficients become inaccurate
• Observed phenomena
• Although the coefficient of determination is high, the t value of each explanatory variable is low (not significant)
• Abnormally high standard error
• It does not coincide with the sign (+ or ) of the coefficient of the result estimated by the model with only one correlative explanatory variable.
Solution
1 Eliminating unnecessary explanatory variables
2 Convert explanatory variables to difference or ratio
3 Factor analysis or principal component analysis
is carried out to the explanatory variables, creating
a non-correlated synthetic variable
VIF (Variance inflation Factor) is obtained and it is confirmed whether
or not a variable showing 4 or more (or 10 or more) exists
If there is a multicollinearity
Trang 73• 7 steps in regression analysis
Design the causal relationship model and drop it into the indicator
• Make a model without endogeneity (omitted variable bias, simultaneity bias)
• Samples should be large (at least explanatory variable × 2 or 3 + 10)
Create descriptive statistics & correlation matrix
• Be sure to create a histogram to verify the distribution
• If the dependent variable does not take normal distribution, estimates other than OLS are also considered
• If the digits of the explanatory variable are different from each other, multiply by 1,000, prepare by 1 / 1,000 times etc
• For explanatory variables whose correlation is too strong, either one is dropped or later checked for multicollinearity
1
2
Trang 74Make two models with only control variables without
explanatory variables and models with explanatory variables
• Compare R2 of both models and see contribution of explanatory variables
If it contains a variable with strong correlation, check whether there is multiple collinearity
• Check VIF : It is more than 4 or more (or 10 or more)?
• In the case of multiple collinearity, one drops out, converts a variable, aggregates it by principal component analysis, etc
3
4