1. Trang chủ
  2. » Thể loại khác

Empirical analysis for beginners Stata

113 9 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 113
Dung lượng 3,04 MB
File đính kèm EmpiricalAnalysis.rar (3 MB)

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Variable name ContentRegion_Narrow Region name Region_Broad Continent name Laborforce_2008_thousands Workforce population thousands 2008 Laborforce_2013_thousands 2013 Laborforce_growth

Trang 1

Introduction of Empirical

Analysis using Stata:

For Beginners

Project Research Associate Department of Technology Management for innovation Graduate School of Engineering, the University of Tokyo

t-koba@tmi.t.u-tokyo.ac.jp

This lecture material can be used secondary according to the Creative Commons name display.

Please note that there are some areas that do not adequately touch the statistical rigor.

Acknowledgements: Mr Kisa Sugihara and Mr Akihiro Kawamura made a great

contribution to the English translation.

Trang 2

• Studying an organizational management in technology and design development

• Assistant in legal affairs in the univ start-up (Signpost, Corp.)

→Policy analysit in a private think tank (Mitsubishi Res Inst.)

→Hitotsubashi Univ & Univ of Tokyo

Trang 3

We will learn basic knowledge and skills to

reveal (or proof) a causal relationship.

be able to analyze by yourself after the seminar

formula is used.

Trang 4

How to Load data

Trang 5

① Setting research questions

Carried out

in the head With dataEmbodiment

Trang 6

I Preparation for the Analysis:

How to Load data

Trang 7

Stata SPSS R GRETL

Features High High Medium

(High w/add-in) Medium

(High w/add-in)

User

Support Official

Support + a couple of books

Official support + Books

A variety of information online

+ Books

Information online

Characteri

stics Strong in the analysis

of the social science

A little strong

in the analysis of the natural science

Strong in data

processing

Strong in analysis of the

economics

Trang 8

• Created from OECD, Main Science and Technology Indicator

growth in 2013 (compare to those in 2008)

• Workforce population (thousands)

• PCT Patent applications (number of patents) Number of patent applications that are willing to apply to foreign countries

• Industry Value added (US $ million)

• Technology trade received (US $ Million)

• Technology trade payments (US $ Million)

• Technical trade balance (US $ million) Amount Received - payment

Trang 9

Variable name Content

Region_Narrow Region name

Region_Broad Continent name

Laborforce_2008_thousands Workforce population

(thousands) (2008) Laborforce_2013_thousands (2013)

Laborforce_growthrate Growth

(2008-2013) pctpatentapplication_2008 Number of international

patent applications (2008) pctpatentapplication_2013 (2013)

Pct_growthrate Growth rate(2008-2013)

Valueadded_2008_m_usd Industry Value added (US

$ Million) (2008) Valueadded_2013_m_usd (2013)

Valueadded_growthrate Growth rate (2008-2013)

ValueAdded_Growth_M_USD Growth value (2008-2013)

Techpayments_2008_m_usd Technology trade

payments (US $ Million) (2008)

Techpayments_2013_m_usd (2013) Techpayments_growthrate Growth rate(2008-2013) Techbalance_2008_m_usd Technical trade balance of

payment (US $ Million) (2008)

Techbalance_2013_m_usd (2013) Techbalance_growth_m_usd Growth value (US

$ Million) (2008-2013) Laborforce_growth_dummy Dummy variable takes 1 if

labor force population growth rate > 0 Techbalance_growth_dummy Dummy variable takes 1 if

technology trade balance growth rate > 0

Asiapacific_dummy Dummy variable takes 1 if

the country is in Asia or Paficif (including North America)

Europe_dummy Dummy variable takes 1 if

the country is in Europe Eu_dummy Dummy variable takes 1 if

the country is one of the

EU members

Trang 11

• IMPP_Eng_DATA.txt or IMPP_EnglishEdu_En.xlsx

• Surveys to public high schools and junior high schools

Trang 12

Prefecture Pref_Str High School English

Teachers' English Skill

(MEXT English Skill

High School Students'

English Skill (MEXT

English Skill Survey in

2016)

Those who took an English examination among (d) (e) HS_S_EXAM Those who graded Eiken Pre-2 and upper among (e) (f) HS_S_E2 Those who are regarded as equivalent to Eiken Pre-2 and

upper except (f) (g)

HS_S_OT

Trang 13

Classification Items Variable names

Junior High School

English Teachers'

English Skill (MEXT

English Skill Survey in

2016)

Number of English teachers in public JHS (h) JH_T_ALLThose who took an English examination among (h) (i) JH_T_EXAMThose who graded Eiken Pre-1 and upper and these

Junior High School

Students' English Skill

(MEXT English Skill

Survey in 2016)

Those who took an English examination among (k) (l) JH_S_EXAMThose who graded Eiken Pre-2 and upper among (l) (m) JH_S_E2Those who are regarded as equivalent to Eiken Pre-2 and

Trang 14

who newly attend

Num of students who newly attend collages and

Num of students who newly attend junior collages (by

Trang 15

• What factors do influence on English skills of high school students?

Trang 16

• Individual observation target in vertical direction (row

direction)

horizontal direction (column direction)

Trang 17

• How to name variables

should prevent use other symbols or Japanese

Trang 18

• It is possible for STATA (though the old version does not

work)

• CSV data separate variables by ","(comma).

• In the numeric data, Excel and other database softwares may add "," as the digit indication

• To avoid to be treated as separeted variables, these softwares add double-quotation like “333,231,298” when file is saved

• Loading the file, R and Stata may treat numeric variables as a string

• If the file is separated by tab, you can prevent this

Trang 19

• FileMenu>Import> Choose Text data created by a spreadsheet

Trang 20

[i] Keep checking delimited data in

tab-advance

Trang 21

• On the file open window, choose Text Files (*.txt)

and then open the data file

Change to Text Files (*.txt)

Trang 22

Here

Trang 23

• Note the type of each variable in the imported data

Number (can be calculated)

String (not calculated)

※When there is garbage in the data or output to a tab- delimited text format with

“”

0/1(Can be calculated)

You can see it here

Trang 24

[Variable Manager]

Here

Trang 25

• The correct method

fixed in DataMenu >Create or change data>Other transformation Commands>Convert variables from string to numeric.

Trang 26

variable-II Descriptive Statistics and Graphs

Trang 27

• View descriptive statistics

• Statistics Menu >Summaries, tables, and tests

>Summary and descriptive statistics >Summary Statistics

Trang 28

• Just click on the data you want to aggregate in “Variables”

[ii]OK [I] Just click and choose

Trang 29

summarize laborforce_growthrate pct_growthrate valueadded_growthrate techbalance_gro

#Command lines for descriptive statistics

Standard deviation Long variable names

are omitted

Trang 30

• by/If/in Tags can be narrowed and aggregated by group

[i]Check here [ii]Select a variable to

be the groupʻs base (For example

Europe_dummy)

Trang 31

#Descriptive statistics by groups

pct_growthrate

Trang 32

• Statistics > Summaries, tables and tests > Summary and descriptive statistics > Correlations and covariances

#Correlations

Trang 33

• Correlations between variables (cont.)

Trang 34

correlate valueadded_growthrate techbalance_growth_m_usd

techbalance_growth_dummy laborforce_growthrate pct_growthrate eu_dummy

Trang 35

• Drawing a histogram

Trang 36

Select a variable

Trang 37

• Drawing a histogram (cont.): Results

Trang 38

• You can create a histogram for each group in the “By” tab

[i]Click “By”

[ii] Select variables to use for grouping

Trang 39

• Drawing a histogram by groups

# Drawing a histogram by groups

Trang 40

• Graphics Menu>Twoway graph (scatter, line, etc.)

Trang 41

• Drawing a scatter chart

[i]Click Create

Trang 42

#Drawing a scatter chart

[i]Select the Scatter in the basic plots

[ii]Select each axis variable

[iii] Press accept to return to the previous screen Then press ok

Trang 43

• Drawing a scatter chart (cont.): Results

Trang 44

plot matrix

• Graphics >

Scatterplot matrix

Trang 45

• Drawing a scatter plot matrix

-.5 0 5

-.1 0 1 2

-.1 0 1 2

0 1 2 3

0 1 2 3 0

.5 1

Select variables

#Drawing a scatter plot matrix

pct_growthrate eu_dummy

Trang 46

• Graphics >

Box plot

Trang 47

• Drawing a box plot

#Drawing a box plot

Trang 48

#Drawing a box plot by groups

Trang 49

• Drawing a box plot by groups: Results

Trang 50

variable contains errors

and value added related variables

histgrams, and scatter plots

Trang 51

• Answer

Trang 52

III Data Processing

Trang 53

• How to compute a new variable

Trang 54

[i]Fill the name of the new variable

[ii]Click

“Create”

Trang 55

• How to compute a new variable (cont.)

[i] The mathematical process can be

chosen from Function >Mathmatical [ii] You can choose a variable from variables

#Create a new variable

Trang 57

• Save the modified

dataset [1]

#Save the dataset in a tab delimited format text file

delimiter(tab) replace

Input a file name Check “Tab-delimited”

Trang 59

IV Regression Analysis

Trang 60

influence of each factor

Performance =a*Factor 1

+b*Factor 2+c

Regression Analysis

(note) Generally, green layer is not triangle,

but in this example, we put limitation on Factor 1 and 2 (>0)

and Performance (< p)

c

b

a

Trang 61

• Key terms

• The variable to be estimated In many cases, performance indicators

• Variables that are affected (or think there is a strong correlation with) dependent variable

Trang 62

2 Significantly

3 Not Significant Significantly

(+) Positive impact is non-linear

4 Significantly (+) Not Significant A linear positive impact

Trang 63

• What can be used as a explanatory variable? (cont.)

• Use when there is a condition and how the explanatory variable works differently (check the moderator effect)

• Estimates along with each explanatory variables and see the degree of influence of both

Trang 64

ii Cross section (cont.)

• Notes:

• Cross section often cause multicollinearity with original explanatory variables: Need centering or standardization

• Centering: Original value – mean value

• Standardization: (Original value - mean) / standard deviation

• If there is an unbalance between two explanatory variables, cross section will have biased influence: Need standardization or alignment

of the number of digits

Trang 65

• What can be used as a explanatory variable? (cont.)

• The variable takes 1 if fulfill specific condition, otherwise 0

• Useful to control the differences of conditions or affiliations

Bolton, R N., & Chapman, R G (1986) Searching for positive returns at the track: A multinomial logit model

for handicapping horse races Management Science, 32(8), 1040-1060.

(Example)

Previous race win dummy:

Takes 1 if the horse won in the previous race

(Source) JRA

Trang 66

i All explanatory variables are data derived from the experiment.

(An uncertain value that takes a certain range = not a random variable)

ii The expected value of the error is 0

• The error term is not unevenly distributed (see next page)

• Variable describing the explained variable is not lacking

• There are no variables that affect both the description variable and the explanatory variable.

• It also says “There is no endogenous” or "error terms are non-correlated"

Trang 67

• Conditions that OLS can be used

1 Add missing variables to model

2 Logarithmic translation of explanatory variables and explained variables

3 Use a robust standard error

4 Estimating by Weighted least squares method (details, practice omitted) , maximum likelihood method

Check by Breusch-pagan Test,

or LM test

If there is uneven dispersion

Trang 68

• iv) No correlation between error and explanatory variable

= no endogeneity (or no omitted variable bias)

Amount of knowledge

Highly ratedresearch papers

Research time

Luck

Evaluation from Instructors/

Awards/

Number of paper cited

Cannot measure

Example:Scenes in which the seminar instructor's influence works

both the number of accessible articles and the evaluation

• It cannot estimate the pure effect of the amount of knowledge as long as it

is not possible to measure the goodness of the head of the person Must be consider before the

analysis.

Durbin-wu-hausman test detect the endogeneity

Solution

• Fixed effect model estimation on panel data

• Adding control variables

• Adopt method of instrumental variables (IV)

If there is an endogeneity

appear in the error sector

Knowledge

volume and

correlation

Trang 69

• Conditions that OLS can be used

• iv) No correlation between error and explanatory variable

= no endogeneity (or no omitted variable bias)

• Phenomena observed when omitted variable bias exists

• R2 is low (the model's explanatory power is weak)

• We have not added explanatory variables and control variables

(It is not important in causality model, but it affects variable to be explained) that have been confirmed to have a significant influence on previous studies using the same explained variable

Solution

- check the previous research carefully!

Trang 70

• iv) No correlation between error and explanatory variable

= no endogeneity (simultaneity bias or reverse causality)

Amount of knowledge

Highly rated research papers

Research time

Evaluation from Instructors/

Awards/

Number of paper cited

Number of papers read

Number of hours spent

Devoted to research

Example: Scenes where you can

concentrate on research by being known as

writing a good paper • Correct calculation is impossible in

circulation.

Must be consider before the analysis Detectable by Durbin-Wu-Hausman test.

Trang 71

• Conditions that OLS can be used

• If the error is not normally distributed, the estimated line is not the correct slope

Frequency ofvalue to takeerror with actualsamples

Confirm whether the residual is normal distribution by Kurtosis / Skewness Test or Shapiro-Wilk Normality Test

Solution

1 Logarithmically transform (Log) and squared the dependent variable and explanatory variable

2 Calculate by the maximum likelihood method, like Possison model, Probit model, or Tobit model

If it is not a normal distribution

However, if the sample is large enough (about a few hundreds)

no verification required

Trang 72

• vi)No strong correlation between explanatory variables:

non-existent of multicollinearity

• Multi-collinearity: it is not known which variables to influence among highly correlated explanatory variables, and the estimated coefficients become inaccurate

• Observed phenomena

• Although the coefficient of determination is high, the t value of each explanatory variable is low (not significant)

• Abnormally high standard error

• It does not coincide with the sign (+ or ­) of the coefficient of the result estimated by the model with only one correlative explanatory variable.

Solution

1 Eliminating unnecessary explanatory variables

2 Convert explanatory variables to difference or ratio

3 Factor analysis or principal component analysis

is carried out to the explanatory variables, creating

a non-correlated synthetic variable

VIF (Variance inflation Factor) is obtained and it is confirmed whether

or not a variable showing 4 or more (or 10 or more) exists

If there is a multicollinearity

Trang 73

• 7 steps in regression analysis

Design the causal relationship model and drop it into the indicator

• Make a model without endogeneity (omitted variable bias, simultaneity bias)

• Samples should be large (at least explanatory variable × 2 or 3 + 10)

Create descriptive statistics & correlation matrix

• Be sure to create a histogram to verify the distribution

• If the dependent variable does not take normal distribution, estimates other than OLS are also considered

• If the digits of the explanatory variable are different from each other, multiply by 1,000, prepare by 1 / 1,000 times etc

• For explanatory variables whose correlation is too strong, either one is dropped or later checked for multicollinearity

1

2

Trang 74

Make two models with only control variables without

explanatory variables and models with explanatory variables

• Compare R2 of both models and see contribution of explanatory variables

If it contains a variable with strong correlation, check whether there is multiple collinearity

• Check VIF : It is more than 4 or more (or 10 or more)?

• In the case of multiple collinearity, one drops out, converts a variable, aggregates it by principal component analysis, etc

3

4

Ngày đăng: 27/08/2021, 09:46

TỪ KHÓA LIÊN QUAN