1. Trang chủ
  2. » Kinh Doanh - Tiếp Thị

Bayesian data analysis for animal scientists

287 33 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 287
Dung lượng 6,58 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

The treatments are different, and if we repeat the experiment an infinite number of times, the difference between the averages of the samples x1 x2 would not be distributed around zero b

Trang 1

Bayesian

Data Analysis for Animal

Scientists

The Basics

Trang 2

Bayesian Data Analysis for Animal Scientists

Trang 3

Agustı´n Blasco

Bayesian Data Analysis for Animal Scientists

The Basics

Trang 4

Institute of Animal Science and Technology

Universitat Polite`cnica de Vale`ncia

Vale`ncia

Spain

ISBN 978-3-319-54273-7 ISBN 978-3-319-54274-4 (eBook)

DOI 10.1007/978-3-319-54274-4

Library of Congress Control Number: 2017945825

# Springer International Publishing AG 2017

This work is subject to copyright All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission

or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service marks, etc in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Printed on acid-free paper

This Springer imprint is published by Springer Nature

The registered company is Springer International Publishing AG

The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Trang 5

To Daniel and Daniel, from whom I have learnt so much in these matters

Trang 6

What we now call ‘Bayesian analysis’ was the standard approximation to statistics

in the nineteenth century and the first quarter of the twentieth century For example,Gauss’s first deduction of the least squares method was made by using Bayesianprocedures A strong reaction against Bayesian methods, led by Ronald Fisher,Jerzy Neyman and Egon Pearson, took place in the 1920s and 1930s with the resultthat Bayesian statistics was replaced by the current ‘frequentist’ methods of P-values, significance tests, confidence intervals and likelihood estimation Thereasons were in part philosophical and in part practical; often Bayesian proceduresneeded to solve complex multidimensional integrals The philosophical issues weremainly related to the use of prior information inherent to Bayesian analysis; itseemed that for most cases, informative priors should be subjective, and moreover,

it was not possible to represent ignorance

When the computers’ era arrived, the practical problems had a solution usingtechniques known as ‘Markov Chain Monte Carlo’ (MCMC), and a new interest inBayesian procedures took place Subjective informative priors were almost aban-doned, and ‘objective’ priors with little information were used instead, expectingthat having enough data, the priors will not affect the results in practice It was soondiscovered that some difficult problems in biological sciences, particularly ingenetics, could be approached much easily using Bayesian techniques, and someproblems that had no solution using frequentist statistics could be solved with thisnew tool The geneticists Daniel Gianola and Daniel Sorensen brought in the 1990sthe MCMC procedures to solve animal breeding problems using Bayesian statistics,and its use started being more and more common Nowadays, Bayesian techniquesare common in animal breeding, and there is no reason why they should not beapplied in other fields of animal production

The objective of this book is to offer the animal production scientist an duction to this methodology, offering examples of its advantages for analysingcommon problems like comparison between treatments, regression or linear mixedmodels analyses Classical statistical tools are often not well understood by practi-cal scientists Significance tests andP-values are misused so often that in somefields of research it has been suggested to eliminate them I think that Bayesianprocedures give much more intuitive results and provide easy tools, helping practi-cal scientists to be more accurate when explaining the results of their research.Substituting stars and ‘n.s.’ by probabilities of obtaining a relevant difference

intro-vii

Trang 7

between treatments helps not only in understanding results but in making furtherdecisions Moreover, some difficult problems have a straightforward Bayesianprocedure to be solved that does not need new conceptual tools but the use of theprinciples learnt for simpler problems.

This book is based on the lecture notes of an introductory Bayesian course I gave

in Scotland (Edinburgh, Scottish Agricultural College), France (Toulouse, InstitutNational de la Recherche Agronomique), Spain (Valencia, Universitat Polite`cnica

de Vale`ncia), the USA (Madison, University of Wisconsin), Uruguay (Montevideo,Universidad Nacional), Brazil (Botucatu, Universidade Estadual Paulista; Lavras,Universidade Federal de Lavras) and Italy (Padova, Universita` di Padova) Thebook analyses common cases that can be found in animal production using practicalexamples and deriving demonstrations to appendices to facilitate reading thechapters I have preferred to be more intuitive than rigorous, to help the reader inhis first contact with Bayesian inference The data are almost always considered to

be Normally distributed conditioned to the unknowns, because Normal distribution

is the most common in animal production and all derivations are similar using otherdistributions The book is structured into 10 chapters The first one reviews theclassical statistics concepts stressing the common misinterpretations; quite a fewpractical scientists will discover that the techniques they were applying do not meanwhat they thought Chapter2offers the Bayesian possibilities for common analyseslike comparisons between treatments; here, the animal production scientists will seenew ways of presenting the results: probability of a difference between treatmentsbeing relevant, minimum guaranteed value with a chosen probability, when can it

be said ‘there is no difference between treatments’ and when ‘we do not knowwhether there is a difference or not’, etc Chapter3gives elementary notions aboutdistribution functions Chapter 4 introduces MCMC procedures intuitively.Chapters 5, 6 and 7 analyse the linear model from its simplest form (only themean and the error) to the complex multitrait mixed models Chapter8gives someexamples of complex problems that have no straightforward solution under classi-cal statistics Chapter9deals with all the problems related to prior information, themain object of criticism of Bayesian statistics in the past Chapter10deals with thecomplex problem of model selection from both frequentist and Bayesian points ofview Although this is a book for practical scientists, not necessarily interested inthe philosophy of Bayesian inference, I thought it would be possible to communi-cate the philosophical problems that made frequentism and Bayesianism twoirreconcilable options in the view of classical statisticians like Fisher, Jeffreys,Pearson (father and son), Neyman, Lindley and many others Following a classicaltradition in Philosophy, I wrote three dialogues between a frequentist statisticianand a Bayesian one, in which they discuss about the problem of using probability as

a degree of belief, the limitations of the classical probability theory, the Bayesiansolution and the difficult problem of induction, i.e the reasons why we think ourinferences are correct I took Hylas and Philonous, the characters invented byBishop Berkeley in the eighteenth century for his dialogues, assigning thefrequentist role to Hylas and the Bayesian to Philonous, and I also placed the

Trang 8

dialogues in the eighteenth century, when Bayes published his famous theorem, tomake them entertaining and give some sense of humour to a book dedicated to such

a rather arid matter as statistics

I have to acknowledge many people who have contributed to this book First ofall, the colleagues who invited me to give the introductory Bayesian course onwhich this book is based; the experience teaching this matter in several countriesand in several languages has been invaluable for trying to be clear and concise in theexplanations of all the statistical concepts and procedures contained in the book I

am grateful to Jose´ Miguel Bernardo for his suggestions about how a practicalscientist can use the Bayesian credibility intervals for most common problems,developed in Chap.2, and to Luis Varona for the intuitive interpretation of howGibbs sampling works I should also be grateful to all the colleagues who have readthe manuscript, corrected my faults and given advice; I am especially grateful toManolo Baselga for his detailed reading line by line and also to the colleagues whoreviewed all or part of the book, Miguel Toro, Juan Manuel Serradilla, QuimCasellas, Luis Varona, Noelia Iba´~nez and Andre´s Legarra, although all the errorsthat could be found in the book are my entire responsibility Finally, I am grateful to

my son Alejandro, educated in British universities, for his thorough English sion of this book and of course to Concha for her patience and support when I wasengaged dedicating so many weekends to writing the book

Trang 9

Colour

To help in identifying the functions used in this book, when necessary, we haveused red colour for the variables and black for the constants or given parameters.Thus,

is the probability density function of a Normal distribution, because the variable is

y, but the conditional

Probabilities and Probability Distributions

The letter ‘P’ is used for probabilities; for example, P(a  x  b) means theprobability ofx being between a and b

The letter ‘f’ is used to represent probability density functions; e.g f(x) can bethe probability density function of a Normal distribution or a binomial distribution

It is equivalent to the common use of the letter ‘p’ for probability density functions,but using ‘f’ stresses that it is a function

We will use the word Normal (with first capital letter) for the Gaussian tion, to avoid the confusion with the common word ‘normal’

distribu-xi

Trang 10

Scalars, Vectors and Matrixes

Bold small letters are column vectors, e.g.y ¼

105

24

35

y0is a transposed (row) vector, e.g.y0¼ [1 0 5]

Bold capital letters are matrixes, e.g A¼ 81 06

24

35

Lower case letters are scalars, e.g.y1¼ 7

Greek letters are parameters, e.g.σ2is a variance

fð Þ not proportional to k þ expy ð Þy2

fð Þ not proportional to exp c y ð y2Þ

Trang 11

1 Do We Understand Classic Statistics? 1

1.1 Historical Introduction 1

1.2 Test of Hypothesis 4

1.2.1 The Procedure 4

1.2.2 Common Misinterpretations 7

1.3 Standard Errors and Confidence Intervals 13

1.3.1 Definition of Standard Error and Confidence Interval 13

1.3.2 Common Misinterpretations 14

1.4 Bias and Risk of an Estimator 16

1.4.1 Unbiased Estimators 16

1.4.2 Common Misinterpretations 16

1.5 Fixed and Random Effects 18

1.5.1 Definition of ‘Fixed’ and ‘Random’ Effects 18

1.5.2 Shrinkage of Random Effects Estimates 19

1.5.3 Bias, Variance and Risk of an Estimator when the Effect is Fixed or Random 20

1.5.4 Common Misinterpretations 21

1.6 Likelihood 22

1.6.1 Definition 22

1.6.2 The Method of Maximum Likelihood 24

1.6.3 Common Misinterpretations 25

Appendix 1.1 26

Appendix 1.2 27

Appendix 1.3 28

Appendix 1.4 29

References 30

2 The Bayesian Choice 33

2.1 Bayesian Inference 33

2.1.1 The Foundations of Bayesian Inference 33

2.1.2 Bayes Theorem 34

2.1.3 Prior Information 36

2.1.4 Probability Density 40

xiii

Trang 12

2.2 Features of Bayesian Inference 42

2.2.1 Point Estimates: Mean, Median and Mode 42

2.2.2 Credibility Intervals 44

2.2.3 Marginalisation 49

2.3 Test of Hypothesis 51

2.3.1 Model Choice 51

2.3.2 Bayes Factors 52

2.3.3 Model Averaging 53

2.4 Common Misinterpretations 54

2.5 Bayesian Inference in Practice 57

2.6 Advantages of Bayesian Inference 61

Appendix 2.1 62

Appendix 2.2 63

Appendix 2.3 63

References 64

3 Posterior Distributions 67

3.1 Notation 67

3.2 Probability Density Function 68

3.2.1 Definition 68

3.2.2 Transformation of Random Variables 69

3.3 Features of a Distribution 71

3.3.1 Mean 71

3.3.2 Median 71

3.3.3 Mode 72

3.3.4 Credibility Intervals 72

3.4 Conditional Distributions 72

3.4.1 Bayes Theorem 72

3.4.2 Conditional Distribution of the Sample of a Normal Distribution 73

3.4.3 Conditional Posterior Distribution of the Variance of a Normal Distribution 73

3.4.4 Conditional Posterior Distribution of the Mean of a Normal Distribution 75

3.5 Marginal Distributions 76

3.5.1 Definition 76

3.5.2 Marginal Posterior Distribution of the Variance of a Normal Distribution 77

3.5.3 Marginal Posterior Distribution of the Mean of a Normal Distribution 78

Appendix 3.1 80

Appendix 3.2 81

Appendix 3.3 82

Appendix 3.4 83

Reference 84

Trang 13

4 MCMC 85

4.1 Samples of Marginal Posterior Distributions 86

4.1.1 Taking Samples of Marginal Posterior Distributions 86

4.1.2 Making Inferences from Samples of Marginal Posterior Distributions 87

4.2 Gibbs Sampling 91

4.2.1 How It Works 91

4.2.2 Why It Works 92

4.2.3 When It Works 94

4.2.4 Gibbs Sampling Features 95

4.3 Other MCMC Methods 98

4.3.1 Acceptance-Rejection 98

4.3.2 Metropolis–Hastings 100

Appendix: Software for MCMC 101

References 102

5 The Baby Model 103

5.1 The Model 103

5.2 Analytical Solutions 104

5.2.1 Marginal Posterior Density Function of the Mean and Variance 104

5.2.2 Joint Posterior Density Function of the Mean and Variance 105

5.2.3 Inferences 105

5.3 Working with MCMC 109

5.3.1 The Process 109

5.3.2 Using Flat Priors 109

5.3.3 Using Vague Informative Priors 112

5.3.4 Common Misinterpretations 114

Appendix 5.1 115

Appendix 5.2 116

Appendix 5.3 117

References 118

6 The Linear Model: I The ‘Fixed Effects’ Model 119

6.1 The ‘Fixed Effects’ Model 119

6.1.1 The Model 119

6.1.2 Example 124

6.1.3 Common Misinterpretations 125

6.2 Marginal Posterior Distributions via MCMC Using Flat Priors 127

6.2.1 Joint Posterior Distribution 127

6.2.2 Conditional Distributions 128

6.2.3 Gibbs Sampling 129

Trang 14

6.3 Marginal Posterior Distributions via MCMC Using Vague

Informative Priors 130

6.3.1 Vague Informative Priors 130

6.3.2 Conditional Distributions 131

6.4 Least Squares as a Bayesian Estimator 132

Appendix 6.1 133

Appendix 6.2 134

References 135

7 The Linear Model: II The ‘Mixed’ Model 137

7.1 The Mixed Model with Repeated Records 137

7.1.1 The Model 137

7.1.2 Common Misinterpretations 141

7.1.3 Marginal Posterior Distributions via MCMC 142

7.1.4 Gibbs Sampling 144

7.2 The Genetic Animal Model 145

7.2.1 The Model 145

7.2.2 Marginal Posterior Distributions via MCMC 150

7.3 Bayesian Interpretation of BLUP and REML 154

7.3.1 BLUP in a Frequentist Context 154

7.3.2 BLUP in a Bayesian Context 156

7.3.3 REML as a Bayesian Estimator 158

7.4 The Multitrait Model 158

7.4.1 The Model 158

7.4.2 Data Augmentation 160

7.4.3 More Complex Models 163

Appendix 7.1 164

References 165

8 A Scope of the Possibilities of Bayesian Inference + MCMC 167

8.1 Nested Models: Examples in Growth Curves 168

8.1.1 The Model 168

8.1.2 Marginal Posterior Distributions 171

8.1.3 More Complex Models 173

8.2 Modelling Residuals: Examples in Canalising Selection 174

8.2.1 The Model 175

8.2.2 Marginal Posterior Distributions 176

8.2.3 More Complex Models 177

8.3 Modelling Priors: Examples in Genomic Selection 178

8.3.1 The Model 179

8.3.2 RR-BLUP 183

8.3.3 Bayes A 185

8.3.4 Bayes B 187

8.3.5 Bayes C and Bayes Cπ 188

8.3.6 Bayes L (Bayesian Lasso) 188

8.3.7 Bayesian Alphabet in Practice 189

Trang 15

Appendix 8.1 190

References 191

9 Prior Information 193

9.1 Exact Prior Information 193

9.1.1 Prior Information 193

9.1.2 Posterior Probabilities with Exact Prior Information 195

9.1.3 Influence of Prior Information in Posterior Probabilities 197

9.2 Vague Prior Information 198

9.2.1 A Vague Definition of Vague Prior Information 198

9.2.2 Examples of the Use of Vague Prior Information 200

9.3 No Prior Information 203

9.3.1 Flat Priors 204

9.3.2 Jeffreys Prior 205

9.3.3 Bernardo’s ‘Reference’ Priors 206

9.4 Improper Priors 207

9.5 The Achilles Heel of Bayesian Inference 208

Appendix 9.1 209

Appendix 9.2 210

References 210

10 Model Selection 213

10.1 Model Selection 213

10.1.1 The Purpose of Model Selection 213

10.1.2 Fitting Data vs Predicting New Records 217

10.1.3 Common Misinterpretations 218

10.2 Hypothesis Tests 221

10.2.1 Likelihood Ratio Test and Other Frequentist Tests 221

10.2.2 Bayesian Model Choice 223

10.3 The Concept of Information 226

10.3.1 Fisher’s Information 227

10.3.2 Shannon Information and Entropy 231

10.3.3 Kullback–Leibler Information 232

10.4 Model Selection Criteria 233

10.4.1 Akaike Information Criterion (AIC) 233

10.4.2 Deviance Information Criterion (DIC) 237

10.4.3 Bayesian Information Criterion (BIC) 239

10.4.4 Model Choice in Practice 241

Appendix 10.1 242

Appendix 10.2 243

Appendix 10.3 244

Appendix 10.4 245

References 246

Trang 16

Appendix: The Bayesian Perspective—Three New Dialogues Between

Hylas and Philonous 247

References 265

Index 271

Trang 17

Do We Understand Classic Statistics? 1

Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.

Jerzy Neyman and Egon Pearson, 1933

In this chapter, we review the classical statistical concepts and procedures, test ofhypothesis, standard errors and confidence intervals, unbiased estimators, maxi-mum likelihood, etc., and we examine the most common misunderstandings aboutthem We will see the limitations of classical statistics in order to stress theadvantages of using Bayesian procedures in the following chapters

The Bayesian School was, in practice, founded by the French aristocrat and cian Marquis Pierre-Simon Laplace via several works published from 1774 to 1812,and it had a preponderant role in scientific inference during the nineteenth century(Hald 1998) A few years before Laplace’s first paper on the matter, the sameprinciple was formalised in a posthumous paper presented at the Royal Society ofLondon and attributed to a rather obscure priest Revd Thomas Bayes of whom weknow very little; even his supposed portrait is probably false (Bellhouse 2004).Bayes wrote his essay to refute Hume’s attack to religion found in his essay ‘OfMiracles’, in which it was argued that miracles had a great improbability incomparison to the probability that they were not accurately reported (Stigler

politi-2013) Apparently, the principle upon which Bayesian inference is based wasformulated before Stigler (1983) attributes it to Saunderson (1683–1739), a blindprofessor of optics who published a large number of papers on several fields ofmathematics Due to the work of Laplace, what we call now Bayesian techniqueswere commonly used along the nineteenth and the first few decades of the twentieth

# Springer International Publishing AG 2017

A Blasco, Bayesian Data Analysis for Animal Scientists,

DOI 10.1007/978-3-319-54274-4_1

1

Trang 18

century For example, the first deduction of the least square method made by Gauss

in 1795 (although it was published in1809) was made using Bayesian theory Atthat time these techniques were known as ‘inverse probability’, because theirobjective was to estimate the parameters from the data (i.e to find the probability

of the causes from their effects); direct probability takes place when the probabilitydistribution originating the data is known, and the probability of a sample iscalculated from it (e.g throwing a dice) The word ‘Bayesian’ is rather recent(Fienberg2006), and it was introduced by Fisher (1950) to stress the precedence intime (not in importance) of the work of Revd Bayes As the essay of Bayes waslargely ignored, given that Laplace was not aware about the work of Bayes whenproposing the same principle, the correct name for this school should probably be

‘Laplacian’, or perhaps we should preserve the old and explicit name of ’inverseprobability’ Nevertheless, as it is commonplace today, we will use the name

‘Bayesian’ throughout this book

Bayesian statistics uses probability to express the uncertainty about theunknowns that are being estimated The use of probability is more efficient thanany other method of expressing uncertainty (Ramsey1931) Unfortunately, to make

it possible, inverse probability needs the knowledge of some prior information Forexample, we know through published literature that in many countries the pig breedLandrace has a litter size of around 10 piglets We then perform an experiment tolearn about Spanish Landrace litter size, and a sample of five sows is evaluated forlitter size in their first parity Let us say an average of six born piglets is observed,this seems a very unlikely outcome a priori, and we should not put much trust in oursample However, it is not entirely clear how to integrate properly the priorinformation about Landrace litter size in our analysis We can pretend we do nothave any prior information and say that all the possible results have the samepriorprobability, but this leads to some inconsistencies, as we will see later in Chap.9.Laplace became aware of this problem, and in his later works, he examined thepossibility of making inferences based in the distribution of the samples rather than

in the probability of the unknowns, founding in practice the frequentist school(Hald1998)

Fisher’s work on the likelihood function in the 1920s and the frequentist work ofNeyman and Pearson in the 1930s eclipsed Bayesian inference, the reason beingthat they offered inferences and measured uncertainty about these inferenceswithout the need of prior information Fisher developed the properties of themethod of maximum likelihood, a method attributed to him, although DanielBernoulli proposed it as early as 1778 (Kendall1961) and Johan Heinrich Lambert

in 1760 (Hald1998) When Fisher discussed this method (Fisher1935), one of thediscussants of his paper noticed that the statistician and economist of Irish-Catalanorigin Isidro Francis Edgeworth had proposed it in 1908 It is likely that Fisher didnot know of this article when he first proposed to use the likelihood in an obscurepaper published in 1912,1when he was 22 years old, but it is remarkable that Fisher

1 The historian of statistics A Hald ( 1998 ) asks himself why such an obscure paper passed the referee’s report and was finally published At that time, Fisher did not use the name of ‘likelihood’.

Trang 19

never cited in his life any precedent of his work about likelihood His maincontribution was to determine the statistical properties of the likelihood and todevelop the concept of information based on it Neyman and Pearson (1933) usedthe likelihood ratio as a useful way to perform hypothesis tests Their theory wasbased in considering hypothesis tests as a decision problem, choosing between ahypothesis and an alternative, a procedure that Fisher disliked Fisher consideredthat when a null hypothesis is not rejected, we cannot assume straight away that thishypothesis is true, but instead just take this hypothesis as provisional (Fisher1925,

1935), very much in the same sense as the Popper theory of refutation (Popper

1934) It is interesting to note that Popper’s famous theory of conjectures andrefutations was published independently on Fisher’s one around the same time.Although some researchers still used Bayesian methods in the 1930s, like thegeologist and statistician Harold Jeffreys did (Jeffreys1939), the classical Fisher-Neyman-Pearson school dominated the statistical world until the 1960s, when a

‘revival’ started that has been increasing hitherto Bayesian statistics had threeproblems to be accepted, two main theoretical problems and a practical one Thefirst theoretical problem was the difficulty of integrating prior information Toovercome this difficulty, Ramsey (1931) and De Finetti (1937) proposed separatelyand independently to consider probability as a ‘belief’ Prior information should then

be evaluated by experts and the probabilities assigned to different events according tothe experts’ opinion This procedure can work for a few traits or effects, but it hasserious problems in the multivariate case The other theoretical problem is how torepresent ‘ignorance’ when there is no prior information or when we would like toassess the information provided by the data without prior considerations This secondtheoretical problem is still the main difficulty for many statisticians to acceptBayesian theory, and it is nowadays an area of intense research The third problemcomes from the use of probability for expressing uncertainty, a characteristic of theBayesian School As we will see later, with the exception of very simple inferences,this leads to multiple integrals that cannot be solved even using approximatemethods This supposed a big problem for applying Bayesian techniques until the1990s, when a numerical method was applied in order to find practical solutions to allthese integrals The method, called Markov chain Monte Carlo (MCMC), enabled theintegrals to be solved, contributing to the current rapid development and application

of Bayesian techniques in all fields of science

Bayesian methods express the uncertainty using probability density functionsthat we will see in Chap.3 The idea of MCMC is to provide a set of random samplenumbers extracted from a probability density function, instead of using the mathe-matical expression of this function The first use of random samples from aprobability density function was proposed by the Guinness brewer William SealyGosset, called ‘Student’, in his famous paper in which he presented thet-distribu-tion (Student1908) The use of Markov chains to find these random samples has itsorigins in the work developed in Los Alamos at the end of Second World War, and

it was first used by John von Neumann and Stanislav Ulman for solving the problem

of neutron diffusion in fissionable material The method had its success due to theappearance of the first computer in 1946, the ENIAC, that made the computationsfeasible The name ‘Monte Carlo’ was proposed by Nicholas Metropolis in 1947

Trang 20

(Metropolis1987) using the Monte Carlo roulette as a kind of random sampling.Metropolis and Ulam (1949) published the first paper describing MCMC Muchlater, Geman and Geman (1986) applied this method to an image analysis using aparticularly efficient type of MCMC that they called ‘Gibbs sampling’ because theywere using Gibbs distributions Gelfand and Smith (1990) introduced this technique

in the statistical world to obtain probability distributions, and Daniel Gianola(Wang et al 1994) and Daniel Sorensen (Sorensen et al 1994) brought thesetechniques into the field of animal breeding These techniques present severalnumerical problems, and there is a very active area of research in this field; SharonMcGrayne has written a lively and entertaining history of this procedure(McGrayne 2011) Today, the association of Bayesian inference and MCMCtechniques has produced a dramatic development of the application of Bayesianmethods to practically every field of science

The Bayesian methods have been compared many times with the frequentistmethods The old criticism to Bayesian methods usually lied in the lack of objec-tivity because of the use of subjective priors (see, e.g Barnett1999) The general-ised use of what Bayesians call ‘objective priors’, often linked to the use of MCMCtechniques, has changed the focus of the criticism The reader who is interested in acritical view of these procedures, even acknowledging their usefulness, can consultthe recent book of Efron and Hastie (2016) A critical comparison in the field ofanimal breeding has been provided by Blasco (2001) The point of view of thepresent book is that Bayesian techniques not only give results that are morepractical for the animal production scientists, but they are easier to understandthan the classical frequentist results It is rather frequent that the practical scientistmisunderstands the meaning of the frequentist statistical tool that is used We arenow going to examine these tools and comment frequent misinterpretations

Let us start with a classical problem: we have an experiment in which we want totest whether there is an effect towards some treatment, for example, we are testingwhether a selected population for growth rate has a higher growth rate than a controlpopulation What we wish to find is the probability of the selected population beinghigher than the control one However, classic statistics does not provide an answer

to this question;classical statistics cannot give the probability of a treatment beinghigher than other treatment, which is rather frustrating The classical procedure is

to start with the hypothesis, called ‘null hypothesisH0’, that there is no differencebetween treatments, i.e that the difference between the means of the selected andcontrol populationsm1 m2¼ 0 By repeating the experiment an infinite number oftimes,2we would obtain an infinite number of samples, and we could calculate the

2 The frequentist concept of probability, considered as the limit of the frequency of an infinite number of trials, was mainly developed by Richard von Mises in 1928 (Von Mises 1957 ), although there are several precedents of this use (see Howie 2002 , for an entertaining history of the

Trang 21

difference x1 x2 for each repetition between the averages of the samples.If thenull hypothesis is true, these differences will be grouped around zero (Fig.1.1).Notice that although there is no difference between selected and control populations(we assumedm1 m2¼ 0), the difference between our samples x1 x2will never

be exactly zero, and by chance, it can be high Let us consider the 5% of the highestdifferences (shadow area in Fig.1.1)

We would actually take onlyone sample If our sample lied in the shadow area ofFig.1.1, we could say that:

1 There is no difference between treatments, and our sample was a very raresample that would only occur 5% of the times, maximum, if we repeat theexperiment an infinite number of times

2 The treatments are different, and if we repeat the experiment an infinite number

of times, the difference between the averages of the samples (x1 x2) would not

be distributed around zero but around an unknown value different from zero

Neyman and Pearson (1933) suggested that the ‘scientific behaviour’ should be

to take option 2, acting as if the null hypothesisH0was wrong A result of thisbehaviour would be that ‘in the long run’ we would be right in almost 95% of the

Fig 1.1 Distribution of the

difference between the

averages of conceptually

repeated samples x 1  x 2 , if

H0is true and there is no

difference between treatments

(m1 m 2 ¼ 0) When our

actual difference between

sample averages lies in the

shadow area, we reject H0and

say that the difference is

‘significant’ This is often

represented by a star

controversy between frequentist and Bayesian probability) A discussion about the different definitions of probability can be found in Childers ( 2013 ) Response to some criticism about the impossibility of repeating the same experiment can be found in Neyman ( 1977 ).

Trang 22

cases.Notice that this is not the probability of the treatments being different, but theprobability of finding samples higher than a given value The relationship betweenboth concepts is not obvious, but theoretical considerations and simulationexperiments show that the probability of the treatments being different is substan-tially lower than what its 95% level of rejection seems to provide For example, aP-value of 0.05 would not give an evidence of 95% against the null hypothesis, butonly about 70% (Berger and Sellke1987; Johnson2013).

We have stated,before making our experiment, the frequency in which we will saythat there are differences between treatments when actually there are not, using aconventional value of 5% This is called ‘Type I error’ There is some discussion in theclassical statistical world about what to do when we do not reject the null hypothesis

In this case we can either say that we do not know whether the two treatments aredifferent,3or we can accept that both treatments have the same effect, i.e that thedifference between treatments is null Fisher (1925) defended the first choice, whereasNeyman and Pearson (1933) defended the second one stressing that we also have thepossibility of being wrong by saying that there is no difference between treatmentswhen actually this difference exists (they called it ‘Type II error’)

Often a ‘P-value’ accompanies the result of the test P-value is the probability ofobtaining a difference between samples x1 x2 equal or higher than the actualdifference found, when there is no difference between populations (m1 m2¼ 0)(Fig 1.2).4 Notice that P-value is not the probability of both treatments beingdifferent, but the probability of finding samples of the difference between

gives the probability of

finding the current sample

value or a higher value if the

null hypothesis holds

3 This attitude to scientific progress was later exposed in a less technical way by Karl Popper ( 1934 ) Scientists and philosophers attribute to Popper the theory about scientific progress based in the refutation of pre-existing theories, whereas accepting the current theory is always provisional However, Fisher ( 1925 , 1935 ) based his testing hypothesis theory in the same principle I do not know how far backwards can be traced the original idea, but it is contained at least in the famous essay “On liberty” of James Stuart Mill ( 1848 ).

4 Here we use a one tail test for simplicity.

Trang 23

treatments higher than ours NowadaysP-values are used associated to significancetests although their rationale for inference is different.P-values were proposed andused by Fisher as exploratory analyses to examine how much evidence we can getfrom our samples.5

A very low P-value gives more evidence of treatments being different than ahigherP-value, but we do not know how much evidence, since they do not measurethe probability of the populations being different In other words, aP-value of 2%does not give twice as much evidence as aP-value of 4% Moreover, the result ofthe test is established with the same Type I error, normally a 5%, independently onthe P-value we obtain, because we define the Type I error before making theexperiment, and theP-value is obtained after the experiment is made It is important

to realise that theP-value changes if we repeat the experiment; thus, a P-value of2% does not give a ‘significance at the level or threshold 2%’ because this willchange if we repeat the experiment.6Modern statisticians useP-values to expressthe amount of evidence the sample gives, but there is still a considerable amount ofdiscussion about how is this ‘amount’ measured, and no standard methods havebeen hitherto implemented (see Sellke et al 2001; Bayarri and Berger 2004;Johnson2013for a discussion)

The error level is the probability of being wrong: It is not We choose the errorlevelbefore making the experiment; thus, a small size or a big size experiment canhave the same error level After the experiment is performed, webehave (accepting

or rejecting the null hypothesis) as if we had probability¼ 100% of being right,hoping to be wrong a small number of times along our career

The error level is a measure of the percentage of times we will be right: This

is not true For example, you may accept an error level of a 5% and find along yourcareer that your data was always distributed far away from the limit of the rejection(Fig.1.3)

P-value is the probability of the null hypothesis being true, i.e not havingdifferences between treatments: This is not true The P-value gives the probabil-ity of finding the current sample value or a higher value, but we are not interested inhow probable it is to find our sample value, but in how probable our hypothesis is,and classic statistic does not have an answer for this question A conscious statisti-cian knows what aP-value means, but the problem is that P-values suggest moreevidence to the average researcher than they actually have For example, Berger

5 Neyman and Pearson never used P-values because they are not needed for accepting or rejecting hypothesis However, it is noticeable how both Neyman and Pearson significance and Fisher P- values are now blended in modern papers just because now P-values are easy to compute Often they create more confusion than help in understanding results.

6 Moreover, as Goodman ( 1993 ) says, this is like having a student ranking the 15th out of

100 students and report that she is ‘within the top 15th percent’ of the class.

Trang 24

and Sellke (1987) have noticed that aP-value of 0.05 corresponds to a probability

of 30% of the null hypothesis being true, instead of 5% as theP-value suggests.Johnson (2013) has showed that, when testing common null hypothesis, about onequarter of significant findings are false This means that people interpretingP-values as the probability of the null hypothesis being true are wrong and fareaway about the real evidence of this Johnson (2013) recommends the use ofP-values of 0.005 as a new threshold for significance

P-value is a measure of ‘significance’: This is not true A P-value of 2% does notmean that the difference between treatments is ‘significant at a 2%’, because if werepeat the experiment, we will find anotherP-value We cannot fix the error level ofour experiment depending on our current result because we drive conclusionsnotonly from our sample but also from all possible repetitions of the experiment.Notice that if aP-value is small enough for establishing significant differences, if

we repeat the experiment, the newP-value will not necessarily be that small Forexample, consider that the true valuem1 m2is placed at the 5% level (Fig.1.4).Then we take a sample and find that the difference between treatments obtained

x1 x2 is also placed at the 5% level (this is not a rare supposition, since thesamples should be distributed near the true value) We will say that the differencebetween treatments is significant However, when repeating the experiment, as thenew samples will be distributed around the true valuem1 m2, half of the sampleswill give significant differences between treatments (P< 0.05), and half of themwill not (P> 0.05) When repeating the experiment, we have the same probability

of obtaining significant differences than not Thus, a P-value of 5% gives theimpression of having more evidence than what we actually have.7In practice, we

do not know where the true value is; thus, we do not know whether we are in thesituation of Fig.1.4or not

Fig 1.3 An error level of 5%

of being wrong when

rejecting the null hypothesis

was accepted, but along his

career, a researcher

discovered that his data

showed a much higher

evidence about the null

hypothesis being wrong

7 Notice that this argument is independent of the power of the test It applies whatever this power is.

Trang 25

Significant difference means that a difference exists: This is not always true.

We may be wrong once every twenty times, on average, if the error level is 5% Theproblem is that when measuring many traits, we may detect a false significantdifference once every twenty traits.8 The same problem arises when we areestimating many effects It is not infrequent to see pathetic efforts of some authorstrying to justify some second or third order interaction that appears in an analysiswhen all the other interactions are not significant, without realising that thisinteraction can be significant purely by chance

N.S (non-significant difference) means that there is no difference betweentreatments: This is usually false First, in agriculture and biology, treatments aregenerally different because they are not going to beexactly equal Two pig breedscan differ in growth rate in less than a gram, but this is obviously irrelevant.Secondly, in well-designed experiments, N.S appears when the difference betweentreatments is irrelevant, but this only happens for the trait for which the experimentwas designed; thus, other traits can have relevant differences between treatments,but we obtain N.S from our specific tests The safest interpretation of N.S is ‘we donot know whether treatments differ or not’; this is Fisher’s interpretation for N.S.Even if the differences are N.S., we can still observe a ‘tendency’: Thisstatement is nonsense If we do not find significant differences between treatments

A and B, this means that now A is higher than B, but after repeating the experiment,

B can be higher than A It is rather unfortunate that referees admit expressions likethis, even in competent scientific journals Moreover, it often happens that theN.S differences are high, nothing that can be described as a ‘tendency’;N.S describes my state of ignorance, not the size of the effect

Our objective is to find whether two treatments are different: We are notinterested in finding whether or not there are differences between treatments,

5%

m1m2 x1– – x2

0

Fig 1.4 Distribution of the

samples when the true value

is the same as the actual

sample, and the current

sample gives us a P-value of

5% Notice that when

repeating the experiment, half

of the times, we will obtain

‘non-significant’ differences

(P > 0.05) This example is

shown for one-tail test, but the

same applies to two-tail tests

8 Once each twenty traits as a maximum if the traits are uncorrelated If they are correlated, the frequency of detecting false significances is different.

Trang 26

because they are not going to beexactly equal Our objective in an experiment is tofind relevant differences How big should a difference be in order to consider it asrelevant should be defined before making the experiment A relevant value is aquantity under which differences between treatments have no biological or eco-nomical meaning Higher values will be also relevant, but we are interested in theminimum value that can be considered relevant, because if two treatments differ in

a lower value, we will consider that they are not different for practical purposes,i.e for the purpose of our experiment In classical statistics, the size of the experi-ment is usually established for finding a significant difference between twotreatments when this difference is considered relevant The problem then is thatexperiments are designed for one trait, but many other traits are often measured;thus, significant differences will not be linked to relevant differences for these othertraits, as we are going to see below

Significant difference means ‘relevant difference’: This is often false In designed experiments, asignificant difference will appear just when this difference

well-isrelevant Thus, if we consider before performing the experiment that 100 g/d is arelevant difference between two treatments, we will calculate the size of ourexperiment in order to find a significant difference when the difference from theaverages of our samples j x1 x2j 100 g/d, and we will not find a significantdifference if it is lower than this The problem arises when we analyse a trait otherthan the one used for defining the size of the experiment; but also in field data,where no experimental design has been made, or in poorly designed experiments Inthese cases, there is no link between the relevance of the difference and itssignificance In these cases, we can find:

1 Significant differences that are completely irrelevant: This case is innocuous;however, ifsignificance is confused with relevance, the author of the paper willstress this result without reason, since the difference found is irrelevant.We willalways get significant differences if the sample is big enough If the sample ishigh, when repeating an experiment many times, the average of the sample willhave a lower dispersion For example, the average of milk production of thedaughters of a sire having only two daughters can be 4000 kg or 15,000 kg, butthe average of sires with 100 daughters will be much closer to the mean of thepopulation Figure1.5shows a non-significant difference that is significant for alarger sample Conversely, if we throw away part of our sample, former signifi-cant differences become non-significant Thus, ‘significance’ itself is of littlevalue; it only indicates whether the sample is big or not.9This has been noticed

at least since 1938 (Berkson1938), and it is a classical criticism to frequentisthypothesis tests (see Johnson1999for a review)

9 In simulation studies, it is easy to find significant differences by augmenting the number of repeated simulations.

Trang 27

2 Non-significant differences that are relevant: This means that the size of theexperiment is not high enough Sometimes experimental facilities are limitedbecause of the nature of the experiment, but relevant differences that arenon-significant would mean that perhaps there is an important effect of thetreatment or perhaps not, we do not know but it is important to know it, and ahigher sample should be analysed.

3 Non-significant differences that are irrelevant, but have high errors: times the estimated difference between treatments can be, by chance, near zero,but if the standard error of it is high, the true difference may be much higher andrelevant This is dangerous, because a small difference accompanied with a ‘N.S.’ seems to be non-important, but it can be rather relevant indeed For example,

Some-if a relevant dSome-ifference between treatments for growth rate is 100 g/d in pigs andthe difference between the selected and control populations is 10 g/d with as.e of 150 g/d, when repeating the experiment, we may find a difference higherthan 100 g/d, i.e we can get a relevant difference Thus, a small and ‘N.S.’

x–1 – x–2

n = 100

n = 10

m1 – m2

Fig 1.5 Distribution of the

samples for two sample sizes.

‘Non-significant’ differences

for n ¼ 10 can be significant

for n ¼ 100

Trang 28

difference should not be interpreted as ‘there is no relevant difference’ unless theprecision of this difference is good enough, i.e its confidence interval is small.

4 Significant differences that are relevant, but have high errors: This may lead

to a dangerous misinterpretation Imagine that we are comparing two breeds ofrabbits for litter size We decide that one kit will be enough to consider thedifference between breeds to be relevant We obtain a significant difference oftwo kits (we get one ‘star’) However, the confidence interval at a 95% proba-bility of this estimation goes from 0.1 to 3.9 kits Thus, we are not sure aboutwhether the difference between breeds is 2 kits, 0.1 kits, 0.5 kits, 2.7 kits orwhatever other value between 0.1 and 3.9 It may happen that the true difference

is 0.5 kits, which is irrelevant However, typically, all the discussion of theresults is organised around the two and the ‘star’, saying ‘we found significantand important differences between breeds’, although we do not have this evi-dence The same applies when comparing our results with other publishedresults; typically the standard errors and confidence intervals of both resultsare ignored when discussing similarities or dissimilarities

We always know what a relevant difference is: Actually, for some problems,

we do not know A panel of experts analyse the aniseed flavour of some meat, andthey find differences of three points in a scale of ten points; is this relevant? Which

is the relevant value for enzyme activities? It is difficult sometimes to precise therelevant value, and in this case, we are completely disoriented when we areinterpreting the tables of results, because in this case we cannot distinguish betweenthe four cases we have listed before This is an important problem, because wecannot conclude anything from an experiment in which we do not know whether thedifferences between treatments we find are irrelevant or not In Appendix1.1, wepropose some practical solutions to this problem

Tests of hypothesis are always needed in experimental research: For mostbiological problems, we do not need any hypothesis test The answer provided by atest is rather elementary: Is there a difference between treatments? Yes or

No However, this is not actually the question for most biological problems Infact, we know that the answer to this question is generally Yes, because twotreatments are not going to beexactly equal The test only adds to our previousinformation that one of the treatments is higher than the other one However, inmost biological problems, our question is whether these treatments differ in morethan a relevant quantity In order to answer this question, we should estimate thedifference between treatments accompanied by a measurement of our uncertainty.This is more informative than comparing LS-means only, by showing whether theyare significantly different, because as we have seen before, significance is notrelated to relevance but to sample size There is a large amount of literaturerecommending focusing more in confidence intervals or other quantitativemeasurements of uncertainty than in hypothesis tests (see Johnson 1999, orLecoutre and Poitevineau2014, for a recent review on this controversy)

Trang 29

1.3 Standard Errors and Confidence Intervals

If we take an infinite number of samples, the sample mean (or the differencebetween two sample means) will be distributed around the true value we want toestimate, as shown in Fig.1.1 The standard deviation of this distribution is called

‘standard error’ (s.e.), to avoid confusion with the standard deviation of thepopulation A large standard error means that the sample averages will take verydifferent values, many of them far away from the true value As we do not takeinfinite samples, just one, a large standard error means that we do not know howclose we are to the true value, but a small standard error means that we are close tothe true value because most of the possible sample averages when repeating theexperiment will be close to this true value

When the distribution obtained by repeating the experiment is Normal,10twicethe standard error around the true value will contain approximately 95% of thesamples.11 This permits the construction of the so-called confidence intervals at95% by establishing the limits within which the true value is expected to be found.Unfortunately, we do not know the true value; thus, it is not possible to establishconfidence intervals as in Fig.1.1, and we have to use our estimate instead of thetrue value to define the limits of the confidence interval Our confidence interval isapproximately (sample average) 2 s.e.12A consequence of this is that each time

we repeat the experiment, we have a new sample and thus a new confidenceinterval For example, let’s assume we want to estimate the litter size of a pigbreed, and we obtain 10 piglets with a confidence interval CI (95%)¼ [9, 11] Thismeans that if we repeat the experiment, we will get many confidence intervals: [8,10], [9.5, 11.5], etc., and a 95% of these intervals will contain the true value.However, we are not going to repeat the experiment an infinite number of times.What shall we do? In classical statistics, webehave as if our interval would be one

of the intervals containing the true value (see Fig.1.7) We hope,as a consequence

of our behaviour, to be wrong a maximum of 5% of times along our career

10 Computer programmes (like SAS) ask whether you have checked the Normality of your data, but Normality of the data is not needed if the sample is large enough (see Chap 10 , Sect 10.1.3) Independently of the distribution of the original data, the average of a sample is distributed Normally, if the simple size is big enough This is often forgotten, as Fisher often complained (Fisher 1925 ).

11 The exact value is not twice but 1.96 times.

12 For small samples, a t-distribution should be used instead of a Normal distribution, and the confidence interval is somewhat larger, but this is not important for this discussion.

Trang 30

1.3.2 Common Misinterpretations

The true value lies between 2 s.e of the estimate: We do not know whether thiswill happen or not First, the distribution of the samples when repeating theexperiment might not be Normal This is common when estimating correlationcoefficients, which are close to 1 or to1; it is nonsense to write a correlationcoefficient as 0.95 0.10 Some techniques (e.g bootstrap), taking advantage ofthe easy computation with modern computers, can show the actual distribution of asample A correlation coefficient sampling distribution may be asymmetric, like inFig.1.6 If we take the most frequent value as our estimate (0.9), the s.e has littlemeaning

CI (95%) means that the probability of the true value to be contained in theinterval is 95%: This is not true We say that the true value is contained in theinterval with probability P ¼ 100%, i.e with total certainty We utter that ourinterval is one of the ‘good ones’ (Fig.1.7) We may be wrong, but if webehave likethis, we hope to be wrong only 5% of the times as a maximum along our career As

in the case of the test of hypothesis, we make inferences not only from our samplebut also from the distribution of samples in ideal repetitions of the experiment

coefficient Repeating the

experiment, the samples are

not distributed symmetrically

around the true value

Fig 1.7 Repeating the

experiment many times, 95%

of the intervals will contain

the true value m We do not

know whether our interval is

one of these, but we assume

that it is We hope not to be

wrong too many times along

our career

Trang 31

The true value should be in the centre of the CI, not in the borders: We donot know where the true value is If the CI for differences in litter size is [0.1, 3.9],the true value may be 0.1, 0.2, 2.0 and 3.9 or some other intermediate value betweenthe confidence interval limits In Fig.1.7we can see that some intervals have thetrue value near one side and others near the centre We expect to have moreintervals in which the true value is near the centre, but we do not know whetherour interval is one of these.

Conceptual repetition leads to paradoxes: Several paradoxes produced bydrawing conclusions not only from our sample but also from conceptual repetitions

of it have been noticed The following one can be found, in a slightly different way,

in Berger and Wolpert (1984, pp 91–92) Imagine we are measuring pH and weknow that the estimates will be Normally distributed around the true value whenrepeating the experiment an infinite number of times We obtain a sample with fivemeasurements: 4.1, 4.5, 5, 5.5 and 5.9 We then calculate our CI as 95% Suddenly,

a colleague tells us that the pH metre was broken, and it does not work if the pH ishigher than 6 Although we did not find any measure higher than 6, repeating theexperiment an infinite number of times, we will obtain a truncated distribution ofour samples (Fig.1.8b) This means that we should change our confidence interval,since all possible samples higher than 6 would not be recorded Then anothercolleague tells us that the pH metre was repaired before we started our experiment,and we write a paper changing the CI 95% to the former values However, ourformer colleague insists in that the pH metre was still broken; thus, we change againour CI Notice that we are changing our CI even though none of our measurementslied in the area in which the pH metre was broken We change our CI not because

we had wrong measures of the pH, but because if we would repeat the experiment

an infinite number of times, this will produce a different distribution of our samples

As we make inferences not only from our samples but also from conceptualrepetitions of the experiment, our conclusions are different if the pH metre isbroken although all our measurements were correct

6

[4.1, 4.5, 5, 5.5, 5.9]

Fig 1.8 After repeating an

experiment an infinite number

of times, we arrive to

different conclusions,

whether our pH metre works

well (a) or whether it is

broken and does not measure

values higher than 6 (b), even

though all our measurements

were correctly taken

Trang 32

1.4 Bias and Risk of an Estimator

Rðbu; uÞ ¼ E l bu; u½ð Þ ¼ E e 2

A good estimator will have a low risk We can express the risk as

Rðbu;uÞ ¼ E eð Þ ¼ E e2 ð 2þ e2 e2Þ ¼ E eð Þ þ E e2 ð 2 e2Þ ¼ e2þ var eð Þ ¼ Bias2þvarð Þ, where we define bias as the mean of the errors e An unbiased estimatorehas a null bias This property is considered particularly attractive in classicalstatistics, because it means that when repeating the experiment an infinite number

of times, the estimates are distributed around the true value like in Fig 1.1.Nevertheless, unbiasedness has not always been considered a particularly attractiveproperty of an estimator; Fisher considered that the property of unbiasedness wasirrelevant due to its lack of invariance to transformations (Fisher 1959,

pp 142–143), as we will see below

A transformation of an unbiased estimator leads to another unbiased tor: This is normally not true In general, a transformation of an unbiased estimatorleads to an estimator that is not unbiased anymore For example, it is frequent tofind unbiased estimators for the variance and use them for estimating the standarddeviation by computing their square root However,the square root of an unbiasedestimator of the variance is not an unbiased estimator of the standard deviation It

estima-is possible to find unbiased estimations of the standard deviation, but they are notthe square root of the unbiased estimator of the variance (see, e.g Kendall et al

1994)

13 All of this is rather arbitrary and other solutions can be used For example, we may express the error as a percentage of the true value, the loss function may be the absolute value of the error instead of its square, and the risk might be the mode instead of the mean of the loss function, but in this chapter, we will use the common definitions.

Trang 33

Unbiased estimators should be always preferred: Not always In general, thebest estimates are the ones with lower risk As the risk is the sum of the bias plus thevariance of the estimator, it may happen that a biased estimator had a lower risk,being a better estimator than an unbiased estimator (Fig.1.9).

For example, take the case of the estimation of the variance We can estimate thevariance as

bσ2¼1k

Xn 1

m

Fig 1.9 An example of

biased estimator (blue) that is

not distributed around the true

value ‘m’, but has lower risk

than an unbiased estimator

(red) that is distributed

around the true value with a

much higher variance

Trang 34

k¼ n  1 the estimator is unbiased, which is one of the reasons of REML users toprefer this estimator However, the risk of REML is higher than the risk of MLbecause its variance is higher; thus, ML should be preferred or, even better, theminimum risk estimator.

Churchill Eisenhart proposed in 1947 a distinction between two types of effects.The effect of a model was ’fixed’ if we were interested in its particular value and

‘random’ if it could be considered just one of the possible values of a randomvariable Consider, for example, an experiment in which we have 40 sows in fourgroups of 10 sows each, and each sow has five parities We feed each group with adifferent diet, and we are interested in knowing the effect of the diet in the litter size

of the sows The effect of the diet can be considered as a ‘fixed’ effect, because weare interested in finding the food that leads to higher litter sizes, and we considerthat, if repeating the experiment, the effect of each diet on litter size will be thesame Some sows are more prolific than others are, but we are not interested in theprolificacy of a particular sow; thus, if the sows have been assigned randomly toeach diet, we consider that each sow effect is a ‘random’ effect Repeating theexperiment, we would have different sows with different sow effects, but theseeffects would not change our inferences about diets because sows are assignedrandomly to each diet

When repeating an experiment an infinite number of times, a fixed effect hasalways the same values, whereas a random effect changes in each repetition of theexperiment Repeating our experiment, we would always give the same four diets,but the sows will be different; thus, the effect of the food will be always the same,but the effect of the sow will randomly change in each repetition In Fig.1.10, wecan see how the true value of the effects and their estimates are distributed Whenrepeating the experiment, the true value of the fixed effect remains constant, and allits estimates are distributed around this unique true value In the case of the randomeffect,each repetition of the experiment leads to a new true value; thus, the truevalue is not constant, it varies and is distributed around its mean

Notice that the errors are the difference between true and estimated values inboth cases, but in the case of random effects, they are not the distance between theestimate and the mean of the estimates, because the true value changes in eachrepetition of the experiment As random values change in each repetition, instead of

‘estimating’ random values, we often say ‘predicting’ random values

Trang 35

1.5.2 Shrinkage of Random Effects Estimates

The estimate (the prediction) of a random effect depends on the amount of dataused Let us take a simple model in which we measure the growth of chicken underseveral diets, with a different number of chickens per diet

If we estimate the effect of a diet as a fixed effect, the effect of a diet will be theaverage of the chicken’s growth under this diet:

buF¼1n

ε We will see this different way of estimating fixed

or random effects in Chap.7

Notice that when ‘n’ is high, both estimates are similar, but when ‘n’ is small, therandom estimate suffers a ‘shrinking’ and takes a smaller value than when the effect

is considered as fixed The importance of this shrinking will depend on the number

of data used for estimating the random effect This is well known by geneticists,who evaluate animals considering their genetic values as random An example isdeveloped in Appendix1.3

Trang 36

As we will see later, the researcher is sometimes faced to the dilemma ofconsidering an effect as fixed or random, particularly when correcting for noiseeffects If some levels of a noise effect have few data, they will be poorly estimated,which could affect the results, but if the noise is considered random, this correctionwill be small This is why it is common to consider random an effect with manylevels and few data in each level The other side of the problem is that thecorrections actually applied in these cases are very small; thus, the decision oftaking an effect as random or fixed is not always clear.

a much less attractive property:14

FIXED: Bias¼ E(e) ¼ E(u  ^u) ¼ E(u)  E(^u) ¼ u  E(^u)

RANDOM: Bias¼ E(e) ¼ E(u  ^u) ¼ E(u)  E(^u)

The variances of the errors are also different (see Appendix 1.2 for ademonstration):

FIXED: var(e)¼ var(u  ^u) ¼ var(^u)

RANDOM: var(e)¼ var(u  ^u) ¼ var(u)  var(^u)

The best estimators are the ones with the lowest risk; in the case of unbiasedestimators, as Bias¼ 0, the best ones have the lowest error variance For fixedeffects, as the true value ‘u’ is a constant, var(u)¼ 0, thus the variance of the error isthe same as the variance of the estimator var(^u), and the best unbiased estimatorsare the ones with smallest variance In the case of random effects, the true values arenot constant, and the variance of the error is the difference between the variances ofthe true and estimated values Thus, in the case of random effects, the best estimator

is the one with a variance as close as possible to the variances of the true values,because this minimizes the variance of the error The source of the confusion is that

a good estimator is not the one with small variance, but the one withsmall errorvariance A good estimator will give values close to the true value in each

14 Henderson ( 1973 ) has been criticised for calling the property E(u) ¼ E(^u) ‘unbiasedness’, in order to defend that his popular estimator ‘BLUP’ was unbiased This property should always mean that the estimates are distributed around the true value In the case of random effects, this means u ¼ E(^u|u), a property that BLUP does not have (see Robinson 1991 ).

Trang 37

repetition, the error will be small and the variance of the error will be small In thecase of fixed effects, this variance of the error is the same as the variance of theestimator, and in the case of random effects, the variance of the error is small whenthe variance of the estimator is close to the variance of the true value.

An effect is fixed or random due to its nature: This is not always true In theexample before, we might have considered the four types of foods as randomsamples of all different types of food Thus, when conceptually repeating theexperiment, we would conceptually change the food.15 Conversely, we mighthave considered the sow as a ‘fixed’ effect, and we could have estimated it, since

we had five litters per sow.16 Thus, the effects can be taken as fixed or randomdepending on our interests

We are not interested in the particular value of a random effect: Sometimes

we can be interested in it A particular case in which it is interesting to consider theeffects as random is the case of predicting genetic values Using Mendel’s laws, weknow the relationships between relatives; thus, we can use this prior informationwhen the individual genetic effects are considered random effects Appendix1.3

gives an example of this prediction

Even for random effects to be unbiased is an important property: Theproperty of unbiasedness is not particularly attractive for random effects, sincewhen repeating the experiment the true values change as well and the estimates arenot distributed around the true value We have seen before that, even for fixedeffects, unbiasedness may be considered rather unattractive, since usually it is notinvariant to transformations

BLUP is the best possible predictor of a genetic random value: Animalbreeding values are commonly estimated (predicted) by BLUP (best linear unbiasedpredictor, Henderson 1973) The word ‘best’ is somewhat misleading because itseems that BLUP is the best possible predictor, but we can have biased predictorswith lower risk than unbiased ones The reason for searching predictors only amongthe unbiased ones is that there are an infinite number of possible biased predictorswith the same risk, depending on their bias and their variance By adding thecondition of unbiasedness, we find a single one, called BLUP, which is not thebest possible predictor, but the best among the unbiased ones In Chap 7, Sect.7.3.2, we will develop BLUP from a Bayesian perspective

15 That is, imagining that we are repeating the experiment, we imagine we change the food.

16 Fisher, considered that the classification of the effects in fixed and random was worse than considering all the effects as random, as they were considered before Eisenhart’s proposal (Yates

1990 ).

Trang 38

1.6 Likelihood

The concept of likelihood and the method of maximum likelihood (ML) weredeveloped by Fisher between 1912 and 1922, although there are historicalprecedents attributed to Bernoulli, Lambert and Edgeworth as we said in thehistorical introduction By 1912, the theory of estimation was in an early stateand the method was practically ignored However, Fisher (1922) published a paper

in which the properties of the estimators were defined, and he found that thismethod produced estimators with good properties, at least asymptotically Themethod was then accepted by the scientific community and it is nowfrequently used

Consider finding the average weight of rabbits of a breed at 8 weeks of age Wetake a sample of one rabbit, and its weight isy0¼ 1.6 kg The rabbit can come from

a population Normally distributed with a mean of 1.5 kg or from other populationwith a mean of 1.8 kg or from many other possible populations Figure1.11showsthe probability density functions of several possible populations from which thisrabbit can come from, with population means ofm1¼ 1.50 kg, m2¼ 1.60 kg and

m3¼ 1.80 kg Notice that, at the point y0, the probability density of the first andthird populationf(y0|m1) andf(y0|m3) is lower than the second onef(y0|m2) It looksveryunlikely that a rabbit of 1.6 kg comes from a population with a mean of 1.8 kg.Therefore, it seems morelikely that the rabbit comes from the second population.All the valuesf(y0|m1),f(y0|m2),f(y0|m3), etc are called ‘likelihoods’ and showhow ‘likely’ we would have obtained our sampley0if the true value of the meanwould have beenm1,m2,m3, etc (Fig.1.12)

These likelihoods can be represented for each value of the means They define acurve with a maximum inf(y0|m2) (Fig.1.13) This curve varies with m, and thesampley0is a fixed value for all those probability density functions It is obviousthat the new function defined by these values is not a probability density function,since each value belongs to a different probability density function

Trang 39

We have a problem of notation here, because here the variable is ‘m’ instead of

‘y’ Speaking about a family of probability density functions f(y0|m1),f(y0|m2),f(y0|

m3), etc for a giveny0is the same as speaking about a functionL(m|y0) that is not aprobability density function.17However this notation hides the fact thatL(m|y0) is a

Fig 1.12 Three likelihoods

for the sample y0¼ 1.6

Fig 1.13 Likelihood curve.

It is not a probability because

its values come from different

distributions, but it is a

rational degree of belief The

notation stresses that the

variable (in red) is m and not

y0,which is a given fixed

sample

17 Some classic texts of statistics (Kendall et al 1994 ) contribute to the confusion by using the notation L(y|m) for the likelihood Moreover, some authors distinguish between ‘given a parame- ter’ (always fixed) and ‘giving the data’ (which are random variables) They use (y|m) for the first

Trang 40

family of probability density functions indexed at a fixed valuey¼ y0 We will use

a new notation, representing the variable in red colour and the constants in blackcolour Thenf(y0|m) means a family of probability density functions, in which thevariable ism, that are indexed at a fixed value y0 For example, if these Normalfunctions of our example are standardised (s.d.¼ 1), then the likelihood will berepresented as:

where the variable is in red colour We will use ‘f ’ exclusively for probabilitydensity functions in a generic way, i.e.f(x) and f( y) may be different functions(e.g Normal or Poisson), but they will be always probability density functions

Fisher (1912) proposed to take the value ofmthat maximisedf(y0|m) because fromall the populations defined byf(y0|m1),f(y0|m2),f(y0|m3), etc., this is the one thatifthis were the true value, the sample would be most probable Here the wordprobability can lead to some confusion since, as we have seen, these values belong

to different probability density functions, and the likelihood function defined takingthem is not a probability function Thus, Fisher preferred to use the wordlikelihoodfor all these values.18

Fisher (1912,1922) not only proposed a method of estimation but also proposedthe likelihood as adegree of belief, different from the probability, but allowingexpressing uncertainty in a similar manner What Fisher proposed was to use thewhole likelihood curve and not only its maximum, a practice rather unusualnowadays (Fig 1.14).19 Today, frequentist statisticians typically use only themaximum of the curve because it has good properties in repeated sampling.Repeating the experiment an infinite number of times, the estimator will bedistributed near the true value, with a variance that can also be estimated However,all those properties are asymptotic; thus, there is no guarantee about the goodness ofthe estimator when samples are small Besides, the ML estimator is not necessarilythe estimator that minimizes the risk Nevertheless, the method has an interestingproperty, apart from its asymptotic frequentist properties: any reparameterizationleads to the same type of estimator (Appendix1.4) For example, the ML estimator

of the variance is the square of the ML estimator of the standard deviation, and in

case and (m;y) for the second Likelihood can be found in textbooks as L(m|y), L(y|m), f(y|m) and L (m;y).

18 Speaking strictly, these quantities are densities of probability As we will see in Sect 3.3.1, probabilities are areas defined by f( y) Δy.

19 In this figure, as in other figures of the book, we do not draw a known function like a Normal, Poisson, Inverted gamma, etc., but functions that help to understand the concept proposed.

Ngày đăng: 14/09/2020, 16:27

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN