1. Trang chủ
  2. » Y Tế - Sức Khỏe

Tài liệu THE ESTIMATION OF THE EFFECTIVE REPRODUCTIVE NUMBER FROM DISEASE OUTBREAK DATA pdf

22 520 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề The estimation of the effective reproductive number from disease outbreak data
Tác giả Ariel Cintrón-Arias, Carlos Castillo-Chávez, Luı́s M. A. Bettencourt, Alun L. Lloyd, H. T. Banks
Trường học North Carolina State University
Chuyên ngành Mathematical Biology
Thể loại journal article
Năm xuất bản 2009
Thành phố Raleigh
Định dạng
Số trang 22
Dung lượng 1,54 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

AND ENGINEERINGTHE ESTIMATION OF THE EFFECTIVE REPRODUCTIVE NUMBER FROM DISEASE OUTBREAK DATA Ariel Cintr´on-Arias Center for Research in Scientific Computation Center for Quantitative S

Trang 1

AND ENGINEERING

THE ESTIMATION OF THE EFFECTIVE REPRODUCTIVE

NUMBER FROM DISEASE OUTBREAK DATA

Ariel Cintr´on-Arias

Center for Research in Scientific Computation Center for Quantitative Sciences in Biomedicine North Carolina State University, Raleigh, NC 27695, USA

Carlos Castillo-Ch´avez

Department of Mathematics and Statistics Arizona State University, P.O Box 871804, Tempe, AZ 85287-1804, USA

Lu´ıs M A Bettencourt

Theoretical Division, Mathematical Modeling and Analysis (T-7)

Los Alamos National Laboratory, Mail Stop B284, Los Alamos, NM 87545, USA

Alun L Lloyd and H T Banks

Center for Research in Scientific Computation Biomathematics Graduate Program Department of Mathematics North Carolina State University, Raleigh, NC 27695, USA

Abstract We consider a single outbreak susceptible-infected-recovered (SIR)

model and corresponding estimation procedures for the effective reproductive

number R(t) We discuss the estimation of the underlying SIR parameters

with a generalized least squares (GLS) estimation technique We do this in the

context of appropriate statistical models for the measurement process We use

asymptotic statistical theories to derive the mean and variance of the limiting

(Gaussian) sampling distribution and to perform post statistical analysis of

the inverse problems We illustrate the ideas and pitfalls (e.g., large condition

numbers on the corresponding Fisher information matrix) with both synthetic

and influenza incidence data sets.

1 Introduction The transmissibility of an infection can be quantified by its sic reproductive number R0, defined as the mean number of secondary infectionsseeded by a typical infective into a completely susceptible (na¨ıve) host popula-tion [1, 19, 26] For many simple epidemic processes, this parameter determines

ba-a threshold: whenever R0 > 1, a typical infective gives rise, on average, to morethan one secondary infection, leading to an epidemic In contrast, when R0 < 1,infectives typically give rise, on average, to less than one secondary infection, andthe prevalence of infection cannot increase

2000 Mathematics Subject Classification Primary: 62G05, 93E24, 49Q12, 37N25; Secondary: 62H12, 62N02.

Key words and phrases effective reproductive number, basic reproduction ratio, reproduction number, R, R(t), R 0 , parameter estimation, generalized least squares, residual plots.

The first author was in part supported by NSF under Agreement No DMS-0112069, and by NIH Grant Number R01AI071915-07.

Trang 2

Owing the natural history of some infections, transmissibility is better quantified

by the effective, rather than the basic, reproductive number For instance, exposure

to influenza in previous years confers some cross-immunity [16,22,32]; the strength

of this protection depends on the antigenic similarity between the current year’sstrain of influenza and earlier ones Consequently, the population is non-na¨ıve,and so it is more appropriate to consider the effective reproductive number R(t), atime-dependent quantity that accounts for the population’s reduced susceptibility.Our goal is to develop a methodology for the estimation of R(t) that also provides

a measure of the uncertainty in the estimates We apply the proposed ogy in the context of annual influenza outbreaks, focusing on data for influenza A(H3N2) viruses, which were, with the exception of the influenza seasons 2000–01and 2002–03, the dominant flu subtype in the United States (US) over the periodfrom 1997 to 2005 [12, 36]

methodol-The estimation of reproductive numbers is typically an indirect process becausesome of the parameters on which these numbers depend are difficult, if not impos-sible, to quantify directly A commonly used indirect approach involves fitting amodel to some epidemiological data, providing estimates of the required parameters

In this study we estimate the effective reproductive number by fitting a istic epidemiological model employing a generalized least squares (GLS) estimationscheme to obtain estimates of model parameters Statistical asymptotic theory[18,34] and sensitivity analysis [17, 33] are then applied to give approximate sam-pling distributions for the estimated parameters Uncertainty in the estimates ofR(t) is then quantified by drawing parameters from these sampling distributions,simulating the corresponding deterministic model and then calculating effectivereproductive numbers In this way, the sampling distribution of the effective repro-ductive number is constructed at any desired time point

determin-The statistical methodology provides a framework within which the adequacy ofthe parameter estimates can be formally assessed for a given data set We discussthe use of residual plots as a diagnostic for the estimation, highlighting the problemsthat arise when the assumptions of the statistical model underlying the estimationframework are violated

This manuscript is organized as follows: In Section 2 the data sets are duced A single-outbreak deterministic model is introduced in Section 3 Section

intro-4 introduces the least squares estimation methodology used to estimate values forthe parameters and quantify the uncertainty in these estimates Our methodologyfor obtaining estimates of R(t) and its uncertainty is also described Use of theseschemes is illustrated in Section5, in which they are applied to synthetic data sets.Section6applies the estimation machinery to the influenza incidence data sets Weconclude with a discussion of the methodologies and their application to the datasets

2 Longitudinal incidence data Influenza is one of the most significant tious diseases of humans, as witnessed by the 1918 “Spanish flu” pandemic, duringwhich 20% to 40% of the worldwide population became infected At least 50 milliondeaths resulted, with 675,000 of these occurring in the US [37] The impact of flu

infec-is still significant during inter-pandemic periods: the Centers for Dinfec-isease Controland Prevention (CDC) estimate that between 5% and 20% of the US populationbecomes infected annually [12] These annual flu outbreaks lead to an average

Trang 3

Table 1 Number of tested specimens and influenza isolates

dur-ing several annual outbreaks in the US [12]

Season Total number Number of Number of Number of

of tested A(H1N1) & A(H3N2) isolates B isolatesspecimens A(H1N2) isolates

Figure 1 Influenza isolates reported by the CDC in the US during

the 1999–00 season [12] The number of H3N2 cases (isolates) is

displayed as a function of time Time is measured as the number

of weeks since the start of the year’s flu season For the 1999–00

flu season, week number one corresponds to the fortieth week of

the year, falling in October

of 200,000 hospitalizations (mostly involving young children and the elderly) andmortality that ranges between about 900 and 13,000 deaths per year [36]

The Influenza Division of the CDC reports weekly information on influenza tivity in the US from calendar week 40 in October through week 20 in May [12], theperiod referred to as the influenza season Because the influenza virus exhibits ahigh degree of genetic variability, data is not only collected on the number of cases

Trang 4

ac-but also on the types of influenza viruses that are circulating A sample of virusesisolated from patients undergoes antigenic characterization, with the type, subtypeand, in some instances, the strain of the virus being reported [12].

The CDC acknowledges that, while these reports may help in mapping influenzaactivity (whether or not it is increasing or decreasing) throughout the US, they often

do not provide sufficient information to calculate how many people became ill withinfluenza during a given season This is true especially in light of measurement un-certainty, e.g., underreporting, longitudinal variability in reporting procedures, etc.Indeed, the sampling process that gives rise to the tested isolates is not sufficientlystandardized across space and time, and results in variabilities in measurementsthat are difficult to quantify We return to discuss this point later in this paper.Despite the cautionary remarks by the CDC, we use such isolate reports asillustrative data sets to which one can apply proposed estimation methodologies.The data sets do, in fact, represent typical data sets available to modelers formany disease progression scenarios Interpretation of the results, however, should

be mindful of the issues associated with the data For the influenza data we havechosen, the total number of tested specimens and isolates through various seasonsare summarized in Table 1 It is observed that H3N2 viruses predominated inmost seasons with the exception of 2000–01 and 2002–03 Consequently, we focusour attention on the H3N2 subtype Fig 1 depicts the number of H3N2 isolatesreported over the 1999–00 influenza season

3 Deterministic single-outbreak SIR model The model that we use is thestandard susceptible-infected-recovered (SIR) model (see, for example, [1,8]) Thestate variables S(t), I(t), and X(t) denote the number of people who are susceptible,infected, and recovered, respectively, at time t It is assumed that newly infectedindividuals immediately become infectious and that recovered individuals acquirepermanent immunity The influenza season, lasting nearly thirty-two weeks [12], isshort compared to the average lifespan, so we ignore demographic processes (birthsand deaths) as well as disease-induced fatalities and assume that the total popula-tion size remains constant The model is given by the set of nonlinear differentialequations

Equation (2) for the infective population can be rewritten as

dI

Trang 5

where R(t) = S(t)N R0 and R0 = β/γ R(t) is known as the effective reproductivenumber, while R0is known as the basic reproductive number We have that R(t) ≤R0, with the upper bound—the basic reproductive number—only being achievedwhen the entire population is susceptible.

We note that R(t) is the product of the per-infective rate at which new infectionsarise and the average duration of infection, and so the effective reproductive numbergives the average number of secondary infections caused by a single infective, at

a given susceptible fraction The prevalence of infection increases or decreasesaccording to whether R(t) is greater than or less than one, respectively Becausethere is no replenishment of the susceptible pool in this SIR model, R(t) decreasesover the course of an outbreak as susceptible individuals become infected

4 Estimation scheme To calculate R(t), one needs to know the two ological parameters β and γ, as well as the number of susceptibles S(t) and thepopulation size N As mentioned before, difficulties in the direct estimation of β,whose value reflects the rate at which contacts occur in the population and theprobability of transmission occurring when a susceptible and an infective meet, anddirect estimation of S(t) preclude direct estimation of R(t) As a result, we adopt

epidemi-an indirect approach, which proceeds by first finding the parameter set for whichthe model has the best agreement with the data and then calculating R(t) by usingthese parameters and the model-predicted time course of S(t) Simulation of themodel also requires knowledge of the initial values, S0 and I0, which must also beestimated

Although the model is framed in terms of the prevalence of infection I(t), thetime-series data provides information on the weekly incidence of infection, which,

in terms of the model, is given by the integral of the rate at which new infectionsarise over the week: R βS(t)I(t)/N dt We observe that the parameters β and Nonly appear (both in the model and in the expression for incidence) as the ratioβ/N , precluding their separate estimation Consequently we need only estimate thevalue of this ratio, which we denote by ˜β = β/N

We employ inverse problem methodology to obtain estimates of the vector θ =(S0, I0, ˜β, γ) ∈ Rp

= R4by minimizing the difference between the model predictionsand the observed data, according to a generalized least squares (GLS) criterion Inwhat follows, we refer to θ as the parameter vector, or simply as the parameter,

in the inverse problem, even though some of its components are initial conditionsrather than parameters, of the underlying dynamic model

4.1 Generalized Least Squares (GLS) estimation The least squares tion methodology is based on a statistical model for the observation process (referred

estima-to as the case-counting process) as well as the mathematical model As is standard inmany statistical formulations, it is assumed that our known model, together with aparticular choice of parameters (the “true” parameter vector, written as θ0) exactlydescribes the epidemic process, but that the n observations {Yj}n

j=1are affected byrandom deviations (e.g., measurement errors) from this underlying process Moreprecisely, it is assumed that

Y = z(t ; θ ) + z(t ; θ )ρ for j = 1, , n (5)

Trang 6

where z(tj; θ0) denotes the weekly incidence given by the model under the trueparameter, θ0, and is defined by the integral

z(tj; θ0) =

Z t j

t j−1

˜βS(t; θ0)I(t; θ0) dt (6)

Here t0denotes the time at which the epidemic observation process started and theweekly observation time points are written as t1< · · · < tn

We remark that the choice of a particular statistical model (i.e., the error modelfor the observation process) is often a difficult task While one can never be certain

of the correctness of one’s choice, there are post-inverse problem quantitative ods (e.g., involving residual plots) that can be effectively used to investigate thisquestion; see the discussions and examples in [3] A major goal of this paper is topresent and illustrate use of such ideas and techniques in the context of surveillancedata modeling

meth-The “errors” j (note that the total measurement errors ˜j = z(tj; θ0)ρj aremodel-dependent) are assumed to be independent and identically distributed (i.i.d.)random variables with zero mean (E[j] = 0), representing measurement error aswell as other phenomena that cause the observations to deviate from the modelpredictions z(tj; θ0) The i.i.d assumption means that the errors are uncorrelatedacross time and have identical variance We assume the variance is finite andwrite var(j) = σ2 < ∞ We make no further assumptions about the distribution

of the errors: specifically, we do not assume that they are normally distributed.Under these assumptions, the observation mean is equal to the model prediction,E[Yj] = z(tj; θ0), while the variance in the observations is a function of the timepoint, with var(Yj) = z(tj; θ0)2ρσ2 In particular, this variance is longitudinallynonconstant and model-dependent One situation in which this error structure may

be appropriate is when observation errors scale with the size of the measurement(so-called relative noise), a reasonable scenario in a “counting” process

Given a set of observations Y = (Y1, , Yn), the estimator θGLS = θGLS(Y ) isdefined as the solution of the normal equations

nX

be inversely proportional to the square of the predicted incidence: wj= 1/[z(tj; θ)]2

On the other hand, if ρ = 1/2, then the weights are proportional to the reciprocal

of the predicted incidence; these correspond to assuming that the variance in theobservations is proportional to the value of the model (as opposed to its square).The most popular assumption, the ρ = 0 case, leads to the standard ordinary leastsquares (OLS) approach; see [3] for a full discussion of OLS methods For theproblem and data set we investigate here, the OLS did not produce very reasonableresults [15]

Trang 7

Suppose {yj}n

j=1is a realization of the case counting process {Yj}n

j=1and definethe function L(θ) as

L(θ) =

nX

∂z

∂S0(tj; θ) =

˜β

Z t j

t j−1

I(t; θ)∂S

of the fitted model These values are not known before carrying out the estimationprocedure and consequently the GLS estimation is implemented as an iterativeprocess The first iteration is carried out by setting ρ = 0, which reduces thestatistical model in equation (5) to Yj= z(tj; θ0) + j, and also implies the weights

in equation (7) are equal to one (wj= 1) This results in an ordinary least squaresscheme, the solution of which provides an initial set of weights via equation (8) Aweighted least squares fit is then performed using these weights, obtaining updatedmodel values and hence an updated set of weights The weighted least squaresprocess is repeated until some convergence criterion is satisfied, such as successivevalues of the estimates being deemed to be sufficiently close to each other Theprocess can be summarized as follows:

1 Estimate ˆθGLS by ˆθ(0) using an OLS criterion Set k = 0 Set ρ = 1 or

ρ = 1/2;

Trang 8

2 form the weights ˆwj= 1/[z(tj; ˆθ(k))]2ρ;

3 define L(θ) =Pnj=1wj[yjˆ − z(tj; θ)]2 Re-estimate ˆθGLS by solving

ˆ(k+1)

= arg minθ∈ΘL(θ)

to obtain the k + 1 estimate ˆθ(k+1) for ˆθGLS;

4 set k = k + 1 and return to 2 Terminate the procedure when successiveestimates for ˆθGLS are sufficiently close to each other

The convergence of this procedure is discussed in [9, 18] This procedure wasimplemented using a direct search method, the Nelder-Mead simplex algorithm,

as discussed by [28], provided by the MATLAB (The Mathworks, Inc.) routinefminsearch

4.2 Estimation of the effective reproductive number Let the pair (ˆθ, ˆΣ) note the parameter estimate and covariance matrix obtained with the GLS method-ology from a given realization {yj}n

de-j=1 of the case-counting process Simulation ofthe SIR model then allows the time course of the susceptible population, S(t; ˆθ),

to be generated The time course of the effective reproductive number can then becalculated as R(t; ˆθ) = S(t; ˆθ)β/ˆˆ γ This trajectory is our central estimate of R(t).The uncertainty in the resulting estimate of R(t) can be assessed by repeatedsampling of parameter vectors from the corresponding sampling distribution ob-tained from the asymptotic theory, and applying the above methodology to calculatethe R(t) trajectory that results each time To generate m such sample trajectories,

we sample m parameter vectors, θ(k), from the 4-multivariate normal distributionN4(ˆθ, ˆΣ) We require that each θ(k) lies within a feasible region Θ determined bybiological constraints If this is not the case for a particular sample, we discard

it and then we resample until θ(k) ∈ Θ Numerical solution of the SIR model ing θ(k) allows the sample trajectory R(t; θ(k)) to be calculated We summarizethese steps involved in the construction of the sampling distribution of the effectivereproductive number:

us-1 Set k = 1;

2 obtain the k-th parameter sample from the 4-multivariate normal distribution:

θ(k)∼ N4(ˆθ, ˆΣ);

3 if θ(k)∈ Θ (constraints are not satisfied) return to/ 2 Otherwise go to4;

4 using θ = θ(k) find numerical solutions, denoted by S(t; θ(k)), I(t; θ(k)), tothe nonlinear system defined by Equations (1) and (2) Construct the effectivereproductive number as follows:

R(t; θ(k)) = S(t; θ(k))

˜

β(k)

γ(k),where θ(k)=S0(k), I0(k), ˜β(k), γ(k);

5 set k = k + 1 If k > m then terminate Otherwise return to2

Uncertainty estimates for R(t) are calculated by finding appropriate percentiles

of the distribution of the R(t) samples

Trang 9

Figure 2 Results from applying the GLS methodology to

syn-thetic data with non-constant variance noise (α = 0.075), using

n = 1, 000 observations The initial guess for the optimization

rou-tine was θ = 1.10θ0 The weights in the cost function were equal

to 1/z(tj; θ)2, for j = 1, , n Panel (a) depicts the observed and

fitted values and panel (b) displays 1, 000 of the m = 10, 000 R(t)

sample trajectories Residuals plots are presented in panels (c)

and (d): modified residuals versus fitted values in (c) and modified

residuals versus time in (d)

Trang 10

Table 2 Estimates from a synthetic data set of size n = 1, 000,

with non-constant variance using α = 0.075 The R(t) sample size

is m = 10, 000 The initial guess of the optimization algorithm was

θ = 1.10θ0 Each weight in the cost function L(θ) (see Equation

(9)) was equal to 1/z(tj; θ)2 for j = 1, , n The units of the

estimated quantities are: people, for S0 and I0; per person per

week, for ˜β; and per week, for γ

Parameter True value Initial guess Estimate Standard error

True value of the reproductive number at time t0; R(t0) = S0β/γ = 3.500˜

5 Estimation scheme applied to synthetic data We generated a syntheticdata set with nonconstant variance noise The true value θ0 was fixed, and wasused to calculate the numerical solution z(tj; θ0) Observations were computed inthe following fashion:

Yj = z(tj; θ0) + z(tj; θ0)αVj= z(tj; θ0) (1 + αVj) , (14)where the Vj are independent random variables with standard normal distribution(i.e., Vj ∼ N (0, 1)), and 0 < α < 1 denotes a desired percentage Hence ρ = 1

in the general formulation with j = αVj In this way, var(Yj) = [z(tj; θ0)α]2which is nonconstant across the time points tj If the terms {vj}n

j=1 denote arealization of {Vj}n

j=1, then a realization of the observation process is denoted by

Residuals plots are displayed in Fig 2(c) and (d) Because αvj = (yj −z(tj; θ0))/z(tj; θ0), by construction of the synthetic data, the residuals analysis fo-cuses on the ratios

yj− z(tj; ˆθGLS)z(tj; ˆθGLS) ,which in the labels of Fig 2(c) and (d) are referred to as “Modified residuals” (for

a more detailed discussion of residuals and modified residuals, see [3]) In Fig 2(c)these ratios are plotted against z(t ; ˆθ ), while Panel (d) displays them versus

Trang 11

the time points tj The lack of any discernable patterns or trends in Fig 2(c) and(d) suggests that the errors in the synthetic data set conform to the assumptionsmade in the formulation of the statistical model of equation (14) In particular, theerrors are uncorrelated and have variance that scales according to the relationshipstated above.

6 Analysis of influenza outbreak data The GLS methodology was applied tolongitudinal observations of six influenza outbreaks (see Section2), giving estimates

of the parameters and the reproductive number for each season The number ofobservations n varies from season to season The R(t) sample size was m = 10, 000

in each case The set of admissible parameters Θ is defined by the lower andupper bounds listed in Table 3 along with the inequality constraint S0β/γ > 1.˜The bounds in Table 3were obtained from or based on [10,29,32] and referencestherein For brevity, we only present here the results obtained using data from the1998–99 season with GLS methods Further results including (unsuccessful) use ofOLS methodology can be found in [15]

Table 3 Lower and upper bounds on the initial conditions and parameters

1.00×102 < S0< 7.00×106 people0.00 < I0< 5.00×103 people7.00×10−9 < ˜β < 7.00×10−1 weeks−1people−13/7 < 1/γ <4/7 weeks

Visual inspection suggests that the model fits obtained using the GLS approach(Fig 3) are even worse than those obtained using OLS (the results obtained usingOLS can be found in [15]) This is somewhat misleading, however, because theweights, defined as wj = 1/[z(tj; θ)]2, mean that the GLS fitting procedure (un-like visual inspection of the figures) places increased emphasis on datapoints whosemodel value is small and decreased emphasis on datapoints where the model value

is large If these graphs are, instead, plotted with a logarithmic scale on the cal axis, an accurate visualization is obtained (Fig 4): multiplicative observation

verti-Table 4 Results of GLS estimation applied to influenza data from

season 1998–99, weights equal to 1/z(tj; θ)2

Parameter Estimate Unit Standard error

S0 7.939×103 people 1.521×104I0 2.436×10−1 people 4.216×10−1

˜

β 3.458×10−4 weeks−1people 5.233×10−5

γ 2.333×100 weeks−1 5.318×100

L(ˆθGLS) =1.754×101ˆ

σ2 GLS= 6.047 × 10−1Min.R(t; ˆθGLS) 0.843 [0.784,1.018]

Max.R(t; ˆθ ) 1.177 [1.052,1.252]

Ngày đăng: 13/02/2014, 16:20

TỪ KHÓA LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm

w