List of Figures1.1 The binding of coronavirus spike protein to human ACE2 receptor.. 204.3 The root mean square deviations of the backbone of the human ACE2receptor and of the viral RBD
Trang 1VIETNAM NATIONAL UNIVERSITY, HANOI
VIETNAM JAPAN UNIVERSITY
CONG PHUONG CAO
COMPARING RECEPTOR BINDING PROPERTIES OF 2019-nCoV VIRUS WITH THOSE OF SARS-CoV VIRUS USING COMPUTATIONAL BIOPHYSICS
APPROACH
MASTER'S THESIS
Trang 2VIETNAM NATIONAL UNIVERSITY, HANOI
VIETNAM JAPAN UNIVERSITY
CONG PHUONG CAO
COMPARING RECEPTOR BINDING
PROPERTIES OF 2019-nCoV VIRUS WITH THOSE OF SARS-CoV VIRUS USING
COMPUTATIONAL BIOPHYSICS APPROACH
MAJOR: NANOTECHNOLOGY CODE: 8440140.11 QTD
RESEARCH SUPERVISOR:
Associate Prof Dr NGUYEN THE TOAN
Hanoi, 2021
Trang 3It could be said that without Prof Nguyen The Toan, I couldn’t have gone this far in myscientific research path, much less conducting this master thesis Therefore, first of all, Iwant to express my sincere thank to Prof Nguyen The Toan as my beloved master thesissupervisor in the VNU Key Laboratory on Multiscale Simulation of Complex Systemsand the Faculty of Physics, VNU University of Science, Vietnam National University
I also wish to thank Dr Pham Trong Lam, who guided me in my very first steps in themachine learning field as well as give me precious advice for my research during myinternship period and thesis defense preparation
I would like to thank the lecturers in VJU Master’s Program in Nanotechnology for manyinspirational discussions and helpful knowledge from classes
I would also like to thank all staff, lecturers, and my good friends in VJU for helping me
a lot during my memorable study in VJU
This research is funded by Vietnam National University under grant number QG.20.82
Hanoi, 17 July 2021
Cong Phuong Cao
Trang 42.1 Molecular Dynamics 4
2.1.1 Integration Algorithm 5
2.1.2 Force field 6
2.2 Materials and Models 8
2.3 Simulation Details 9
2.3.1 Thermostat and Barostat 9
2.3.2 Periodic Boundary Conditions 10
3 ANALYSES METHODS 13 3.1 Sequence Alignment 13
3.2 Root Mean Square Deviation 13
3.3 Root Mean Square Fluctuation 14
3.4 Principal Component Analysis 14
3.5 Variational Autoencoder 15
4 RESULTS AND DISCUSSION 19 4.1 Preliminary Sequence Alignments of The Viral RBDs 19
4.2 Deviations and Fluctuations of The Structural Backbone Atoms 20
4.2.1 Root Mean Square Deviations 21
4.2.2 Root Mean Square Fluctuations 22
4.3 Principal Component Analysis 25
4.4 Machine Learning on 6M0J System 27
Trang 5A IN-HOUSE SOURCE CODE 35A.1 Data Pre-processing Source Code 35A.2 Autoencoder Source Code 35
Trang 6List of Tables
2.1 The molecules simulated for each systems 93.1 The detailed parameters of VAE model 184.1 The trace of the co-variance matrix of the projections of the proteinbackbones on the two largest principal components 26
Trang 7List of Figures
1.1 The binding of coronavirus spike protein to human ACE2 receptor 21.2 Antibodies neutralizing SARS-CoV-2 virus by blocking its interactionwith human ACE2 receptor 32.1 A 2-dimensional PBC view along the z-axis direction of the 6VW1 sys-tem The primitive system is surrounded and interacts with its images 112.2 A typical snapshot of the 6M0J system after being simulated for 800nsshowing the arrangement of RBD and ACE2 fluctuating in water 123.1 Illustration of VAE structure used for protein datasets 164.1 The sequence alignments of the viral RBD of 6VW1 and 6M0J for twovariants of SARS-CoV-2 virus, and of 2AJF for SARS-CoV virus 194.2 The location of four discovered significant mutations of the viral RBD 204.3 The root mean square deviations of the backbone of the human ACE2receptor and of the viral RBD protein 214.4 The root mean square fluctuations of the backbone of the human ACE2receptor and of the viral RBD protein 234.5 The location of residue 113 of the viral RBD in the 6VW1 system 244.6 The location of residue 50 of the viral RBD in the 2AJF system 254.7 The probability density in the plane of the two largest principal compo-nents from the PCA of the backbones structure of proteins 274.8 Latent space projection of variational autoencoder trained on the dis-tance matrix of RBD-ACE2 complex of 6M0J 28B.1 Latent space projection of variational autoencoder trained on the dis-tance matrix of RBD-ACE2 complex of 6M0J 38B.2 Latent space projection of variational autoencoder trained on the dis-tance matrix of RBD-ACE2 complex of 6M0J 39B.3 Latent space projection of variational autoencoder trained on the dis-tance matrix of RBD-ACE2 complex of 6M0J 40
Trang 8List of Abbreviations
SARS Severe Acute Respiratory Syndrome
SARS-CoV-2 Severe Acute Respiratory Syndrome CoronaVirus 2
2019-nCoV 2019 Novel CoronaVirus, colloquial name of SARS-CoV-2SARS-CoV or SARS-CoV-1 Severe Scute Respiratory Syndrome CoronaVirus
(caused the epidemic in June 2003, different from 2019-nCoV)
ACE2 Angiotensin Converting Enzyme 2
EOM Newton’s Equations of Motion
RCSB The Research Collaboratory for Structural Bioinformatics
PBC Periodic Boundary Conditions
RMSD Root Mean Square Deviation
RMSF Root Mean Square Fluctuation
PCA Principal Component Analysis
VAE Variational Autoencoder
Trang 9Chapter 1 INTRODUCTION
By the end of 2019, the Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)(also known as 2019-nCoV) was detected in Wuhan city, China, and spread rapidly toall over the countries and regions, forcing The World Health Organization must declare
a public health emergency only three months later [1] Because of the extremely fastspread rate, fast mutation rate and the toxicity of the SARS-CoV-2, scientists are rushing
to find a cure for severe acute respiratory syndrome caused by the virus It turns outthat the genome of SARS-CoV-2 is very similar to the genome of other coronavirusesand can be classified as a variant of the Severe acute respiratory syndrome coronavirus(SARS-CoV), which caused the SARS epidemic in June 2003
The structure of coronavirus can be divided into two parts, namely core and shell Theviral core is the single-stranded RNA viral genome The viral shell is the combination offat lipids, envelope proteins, and spike proteins, in which spike proteins play an impor-tant role in the entry of the RNA viral genome into the host cell The receptor-bindingdomain (RBD) is a subunit of the spike glycoprotein (also known as protein S) attached
to the viral outer shell [2], [3] RBD recognizes and binds to human cells through areceptor call Angiotensin Converting Enzyme 2 (ACE2), like a key being inserted into
a lock (illustrated in Figure 1.1) [4] After that, the coronavirus is incorporated into thehost cell to release the viral RNA into the cytoplasm
According to [6]–[10], the RBD of SARS-CoV and SARS-CoV-2 have significant larities in genome sequence and also use the same cellular entry receptor, namely ACE2.Because of the critical relation between SARS-CoV and SARS-CoV-2, there raises animportant question: What are the significant differences (mutations) between them mak-ing SARS-CoV-2 much more contagious and dangerous? It is supposed that the muta-tions in the RBD of SARS-CoV-2 in respect of that of SARS-CoV can impact the bind-ing affinity for the ACE2 receptor [8], [11] In this study, we aimed to answer the abovequestion by analyzing the structural differences in the binding of RBDs of two variants
simi-of SARS-CoV-2 and SARS-CoV to the human ACE2 receptor
Trang 10FIGURE 1.1: The binding of coronavirus spike protein to human
ACE2 receptor (The figure is from [5])
One of the approaches is to study the behavior of the coronaviruses (including CoV-2) interactions with the human ACE2 receptor using computational biophysics ap-proaches, such as molecular dynamics and unsupervised machine learning techniques
SARS-In this study, we use both molecular dynamics and machine learning To investigate thecharacteristics of the binding mechanism of the complex of RBD protein and ACE2 re-ceptor, conventional molecular dynamics is used to simulate the molecular interactions.The trajectories obtained from the molecular dynamics simulation are then used as inputfor the principal component analysis (PCA) and the variational autoencoder (unsuper-vised learning methods) to extract features (knowledge) of the binding
It is expected that from knowing the binding mechanism between the viral RBDs andthe ACE2 receptor, one can build and develop antibodies or antiviral drugs based onthe binding features of the RBD of the SARS-CoV-2 spike protein The SARS-CoV-2spike protein is the main target for antibodies and antiviral drugs design throughout the
Trang 11vaccine history Antibodies and some antiviral drugs work on the principle of attackingthe RBD of viruses, binding to RBD regions before the viruses can interact with theACE2 receptor (Figure 1.2) By understanding the mechanism between the SARS-CoV-
2 RBD and the human ACE2 receptor, it is possible to design and develop therapeuticantibodies and antivirals for the treatment of acute respiratory infections caused by thevirus Noticeably, not all therapeutic antibodies or antivirals work well with differentviruses of the same strain This is because of the difference in structure caused bymutations between virus variants [8], [11] Therefore, to evaluate the reliability of themodel, we need to study the interaction mechanism of the SARS-CoV-2 coronaviruswith ACE2 in comparison with the interaction mechanism of other coronaviruses
FIGURE 1.2: Antibodies neutralizing SARS-CoV-2 virus by
block-ing its interaction with human ACE2 receptor
This thesis is organized as follows After Chapter 1 about introduction, the methodology
of the simulation and analyses are described in Chapter 2 and Chapter 3 respectively Allresults are shown and discussed in Chapter 4 And Chapter 4.4 is the conclusions
Trang 12Chapter 2 MOLECULAR DYNAMICS SIMULATION
2.1 Molecular Dynamics
Molecular dynamics (MD) is a computer simulation technique that is used widely fortheoretical study of many-body systems [12], or in our case of biological systems ofthe RBD-ACE2 complex MD algorithm can calculate the time evolution of the systembased on the given initial configuration (positions and velocities) of the system In otherwords, from an initial configuration of the system, MD can predict the future configu-rations, which are called trajectory, with some tolerable errors while the behavior of thesystem still obeys the ergodic hypothesis in physics and thermodynamics The trajectory
of the system reveals detailed information on the changes and fluctuations of the proteinsand nucleic acids As a result, the MD is a very suitable and powerful tool to investigatethe thermodynamics and structure of biological systems, or in our case of the systems ofthe RBD-ACE2 complex
The conventional molecular dynamics method come from the Newton’s second law.Assume that our system has N particles (atoms), the particle ithhas the mass mi, position
ri and acceleration ai = d2ri/dt2 at the current time Hence, governed by the Newton’sequations of motion (EOM) the force applied to the particle ith can be expressed as
On the other hand, assume that the interacting potential (potential energy) between ticles is known as function of positions riof N particles, such as U (r1, r2, , rN) Hence,the force applied to the particle ith can be also derived from the derivative (the gradient)
par-of U (r1, r2, , rN) as
Fi= −∇iU(r1, r2, , rN) = −∂U
∂ ri
(2.2)
Trang 13Combining equations 2.1 and 2.2 yields
an-There are many criteria of an algorithm, such as
• The algorithm should approximate the true trajectory for a long period of time withsome tolerable errors
• The algorithm should be time-reversible
• The algorithm should be fast enough to perform
• The algorithm should be easy to implement
• The algorithm should conserve some macroscopic physical quantities
In this thesis, the leap-frog algorithm is chosen for integrating Newton’s EOM Thereason for this choice is that the leap-frog algorithm satisfies all those criteria and isgood enough to some extent in comparison with other algorithms Importantly, the leap-frog algorithm is not only fast but also obeys the ergodic hypothesis in physics andthermodynamics making the physical quantities calculated from system configurationsreliable
Trang 14Assume that the time-step is ∆t, the position and acceleration vectors of the particle ith
at the current time t are ri(t) and ai(t) respectively In this algorithm, the velocities
are assumed to be first calculated at the time t − ∆t/2 as vi(t − ∆t/2) The leap-frog
algorithm follows the following scheme:
1 Compute accelerations from the current positions
4 Advance to next time step and repeat from step (1)
To calculate the total energy at the time t, the velocities can be approximated by:
In MD simulation, the interacting potential U (r1, r2, , rN) that we assumed in equation
2.2 plays an important role in determining the force between particles (atoms) The
interatomic potential of an interacting system can be expanded in terms of many-body
Trang 15where i, j, k, l are indexes of particles of system, U1 is one-body term showing the nal potential acting on a particle, U2 is two-body term showing the interaction betweenonly two particles, U3 is three-body term showing the interaction between only threeparticles, and U4is four-body term similarly.
exter-For biological systems, there is usually no one-body term as well as expansion 2.8 onlyneed to truncate at U4 because the larger-than-four-body terms are excessive and need-less whereas the other terms are enough to describe the physical picture the system with
a reasonable computational cost The empirical force fields U are usually used, whoseparameters are obtained from experiments or quantum mechanical calculations Theforce fields U from expansion 2.8 for biological systems typically has the form
U = Evdw+ Eelect+ Ebonds+ Ebends+ Edihedrals (2.9)
where Evdw is van der Waals potential, Eelect is electrostatic potential, Ebonds is bonddistances stretching potential, Ebends is bond angles bending potential, and Edihedrals isthe bond torsion angle potential
From expression 2.9 , Evdw+ Eelect+ Ebondsis equivalent to two-body term U2 of sion 2.8 Term Ebends and Edihedralsare equivalent to U3 and U4 respectively Moreover,the first two terms of expression 2.9 are considered as non-bonded interactions betweenatoms The last three terms of 2.9 represent the bonded or intramolecular bonding inter-actions as multiplets of atoms are connected by chemical bonds
expres-In detail, every terms in 2.9 can be expressed as follow
Evdw=
atoms
∑i< j
2kθ(θ − θeq)2 (2.13)
Trang 16approx-In the bonded interactions, the bond stretching and angle bending can be both described
by the harmonic pendulum oscillation model Therefore, the term 2.12 and 2.13 haveharmonic energy functions where req and θeq are bond lengths constants and anglesconstants at equilibrium states, kr and kθ are the vibrational constants In the bondtorsion angle potential 2.14, the torsional barrier Vn corresponds to the nth barrier for aparticular torsional angle and phase γ
2.2 Materials and Models
There are two variants of SARS-CoV-2 viruses that are investigated during this work.The complexes of RBD and ACE2 are obtained from the Research Collaboratory forStructural Bioinformatics (RCSB) Protein Data Bank (PDB) [13] database with ID:2AJF [14] for the SARS-CoV virus and 6M0J [15], 6VW1 [16] for two variants ofSARS-CoV-2 viruses From now on, these systems will be referred to as 2AJF, 6M0Jand 6VW1 respectively for easy identification
The primary sequences of RBD protein of SARS-CoV virus and SARS-CoV-2 virusesare aligned using Multiple Sequence Alignment by ClustalW [17] web-server of KyotoUniversity Bioinformatics Center with BLOSUM matrix [18]
For the main molecular simulations, the initial simulation configurations of all systemsare generated using CHARMM-GUI web-server [19] and are manually adjusted after-ward The GROMACS/2018.6 software package [20] is used to run MD simulations onthe systems
Trang 17For the force fields, many force field packages are used depending on the functioning
of each component of the system The proteins and ions of the system are simulatedusing parameters from Charmm-336 force field [21] For the glycans, a part of ACE2receptor, GLYCAM06 force field [22] is chosen for the parametrization The explicitsolvent model TIP3P [23] is applied to represent water in the system The total charges
of the system are neutralized by adding sodium and chlorine ions In addition, the iological salt concentration in the human cell environment is about 150mM determiningthe number of added Na+ and Cl- ions The detailed numbers of molecules of systemsare described in Table 2.1
phys-TABLE 2.1: The molecules simulated for each systems
2.3.1 Thermostat and Barostat
The interactions between the viral RBD and the human ACE2 happen in human body.Accordingly, the temperature of all systems is also the temperature of the human bodythat is 310 K The pressure of the systems is 1 atm However, one needs to make thetemperature and pressure behaving naturally as much as possible In other words, thesystems need to conduct in the correct type of thermodynamics ensembles, which arecharacterized by the restraint of some specific thermodynamic quantities In case of oursystems, isothermal-isobaric (NPT) ensemble describes the realistic systems the best
To mimic the system in such ensemble, thermostat and barostat algorithms are required
to regulate the temperature and pressure throughout the MD run
Trang 18In the equilibrating stage of simulation (very early stage), the velocity-rescaling mostat and Berendsen barostat are implemented to guide the system to the equilibriumstates as fast as possible These algorithms save a lot of time and computational cost de-spite the fact that they do not have much physical meaning Besides, the non-equilibriumconfigurations of the system are not of interest The equilibrating procedure is performed
ther-in 1 ns
In the MD production run, the Nosé-Hoover thermostat and Parrinello-Rahman barostatare chosen for the simulations Both Nosé-Hoover thermostat and Parrinello-Rahmanbarostat add an extra degree of freedom to the system to regulate the temperature andpressure gradually, not abruptly in comparison with velocity-rescaling thermostat andBerendsen barostat This is also how the realistic systems behave, making Nosé-Hooverthermostat and Parrinello-Rahman barostat accurate and efficient methods for isothermal-isobaric ensemble MD simulation The total simulation time of this procedure is 2 µs,with time-step 2 fs
2.3.2 Periodic Boundary Conditions
The periodic boundary conditions (PBC) opens the boundary of the system mimickingthe infinite clones of the primitive system which surrounding and interacting with theprimitive system (Figure 2.1) When a component of the system passes through theboundary of the box, it is put back to the opposite side of the box This idea is equivalent
to the description that a part of the system lost will be recompensed by the exact part ofanother system coming from the opposite direction
MD simulations are typically run using PBC to reduce boundary effects and simulatethe presence of the bulk environment if the size of the box is big enough In this work,the box is cubic with an edge of 14nm to guarantee that the RBD-ACE2 complex andits periodic complex are far enough, at least 3nm from each other, to prevent unwantedinteractions such as electrostatic screening effect Because the electrostatic screeninglength at 150mM NaCl concentration is around 7Å, this 3nm separation is more than
Trang 19FIGURE 2.1: A 2-dimensional PBC view along the z-axis direction
of the 6VW1 system The primitive system is surrounded and
inter-acts with its images
enough to avoid the finite size effect caused by long-range electrostatic interactions tween proteins in nearby simulation boxes To deal with the long-range electrostaticinteraction, Particle Mesh Ewald (PME) method is used with the cutoff length of 1.2nm.The cutoff length of van der Waals interaction is also 1.2nm
Trang 20be-FIGURE 2.2: A typical snapshot of the 6M0J system after being
simulated for 800ns showing the arrangement of RBD and ACE2
fluctuating in water
Trang 21Chapter 3 ANALYSES METHODS
3.1 Sequence Alignment
Sequence alignment is a method of arranging two or more genome sequences in order
to achieve maximum similarity [24] These sequences may be interleaved with spaces
at possible locations to form identical or similar columns The term "sequence ment" refers to the act of constructing this arrangement, or identifying the best potentialarrangements in a database of unique sequences
align-In bioinformatics, this method is often used to study the evolution of sequences from acommon ancestor, especially biological sequences such as protein sequences or DNA,RNA sequences Incorrect matches in the sequence correspond to mutations and gapscorrespond to additions or deletions In this work, the sequence alignment is used for theviral RBDs of both SARS-CoV virus and SARS-CoV-2 viruses to elucidate the commonfeatures as well as the viral mutations during the time of more than a decade From thepoint of view of a biophysicist, the mutations of SARS-CoV-2 make some considerablechanges in the protein backbone These changes would make the protein backbone morerigid/flexible causing significant changes in the way the virus binding to the humanACE2
3.2 Root Mean Square Deviation
The root mean square deviation (RMSD) is a common method analyzing the ment of a group of atoms of a system configuration with respect to a reference systemconfiguration at a particular time
displace-Assume that a group of N atoms are examined Atoms are labeled from 1 to N ri(t) isthe position of atom i at some time t in the simulation The reference position of atom i
is denoted by rre fi And mi is the mass of atom i The RMSD at the time t is calculatedas
RMSD(t) =
"
1M
N
∑i=1
mi
ri(t) − rre fi
2#1/2
(3.1)
Trang 22where M = ∑Ni=1mi is the total mass of the atom group Commonly, rre fi is usually theposition of atom i in the initial configuration of the system (configuration at the time
t = 0)
3.3 Root Mean Square Fluctuation
The root mean square fluctuation (RMSF) calculates the average displacement of a singleatom of a system configuration with respect to a reference system configuration alongthe simulation time The RMSF of some atom i is calculated as
RMSF(i) =
"
1T
T
∑t=1
ri(t) − rire f
... the number of atoms of the RBD-ACE2 complexes is too big(around 12500 atoms) If all atoms are used for PCA, the covariance matrix will have asize of about 37500 × 37500 The size of the covariance... PCA is a powerful tool for analyzingprotein dynamics because of the big data of a large number of atoms of proteins over along time of simulation
Trang... calculates the average displacement of a singleatom of a system configuration with respect to a reference system configuration alongthe simulation time The RMSF of some atom i is calculated asRMSF(i)