1. Trang chủ
  2. » Giáo án - Bài giảng

Building machine-learning-based models for retention time and resolution predictions in ion pair chromatography of oligonucleotides

10 6 0

Đang tải... (xem toàn văn)

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Tiêu đề Building machine-learning-based models for retention time and resolution predictions in ion pair chromatography of oligonucleotides
Tác giả Martin Enmark, Jakob Högström, Jürgen Samuelsson, Torgny Fornstedt
Trường học Karlstad University
Chuyên ngành Chemical Sciences
Thể loại Research article
Năm xuất bản 2022
Thành phố Karlstad
Định dạng
Số trang 10
Dung lượng 1,35 MB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Support vector regression models are created and used to predict the retention times of oligonucleotides separated using gradient ion-pair chromatography with high accuracy. The experimental dataset consisted of fully phosphorothioated oligonucleotides. Two models were trained and validated using two pseudoorthogonal gradient modes and three gradient slopes.

Trang 1

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/chroma

Martin Enmark, Jakob Häggström, Jörgen Samuelsson∗, Torgny Fornstedt∗

Department of Engineering and Chemical Sciences, Karlstad University, SE-651 88 Karlstad, Sweden

a r t i c l e i n f o

Article history:

Received 8 December 2021

Revised 22 March 2022

Accepted 25 March 2022

Available online 27 March 2022

Keywords:

Machine-learning

Support vector regression (SVR) model

Oligonucleotides

Ion-pair chromatography

Resolution

a b s t r a c t

Support vector regression models are created and used to predict the retention times of oligonucleotides separated using gradient ion-pair chromatography with high accuracy The experimental dataset consisted

of fully phosphorothioated oligonucleotides Two models were trained and validated using two pseudo- orthogonal gradient modes and three gradient slopes The results show that the spread in retention time differs between the two gradient modes, which indicated varying degree of sequence dependent sep- aration Peak widths from the experimental dataset were calculated and correlated with the guanine- cytosine content and retention time of the sequence for each gradient slope This data was used to pre- dict the resolution of the n – 1 impurity among 250 0 0 0 random 12- and 16-mer sequences; showing one of the investigated gradient modes has a much higher probability of exceeding a resolution of 1.5, particularly for the 16-mer sequences Sequences having a high guanine-cytosine content and a terminal

C are more likely to not reach critical resolution The trained SVR models can both be used to identify characteristics of different separation methods and to assist in the choice of method conditions, i.e to optimize resolution for arbitrary sequences The methodology presented in this study can be expected to

be applicable to predict retention times of other oligonucleotide synthesis and degradation impurities if provided enough training data

© 2022 The Authors Published by Elsevier B.V This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/)

1 Introduction

Ion-pair chromatography (IPC) is an important technique for

separating synthetic oligonucleotides, which are a class of DNA-

or RNA-based molecules with widespread and well-known appli-

cations in diagnostics [ 1, 2], research [3], and, recently, therapeu-

tic applications [ 4, 5] Oligonucleotides used for antisense therapy

[6] are typically produced using stepwise solid-phase synthesis

via the β-cyanoethyl phosphoramidite method [7] Depending on

the length, sequence, and miscellaneous chemical modifications of

these antisense active pharmaceutical ingredients (APIs) [8], the fi-

nal synthesis product will contain a large fraction of impurities

The polymeric nature of the oligonucleotides and the many impu-

rities challenge analytical separations, and phosphorothioated (PS)

oligonucleotides are especially difficult to analyze [9–12] In this

study, we will focus on the shortmer impurities with respect to the

parent full-length product (FLP) In this study we put particular fo-

cus on the n – 1 impurity generated due to e.g failed coupling in

the last coupling step, i.e trityl-off

∗ Corresponding authors

E-mail addresses: Jorgen.Samuelsson@kau.se (J Samuelsson),

Torgny.Fornstedt@kau.se (T Fornstedt)

Amphipathic [13]oligonucleotides are predominately separated and analyzed using IPC [ 9, 14, 15] The most-used stationary phase

is the C18 column, typically pH-stable variants such as the XBridge C18 and other reversed-phase chemistries [ 11, 12, 15, 16] Many dif- ferent combinations of ion-pairing reagents (IPRs) have been eval- uated [ 9, 15] For the separation of PS oligonucleotides, methods using tributyl ammonium acetate (TBuAA) as the IPR have been proven successful [ 11, 15, 17] In this study, we will use TBuAA in two previously investigated gradient modes [18] In the aforemen- tioned study we could show that using the phenyl column resulted

in slightly improved n – 1 selectivity compared to the C18 column

in the IPR gradient mode In the co-solvent gradient elution mode, the co-solvent fraction increases over time, while the IPR concen- tration typically remains constant In the IPR gradient mode, the IPR concentration decreases over time while the co-solvent fraction remains constant Both modes elute oligonucleotides by decreasing the apparent electrostatic potential generated by the adsorption of the IPR We have previously shown that the IPR gradient increases the selectivity for oligonucleotide impurities of the same charge, for example phosphodiester (P =O) 1 impurities of fully phospho- rothioated oligonucleotides, especially using a phenyl column [18] Other chromatographic modes not using IPRs such as HILIC have

https://doi.org/10.1016/j.chroma.2022.462999

0021-9673/© 2022 The Authors Published by Elsevier B.V This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )

Trang 2

also been investigated for the separation of PS-modified oligonu-

cleotides [19]

Retention time prediction models for the IPC separation of

oligonucleotides are few, and noteworthy works include those of

Gilar et al [20], Studzinska and Buszewski [21], Sturm et al [22],

Liang et al [23], and Kohlbacher et al [24] These models are well

established for peptides and are routinely employed, for example,

in shotgun proteomics to design targeted proteomics experiments

and to reduce false-positive hits in mass spectrometry analysis The

many different approaches used can roughly be divided into (i)

index-based, (ii) modeling-based, and (iii) machine-learning (ML)-

based methods [25] In index-based methods, the effect of each

amino acid in a sequence is estimated using the multilinear re-

gression of a large set of peptides with known retention times

[ 26, 27] In modeling-based methods, the physicochemical proper-

ties of the peptide are used to predict the retention times [27] In

ML-based methods, a training set of peptides is used to estimate

the parameters of a predefined mathematical model; many differ-

ent approaches have been used for this, such as artificial neural

networks [28]and support vector regression (SVR) [ 29, 30]

Gilar et al have developed an empirical logarithmic model

(hereafter denoted as LM) to predict the retention of synthetic

oligonucleotides [20] Their modeling-based method has five input

variables, i.e., the amount of each nucleotide (T, C, G, and A) as

well as the total number of nucleotides in the oligo Studzinska

and Buszewski used quantitative structure–retention relationships

(QSRRs) to predict the retention based on descriptors such as van

der Waals surface area, solvent-accessible area, dipole moment, to-

tal energy, and hydration energy [21] All these parameters were

numerically estimated and fitted to simple functions Neither of

these methods delivers excellent predictivity The great advantage

of the LM model is that it is easy to use, requires few data points

for calibration, and has been shown to be rather good for predict-

ing the retention of non-phosphorothioated oligonucleotides How-

ever, due to the selection of descriptors, the model cannot address

potential structural changes such as grove and hairpin formation

as well as whether the retention is dependent on sequence and

not just on composition The same could also be true of the QSRR

method, which shares the problems of descriptor selection and of

finding accurate descriptors of more complicated molecules such

as oligonucleotides Sturm et al used SVR for retention predictions

[22], mainly using sequence-based descriptors as well as descrip-

tors correlating to stacking energies occurring in hairpin formation

Sturm et al showed that their model had better predictive power

than did the LM model and also could predict the retention change

due to hairpin formation Since the experimental system includ-

ing solutes investigated by Gilar et al and Sturm et al is similar,

it is relevant to compare both approaches for phosphorothioated

oligonucleotides separated in different experimental systems Later,

Liang et al used a similar SVR model to investigate how to opti-

mize the selectivity in gradient elution [23] In all above studies,

the authors investigated non-phosphorothioated oligonucleotides

using triethylamine as the IPR as well as co-solvent gradient mode

Due to the successful utilization of SVR models in [ 22, 23] we de-

cided to investigate if such models also can be successfully used

to predict retention of phosphorothioated oligonucleotides eluted

using tributylamine as IPR

The aim of this study is to build SVR IPC retention time predic-

tion models based on the oligonucleotide sequence for two differ-

ent gradient modes, i.e., the conventional co-solvent gradient and

the IPR gradient modes As training and testing solutes, around 100

heteromeric, fully phosphorothioated oligonucleotides will be used

As the IPR, TBuAA will be used to reduce diastereomer separation

Finally, and most importantly, the retention time prediction mod-

els will be used to predict the probability of successfully separating

the impurities from synthetic oligonucleotides as well as compar-

ing the two different gradient modes; (i) co-solvent gradient and (ii) IPR gradient mode using three gradient slopes

2 Materials and methods

2.1 Chemicals and materials

The IPRs TBuAA and triethylammonium acetate (TEtAA) were prepared from tributylamine ( ≥99.5%, CAS number: 121-44-8) and triethylamine ( ≥99.5%, CAS number 121-44-8) with acetic acid ( ≥99.8%, CAS number 64-19-7), all purchased from Sigma-Aldrich (St Louis, MO, USA) The mobile phases were prepared using HPLC gradient-grade acetonitrile (CAS number 75-05-8) from VWR (Rad- nor, PA, USA) and deionized water with a resistivity of 18.2 M /cm from a Milli-Q water purification system (Merck Millipore, Darm- stadt, Germany) An XBridge Phenyl column, 150 × 3.0 mm, 3.5

μm, 100 ˚A pore size from Waters (Milford, MA, USA) was used

in all experiments Fully phosphorothioated oligonucleotides were purchased in 0.25-μmol scale from Integrated DNA Technologies (Leuven, Belgium) and delivered desalted and lyophilized The pur- chased FLP oligonucleotides were not purified before use A list of all oligonucleotide sequences can be found in Supplementary ma- terial Table S1

2.2 Instrumentation

Experiments were conducted on an Agilent 1260 Infinity II HPLC system (Agilent Technologies, Palo Alto, CA, USA), configured with a binary pump, a 100-μL injection loop, a diode-array UV de- tector, single quadrupole MS, and a column thermostat

2.3 Procedures 2.3.1 Selection of oligonucleotides

The first part of the dataset was selected to explore the ef- fects of length, nucleobase composition, and sequence It con- tains three different 8-, 12-, and 16-mer oligonucleotide sequences These were designed in silico by first generating one million se- quences of length 8, 12 and 16 by randomly picking adenine (A), thymine (T), cytosine (C), or guanine (G) at each position in the sequence The retention time of all sequences was then calculated using the LM model described by Gilar et al [20] This allows us

to estimate the variance in retention time for each population of

8, 12 and 16mers Then, we randomly picked three sequences of each length from each population mean – 2 standard deviations, mean and finally mean + 2 standard deviations, labeled S nA, S nB,

or S nC, where n= 8, 12, or 16, respectively These sequences can

be found in Supplementary material Table S1 Since the LM pre- dicts that the contribution to retention time increases according to the nucleobase in the order C < G < A < T, the base composi- tion of the sequences will vary from high proportions of guanine- cytosine content (GC-content) in the S nA sequences to high pro- portions A and T in the S nC sequences, respectively The second part of the dataset was selected to test whether the secondary oligonucleotide structure influences the retention time The 16-mer sequences referred to as reference hairpin (RHA) and model hair- pin (MHA) by Stellwagen et al [31]were then selected; Stellwagen

et al investigated the effect of monovalent cations on the thermal stability of MHA, as measured by capillary electrophoresis In this case the MHA should contain more than 10% hairpin structures at

50 °C at least in a solution containing 100 mM tetrabutyl ammo- nium, no organic solvent and high amount of other background electrolytes They also found that the DNA melting point decreases with increasing lipophilicity of the IPR [31] In our study, we there- fore included permutated variants of RHA and MHA that minimize

Trang 3

Table 1

Summary of experimental gradient conditions

Co-solvent

gradient

Initial MeCN (v%) 38

Slope (v% MeCN min –1 ) 2.22 1.23 0.81

IPR gradient Initial TEtAA (mM) 0.1

Slope (mM TEtAA min –1 ) 0.32 0.16 0.08

hairpin formation (i.e., RHB and MHB) Finally, a sequence mim-

icking the MALAT-1 transcript targeting ASO described by Nilsson

et al [32]was included in the dataset The 8-, 12-, and 16-mer se-

quences synthesized are hereafter referred to as FLPs of length n

2.3.2 Experimental

All samples were prepared by dissolving the lyophilized

oligonucleotides by vortexing them in deionized water prepared

using a Milli-Q water purification system (Merck Millipore) The

stock concentration was 1 mg mL –1 and the injection concentra-

tion was 0.2 mg mL –1 3 μL was injected into the column of this

solution Mobile phases were prepared by weight using the density

of water and acetonitrile (MeCN) at room temperature For the co-

solvent gradient experiments, 10 and 80 v% MeCN solutions were

prepared, while for the IPR concentration gradient experiments,

two 41.5 v% solutions were made During stirring, acetic acid was

added followed by tributylamine (to both eluents for co-solvent

gradient experiments) or tributylamine or triethylamine separately

for IPR concentration gradient experiments All mobile phases were

stirred for at least 12 hours before use to ensure that the all IPR-

molecules are fully dissolved Before use, the s

wpH of all mobile phases (solvent/water) was determined using a pH electrode cal-

ibrated in aqueous buffer The measured pH value of the mobile

phase ranged between 7-8 depending on the mobile phase com-

position; 7 at low concentration of MeCN and 8 at high concentra-

tion of MeCN All experiments were performed using still-air col-

umn temperature control at 50 °C The flow rate was 0.5 mL min –1

which provided sufficiently good MS signals, i.e., good enough neb-

ulization in the spray chamber Three gradient slopes were evalu-

ated for each of the two gradient methods, and their details can

be found in Table1 A re-equilibration time of about three column

volumes was used after the end of each gradient A 0.01 mg mL −1

sample of uracil was prepared in deionized water and used as the

void volume marker

The UV signal was recorded at 260 nm Mass spectrometry

analysis was performed using negative polarity in API-ES ionization

mode More details of the mass spectrometry settings can be found

in Roussis et al [33] Retention times were obtained from both UV

and MS signals The retention time of the full-length sequence was

determined from the peak apex of the UV signal Retention times

of shortmer impurity sequences were obtained by the selective ion

monitoring of charge states 3 and 4 For the 8-mer samples, a re-

tention time of n= 8, 7, 6, 5, or 4 was obtained in a single injec-

tion, whereas for the 16-mer samples, retention times of n = 16,

…, 12 and 11, …, 8 were obtained in two separate injections This

allowed the repeatability of experiments to be monitored Reten-

tion times were adjusted for the additional dwell volume intro-

duced by the tubing to the MS To determine the correct time for

samples having overlapping m/ values for different charge states,

it was assumed that the retention time of the n – x -mer was al-

ways less than that of the n-mer

Some mentioning on the amounts of data used; in total, reten-

tion times for 98 unique sequences were collected and determined

for all gradient slopes in the IPR-gradient experiments, 96 for the

G1 and G2 gradient slopes and 91 for the G3 gradient slope, for

the co-solvent gradient experiments A list of all oligonucleotide sequences as well as their retention times can be found in Supple- mentary material Table S1; the peak widths were obtained from the n –1, n –2, and n – 3 peaks by first interpolating the actual peak and then determining the corresponding width at half height

3 Calculations

All general computations were performed using Python with the Numpy supporting libraries and all graphics were generated using Matplotlib

The first step in finding an ML model is processing the data Our dataset consists of the output data, i.e., the retention times and the corresponding oligonucleotides, represented by a string of different combinations of A, T, G, and C, serving as input data Since ML models require numerical input, the oligonucleotides must be encoded In our implementation, we encoded the oligonu- cleotides in terms of different frequencies based on their primary and secondary structural properties, as described by Sturm et al [22] These different f eatures were divided into groups, as done

by Sturm et al., where COUNT contains the frequency of each nu- cleotide in the sequence, CONTACT contains the frequencies of all possible dinucleotides in terms of their order (e.g., the numbers

of CG, CA, CT, CC etc occurring in the sequence), SCONTACT con- tains the frequencies of all dinucleotides bases, disregarding their order (e.g., the numbers of CG + GC, CA + AC, CC, etc.), and finally

HAIRPIN contains the numbers of stem, loop, and free bases [22] The secondary structure of the sequences was calculated using the

seqfold module [34]assuming the temperature 50 °C

The next step in the search for a model was the training, and then finding the best-performing features and hyperparameters This was done by performing a nested cross-validation, the pur- pose of which was to estimate how well the model responded to new data, to reduce the risk of model overfitting First, one split the dataset into to k subsets Then, one chose one subset to be omitted from the training to act as validation data (1/3 of all data), while the rest of the dataset was used for training (2/3 of all data) The chosen training set was then further split into n subsets, and the same procedure as described before was repeated This ap- proach is visualized in Fig.1 The best-performing model on aver- age after the inner cross-validation was chosen to be tested on the outer validation set Then the result was evaluated based on the average performance in the outer validation, and the main metric that this implementation used was the root mean squared error (RMSE)

This procedure was performed for each sub-dataset, where ev- ery unique combination of the described feature groups was eval- uated The inner cross-validation was done using gridsearchcv from the sklearn ML library, which performs a k-fold cross-validation for a given model ( SVR) and lists of hyperparameters (regulariza- tion parameter C, epsilon tube ε, and kernel coefficient γ) When

gridsearchcv found a fit for each combination of hyperparameters, then the best-performing model was chosen and further evalu- ated on the outer validation set, which was randomly split using the sklearn function kfold [35] The number of folds in both the outer and inner cross-validations was chosen to be three Further- more, results might vary due to the stochastic nature of the algo- rithms when performing a fit and due to the randomized split of the datasets, so the process was performed another three times to reduce the variance of the results As a comparison, the LM model developed by Gilar et al (equation 7 in [20]) was fitted to each sub-dataset A nonlinear least squared regression was performed to find the optimal weights by using the lmfit module [36] The LM requires no hyperparameter optimization and was therefore only evaluated on the outer validation split When the best-performing features were found, a final training was then done using the best-

Trang 4

Fig 1 Flowchart showing the steps required to train an SVR model to predict re-

tention times

performing features on two thirds of the dataset to visualize the

results in plots Also, the models that was trained on the whole

dataset was saved for later use

To evaluate the characteristics of the SVR model, we generated

250,0 0 0 unique random sequences with n= 12 and 16 We then

calculated their retention times and fitted them using a normal

distribution The peak width at half height ( w0.5,i) was assumed to

be described by the GC-content (sum of fractions of C and G) of the

sequence and its retention time plus a constant The solution to the

resulting linear matrix equation (Supplementary material S4) was

determined using the least-squares method The half-height width

of the UV trace of FLP and mass trace of the n – 1 to n 7-mers

of 16-mer FLPs in the dataset as well as n – 1 to n 3-mers of the

12-mer FLPs were used as input

The SVR model can be downloaded from the Supplementary

material

4 Results and discussion

The shortmer population ( n -1, n -2, …, n – n + 1)) constitutes

the largest number of impurities generated by the solid-phase syn-

thesis Successful separation and quantification of the individual

shortmers are necessary for the quality control of APIs Generally,

the separation of the n – 1-mer is the most relevant and most

challenging problem Therefore, it is beneficial to have a tool that

can assist in the selection of chromatographic methods and the

corresponding conditions necessary to achieve critical resolution of

the pair, here defined as ≥ 1.5. In Section4.1, we will present ex-

perimental retention data obtained using two methods for three

gradient slopes and discuss the characteristics of the two systems

The determined retention data will then be used to train ML mod-

els, whose performance and characteristics will be discussed in

Section 4.2 Finally, in Section 4.3, we will use the ML model to

estimate the probability of resolving an arbitrary oligonucleotide

from its n – 1 impurity We will also demonstrate how the choice

of elution method, conditions, and sequence characteristics affect the probability of success

4.1 Retention times

The first observation of both the co-solvent gradient and IPR gradient was that the retention times of sequences with n = 8,

12, and 16 increased with increasing proportions of A and T (sam- ples S nA through S nC in Supplementary material Table S1) The re- tention time also increased with decreasing gradient slope Very short oligonucleotides, i.e., n < 5, were only marginally affected

by the gradient compared with longer sequences, i.e., n = 16, as the system dwell volume had less of an effect on strongly retained oligonucleotides The oligonucleotide 3  -ACGACCGGGCGGAGTC-5  (S16A) had similar retention times using either method for all three gradients, as it was used to normalize the effects of gradi- ent slope and starting point between the methods This normaliza- tion had the unexpected effect that the shorter oligonucleotides, i.e., the S8 x and S12 x samples, were eluted significantly earlier using the IPR gradient than the co-solvent gradient Clearly, the two methods cannot be normalized for oligonucleotides of differ- ent lengths without also changing the shape of the gradient Other 16-mer sequences than S16A had different retention times in the two modes, indicating that there were different sequence-specific contributions to retention The hairpin-forming sequence MHA had about a 0.15-min shorter retention time than did its permutated sequence MHB in the co-solvent gradient system and about a 0.3- min difference in the IPR gradient system using the shallower gra- dient (G3) The second hairpin-forming sequence RHA had reten- tion almost identical to that of its permutated variant RHB in both systems at the same gradient slope

In Fig.2a, we can see the difference between the two gradient modes The shortest oligonucleotides display better selectivity, i.e.,

a large change in the y-direction with the addition or removal of

a nucleobase subunit in the co-solvent gradient method; whereas the opposite trend holds for the longest oligonucleotides in the IPR gradient method (the larger change is in the x-direction) However,

as can also be seen in Fig.2b, the eluted peaks in the IPR gradient are wider than in the co-solvent gradient How this affects resolu- tion will be investigated further, see Section4.3below

4.2 Machine learning model to predict retention times

The first step in finding the best ML model was to evaluate the model performance as a function of numbers of features, i.e.,

count, contact, scontact, and hairpin (see Section3for more details about the features) We found that for all combinations of gradi- ent modes and slopes, count gave the smallest RMSE for three out

of six systems (for a summary of all models, see Supplementary material Table S2) For the remaining three systems, different com- binations of features gave only marginally improved model RMSE This result could already be anticipated from the retention data, with permutations of the strong hairpin structures MHA and RHA only marginally affecting the retention time We therefore decided

to continue using the model but with only the count feature

In the study by Sturm et al [22] all features were found required to properly predict the retention times However, this finding cannot be directly extrapolated to our study since there are two main experimental difference between the experiments conducted by Sturm et al and by us Firstly, they uses another IPR (TEA) and, secondly, they uses unmodified oligonucleotides whereas we used TBuAA as IPR and fully phosphorothioated oligonucleotides as solutes As a consequence, Sturm et al con- ducted their separations with much lower amounts of acetonitrile (MeCN), 0–16% MeCN gradient, as compared to 38 – 70% as in this

Trang 5

Fig 2 Normalized experimental retention times obtained in co-solvent and IPR gradients using gradient G3 ( Table 1 ) (a) and b) chromatogram showing the separation of sequence MHB (Supplementary Material Table S1) b) at gradient G3 ( Table 1 )

Table 2

Summary of model performance on the training and validation sets

Gradient mode Gradient slope Model RMSE Training set (min) RMSE Validation set (min) R2 Training set Q 2 Validation set

Co-solvent

gradient

study Previously it was shown that in separations conducted at

higher amounts of MeCN the separation systems ability to sepa-

rate charge differences is increased while systems ability to sepa-

rate compounds with same charge is decreased [18] This result in

that the feature count will be more important and that the next-

neighbor effect indicating features contact and scontact will con-

tribute less to the model, which was also observed in our study

We also compared the SVR model with the LM The results

indicated that the SVR model gave lower RMSE in all cases (see

Table 2) The relative difference in RMSE between the SVR and

the LM models increased with decreasing gradient slope for both

gradient modes SVR was also markedly better at accurately pre-

dicting retention times for the IPR gradient at all gradient slopes

This could be expected since the LM model was developed for co-

solvent gradient elution, native oligonucleotide samples, and dif-

ferent IPR and stationary phases Furthermore, this model was de-

veloped to give a rough estimate of the amount of acetonitrile re-

quired to elute an oligonucleotide based on its length and relative

proportions of nucleobases, for which it would still be useful given

the current datasets Another way to estimate the model fit is to

calculate the correlation coefficients R2 and Q2, where R2 is es-

timated from the training set and Q2 is estimated from data not

used in the training set; R2will therefore estimate the goodness of

fit and Q2 will estimate the goodness of prediction From Table2,

we can see that: (i) R2 was always greater than Q2, as expected;

(ii) both R2 and Q2 were substantially larger for the SVR model

than the LM model; (iii) the LM model was much worse in pre-

dicting the IPR gradient than the co-solvent gradient; and (iv) the SVR model was only slightly worse in predicting the IPR-gradient than the co-solvent gradient

Plots of predicted versus experimental retention times for the validation subset of the experimental data obtained at gradient G3 are shown Fig.3a and c for the co-solvent and IPR gradient modes, respectively The validation subset shown in this plot contains one third of the sequences in the complete dataset The corresponding box plot of the relative error for the SVR and LM models are shown

in Fig.3b and d

The characteristics of the SVR models were evaluated by calcu- lating the retention times of 250,0 0 0 unique random 12- and 16- mers The distribution of retention times can be found in Fig 4 The spread of the distributions increased with increasing oligonu- cleotide length and decreasing gradient slope for both gradient modes which could be expected In general, the spread of retention times was higher for the IPR gradient mode suggesting that the hy- drophobicity of the base pairs has a larger impact in this mode The larger variance observed for 16-mers could already be pre- dicted from Fig 2a Analyzing the base composition of sequences

by fitting a normal distribution to the retention data shows that, for both gradient modes, 12-mer sequences obtained at below 1.5 standard deviations had a higher proportion of G and especially

C compared with the baseline of 25% each (see Supplementary material Table S3) On the other hand, 12-mer sequences having retention times of above 1.5 standard deviations had larger than baseline (25%) proportions of A and especially T for both gradient

Trang 6

Fig. 3 Experimental ( t R, exp ) and predicted ( t R, pred ) retention times in the validation dataset obtained using the SVR model (dots) or LM model (crosses) for the co-solvent gradient mode (a, b) and the IPR gradient mode (c, d), respectively In c) and d), the relative errors of the predictions are summarized in boxplots: the line in the boxplot is the median and the whiskers are the first and third quantiles

modes For the 16-mer sequences, the differences in base compo-

sition was less pronounced below 1.5 standard deviations for both

gradient modes but the GC-content remains above 50% Among

the strongest retained 16-mers, over 40% of nucleobases in the se-

quence are T for both modes and all gradient slopes

4.3 Predictions of the probability of resolving the FLP from the n – 1

impurity

Of particular interest for the quality control of synthetic

oligonucleotides is determining the purity of the FLP, which re-

quires sufficient (i.e., Rs > 1.5) resolution when using UV detec-

tion To calculate the resolution, we need accurate predictions of

retention time and peak width In addition to retention times, we

therefore also investigated how the peak widths correlated with

the retention times in each gradient mode; we found there was

only a weak correlation for the co-solvent gradient but a more pro- nounced correlation for the IPR gradient In both gradient modes, the peak widths increased with increasing retention time (see Sup- plementary material Fig S1) The peak widths obtained in the IPR gradient mode were greater than in the co-solvent gradient mode, both in absolute terms and by having a larger sequence variance One possible explanation is the gradient compression experienced

by each solute differed because they have different sensitivity to the gradient change Also, the effective gradient slope ( G) could

be different between the two gradient modes However, since the retention time shift of sample S16A was shown to be about the same for gradient slopes G1, G2, and G3 between the two modes,

a significant difference in effective gradient slope was unlikely An- other explanation could be that the peak broadening due to par- tial diastereomer separation was greater in the IPR gradient mode than the co-solvent mode This explanation is plausible since we

Trang 7

Fig 4 Distributions of the predicted retention times for 250,0 0 0 unique 12- and 16-mer sequences (blue and orange fill, respectively) calculated using the SVR model

Subplots a)–c) show the co-solvent gradient and d)–f) the IPR gradient Gradient slope G1 (a, d), G2 (b, e), and G3 (c, f)

have shown that the diastereomer separation increased at lower

and constant co-solvent concentration in the IPR gradient mode

as compared with co-solvent gradient elution [18] This would ex-

plain why the peak width increased with both decreasing gradient

slope and increasing retention time We have previously showed

the diastereomer separation involving C and G was greater than

that involving A and T [17] and therefore attempted to correlate

the GC-content together with the retention time and a constant, to

the observed peak width This simple correlation provides a rea-

sonable approximation of peak width, as summarized in Supple-

mentary material S2

The predicted versus experimentally calculated resolutions for

12- and 16-mer samples are presented in Table 4 Except for the

steepest gradient slope investigated using both gradient modes, the

prediction error is less than 10% We also observe that the abso-

lute mean error of prediction decreases with decreasing gradient

slope Investigating the details, we see that the n – 1 impurity

of sample S12A and S12C are always resolved at a resolution of

more than 1.5 regardless of investigated gradient slope or mode

For the 16-mer sequences, the critical resolution is reached at a

steeper gradient slope using the IPR gradient mode compared to

co-solvent gradient mode Interestingly, the two 12-mer samples

always have higher resolution using the co-solvent gradient at any

gradient slope whereas the GC rich sample S12A has a lower res-

olution than 2 out of the 8 investigated 16-mers using the IPR

gradient mode This again highlights that the IPR gradient mode

has a higher degree of separation based on sequence rather than

length as compared to the co-solvent gradient An accurate esti-

mation of resolution based on sequence composition and retention

time allowed us to calculate the peak widths of all 250,0 0 0 ran-

dom unique 12- and 16-mers as well as their n – 1 impurities at

each gradient slope

The resulting distributions of calculated resolutions are shown

in Fig 5 For the co-solvent gradient mode, the 12/11-mer sep-

aration always has a higher resolution than does the 16/15-mer

separation regardless of the sequences In addition, all 12-mer se-

quences are predicted to reach a resolution of 1.5 at all investi-

gated gradient slopes The resolution of the 12-mer sequences us-

ing the co-solvent gradient mode was generally similar or slightly

better than could be achieved with the IPR gradient mode This

could also be anticipated from Fig 2a, where the selectivity be-

tween shorter oligonucleotides is greater for the co-solvent gradi- ent than the IPR gradient For the 16/15-mer separation resolution,

no sequences could be separated with a resolution of at least 1.5 using the steepest co-solvent gradient investigated At the second and third steepest gradients, i.e., G2 and G3, 42 and 28% of the random sequences could not be separated (see Table 3) For the IPR gradient mode, the resolution distributions between the 16/15- mer and 12/11-mer show overlap at all gradient slopes, with the overlap increasing with decreasing gradient slope The results indi- cate that some 12/11-mers are more difficult to resolve than some 16/15-mers using IPR gradients This could be expected from the experimental resolution data showing that a GC-rich 12-mer can have lower resolution compared to some 16-mers ( Table4) For the 16-mer FLPs, 31, 9, and 4% of all random unique sequences are ex- pected not to reach the critical resolution of 1.5 at gradient slopes

of G1, G2, and G3, respectively

Investigating the characteristics of the 16-mer sequences that

do not reach a resolution of at least 1.5, we found, for the co- solvent gradient, that they had a marginally higher frequency of C, both throughout the sequence and in the 5  terminal nucleobase, which when missing creates the n – 1-mer (see Table 3) For the IPR gradient, there was a similar but more pronounced trend The sequences that does reach critical resolution at G2 and G3 con- tained 27 and 40% C as well as above average A For the 5  termi- nal nucleobase, there was a 41 or 82% probability that it was a C at gradient slopes G2 and G3 At this could be understood from two earlier observations: first, a sequence containing a large proportion

of C will lead to a wider peak; second, the loss of a terminal C will give a smaller than average relative decrease in retention time These effects combined lead to difficulties obtaining sufficient res- olution

Investigating the FLPs of experimental dataset (Supplemen- tary material Table S1), we found that the one of the sequences that did not reach the critical resolution using the IPR gradi- ent at the steepest gradient slope G1 was the RHB sample (3  CGCGTGGTCCTGGTCC-5  ) This sequence has a composition of 37.5%

C, 37.5% G, 25% T, and 0% A as well as a terminal C at the 5  end The experimental resolution for the n – 1-mer was calculated to about 1.3 at G1, see Table4 Decreasing the gradient slope to G3 in- creased the resolution to about 1.9 The resolution at gradient slope G1 using the co-solvent gradient was even lower, about 1 at G1

Trang 8

Fig. 5 Distributions of the predicted n – 1 resolutions for 250,0 0 0 unique 12- and 16-mer sequences (blue and orange fill) calculated using the SVR model Subplots a)–c)

show the co-solvent gradient and d)–f) the IPR gradient Gradient slope G1 (a, d), G2 (b, e), and G3 (c, f) Vertical dashed line at a resolution of 1.5

Fig 6 Experimental and simulated chromatograms of the RHB sample at the steep gradient slope G1 (a, c) and the shallow gradient slope G3 (b, d), respectively Co-solvent

gradient mode (a, b) and IPR gradient mode (c, d)

and just 1.5 at G3 Experimental and simulated chromatograms of

RHB are shown in Fig.6 The simulated peaks were constructed by

generating a normal distribution with a variance calculated from

the nucleobase composition and retention time The areas of the

FLP and n – 1 were manually normalized by adjusted the height

separately for each peak and then stitched them together to get the

final chromatogram The retention time and peak widths of the ex- perimental and simulated chromatograms are in good agreement, although there is a slight underestimation of calculated resolution

in the co-solvent gradient at gradient slope G1, also indicated from Table4

Trang 9

Table 3

Details of predicted 16-mer failure sequences ( R s < 1.5); f x is the percentage of nucleobase x

Gradient mode Gradient slope Below critical resolution, R s < 1.5

Frequency (%) Sequence composition Terminal nucleobase composition

fA , fT , fC , fG fA fT fC fG

Table 4

Experimentally measured resolutions vs predictions for FLP and n – 1 using the SVR model for retention times and the linear model for peak widths, respectively

Sample 5  -end Exp Pred Exp Pred Exp Pred Exp Pred Exp Pred Exp Pred S12A C 1.81 1.81 2.40 2.53 2.76 2.77 2.17 2.17 2.45 2.44 2.38 2.38 S12C C 1.73 1.71 2.45 2.63 2.92 2.92 2.26 2.39 2.63 2.70 2.82 2.81 S16A C 1.10 1.30 1.43 1.63 1.69 2.06 1.56 1.89 1.97 2.33 2.21 2.29 S16B A 1.03 0.73 1.49 1.57 1.75 1.76 1.49 1.49 1.90 1.98 2.52 2.52 S16C A 1.02 1.03 1.45 1.69 1.75 1.83 1.45 1.91 1.84 2.14 2.21 2.44 MALAT G 0.86 0.61 1.23 1.48 1.51 1.50 1.27 1.59 1.62 1.72 1.92 2.15 MHA A 1.05 0.83 1.57 1.59 1.85 1.96 1.54 1.76 1.98 2.35 2.35 2.78 MHB A 1.14 0.83 1.65 1.59 1.93 1.96 1.85 1.76 2.31 2.35 2.74 2.78 RHA G 1.09 0.82 1.54 1.49 1.77 1.68 1.63 1.55 2.06 2.06 2.30 2.32 RHB C 0.98 0.92 1.32 1.34 1.53 1.51 1.29 1.31 1.68 1.66 1.90 1.91

5 Conclusions

This study aimed at constructing an ML model capable of pre-

dicting the retention times of phosphorothioated oligonucleotides

with high accuracy The model was shown to predict retention

times with low RMSE as well as high Q2and R2for all investigated

conditions For the investigated experimental systems, the effect of

secondary oligonucleotide structure was shown to be minimal, al-

lowing us to construct a simpler model

The ML models were used for predicting the chromatographic

characteristics of 250,0 0 0 random 12- and 16-mers It was found

that the variance in retention time was higher when using the

IPR gradient mode than the co-solvent gradient mode However, a

slight skewness in the distribution of retention times for a uniform

distribution of A, T, G, C indicates that the SVR model has captured

sequence specific contribution to the retention time which could

indicate the presence of next neighbor effects Sequences contain-

ing high proportions of C and G gave the shortest retention times,

whereas high proportions of A and T gave the longest retention

times in both gradient modes

Finally, the resolution of each of the 250,0 0 0 random sequences

to its n – 1-mer was calculated using the retention time from

the ML model and the peak width from the linear combination

of oligonucleotide GC-content and retention time Results indicate

that the co-solvent gradient mode can be expected to easily resolve

all 12-mer sequences from the 11-mers, typically with greater res-

olution than can the IPR gradient On the other hand, the proba-

bility of successfully resolving longer 16-mer sequences from 15-

mers was significantly higher using the IPR gradient mode For

both methods, decreasing the gradient slope increased the proba-

bility of achieving critical resolution Among the 16-mers that still

could not be resolved using the IPR gradient mode, the frequencies

of C were very high, respectively, at the terminal nucleobase

The ML models constructed in this study could help select the

appropriate gradient mode and gradient slope that would lead to

successful separation before performing an experiment The mod-

els could be expanded to account for retention shifts introduced by other oligonucleotide modifications such as 3  -MOE, methyl-C or LNAs if sufficient data is provided Also other impurities related to the FLP if trained with such retention data Other impurities could for example include (P = O) or abasics Other chromatographic sys- tems including other column chemistries, particle sizes, temper- atures, and mobile phases could also be added to have an even greater number of possible systems to choose from The methodol- ogy could also be used to optimize the method run time in silico before running experiments

Availability

Implementations and code used in this study can be found at: https://github.com/jakobhaggstrom/JCA-21-1579

Declaration of Competing Interest

The authors declare that they have no known competing finan- cial interests or personal relationships that could have appeared to influence the work reported in this paper

CRediT authorship contribution statement Martin Enmark: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing – original draft, Writing – review & editing, Visualiza- tion, Supervision Jakob Häggström: Methodology, Software, For- mal analysis, Investigation, Data curation, Writing – original draft

Jörgen Samuelsson: Conceptualization, Validation, Writing – orig- inal draft, Writing – review & editing, Supervision Torgny Forn-stedt: Conceptualization, Writing – review & editing, Supervision, Project administration, Funding acquisition

Trang 10

This work was supported by the Swedish Knowledge Founda-

tion via the project “Improved Methods for Process and Quality

Controls using Digital Tools” (grant number 20210021) and by the

Swedish Research Council (VR) via the project “Fundamental Stud-

ies on Molecular Interactions aimed at Preparative Separations and

Biospecific Measurements” (grant number 2015-04627)

Supplementary materials

Supplementary material associated with this article can be

found, in the online version, at doi: 10.1016/j.chroma.2022.462999

References

[1] S Yang, R.E Rothman, PCR-based diagnostics for infectious diseases: uses, lim-

itations, and future applications in acute-care settings, Lancet Infect Dis 4

(2004) 337–348, doi: 10.1016/S1473-3099(04)01044-8

[2] L Becherer, N Borst, M Bakheit, S Frischmann, et al., Loop-mediated isother-

mal amplification (LAMP) – review and classification of methods for sequence-

specific detection, Anal Methods 12 (2020) 717–746, doi: 10.1039/C9AY02246E

[3] M.J Heller, DNA Microarray Technology: Devices, Systems, and Applications,

Annu Rev Biomed Eng 4 (2002) 129–153, doi: 10.1146/annurev.bioeng.4

020702.153438

[4] W Yin, M Rogge, Targeting RNA: A Transformative Therapeutic Strategy, Clin

Translat Sci 12 (2019) 98–112, doi: 10.1111/cts.12624

[5] T.C Roberts, R Langer, M.J.A Wood, Advances in oligonucleotide drug delivery,

Nat Rev Drug Discovery 19 (2020) 673–694, doi: 10.1038/s41573- 020- 0075- 7

[6] C.F Bennett, E.E Swayze, RNA targeting therapeutics: molecular mechanisms

of antisense oligonucleotides as a therapeutic platform, Annu Rev Pharmacol

Toxicol 50 (2010) 259–293, doi: 10.1146/annurev.pharmtox.010909.105654

[7] E Paredes, V Aduda, K.L Ackley, H Cramer, 6.11 - Manufacturing of Oligonu-

cleotides, in: S Chackalamannil, D Rotella, S.E Ward (Eds.), Comprehen-

sive Medicinal Chemistry III, Elsevier, Oxford, 2017, pp 233–279, doi: 10.1016/

B978- 0- 12- 409547- 2.12423- 0

[8] S Benizri, A Gissot, A Martin, B Vialet, et al., Bioconjugated oligonucleotides:

recent developments and therapeutic applications, Bioconjugate Chem 30

(2019) 366–383, doi: 10.1021/acs.bioconjchem.8b00761

[9] N.M El Zahar, N Magdy, A.M El-Kosasy, M.G Bartlett, Chromatographic ap-

proaches for the characterization and quality control of therapeutic oligonu-

cleotide impurities, Biomed Chromatogr 32 (2018), doi: 10.1002/bmc.4088

[10] D Capaldi, A Teasdale, S Henry, N Akhtar, et al., Impurities in Oligonucleotide

Drug Substances and Drug Products, Nucleic Acid Ther 27 (2017) 309–322,

doi: 10.1089/nat.2017.0691

[11] M Enmark, J Bagge, J Samuelsson, L Thunberg, et al., Analytical and

preparative separation of phosphorothioated oligonucleotides: columns and

ion-pair reagents, Anal Bioanal Chem 412 (2020) 299–309, doi: 10.1007/

s00216- 019- 02236- 9

[12] S.G Roussis, M Pearce, C Rentel, Small alkyl amines as ion-pair reagents

for the separation of positional isomers of impurities in phosphate diester

oligonucleotides, J Chromatogr A 1594 (2019) 105–111, doi: 10.1016/j.chroma

2019.02.026

[13] S.T Crooke, J.L Witztum, C.F Bennett, B.F Baker, RNA-Targeted Therapeutics,

Cell Metab 27 (2018) 714–739, doi: 10.1016/j.cmet.2018.03.004

[14] M Catani, C.D Luca, J.M.G Alcântara, N Manfredini, et al., Oligonucleotides:

current trends and innovative applications in the synthesis, characterization,

and purification, Biotechnol J (2022) 1900226 n/a (n.d.), doi: 10.1002/biot

201900226

[15] A Goyon, P Yehl, K Zhang, Characterization of therapeutic oligonucleotides by

liquid chromatography, J Pharm Biomed Anal 182 (2020) 113105, doi: 10.1016/

j.jpba.2020.113105

[16] S Studzi ´nska, S Bocian, L Sieci ´nska, B Buszewski, Application of phenyl-based

stationary phases for the study of retention and separation of oligonucleotides,

J Chromatogr B 1060 (2017) 36–43, doi: 10.1016/j.jchromb.2017.05.033

[17] M Enmark, M Rova, J Samuelsson, E Örnskov, et al., Investigation of factors influencing the separation of diastereomers of phosphorothioated oligonucleotides, Anal Bioanal Chem 411 (2019) 3383–3394, doi: 10.1007/ s00216- 019- 01813- 2

[18] M Enmark, S Harun, J Samuelsson, E Örnskov, et al., Selectivity limits of and opportunities for ion pair chromatographic separation of oligonucleotides, J Chromatogr A 1651 (2021) 462269, doi: 10.1016/j.chroma.2021.462269 [19] A Demelenne, M.-J Gou, G Nys, C Parulski, et al., Evaluation of hydrophilic in- teraction liquid chromatography, capillary zone electrophoresis and drift tube ion-mobility quadrupole time of flight mass spectrometry for the characteriza- tion of phosphodiester and phosphorothioate oligonucleotides, J Chromatogr

A 1614 (2020) 460716, doi: 10.1016/j.chroma.2019.460716 [20] M Gilar, K.J Fountain, Y Budman, U.D Neue, et al., Ion-pair reversed- phase high-performance liquid chromatography analysis of oligonucleotides: Retention prediction, J Chromatogr A 958 (2002) 167–182, doi: 10.1016/ S0 021-9673(02)0 0306-0

[21] S Studzi ´nska, B Buszewski, Different approaches to quantitative structure– retention relationships in the prediction of oligonucleotide retention, J Sep Sci 38 (2015) 2076–2084, doi: 10.1002/jssc.201401395

[22] M Sturm, S Quinten, C.G Huber, O Kohlbacher, A statistical learning approach

to the modeling of chromatographic retention of oligonucleotides incorporat- ing sequence and secondary structure data, Nucleic Acids Res 35 (2007) 4195–

4202, doi: 10.1093/nar/gkm338 [23] C Liang, J.-Q Qiao, H.-Z Lian, A novel strategy for retention prediction of nu- cleic acids with their sequence information in ion-pair reversed phase liquid chromatography, Talanta 185 (2018) 592–601, doi: 10.1016/j.talanta.2018.04.030 [24] O Kohlbacher, S Quinten, M Sturm, B.M Mayr, et al., Structure–Activity Re- lationships in Chromatography: Retention Prediction of Oligonucleotides with Support Vector Regression, Angew Chem Int Ed 45 (20 06) 70 09–7012, doi: 10

10 02/anie.20 0602561 [25] L Moruz, L Käll, Peptide retention time prediction, Mass Spec Rev 36 (2017) 615–623, doi: 10.1002/mas.21488

[26] M Gilar, A Jaworski, P Olivova, J.C Gebler, Peptide retention prediction ap- plied to proteomic data analysis, Rapid Commun Mass Spectrom 21 (2007) 2813–2821, doi: 10.1002/rcm.3150

[27] O.V Krokhin, R Craig, V Spicer, W Ens, et al., An improved model for predic- tion of retention times of tryptic peptides in ion pair reversed-phase HPLC its application to protein peptide mapping by off-line HPLC-MALDI MS, Mol Cell Proteomics 3 (2004) 908–919, doi: 10.1074/mcp.M400031-MCP200

[28] K Petritis, L.J Kangas, P.L Ferguson, G.A Anderson, et al., Use of Artificial Neural Networks for the Accurate Prediction of Peptide Liquid Chromatogra- phy Elution Times in Proteome Analyses, Anal Chem 75 (2003) 1039–1048, doi: 10.1021/ac0205154

[29] A A Klammer, X Yi, M.J MacCoss, W.S Noble, Improving Tandem Mass Spec- trum Identification Using Peptide Retention Time Prediction across Diverse Chromatography Conditions, Anal Chem 79 (2007) 6111–6118, doi: 10.1021/ ac070262k

[30] J Samuelsson, F.F Eiriksson, D ˚Asberg, M Thorsteinsdóttir, et al., Determining gradient conditions for peptide purification in RPLC with machine-learning- based retention time predictions, J Chromatogr A 1598 (2019) 92–100, doi: 10 1016/j.chroma.2019.03.043

[31] E Stellwagen, J.M Muse, N.C Stellwagen, Monovalent Cation Size and DNA Conformational Stability, Biochemistry 50 (2011) 3084–3094, doi: 10.1021/ bi1015524

[32] J.R Nilsson, T Baladi, A Gallud, D Baždarevi ´c, et al., Fluorescent base ana- logues in gapmers enable stealth labeling of antisense oligonucleotide thera- peutics, Sci Rep 11 (2021) 11365, doi: 10.1038/s41598- 021- 90629- 1 [33] S.G Roussis, C Koch, D Capaldi, C Rentel, Rapid oligonucleotide drug impurity determination by direct spectral comparison of ion-pair reversed-phase high- performance liquid chromatography electrospray ionization mass spectrometry data, Rapid Commun Mass Spectrom 32 (2018) 1099–1106, doi: 10.1002/rcm

8125 [34] J Timmons, leshane, Lattice-Automation/seqfold 0.7.7, Zenodo (2021), doi: 10 5281/zenodo.4579886

[35] F Pedregosa , G Varoquaux , A Gramfort , V Michel , et al ,Scikit-learn: machine learning in python, J Mach Learn Res 12 (2011) 2825–2830

[36] M Newville, T Stensitzki, D.B Allen, A Ingargiola, LMFIT: non-linear least- square minimization and curve-fitting for python, Zenodo (2014), doi: 10.5281/ zenodo.11813

Ngày đăng: 25/12/2022, 02:55

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN

🧩 Sản phẩm bạn có thể quan tâm

w