Support vector regression models are created and used to predict the retention times of oligonucleotides separated using gradient ion-pair chromatography with high accuracy. The experimental dataset consisted of fully phosphorothioated oligonucleotides. Two models were trained and validated using two pseudoorthogonal gradient modes and three gradient slopes.
Trang 1Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/chroma
Martin Enmark, Jakob Häggström, Jörgen Samuelsson∗, Torgny Fornstedt∗
Department of Engineering and Chemical Sciences, Karlstad University, SE-651 88 Karlstad, Sweden
a r t i c l e i n f o
Article history:
Received 8 December 2021
Revised 22 March 2022
Accepted 25 March 2022
Available online 27 March 2022
Keywords:
Machine-learning
Support vector regression (SVR) model
Oligonucleotides
Ion-pair chromatography
Resolution
a b s t r a c t
Support vector regression models are created and used to predict the retention times of oligonucleotides separated using gradient ion-pair chromatography with high accuracy The experimental dataset consisted
of fully phosphorothioated oligonucleotides Two models were trained and validated using two pseudo- orthogonal gradient modes and three gradient slopes The results show that the spread in retention time differs between the two gradient modes, which indicated varying degree of sequence dependent sep- aration Peak widths from the experimental dataset were calculated and correlated with the guanine- cytosine content and retention time of the sequence for each gradient slope This data was used to pre- dict the resolution of the n – 1 impurity among 250 0 0 0 random 12- and 16-mer sequences; showing one of the investigated gradient modes has a much higher probability of exceeding a resolution of 1.5, particularly for the 16-mer sequences Sequences having a high guanine-cytosine content and a terminal
C are more likely to not reach critical resolution The trained SVR models can both be used to identify characteristics of different separation methods and to assist in the choice of method conditions, i.e to optimize resolution for arbitrary sequences The methodology presented in this study can be expected to
be applicable to predict retention times of other oligonucleotide synthesis and degradation impurities if provided enough training data
© 2022 The Authors Published by Elsevier B.V This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/)
1 Introduction
Ion-pair chromatography (IPC) is an important technique for
separating synthetic oligonucleotides, which are a class of DNA-
or RNA-based molecules with widespread and well-known appli-
cations in diagnostics [ 1, 2], research [3], and, recently, therapeu-
tic applications [ 4, 5] Oligonucleotides used for antisense therapy
[6] are typically produced using stepwise solid-phase synthesis
via the β-cyanoethyl phosphoramidite method [7] Depending on
the length, sequence, and miscellaneous chemical modifications of
these antisense active pharmaceutical ingredients (APIs) [8], the fi-
nal synthesis product will contain a large fraction of impurities
The polymeric nature of the oligonucleotides and the many impu-
rities challenge analytical separations, and phosphorothioated (PS)
oligonucleotides are especially difficult to analyze [9–12] In this
study, we will focus on the shortmer impurities with respect to the
parent full-length product (FLP) In this study we put particular fo-
cus on the n – 1 impurity generated due to e.g failed coupling in
the last coupling step, i.e trityl-off
∗ Corresponding authors
E-mail addresses: Jorgen.Samuelsson@kau.se (J Samuelsson),
Torgny.Fornstedt@kau.se (T Fornstedt)
Amphipathic [13]oligonucleotides are predominately separated and analyzed using IPC [ 9, 14, 15] The most-used stationary phase
is the C18 column, typically pH-stable variants such as the XBridge C18 and other reversed-phase chemistries [ 11, 12, 15, 16] Many dif- ferent combinations of ion-pairing reagents (IPRs) have been eval- uated [ 9, 15] For the separation of PS oligonucleotides, methods using tributyl ammonium acetate (TBuAA) as the IPR have been proven successful [ 11, 15, 17] In this study, we will use TBuAA in two previously investigated gradient modes [18] In the aforemen- tioned study we could show that using the phenyl column resulted
in slightly improved n – 1 selectivity compared to the C18 column
in the IPR gradient mode In the co-solvent gradient elution mode, the co-solvent fraction increases over time, while the IPR concen- tration typically remains constant In the IPR gradient mode, the IPR concentration decreases over time while the co-solvent fraction remains constant Both modes elute oligonucleotides by decreasing the apparent electrostatic potential generated by the adsorption of the IPR We have previously shown that the IPR gradient increases the selectivity for oligonucleotide impurities of the same charge, for example phosphodiester (P =O) 1 impurities of fully phospho- rothioated oligonucleotides, especially using a phenyl column [18] Other chromatographic modes not using IPRs such as HILIC have
https://doi.org/10.1016/j.chroma.2022.462999
0021-9673/© 2022 The Authors Published by Elsevier B.V This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )
Trang 2also been investigated for the separation of PS-modified oligonu-
cleotides [19]
Retention time prediction models for the IPC separation of
oligonucleotides are few, and noteworthy works include those of
Gilar et al [20], Studzinska and Buszewski [21], Sturm et al [22],
Liang et al [23], and Kohlbacher et al [24] These models are well
established for peptides and are routinely employed, for example,
in shotgun proteomics to design targeted proteomics experiments
and to reduce false-positive hits in mass spectrometry analysis The
many different approaches used can roughly be divided into (i)
index-based, (ii) modeling-based, and (iii) machine-learning (ML)-
based methods [25] In index-based methods, the effect of each
amino acid in a sequence is estimated using the multilinear re-
gression of a large set of peptides with known retention times
[ 26, 27] In modeling-based methods, the physicochemical proper-
ties of the peptide are used to predict the retention times [27] In
ML-based methods, a training set of peptides is used to estimate
the parameters of a predefined mathematical model; many differ-
ent approaches have been used for this, such as artificial neural
networks [28]and support vector regression (SVR) [ 29, 30]
Gilar et al have developed an empirical logarithmic model
(hereafter denoted as LM) to predict the retention of synthetic
oligonucleotides [20] Their modeling-based method has five input
variables, i.e., the amount of each nucleotide (T, C, G, and A) as
well as the total number of nucleotides in the oligo Studzinska
and Buszewski used quantitative structure–retention relationships
(QSRRs) to predict the retention based on descriptors such as van
der Waals surface area, solvent-accessible area, dipole moment, to-
tal energy, and hydration energy [21] All these parameters were
numerically estimated and fitted to simple functions Neither of
these methods delivers excellent predictivity The great advantage
of the LM model is that it is easy to use, requires few data points
for calibration, and has been shown to be rather good for predict-
ing the retention of non-phosphorothioated oligonucleotides How-
ever, due to the selection of descriptors, the model cannot address
potential structural changes such as grove and hairpin formation
as well as whether the retention is dependent on sequence and
not just on composition The same could also be true of the QSRR
method, which shares the problems of descriptor selection and of
finding accurate descriptors of more complicated molecules such
as oligonucleotides Sturm et al used SVR for retention predictions
[22], mainly using sequence-based descriptors as well as descrip-
tors correlating to stacking energies occurring in hairpin formation
Sturm et al showed that their model had better predictive power
than did the LM model and also could predict the retention change
due to hairpin formation Since the experimental system includ-
ing solutes investigated by Gilar et al and Sturm et al is similar,
it is relevant to compare both approaches for phosphorothioated
oligonucleotides separated in different experimental systems Later,
Liang et al used a similar SVR model to investigate how to opti-
mize the selectivity in gradient elution [23] In all above studies,
the authors investigated non-phosphorothioated oligonucleotides
using triethylamine as the IPR as well as co-solvent gradient mode
Due to the successful utilization of SVR models in [ 22, 23] we de-
cided to investigate if such models also can be successfully used
to predict retention of phosphorothioated oligonucleotides eluted
using tributylamine as IPR
The aim of this study is to build SVR IPC retention time predic-
tion models based on the oligonucleotide sequence for two differ-
ent gradient modes, i.e., the conventional co-solvent gradient and
the IPR gradient modes As training and testing solutes, around 100
heteromeric, fully phosphorothioated oligonucleotides will be used
As the IPR, TBuAA will be used to reduce diastereomer separation
Finally, and most importantly, the retention time prediction mod-
els will be used to predict the probability of successfully separating
the impurities from synthetic oligonucleotides as well as compar-
ing the two different gradient modes; (i) co-solvent gradient and (ii) IPR gradient mode using three gradient slopes
2 Materials and methods
2.1 Chemicals and materials
The IPRs TBuAA and triethylammonium acetate (TEtAA) were prepared from tributylamine ( ≥99.5%, CAS number: 121-44-8) and triethylamine ( ≥99.5%, CAS number 121-44-8) with acetic acid ( ≥99.8%, CAS number 64-19-7), all purchased from Sigma-Aldrich (St Louis, MO, USA) The mobile phases were prepared using HPLC gradient-grade acetonitrile (CAS number 75-05-8) from VWR (Rad- nor, PA, USA) and deionized water with a resistivity of 18.2 M /cm from a Milli-Q water purification system (Merck Millipore, Darm- stadt, Germany) An XBridge Phenyl column, 150 × 3.0 mm, 3.5
μm, 100 ˚A pore size from Waters (Milford, MA, USA) was used
in all experiments Fully phosphorothioated oligonucleotides were purchased in 0.25-μmol scale from Integrated DNA Technologies (Leuven, Belgium) and delivered desalted and lyophilized The pur- chased FLP oligonucleotides were not purified before use A list of all oligonucleotide sequences can be found in Supplementary ma- terial Table S1
2.2 Instrumentation
Experiments were conducted on an Agilent 1260 Infinity II HPLC system (Agilent Technologies, Palo Alto, CA, USA), configured with a binary pump, a 100-μL injection loop, a diode-array UV de- tector, single quadrupole MS, and a column thermostat
2.3 Procedures 2.3.1 Selection of oligonucleotides
The first part of the dataset was selected to explore the ef- fects of length, nucleobase composition, and sequence It con- tains three different 8-, 12-, and 16-mer oligonucleotide sequences These were designed in silico by first generating one million se- quences of length 8, 12 and 16 by randomly picking adenine (A), thymine (T), cytosine (C), or guanine (G) at each position in the sequence The retention time of all sequences was then calculated using the LM model described by Gilar et al [20] This allows us
to estimate the variance in retention time for each population of
8, 12 and 16mers Then, we randomly picked three sequences of each length from each population mean – 2 standard deviations, mean and finally mean + 2 standard deviations, labeled S nA, S nB,
or S nC, where n= 8, 12, or 16, respectively These sequences can
be found in Supplementary material Table S1 Since the LM pre- dicts that the contribution to retention time increases according to the nucleobase in the order C < G < A < T, the base composi- tion of the sequences will vary from high proportions of guanine- cytosine content (GC-content) in the S nA sequences to high pro- portions A and T in the S nC sequences, respectively The second part of the dataset was selected to test whether the secondary oligonucleotide structure influences the retention time The 16-mer sequences referred to as reference hairpin (RHA) and model hair- pin (MHA) by Stellwagen et al [31]were then selected; Stellwagen
et al investigated the effect of monovalent cations on the thermal stability of MHA, as measured by capillary electrophoresis In this case the MHA should contain more than 10% hairpin structures at
50 °C at least in a solution containing 100 mM tetrabutyl ammo- nium, no organic solvent and high amount of other background electrolytes They also found that the DNA melting point decreases with increasing lipophilicity of the IPR [31] In our study, we there- fore included permutated variants of RHA and MHA that minimize
Trang 3Table 1
Summary of experimental gradient conditions
Co-solvent
gradient
Initial MeCN (v%) 38
Slope (v% MeCN min –1 ) 2.22 1.23 0.81
IPR gradient Initial TEtAA (mM) 0.1
Slope (mM TEtAA min –1 ) 0.32 0.16 0.08
hairpin formation (i.e., RHB and MHB) Finally, a sequence mim-
icking the MALAT-1 transcript targeting ASO described by Nilsson
et al [32]was included in the dataset The 8-, 12-, and 16-mer se-
quences synthesized are hereafter referred to as FLPs of length n
2.3.2 Experimental
All samples were prepared by dissolving the lyophilized
oligonucleotides by vortexing them in deionized water prepared
using a Milli-Q water purification system (Merck Millipore) The
stock concentration was 1 mg mL –1 and the injection concentra-
tion was 0.2 mg mL –1 3 μL was injected into the column of this
solution Mobile phases were prepared by weight using the density
of water and acetonitrile (MeCN) at room temperature For the co-
solvent gradient experiments, 10 and 80 v% MeCN solutions were
prepared, while for the IPR concentration gradient experiments,
two 41.5 v% solutions were made During stirring, acetic acid was
added followed by tributylamine (to both eluents for co-solvent
gradient experiments) or tributylamine or triethylamine separately
for IPR concentration gradient experiments All mobile phases were
stirred for at least 12 hours before use to ensure that the all IPR-
molecules are fully dissolved Before use, the s
wpH of all mobile phases (solvent/water) was determined using a pH electrode cal-
ibrated in aqueous buffer The measured pH value of the mobile
phase ranged between 7-8 depending on the mobile phase com-
position; 7 at low concentration of MeCN and 8 at high concentra-
tion of MeCN All experiments were performed using still-air col-
umn temperature control at 50 °C The flow rate was 0.5 mL min –1
which provided sufficiently good MS signals, i.e., good enough neb-
ulization in the spray chamber Three gradient slopes were evalu-
ated for each of the two gradient methods, and their details can
be found in Table1 A re-equilibration time of about three column
volumes was used after the end of each gradient A 0.01 mg mL −1
sample of uracil was prepared in deionized water and used as the
void volume marker
The UV signal was recorded at 260 nm Mass spectrometry
analysis was performed using negative polarity in API-ES ionization
mode More details of the mass spectrometry settings can be found
in Roussis et al [33] Retention times were obtained from both UV
and MS signals The retention time of the full-length sequence was
determined from the peak apex of the UV signal Retention times
of shortmer impurity sequences were obtained by the selective ion
monitoring of charge states 3 and 4 For the 8-mer samples, a re-
tention time of n= 8, 7, 6, 5, or 4 was obtained in a single injec-
tion, whereas for the 16-mer samples, retention times of n = 16,
…, 12 and 11, …, 8 were obtained in two separate injections This
allowed the repeatability of experiments to be monitored Reten-
tion times were adjusted for the additional dwell volume intro-
duced by the tubing to the MS To determine the correct time for
samples having overlapping m/ values for different charge states,
it was assumed that the retention time of the n – x -mer was al-
ways less than that of the n-mer
Some mentioning on the amounts of data used; in total, reten-
tion times for 98 unique sequences were collected and determined
for all gradient slopes in the IPR-gradient experiments, 96 for the
G1 and G2 gradient slopes and 91 for the G3 gradient slope, for
the co-solvent gradient experiments A list of all oligonucleotide sequences as well as their retention times can be found in Supple- mentary material Table S1; the peak widths were obtained from the n –1, n –2, and n – 3 peaks by first interpolating the actual peak and then determining the corresponding width at half height
3 Calculations
All general computations were performed using Python with the Numpy supporting libraries and all graphics were generated using Matplotlib
The first step in finding an ML model is processing the data Our dataset consists of the output data, i.e., the retention times and the corresponding oligonucleotides, represented by a string of different combinations of A, T, G, and C, serving as input data Since ML models require numerical input, the oligonucleotides must be encoded In our implementation, we encoded the oligonu- cleotides in terms of different frequencies based on their primary and secondary structural properties, as described by Sturm et al [22] These different f eatures were divided into groups, as done
by Sturm et al., where COUNT contains the frequency of each nu- cleotide in the sequence, CONTACT contains the frequencies of all possible dinucleotides in terms of their order (e.g., the numbers
of CG, CA, CT, CC etc occurring in the sequence), SCONTACT con- tains the frequencies of all dinucleotides bases, disregarding their order (e.g., the numbers of CG + GC, CA + AC, CC, etc.), and finally
HAIRPIN contains the numbers of stem, loop, and free bases [22] The secondary structure of the sequences was calculated using the
seqfold module [34]assuming the temperature 50 °C
The next step in the search for a model was the training, and then finding the best-performing features and hyperparameters This was done by performing a nested cross-validation, the pur- pose of which was to estimate how well the model responded to new data, to reduce the risk of model overfitting First, one split the dataset into to k subsets Then, one chose one subset to be omitted from the training to act as validation data (1/3 of all data), while the rest of the dataset was used for training (2/3 of all data) The chosen training set was then further split into n subsets, and the same procedure as described before was repeated This ap- proach is visualized in Fig.1 The best-performing model on aver- age after the inner cross-validation was chosen to be tested on the outer validation set Then the result was evaluated based on the average performance in the outer validation, and the main metric that this implementation used was the root mean squared error (RMSE)
This procedure was performed for each sub-dataset, where ev- ery unique combination of the described feature groups was eval- uated The inner cross-validation was done using gridsearchcv from the sklearn ML library, which performs a k-fold cross-validation for a given model ( SVR) and lists of hyperparameters (regulariza- tion parameter C, epsilon tube ε, and kernel coefficient γ) When
gridsearchcv found a fit for each combination of hyperparameters, then the best-performing model was chosen and further evalu- ated on the outer validation set, which was randomly split using the sklearn function kfold [35] The number of folds in both the outer and inner cross-validations was chosen to be three Further- more, results might vary due to the stochastic nature of the algo- rithms when performing a fit and due to the randomized split of the datasets, so the process was performed another three times to reduce the variance of the results As a comparison, the LM model developed by Gilar et al (equation 7 in [20]) was fitted to each sub-dataset A nonlinear least squared regression was performed to find the optimal weights by using the lmfit module [36] The LM requires no hyperparameter optimization and was therefore only evaluated on the outer validation split When the best-performing features were found, a final training was then done using the best-
Trang 4Fig 1 Flowchart showing the steps required to train an SVR model to predict re-
tention times
performing features on two thirds of the dataset to visualize the
results in plots Also, the models that was trained on the whole
dataset was saved for later use
To evaluate the characteristics of the SVR model, we generated
250,0 0 0 unique random sequences with n= 12 and 16 We then
calculated their retention times and fitted them using a normal
distribution The peak width at half height ( w0.5,i) was assumed to
be described by the GC-content (sum of fractions of C and G) of the
sequence and its retention time plus a constant The solution to the
resulting linear matrix equation (Supplementary material S4) was
determined using the least-squares method The half-height width
of the UV trace of FLP and mass trace of the n – 1 to n 7-mers
of 16-mer FLPs in the dataset as well as n – 1 to n 3-mers of the
12-mer FLPs were used as input
The SVR model can be downloaded from the Supplementary
material
4 Results and discussion
The shortmer population ( n -1, n -2, …, n – n + 1)) constitutes
the largest number of impurities generated by the solid-phase syn-
thesis Successful separation and quantification of the individual
shortmers are necessary for the quality control of APIs Generally,
the separation of the n – 1-mer is the most relevant and most
challenging problem Therefore, it is beneficial to have a tool that
can assist in the selection of chromatographic methods and the
corresponding conditions necessary to achieve critical resolution of
the pair, here defined as ≥ 1.5. In Section4.1, we will present ex-
perimental retention data obtained using two methods for three
gradient slopes and discuss the characteristics of the two systems
The determined retention data will then be used to train ML mod-
els, whose performance and characteristics will be discussed in
Section 4.2 Finally, in Section 4.3, we will use the ML model to
estimate the probability of resolving an arbitrary oligonucleotide
from its n – 1 impurity We will also demonstrate how the choice
of elution method, conditions, and sequence characteristics affect the probability of success
4.1 Retention times
The first observation of both the co-solvent gradient and IPR gradient was that the retention times of sequences with n = 8,
12, and 16 increased with increasing proportions of A and T (sam- ples S nA through S nC in Supplementary material Table S1) The re- tention time also increased with decreasing gradient slope Very short oligonucleotides, i.e., n < 5, were only marginally affected
by the gradient compared with longer sequences, i.e., n = 16, as the system dwell volume had less of an effect on strongly retained oligonucleotides The oligonucleotide 3 -ACGACCGGGCGGAGTC-5 (S16A) had similar retention times using either method for all three gradients, as it was used to normalize the effects of gradi- ent slope and starting point between the methods This normaliza- tion had the unexpected effect that the shorter oligonucleotides, i.e., the S8 x and S12 x samples, were eluted significantly earlier using the IPR gradient than the co-solvent gradient Clearly, the two methods cannot be normalized for oligonucleotides of differ- ent lengths without also changing the shape of the gradient Other 16-mer sequences than S16A had different retention times in the two modes, indicating that there were different sequence-specific contributions to retention The hairpin-forming sequence MHA had about a 0.15-min shorter retention time than did its permutated sequence MHB in the co-solvent gradient system and about a 0.3- min difference in the IPR gradient system using the shallower gra- dient (G3) The second hairpin-forming sequence RHA had reten- tion almost identical to that of its permutated variant RHB in both systems at the same gradient slope
In Fig.2a, we can see the difference between the two gradient modes The shortest oligonucleotides display better selectivity, i.e.,
a large change in the y-direction with the addition or removal of
a nucleobase subunit in the co-solvent gradient method; whereas the opposite trend holds for the longest oligonucleotides in the IPR gradient method (the larger change is in the x-direction) However,
as can also be seen in Fig.2b, the eluted peaks in the IPR gradient are wider than in the co-solvent gradient How this affects resolu- tion will be investigated further, see Section4.3below
4.2 Machine learning model to predict retention times
The first step in finding the best ML model was to evaluate the model performance as a function of numbers of features, i.e.,
count, contact, scontact, and hairpin (see Section3for more details about the features) We found that for all combinations of gradi- ent modes and slopes, count gave the smallest RMSE for three out
of six systems (for a summary of all models, see Supplementary material Table S2) For the remaining three systems, different com- binations of features gave only marginally improved model RMSE This result could already be anticipated from the retention data, with permutations of the strong hairpin structures MHA and RHA only marginally affecting the retention time We therefore decided
to continue using the model but with only the count feature
In the study by Sturm et al [22] all features were found required to properly predict the retention times However, this finding cannot be directly extrapolated to our study since there are two main experimental difference between the experiments conducted by Sturm et al and by us Firstly, they uses another IPR (TEA) and, secondly, they uses unmodified oligonucleotides whereas we used TBuAA as IPR and fully phosphorothioated oligonucleotides as solutes As a consequence, Sturm et al con- ducted their separations with much lower amounts of acetonitrile (MeCN), 0–16% MeCN gradient, as compared to 38 – 70% as in this
Trang 5Fig 2 Normalized experimental retention times obtained in co-solvent and IPR gradients using gradient G3 ( Table 1 ) (a) and b) chromatogram showing the separation of sequence MHB (Supplementary Material Table S1) b) at gradient G3 ( Table 1 )
Table 2
Summary of model performance on the training and validation sets
Gradient mode Gradient slope Model RMSE Training set (min) RMSE Validation set (min) R2 Training set Q 2 Validation set
Co-solvent
gradient
study Previously it was shown that in separations conducted at
higher amounts of MeCN the separation systems ability to sepa-
rate charge differences is increased while systems ability to sepa-
rate compounds with same charge is decreased [18] This result in
that the feature count will be more important and that the next-
neighbor effect indicating features contact and scontact will con-
tribute less to the model, which was also observed in our study
We also compared the SVR model with the LM The results
indicated that the SVR model gave lower RMSE in all cases (see
Table 2) The relative difference in RMSE between the SVR and
the LM models increased with decreasing gradient slope for both
gradient modes SVR was also markedly better at accurately pre-
dicting retention times for the IPR gradient at all gradient slopes
This could be expected since the LM model was developed for co-
solvent gradient elution, native oligonucleotide samples, and dif-
ferent IPR and stationary phases Furthermore, this model was de-
veloped to give a rough estimate of the amount of acetonitrile re-
quired to elute an oligonucleotide based on its length and relative
proportions of nucleobases, for which it would still be useful given
the current datasets Another way to estimate the model fit is to
calculate the correlation coefficients R2 and Q2, where R2 is es-
timated from the training set and Q2 is estimated from data not
used in the training set; R2will therefore estimate the goodness of
fit and Q2 will estimate the goodness of prediction From Table2,
we can see that: (i) R2 was always greater than Q2, as expected;
(ii) both R2 and Q2 were substantially larger for the SVR model
than the LM model; (iii) the LM model was much worse in pre-
dicting the IPR gradient than the co-solvent gradient; and (iv) the SVR model was only slightly worse in predicting the IPR-gradient than the co-solvent gradient
Plots of predicted versus experimental retention times for the validation subset of the experimental data obtained at gradient G3 are shown Fig.3a and c for the co-solvent and IPR gradient modes, respectively The validation subset shown in this plot contains one third of the sequences in the complete dataset The corresponding box plot of the relative error for the SVR and LM models are shown
in Fig.3b and d
The characteristics of the SVR models were evaluated by calcu- lating the retention times of 250,0 0 0 unique random 12- and 16- mers The distribution of retention times can be found in Fig 4 The spread of the distributions increased with increasing oligonu- cleotide length and decreasing gradient slope for both gradient modes which could be expected In general, the spread of retention times was higher for the IPR gradient mode suggesting that the hy- drophobicity of the base pairs has a larger impact in this mode The larger variance observed for 16-mers could already be pre- dicted from Fig 2a Analyzing the base composition of sequences
by fitting a normal distribution to the retention data shows that, for both gradient modes, 12-mer sequences obtained at below 1.5 standard deviations had a higher proportion of G and especially
C compared with the baseline of 25% each (see Supplementary material Table S3) On the other hand, 12-mer sequences having retention times of above 1.5 standard deviations had larger than baseline (25%) proportions of A and especially T for both gradient
Trang 6Fig. 3 Experimental ( t R, exp ) and predicted ( t R, pred ) retention times in the validation dataset obtained using the SVR model (dots) or LM model (crosses) for the co-solvent gradient mode (a, b) and the IPR gradient mode (c, d), respectively In c) and d), the relative errors of the predictions are summarized in boxplots: the line in the boxplot is the median and the whiskers are the first and third quantiles
modes For the 16-mer sequences, the differences in base compo-
sition was less pronounced below 1.5 standard deviations for both
gradient modes but the GC-content remains above 50% Among
the strongest retained 16-mers, over 40% of nucleobases in the se-
quence are T for both modes and all gradient slopes
4.3 Predictions of the probability of resolving the FLP from the n – 1
impurity
Of particular interest for the quality control of synthetic
oligonucleotides is determining the purity of the FLP, which re-
quires sufficient (i.e., Rs > 1.5) resolution when using UV detec-
tion To calculate the resolution, we need accurate predictions of
retention time and peak width In addition to retention times, we
therefore also investigated how the peak widths correlated with
the retention times in each gradient mode; we found there was
only a weak correlation for the co-solvent gradient but a more pro- nounced correlation for the IPR gradient In both gradient modes, the peak widths increased with increasing retention time (see Sup- plementary material Fig S1) The peak widths obtained in the IPR gradient mode were greater than in the co-solvent gradient mode, both in absolute terms and by having a larger sequence variance One possible explanation is the gradient compression experienced
by each solute differed because they have different sensitivity to the gradient change Also, the effective gradient slope ( G) could
be different between the two gradient modes However, since the retention time shift of sample S16A was shown to be about the same for gradient slopes G1, G2, and G3 between the two modes,
a significant difference in effective gradient slope was unlikely An- other explanation could be that the peak broadening due to par- tial diastereomer separation was greater in the IPR gradient mode than the co-solvent mode This explanation is plausible since we
Trang 7Fig 4 Distributions of the predicted retention times for 250,0 0 0 unique 12- and 16-mer sequences (blue and orange fill, respectively) calculated using the SVR model
Subplots a)–c) show the co-solvent gradient and d)–f) the IPR gradient Gradient slope G1 (a, d), G2 (b, e), and G3 (c, f)
have shown that the diastereomer separation increased at lower
and constant co-solvent concentration in the IPR gradient mode
as compared with co-solvent gradient elution [18] This would ex-
plain why the peak width increased with both decreasing gradient
slope and increasing retention time We have previously showed
the diastereomer separation involving C and G was greater than
that involving A and T [17] and therefore attempted to correlate
the GC-content together with the retention time and a constant, to
the observed peak width This simple correlation provides a rea-
sonable approximation of peak width, as summarized in Supple-
mentary material S2
The predicted versus experimentally calculated resolutions for
12- and 16-mer samples are presented in Table 4 Except for the
steepest gradient slope investigated using both gradient modes, the
prediction error is less than 10% We also observe that the abso-
lute mean error of prediction decreases with decreasing gradient
slope Investigating the details, we see that the n – 1 impurity
of sample S12A and S12C are always resolved at a resolution of
more than 1.5 regardless of investigated gradient slope or mode
For the 16-mer sequences, the critical resolution is reached at a
steeper gradient slope using the IPR gradient mode compared to
co-solvent gradient mode Interestingly, the two 12-mer samples
always have higher resolution using the co-solvent gradient at any
gradient slope whereas the GC rich sample S12A has a lower res-
olution than 2 out of the 8 investigated 16-mers using the IPR
gradient mode This again highlights that the IPR gradient mode
has a higher degree of separation based on sequence rather than
length as compared to the co-solvent gradient An accurate esti-
mation of resolution based on sequence composition and retention
time allowed us to calculate the peak widths of all 250,0 0 0 ran-
dom unique 12- and 16-mers as well as their n – 1 impurities at
each gradient slope
The resulting distributions of calculated resolutions are shown
in Fig 5 For the co-solvent gradient mode, the 12/11-mer sep-
aration always has a higher resolution than does the 16/15-mer
separation regardless of the sequences In addition, all 12-mer se-
quences are predicted to reach a resolution of 1.5 at all investi-
gated gradient slopes The resolution of the 12-mer sequences us-
ing the co-solvent gradient mode was generally similar or slightly
better than could be achieved with the IPR gradient mode This
could also be anticipated from Fig 2a, where the selectivity be-
tween shorter oligonucleotides is greater for the co-solvent gradi- ent than the IPR gradient For the 16/15-mer separation resolution,
no sequences could be separated with a resolution of at least 1.5 using the steepest co-solvent gradient investigated At the second and third steepest gradients, i.e., G2 and G3, 42 and 28% of the random sequences could not be separated (see Table 3) For the IPR gradient mode, the resolution distributions between the 16/15- mer and 12/11-mer show overlap at all gradient slopes, with the overlap increasing with decreasing gradient slope The results indi- cate that some 12/11-mers are more difficult to resolve than some 16/15-mers using IPR gradients This could be expected from the experimental resolution data showing that a GC-rich 12-mer can have lower resolution compared to some 16-mers ( Table4) For the 16-mer FLPs, 31, 9, and 4% of all random unique sequences are ex- pected not to reach the critical resolution of 1.5 at gradient slopes
of G1, G2, and G3, respectively
Investigating the characteristics of the 16-mer sequences that
do not reach a resolution of at least 1.5, we found, for the co- solvent gradient, that they had a marginally higher frequency of C, both throughout the sequence and in the 5 terminal nucleobase, which when missing creates the n – 1-mer (see Table 3) For the IPR gradient, there was a similar but more pronounced trend The sequences that does reach critical resolution at G2 and G3 con- tained 27 and 40% C as well as above average A For the 5 termi- nal nucleobase, there was a 41 or 82% probability that it was a C at gradient slopes G2 and G3 At this could be understood from two earlier observations: first, a sequence containing a large proportion
of C will lead to a wider peak; second, the loss of a terminal C will give a smaller than average relative decrease in retention time These effects combined lead to difficulties obtaining sufficient res- olution
Investigating the FLPs of experimental dataset (Supplemen- tary material Table S1), we found that the one of the sequences that did not reach the critical resolution using the IPR gradi- ent at the steepest gradient slope G1 was the RHB sample (3 CGCGTGGTCCTGGTCC-5 ) This sequence has a composition of 37.5%
C, 37.5% G, 25% T, and 0% A as well as a terminal C at the 5 end The experimental resolution for the n – 1-mer was calculated to about 1.3 at G1, see Table4 Decreasing the gradient slope to G3 in- creased the resolution to about 1.9 The resolution at gradient slope G1 using the co-solvent gradient was even lower, about 1 at G1
Trang 8Fig. 5 Distributions of the predicted n – 1 resolutions for 250,0 0 0 unique 12- and 16-mer sequences (blue and orange fill) calculated using the SVR model Subplots a)–c)
show the co-solvent gradient and d)–f) the IPR gradient Gradient slope G1 (a, d), G2 (b, e), and G3 (c, f) Vertical dashed line at a resolution of 1.5
Fig 6 Experimental and simulated chromatograms of the RHB sample at the steep gradient slope G1 (a, c) and the shallow gradient slope G3 (b, d), respectively Co-solvent
gradient mode (a, b) and IPR gradient mode (c, d)
and just 1.5 at G3 Experimental and simulated chromatograms of
RHB are shown in Fig.6 The simulated peaks were constructed by
generating a normal distribution with a variance calculated from
the nucleobase composition and retention time The areas of the
FLP and n – 1 were manually normalized by adjusted the height
separately for each peak and then stitched them together to get the
final chromatogram The retention time and peak widths of the ex- perimental and simulated chromatograms are in good agreement, although there is a slight underestimation of calculated resolution
in the co-solvent gradient at gradient slope G1, also indicated from Table4
Trang 9Table 3
Details of predicted 16-mer failure sequences ( R s < 1.5); f x is the percentage of nucleobase x
Gradient mode Gradient slope Below critical resolution, R s < 1.5
Frequency (%) Sequence composition Terminal nucleobase composition
fA , fT , fC , fG fA fT fC fG
Table 4
Experimentally measured resolutions vs predictions for FLP and n – 1 using the SVR model for retention times and the linear model for peak widths, respectively
Sample 5 -end Exp Pred Exp Pred Exp Pred Exp Pred Exp Pred Exp Pred S12A C 1.81 1.81 2.40 2.53 2.76 2.77 2.17 2.17 2.45 2.44 2.38 2.38 S12C C 1.73 1.71 2.45 2.63 2.92 2.92 2.26 2.39 2.63 2.70 2.82 2.81 S16A C 1.10 1.30 1.43 1.63 1.69 2.06 1.56 1.89 1.97 2.33 2.21 2.29 S16B A 1.03 0.73 1.49 1.57 1.75 1.76 1.49 1.49 1.90 1.98 2.52 2.52 S16C A 1.02 1.03 1.45 1.69 1.75 1.83 1.45 1.91 1.84 2.14 2.21 2.44 MALAT G 0.86 0.61 1.23 1.48 1.51 1.50 1.27 1.59 1.62 1.72 1.92 2.15 MHA A 1.05 0.83 1.57 1.59 1.85 1.96 1.54 1.76 1.98 2.35 2.35 2.78 MHB A 1.14 0.83 1.65 1.59 1.93 1.96 1.85 1.76 2.31 2.35 2.74 2.78 RHA G 1.09 0.82 1.54 1.49 1.77 1.68 1.63 1.55 2.06 2.06 2.30 2.32 RHB C 0.98 0.92 1.32 1.34 1.53 1.51 1.29 1.31 1.68 1.66 1.90 1.91
5 Conclusions
This study aimed at constructing an ML model capable of pre-
dicting the retention times of phosphorothioated oligonucleotides
with high accuracy The model was shown to predict retention
times with low RMSE as well as high Q2and R2for all investigated
conditions For the investigated experimental systems, the effect of
secondary oligonucleotide structure was shown to be minimal, al-
lowing us to construct a simpler model
The ML models were used for predicting the chromatographic
characteristics of 250,0 0 0 random 12- and 16-mers It was found
that the variance in retention time was higher when using the
IPR gradient mode than the co-solvent gradient mode However, a
slight skewness in the distribution of retention times for a uniform
distribution of A, T, G, C indicates that the SVR model has captured
sequence specific contribution to the retention time which could
indicate the presence of next neighbor effects Sequences contain-
ing high proportions of C and G gave the shortest retention times,
whereas high proportions of A and T gave the longest retention
times in both gradient modes
Finally, the resolution of each of the 250,0 0 0 random sequences
to its n – 1-mer was calculated using the retention time from
the ML model and the peak width from the linear combination
of oligonucleotide GC-content and retention time Results indicate
that the co-solvent gradient mode can be expected to easily resolve
all 12-mer sequences from the 11-mers, typically with greater res-
olution than can the IPR gradient On the other hand, the proba-
bility of successfully resolving longer 16-mer sequences from 15-
mers was significantly higher using the IPR gradient mode For
both methods, decreasing the gradient slope increased the proba-
bility of achieving critical resolution Among the 16-mers that still
could not be resolved using the IPR gradient mode, the frequencies
of C were very high, respectively, at the terminal nucleobase
The ML models constructed in this study could help select the
appropriate gradient mode and gradient slope that would lead to
successful separation before performing an experiment The mod-
els could be expanded to account for retention shifts introduced by other oligonucleotide modifications such as 3 -MOE, methyl-C or LNAs if sufficient data is provided Also other impurities related to the FLP if trained with such retention data Other impurities could for example include (P = O) or abasics Other chromatographic sys- tems including other column chemistries, particle sizes, temper- atures, and mobile phases could also be added to have an even greater number of possible systems to choose from The methodol- ogy could also be used to optimize the method run time in silico before running experiments
Availability
Implementations and code used in this study can be found at: https://github.com/jakobhaggstrom/JCA-21-1579
Declaration of Competing Interest
The authors declare that they have no known competing finan- cial interests or personal relationships that could have appeared to influence the work reported in this paper
CRediT authorship contribution statement Martin Enmark: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing – original draft, Writing – review & editing, Visualiza- tion, Supervision Jakob Häggström: Methodology, Software, For- mal analysis, Investigation, Data curation, Writing – original draft
Jörgen Samuelsson: Conceptualization, Validation, Writing – orig- inal draft, Writing – review & editing, Supervision Torgny Forn-stedt: Conceptualization, Writing – review & editing, Supervision, Project administration, Funding acquisition
Trang 10This work was supported by the Swedish Knowledge Founda-
tion via the project “Improved Methods for Process and Quality
Controls using Digital Tools” (grant number 20210021) and by the
Swedish Research Council (VR) via the project “Fundamental Stud-
ies on Molecular Interactions aimed at Preparative Separations and
Biospecific Measurements” (grant number 2015-04627)
Supplementary materials
Supplementary material associated with this article can be
found, in the online version, at doi: 10.1016/j.chroma.2022.462999
References
[1] S Yang, R.E Rothman, PCR-based diagnostics for infectious diseases: uses, lim-
itations, and future applications in acute-care settings, Lancet Infect Dis 4
(2004) 337–348, doi: 10.1016/S1473-3099(04)01044-8
[2] L Becherer, N Borst, M Bakheit, S Frischmann, et al., Loop-mediated isother-
mal amplification (LAMP) – review and classification of methods for sequence-
specific detection, Anal Methods 12 (2020) 717–746, doi: 10.1039/C9AY02246E
[3] M.J Heller, DNA Microarray Technology: Devices, Systems, and Applications,
Annu Rev Biomed Eng 4 (2002) 129–153, doi: 10.1146/annurev.bioeng.4
020702.153438
[4] W Yin, M Rogge, Targeting RNA: A Transformative Therapeutic Strategy, Clin
Translat Sci 12 (2019) 98–112, doi: 10.1111/cts.12624
[5] T.C Roberts, R Langer, M.J.A Wood, Advances in oligonucleotide drug delivery,
Nat Rev Drug Discovery 19 (2020) 673–694, doi: 10.1038/s41573- 020- 0075- 7
[6] C.F Bennett, E.E Swayze, RNA targeting therapeutics: molecular mechanisms
of antisense oligonucleotides as a therapeutic platform, Annu Rev Pharmacol
Toxicol 50 (2010) 259–293, doi: 10.1146/annurev.pharmtox.010909.105654
[7] E Paredes, V Aduda, K.L Ackley, H Cramer, 6.11 - Manufacturing of Oligonu-
cleotides, in: S Chackalamannil, D Rotella, S.E Ward (Eds.), Comprehen-
sive Medicinal Chemistry III, Elsevier, Oxford, 2017, pp 233–279, doi: 10.1016/
B978- 0- 12- 409547- 2.12423- 0
[8] S Benizri, A Gissot, A Martin, B Vialet, et al., Bioconjugated oligonucleotides:
recent developments and therapeutic applications, Bioconjugate Chem 30
(2019) 366–383, doi: 10.1021/acs.bioconjchem.8b00761
[9] N.M El Zahar, N Magdy, A.M El-Kosasy, M.G Bartlett, Chromatographic ap-
proaches for the characterization and quality control of therapeutic oligonu-
cleotide impurities, Biomed Chromatogr 32 (2018), doi: 10.1002/bmc.4088
[10] D Capaldi, A Teasdale, S Henry, N Akhtar, et al., Impurities in Oligonucleotide
Drug Substances and Drug Products, Nucleic Acid Ther 27 (2017) 309–322,
doi: 10.1089/nat.2017.0691
[11] M Enmark, J Bagge, J Samuelsson, L Thunberg, et al., Analytical and
preparative separation of phosphorothioated oligonucleotides: columns and
ion-pair reagents, Anal Bioanal Chem 412 (2020) 299–309, doi: 10.1007/
s00216- 019- 02236- 9
[12] S.G Roussis, M Pearce, C Rentel, Small alkyl amines as ion-pair reagents
for the separation of positional isomers of impurities in phosphate diester
oligonucleotides, J Chromatogr A 1594 (2019) 105–111, doi: 10.1016/j.chroma
2019.02.026
[13] S.T Crooke, J.L Witztum, C.F Bennett, B.F Baker, RNA-Targeted Therapeutics,
Cell Metab 27 (2018) 714–739, doi: 10.1016/j.cmet.2018.03.004
[14] M Catani, C.D Luca, J.M.G Alcântara, N Manfredini, et al., Oligonucleotides:
current trends and innovative applications in the synthesis, characterization,
and purification, Biotechnol J (2022) 1900226 n/a (n.d.), doi: 10.1002/biot
201900226
[15] A Goyon, P Yehl, K Zhang, Characterization of therapeutic oligonucleotides by
liquid chromatography, J Pharm Biomed Anal 182 (2020) 113105, doi: 10.1016/
j.jpba.2020.113105
[16] S Studzi ´nska, S Bocian, L Sieci ´nska, B Buszewski, Application of phenyl-based
stationary phases for the study of retention and separation of oligonucleotides,
J Chromatogr B 1060 (2017) 36–43, doi: 10.1016/j.jchromb.2017.05.033
[17] M Enmark, M Rova, J Samuelsson, E Örnskov, et al., Investigation of factors influencing the separation of diastereomers of phosphorothioated oligonucleotides, Anal Bioanal Chem 411 (2019) 3383–3394, doi: 10.1007/ s00216- 019- 01813- 2
[18] M Enmark, S Harun, J Samuelsson, E Örnskov, et al., Selectivity limits of and opportunities for ion pair chromatographic separation of oligonucleotides, J Chromatogr A 1651 (2021) 462269, doi: 10.1016/j.chroma.2021.462269 [19] A Demelenne, M.-J Gou, G Nys, C Parulski, et al., Evaluation of hydrophilic in- teraction liquid chromatography, capillary zone electrophoresis and drift tube ion-mobility quadrupole time of flight mass spectrometry for the characteriza- tion of phosphodiester and phosphorothioate oligonucleotides, J Chromatogr
A 1614 (2020) 460716, doi: 10.1016/j.chroma.2019.460716 [20] M Gilar, K.J Fountain, Y Budman, U.D Neue, et al., Ion-pair reversed- phase high-performance liquid chromatography analysis of oligonucleotides: Retention prediction, J Chromatogr A 958 (2002) 167–182, doi: 10.1016/ S0 021-9673(02)0 0306-0
[21] S Studzi ´nska, B Buszewski, Different approaches to quantitative structure– retention relationships in the prediction of oligonucleotide retention, J Sep Sci 38 (2015) 2076–2084, doi: 10.1002/jssc.201401395
[22] M Sturm, S Quinten, C.G Huber, O Kohlbacher, A statistical learning approach
to the modeling of chromatographic retention of oligonucleotides incorporat- ing sequence and secondary structure data, Nucleic Acids Res 35 (2007) 4195–
4202, doi: 10.1093/nar/gkm338 [23] C Liang, J.-Q Qiao, H.-Z Lian, A novel strategy for retention prediction of nu- cleic acids with their sequence information in ion-pair reversed phase liquid chromatography, Talanta 185 (2018) 592–601, doi: 10.1016/j.talanta.2018.04.030 [24] O Kohlbacher, S Quinten, M Sturm, B.M Mayr, et al., Structure–Activity Re- lationships in Chromatography: Retention Prediction of Oligonucleotides with Support Vector Regression, Angew Chem Int Ed 45 (20 06) 70 09–7012, doi: 10
10 02/anie.20 0602561 [25] L Moruz, L Käll, Peptide retention time prediction, Mass Spec Rev 36 (2017) 615–623, doi: 10.1002/mas.21488
[26] M Gilar, A Jaworski, P Olivova, J.C Gebler, Peptide retention prediction ap- plied to proteomic data analysis, Rapid Commun Mass Spectrom 21 (2007) 2813–2821, doi: 10.1002/rcm.3150
[27] O.V Krokhin, R Craig, V Spicer, W Ens, et al., An improved model for predic- tion of retention times of tryptic peptides in ion pair reversed-phase HPLC its application to protein peptide mapping by off-line HPLC-MALDI MS, Mol Cell Proteomics 3 (2004) 908–919, doi: 10.1074/mcp.M400031-MCP200
[28] K Petritis, L.J Kangas, P.L Ferguson, G.A Anderson, et al., Use of Artificial Neural Networks for the Accurate Prediction of Peptide Liquid Chromatogra- phy Elution Times in Proteome Analyses, Anal Chem 75 (2003) 1039–1048, doi: 10.1021/ac0205154
[29] A A Klammer, X Yi, M.J MacCoss, W.S Noble, Improving Tandem Mass Spec- trum Identification Using Peptide Retention Time Prediction across Diverse Chromatography Conditions, Anal Chem 79 (2007) 6111–6118, doi: 10.1021/ ac070262k
[30] J Samuelsson, F.F Eiriksson, D ˚Asberg, M Thorsteinsdóttir, et al., Determining gradient conditions for peptide purification in RPLC with machine-learning- based retention time predictions, J Chromatogr A 1598 (2019) 92–100, doi: 10 1016/j.chroma.2019.03.043
[31] E Stellwagen, J.M Muse, N.C Stellwagen, Monovalent Cation Size and DNA Conformational Stability, Biochemistry 50 (2011) 3084–3094, doi: 10.1021/ bi1015524
[32] J.R Nilsson, T Baladi, A Gallud, D Baždarevi ´c, et al., Fluorescent base ana- logues in gapmers enable stealth labeling of antisense oligonucleotide thera- peutics, Sci Rep 11 (2021) 11365, doi: 10.1038/s41598- 021- 90629- 1 [33] S.G Roussis, C Koch, D Capaldi, C Rentel, Rapid oligonucleotide drug impurity determination by direct spectral comparison of ion-pair reversed-phase high- performance liquid chromatography electrospray ionization mass spectrometry data, Rapid Commun Mass Spectrom 32 (2018) 1099–1106, doi: 10.1002/rcm
8125 [34] J Timmons, leshane, Lattice-Automation/seqfold 0.7.7, Zenodo (2021), doi: 10 5281/zenodo.4579886
[35] F Pedregosa , G Varoquaux , A Gramfort , V Michel , et al ,Scikit-learn: machine learning in python, J Mach Learn Res 12 (2011) 2825–2830
[36] M Newville, T Stensitzki, D.B Allen, A Ingargiola, LMFIT: non-linear least- square minimization and curve-fitting for python, Zenodo (2014), doi: 10.5281/ zenodo.11813