Improved workflow for constructing machine learning models: Predicting retention times and peak widths in oligonucleotide separationShow others and affiliations
2025 (English)In: Journal of Chromatography A, ISSN 0021-9673, E-ISSN 1873-3778, Vol. 1747, article id 465746Article in journal (Refereed) Published
Abstract [en]
This study presents an improved workflow to support the development of machine learning models to predict oligonucleotide retention times, peak widths and thus peak resolutions, from larger datasets where manual processing is not feasible. We explored diverse oligonucleotide forms, ranging from native to fully phosphorothioated, using three different gradient slopes. Both native and phosphorothioated oligonucleotides were separated, using a chromatographic C18 system with tributylaminium ion as the ion-pair reagent in the eluent, resulting in retention time data for approximately 900 sequences per gradient. For managing the large and extensive datasets, we developed a semi-automatic rule-based approach for retention time determination, peak decomposition, peak width assessment, signal-to-noise ratio, and skewness analysis. Probability density functions (PDFs) were fitted to elution profiles, with PDF selection based on an Ftest. Co-eluting peaks were addressed using a multiple Gaussian PDF. The encoded sequence data underwent modeling using support vector regression (SVR), gradient boosting (GB), random forest (RF), and decision tree (DT) models. GB and SVR showed promise for retention predictions, while RT and DT were faster but demonstrated limited generalization capabilities. The machine learning models exhibited larger errors for the shallowest gradient and lower predictability for P=O sequences, potentially due to signal intensity and sequence heterogeneity. Improvements in signal-to-noise ratios were considered, including mass spectrometry in selected ion monitoring mode. The best model for this data sets were GB, closely followed by the SVR model. With established models for retention and peak width, chromatograms can now be predicted for various gradient slopes, offering prediction of impurity peak resolution for arbitrary sequences and gradient slopes.
Place, publisher, year, edition, pages
Elsevier, 2025. Vol. 1747, article id 465746
Keywords [en]
Oligonucleotides, Ion-pair chromatography, Machine learning, Computer simulation, Resolution predictions
National Category
Bioinformatics (Computational Biology) Analytical Chemistry
Research subject
Chemistry; Computer Science
Identifiers
URN: urn:nbn:se:kau:diva-103955DOI: 10.1016/j.chroma.2025.465746ISI: 001436803200001PubMedID: 40014960Scopus ID: 2-s2.0-85218463003OAI: oai:DiVA.org:kau-103955DiVA, id: diva2:1951631
Funder
Knowledge Foundation, 202100212025-04-112025-04-112025-04-11Bibliographically approved