Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data

dc.date.accessioned	2021-02-13T20:24:15Z
dc.date.available	2021-02-13T20:24:15Z
dc.date.created	2020-01-28T15:01:50Z
dc.date.issued	2020
dc.identifier.citation	Maros, Máté E. Capper, David Jones, David T. Hovestadt, Volker von Deimling, Andreas Pfister, Stefan M Benner, Axel Zucknick, Karola Manuela Sill, Martin . Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data. Nature Protocols. 2020, 15, 479-512
dc.identifier.uri	http://hdl.handle.net/10852/83179
dc.description.abstract	DNA methylation data-based precision cancer diagnostics is emerging as the state of the art for molecular tumor classification. Standards for choosing statistical methods with regard to well-calibrated probability estimates for these typically highly multiclass classification tasks are still lacking. To support this choice, we evaluated well-established machine learning (ML) classifiers including random forests (RFs), elastic net (ELNET), support vector machines (SVMs) and boosted trees in combination with post-processing algorithms and developed ML workflows that allow for unbiased class probability (CP) estimation. Calibrators included ridge-penalized multinomial logistic regression (MR) and Platt scaling by fitting logistic regression (LR) and Firth’s penalized LR. We compared these workflows on a recently published brain tumor 450k DNA methylation cohort of 2,801 samples with 91 diagnostic categories using a 5 × 5-fold nested cross-validation scheme and demonstrated their generalizability on external data from The Cancer Genome Atlas. ELNET was the top stand-alone classifier with the best calibration profiles. The best overall two-stage workflow was MR-calibrated SVM with linear kernels closely followed by ridge-calibrated tuned RF. For calibration, MR was the most effective regardless of the primary classifier. The protocols developed as a result of these comparisons provide valuable guidance on choosing ML workflows and their tuning to generate well-calibrated CP estimates for precision diagnostics using DNA methylation data. Computation times vary depending on the ML algorithm from <15 min to 5 d using multi-core desktop PCs. Detailed scripts in the open-source R language are freely available on GitHub, targeting users with intermediate experience in bioinformatics and statistics and using R with Bioconductor extensions.
dc.language	EN
dc.title	Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data
dc.type	Journal article
dc.creator.author	Maros, Máté E.
dc.creator.author	Capper, David
dc.creator.author	Jones, David T.
dc.creator.author	Hovestadt, Volker
dc.creator.author	von Deimling, Andreas
dc.creator.author	Pfister, Stefan M
dc.creator.author	Benner, Axel
dc.creator.author	Zucknick, Karola Manuela
dc.creator.author	Sill, Martin
cristin.unitcode	185,51,15,0
cristin.unitname	Avdeling for biostatistikk
cristin.ispublished	true
cristin.fulltext	postprint
cristin.qualitycode	2
dc.identifier.cristin	1784347
dc.identifier.bibliographiccitation	info:ofi/fmt:kev:mtx:ctx&ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.jtitle=Nature Protocols&rft.volume=15&rft.spage=479&rft.date=2020
dc.identifier.jtitle	Nature Protocols
dc.identifier.volume	15
dc.identifier.issue	2
dc.identifier.startpage	479
dc.identifier.endpage	512
dc.identifier.doi	https://doi.org/10.1038/s41596-019-0251-6
dc.identifier.urn	URN:NBN:no-85951
dc.type.document	Tidsskriftartikkel
dc.type.peerreviewed	Peer reviewed
dc.source.issn	1754-2189
dc.identifier.fulltext	Fulltext https://www.duo.uio.no/bitstream/handle/10852/83179/2/Maros_2020_postprint.pdf
dc.type.version	AcceptedVersion

Files in this item

Name:: Maros_2020_postprint.pdf
Size:: 1.016Mb
Format:: application/

View/Open

Appears in the following Collection

Institutt for medisinske basalfag [2799]
CRIStin høstingsarkiv [31446]

Hide metadata

Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data

Files in this item

Appears in the following Collection

Browse

For library staff

RSS Feeds