Hide metadata

dc.date.accessioned2021-02-13T20:24:15Z
dc.date.available2021-02-13T20:24:15Z
dc.date.created2020-01-28T15:01:50Z
dc.date.issued2020
dc.identifier.citationMaros, Máté E. Capper, David Jones, David T. Hovestadt, Volker von Deimling, Andreas Pfister, Stefan M Benner, Axel Zucknick, Karola Manuela Sill, Martin . Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data. Nature Protocols. 2020, 15, 479-512
dc.identifier.urihttp://hdl.handle.net/10852/83179
dc.description.abstractDNA methylation data-based precision cancer diagnostics is emerging as the state of the art for molecular tumor classification. Standards for choosing statistical methods with regard to well-calibrated probability estimates for these typically highly multiclass classification tasks are still lacking. To support this choice, we evaluated well-established machine learning (ML) classifiers including random forests (RFs), elastic net (ELNET), support vector machines (SVMs) and boosted trees in combination with post-processing algorithms and developed ML workflows that allow for unbiased class probability (CP) estimation. Calibrators included ridge-penalized multinomial logistic regression (MR) and Platt scaling by fitting logistic regression (LR) and Firth’s penalized LR. We compared these workflows on a recently published brain tumor 450k DNA methylation cohort of 2,801 samples with 91 diagnostic categories using a 5 × 5-fold nested cross-validation scheme and demonstrated their generalizability on external data from The Cancer Genome Atlas. ELNET was the top stand-alone classifier with the best calibration profiles. The best overall two-stage workflow was MR-calibrated SVM with linear kernels closely followed by ridge-calibrated tuned RF. For calibration, MR was the most effective regardless of the primary classifier. The protocols developed as a result of these comparisons provide valuable guidance on choosing ML workflows and their tuning to generate well-calibrated CP estimates for precision diagnostics using DNA methylation data. Computation times vary depending on the ML algorithm from <15 min to 5 d using multi-core desktop PCs. Detailed scripts in the open-source R language are freely available on GitHub, targeting users with intermediate experience in bioinformatics and statistics and using R with Bioconductor extensions.
dc.languageEN
dc.titleMachine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data
dc.typeJournal article
dc.creator.authorMaros, Máté E.
dc.creator.authorCapper, David
dc.creator.authorJones, David T.
dc.creator.authorHovestadt, Volker
dc.creator.authorvon Deimling, Andreas
dc.creator.authorPfister, Stefan M
dc.creator.authorBenner, Axel
dc.creator.authorZucknick, Karola Manuela
dc.creator.authorSill, Martin
cristin.unitcode185,51,15,0
cristin.unitnameAvdeling for biostatistikk
cristin.ispublishedtrue
cristin.fulltextpostprint
cristin.qualitycode2
dc.identifier.cristin1784347
dc.identifier.bibliographiccitationinfo:ofi/fmt:kev:mtx:ctx&ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.jtitle=Nature Protocols&rft.volume=15&rft.spage=479&rft.date=2020
dc.identifier.jtitleNature Protocols
dc.identifier.volume15
dc.identifier.issue2
dc.identifier.startpage479
dc.identifier.endpage512
dc.identifier.doihttps://doi.org/10.1038/s41596-019-0251-6
dc.identifier.urnURN:NBN:no-85951
dc.type.documentTidsskriftartikkel
dc.type.peerreviewedPeer reviewed
dc.source.issn1754-2189
dc.identifier.fulltextFulltext https://www.duo.uio.no/bitstream/handle/10852/83179/2/Maros_2020_postprint.pdf
dc.type.versionAcceptedVersion


Files in this item

Appears in the following Collection

Hide metadata