dc.date.accessioned | 2021-02-13T20:24:15Z | |
dc.date.available | 2021-02-13T20:24:15Z | |
dc.date.created | 2020-01-28T15:01:50Z | |
dc.date.issued | 2020 | |
dc.identifier.citation | Maros, Máté E. Capper, David Jones, David T. Hovestadt, Volker von Deimling, Andreas Pfister, Stefan M Benner, Axel Zucknick, Karola Manuela Sill, Martin . Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data. Nature Protocols. 2020, 15, 479-512 | |
dc.identifier.uri | http://hdl.handle.net/10852/83179 | |
dc.description.abstract | DNA methylation data-based precision cancer diagnostics is emerging as the state of the art for molecular tumor classification. Standards for choosing statistical methods with regard to well-calibrated probability estimates for these typically highly multiclass classification tasks are still lacking. To support this choice, we evaluated well-established machine learning (ML) classifiers including random forests (RFs), elastic net (ELNET), support vector machines (SVMs) and boosted trees in combination with post-processing algorithms and developed ML workflows that allow for unbiased class probability (CP) estimation. Calibrators included ridge-penalized multinomial logistic regression (MR) and Platt scaling by fitting logistic regression (LR) and Firth’s penalized LR. We compared these workflows on a recently published brain tumor 450k DNA methylation cohort of 2,801 samples with 91 diagnostic categories using a 5 × 5-fold nested cross-validation scheme and demonstrated their generalizability on external data from The Cancer Genome Atlas. ELNET was the top stand-alone classifier with the best calibration profiles. The best overall two-stage workflow was MR-calibrated SVM with linear kernels closely followed by ridge-calibrated tuned RF. For calibration, MR was the most effective regardless of the primary classifier. The protocols developed as a result of these comparisons provide valuable guidance on choosing ML workflows and their tuning to generate well-calibrated CP estimates for precision diagnostics using DNA methylation data. Computation times vary depending on the ML algorithm from <15 min to 5 d using multi-core desktop PCs. Detailed scripts in the open-source R language are freely available on GitHub, targeting users with intermediate experience in bioinformatics and statistics and using R with Bioconductor extensions. | |
dc.language | EN | |
dc.title | Machine learning workflows to estimate class probabilities for precision cancer diagnostics on DNA methylation microarray data | |
dc.type | Journal article | |
dc.creator.author | Maros, Máté E. | |
dc.creator.author | Capper, David | |
dc.creator.author | Jones, David T. | |
dc.creator.author | Hovestadt, Volker | |
dc.creator.author | von Deimling, Andreas | |
dc.creator.author | Pfister, Stefan M | |
dc.creator.author | Benner, Axel | |
dc.creator.author | Zucknick, Karola Manuela | |
dc.creator.author | Sill, Martin | |
cristin.unitcode | 185,51,15,0 | |
cristin.unitname | Avdeling for biostatistikk | |
cristin.ispublished | true | |
cristin.fulltext | postprint | |
cristin.qualitycode | 2 | |
dc.identifier.cristin | 1784347 | |
dc.identifier.bibliographiccitation | info:ofi/fmt:kev:mtx:ctx&ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.jtitle=Nature Protocols&rft.volume=15&rft.spage=479&rft.date=2020 | |
dc.identifier.jtitle | Nature Protocols | |
dc.identifier.volume | 15 | |
dc.identifier.issue | 2 | |
dc.identifier.startpage | 479 | |
dc.identifier.endpage | 512 | |
dc.identifier.doi | https://doi.org/10.1038/s41596-019-0251-6 | |
dc.identifier.urn | URN:NBN:no-85951 | |
dc.type.document | Tidsskriftartikkel | |
dc.type.peerreviewed | Peer reviewed | |
dc.source.issn | 1754-2189 | |
dc.identifier.fulltext | Fulltext https://www.duo.uio.no/bitstream/handle/10852/83179/2/Maros_2020_postprint.pdf | |
dc.type.version | AcceptedVersion | |