Hide metadata

dc.date.accessioned2022-06-07T15:11:49Z
dc.date.available2022-06-07T15:11:49Z
dc.date.created2022-05-27T07:47:11Z
dc.date.issued2022
dc.identifier.citationKanduri, Chakravarthi Pavlović, Milena Scheffer, Lonneke Motwani, Keshav Chernigovskaya, Maria Greiff, Victor Sandve, Geir Kjetil Ferkingstad . Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification. GigaScience. 2022, 11(05)
dc.identifier.urihttp://hdl.handle.net/10852/94308
dc.description.abstractBackground Machine learning (ML) methodology development for the classification of immune states in adaptive immune receptor repertoires (AIRRs) has seen a recent surge of interest. However, so far, there does not exist a systematic evaluation of scenarios where classical ML methods (such as penalized logistic regression) already perform adequately for AIRR classification. This hinders investigative reorientation to those scenarios where method development of more sophisticated ML approaches may be required. Results To identify those scenarios where a baseline ML method is able to perform well for AIRR classification, we generated a collection of synthetic AIRR benchmark data sets encompassing a wide range of data set architecture-associated and immune state–associated sequence patterns (signal) complexity. We trained ≈1,700 ML models with varying assumptions regarding immune signal on ≈1,000 data sets with a total of ≈250,000 AIRRs containing ≈46 billion TCRβ CDR3 amino acid sequences, thereby surpassing the sample sizes of current state-of-the-art AIRR-ML setups by two orders of magnitude. We found that L1-penalized logistic regression achieved high prediction accuracy even when the immune signal occurs only in 1 out of 50,000 AIR sequences. Conclusions We provide a reference benchmark to guide new AIRR-ML classification methodology by (i) identifying those scenarios characterized by immune signal and data set complexity, where baseline methods already achieve high prediction accuracy, and (ii) facilitating realistic expectations of the performance of AIRR-ML models given training data set properties and assumptions. Our study serves as a template for defining specialized AIRR benchmark data sets for comprehensive benchmarking of AIRR-ML methods.
dc.languageEN
dc.rightsAttribution 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.titleProfiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
dc.title.alternativeENEngelskEnglishProfiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification
dc.typeJournal article
dc.creator.authorKanduri, Chakravarthi
dc.creator.authorPavlović, Milena
dc.creator.authorScheffer, Lonneke
dc.creator.authorMotwani, Keshav
dc.creator.authorChernigovskaya, Maria
dc.creator.authorGreiff, Victor
dc.creator.authorSandve, Geir Kjetil Ferkingstad
cristin.unitcode185,15,31,0
cristin.unitnameSenter for bioinformatikk
cristin.ispublishedtrue
cristin.fulltextoriginal
cristin.qualitycode1
dc.identifier.cristin2027629
dc.identifier.bibliographiccitationinfo:ofi/fmt:kev:mtx:ctx&ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.jtitle=GigaScience&rft.volume=11&rft.spage=&rft.date=2022
dc.identifier.jtitleGigaScience
dc.identifier.volume11
dc.identifier.issue05
dc.identifier.doihttps://doi.org/10.1093/gigascience/giac046
dc.identifier.urnURN:NBN:no-96862
dc.type.documentTidsskriftartikkel
dc.type.peerreviewedPeer reviewed
dc.source.issn2047-217X
dc.identifier.fulltextFulltext https://www.duo.uio.no/bitstream/handle/10852/94308/1/giac046.pdf
dc.type.versionPublishedVersion
cristin.articleidgiac046


Files in this item

Appears in the following Collection

Hide metadata

Attribution 4.0 International
This item's license is: Attribution 4.0 International