simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods

dc.date.accessioned	2024-03-23T17:44:33Z
dc.date.available	2024-03-23T17:44:33Z
dc.date.created	2023-11-17T13:59:17Z
dc.date.issued	2023
dc.identifier.citation	Kanduri, Chakravarthi Scheffer, Lonneke Pavlović, Milena Rand, Knut Dagestad Chernigovskaia, Maria Pirvandy, Oz Yaari, Gur Greiff, Victor Sandve, Geir Kjetil Ferkingstad . simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods. GigaScience. 2023, 12, 1-16
dc.identifier.uri	http://hdl.handle.net/10852/110070
dc.description.abstract	Abstract Background Machine learning (ML) has gained significant attention for classifying immune states in adaptive immune receptor repertoires (AIRRs) to support the advancement of immunodiagnostics and therapeutics. Simulated data are crucial for the rigorous benchmarking of AIRR-ML methods. Existing approaches to generating synthetic benchmarking datasets result in the generation of naive repertoires missing the key feature of many shared receptor sequences (selected for common antigens) found in antigen-experienced repertoires. Results We demonstrate that a common approach to generating simulated AIRR benchmark datasets can introduce biases, which may be exploited for undesired shortcut learning by certain ML methods. To mitigate undesirable access to true signals in simulated AIRR datasets, we devised a simulation strategy (simAIRR) that constructs antigen-experienced-like repertoires with a realistic overlap of receptor sequences. simAIRR can be used for constructing AIRR-level benchmarks based on a range of assumptions (or experimental data sources) for what constitutes receptor-level immune signals. This includes the possibility of making or not making any prior assumptions regarding the similarity or commonality of immune state–associated sequences that will be used as true signals. We demonstrate the real-world realism of our proposed simulation approach by showing that basic ML strategies perform similarly on simAIRR-generated and real-world experimental AIRR datasets. Conclusions This study sheds light on the potential shortcut learning opportunities for ML methods that can arise with the state-of-the-art way of simulating AIRR datasets. simAIRR is available as a Python package: https://github.com/KanduriC/simAIRR.
dc.language	EN
dc.rights	Attribution 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.title	simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods
dc.title.alternative	ENEngelskEnglishsimAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods
dc.type	Journal article
dc.creator.author	Kanduri, Chakravarthi
dc.creator.author	Scheffer, Lonneke
dc.creator.author	Pavlović, Milena
dc.creator.author	Rand, Knut Dagestad
dc.creator.author	Chernigovskaia, Maria
dc.creator.author	Pirvandy, Oz
dc.creator.author	Yaari, Gur
dc.creator.author	Greiff, Victor
dc.creator.author	Sandve, Geir Kjetil Ferkingstad
cristin.unitcode	185,15,31,0
cristin.unitname	Senter for bioinformatikk
cristin.ispublished	true
cristin.fulltext	original
cristin.qualitycode	1
dc.identifier.cristin	2198166
dc.identifier.bibliographiccitation	info:ofi/fmt:kev:mtx:ctx&ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.jtitle=GigaScience&rft.volume=12&rft.spage=1&rft.date=2023
dc.identifier.jtitle	GigaScience
dc.identifier.volume	12
dc.identifier.startpage	1
dc.identifier.endpage	16
dc.identifier.doi	https://doi.org/10.1093/gigascience/giad074
dc.type.document	Tidsskriftartikkel
dc.type.peerreviewed	Peer reviewed
dc.source.issn	2047-217X
dc.type.version	PublishedVersion
dc.relation.project	NFR/311341
dc.relation.project	KF/215817
dc.relation.project	NFR/331890
dc.relation.project	NFR/300740
dc.relation.project	SIGMA2/NN9603K
dc.relation.project	EC/H2020/825821