Sammendrag
The adaptive immune system protects the body by remembering previously encountered antigens, so it can react more efficiently when encountering the same antigen in the future. The adaptive immune receptors, collectively called the adaptive immune receptor repertoire, on T-cells and B-cells, play a key role when recognizing antigens. Analyzing these immune repertoires gives us a deeper understanding of them and aids the development of diagnostic technologies. The immune signal is the set of features in the adaptive immune receptor repertoire that are associated with antigen binding or disease status. Simulating these immune signals allows us to have precise control of the ground truth of the immune signal when using the simulated data to assess machine learning models. One approach to simulating immune signals is to assume it will take the form of full sequences. Full sequence implanting simulates the effect of an immune event on the immune repertoire dataset by implanting one or more sequences many times into immune repertoires. Due to biases when generating immune receptors naturally, they have very different probability of generation. This generation probability can be computed. However, if a full sequence that is unlikely to be generated naturally is implanted many times in a dataset, this could make it an easily detectable outlier. This could produce unrealistic simulated data that can give false benchmarking results. The full sequence implanting in immuneML, an open-source immune repertoire machine learning platform, can produce generation probability outliers. This thesis presents two implementations with different approaches to signal implanting strategy solutions for this generation probability outlier-problem, that will extend the full sequence simulation in immuneML. The distribution of how the generation probability of sequences relate to how often the sequences appear were analyzed in synthetic and experimental datasets to examine how the signal implanting strategies should behave and what parameters should be controlled by the user. Finally, a method that can detect candidates for these generation probability outliers was used to assess the new immune signal implanting strategies. The new signal implanting strategies both successfully showed that they could implant the signal in such a way that the generation probability outlier-problem could not reliably be exploited. The two strategies have different strengths and weaknesses, and can both be used to simulate full sequence immune signals for different types of machine learning models.