Abstract
Synthetic data has gained attention over the last years because of its ability to safeguard the privacy of real data points while still ensuring data utility. These properties are beneficial in many domains and sectors working with sensitive data, particularly to public agencies, which govern large amounts of data on individuals. Most previous works on synthetic data centres around tabular data, and while some research has been done on synthetic survival data, the topic of synthetic multi-state time-to-event (MS-TTE) data has yet to be considered. In this thesis, we develop a novel semi-parametric approach to synthesising MS-TTE data, which combines a non-parametric tabular synthesiser with a parametric multi-state survival regression model. We use Weibull regression and both clock-reset and clock-forward models. Moreover, we extend our approach into an MS-TTE model with a differential privacy guarantee. We also introduce a novel differentially private Weibull regression model. We review selected evaluation methods for synthetic data in terms of privacy and utility evaluation. The standard approach evaluates synthetic data based on a single data set, which does not account for the variance between synthetic data sets generated from the same synthesiser. We propose a distance-based evaluation framework which adjusts for this variance. Using an open-access data set, we demonstrate our proposed synthesisers for MS-TTE data with and without differential privacy. Furthermore, we exemplify the evaluation of these synthesisers and their synthetic data by adapting reviewed methods to an MS-TTE setting and utilising our proposed evaluation framework.