Hide metadata

dc.contributor.authorJustnes, Vemund
dc.date.accessioned2021-09-07T22:32:45Z
dc.date.available2021-09-07T22:32:45Z
dc.date.issued2021
dc.identifier.citationJustnes, Vemund. Using Word Embeddings to Determine Concepts of Values In Insurance Claim Spreadsheets. Master thesis, University of Oslo, 2021
dc.identifier.urihttp://hdl.handle.net/10852/87784
dc.description.abstractMany business decisions are based on data which exist in spreadsheets. In the insurance domain, domain experts use spreadsheet tools for different analytical tasks, such as underwriting, and as a tool for exchanging data with a third party. In business-to-business insurance, a claimant or surveyor may send an overview of damages as a spreadsheet. Data in these spreadsheets are of interest when handling a claim, and can also include meta-information, such as parties, locations, events, etc. For instance, the claim handler will look in the spreadsheet and find the costs related to a damage, and then map those costs to specific insurance coverage types. In order to automate the mapping from a cost to an insurance coverage, both the description of the cost and the total sum need to be extracted from the spreadsheet. Automatically mining these spreadsheets is challenging as they have no standardized structure; i.e., they are semi-structured tables. I explored the idea of classifying the concept of textual values found in spreadsheets as a potential first step for mining data in semi-structured tables in the insurance domain. In order to use supervised learning, I created a data set consisting of 119,963 cell values gathered from spreadsheets linked to actual Norwegian property claims. I tried four different methods for the classification task: rule-based (RB), multinomial naïve Bayes (MNB), k-nearest neighbors (k-NN), and a method where I represented each concept by the mean embedding of all its samples (MI; short for multi-index). The RB used the raw text as input, the MNB used bag-of-character n-grams to featurize the text, and both the k-NN and the MI represented the text by sentence embeddings (where both the composition method and the distributed method was used). Measuring the accuracy of the models by F1-score (harmonic mean between correct classification and missed classifications), the RB has an accuracy of 52.90%, MNB has an accuracy of 79.11%, k-NN has an accuracy of 87.09%, while the MI has an accuracy of 64.78%. Although k-NN achieved the highest accuracy, it took nearly three hours to evaluate 23,998 test samples, whereas the MNB evaluated them in just under five seconds. Interestingly, I found that the MI (which with an inefficient implementation evaluated in roughly 20 seconds) reached an accuracy on par with the RB with just ten samples -- indicating that the MI is a practical method for streamlining annotation of text found in spreadsheets. To conclude, the approach I present is not suitable to extract information that affect the claim handling, as it relies too much on the formatting of values, such as monetary values. However, the approach makes meta-information more accessible, and therefore, the approach can be used to extract information, such as organisation names and locations. Although the k-NN is inefficient at classifying, the method can be used to extract meta-information, as such an approach can be handled as a background process that does not affect the claim handling. Finally, I suggest that some future work is done to improve the data set I created, investigate whether word embeddings fine-tuned to the insurance domain improves the accuracy, and investigate a method for making more efficient classification using the k-NN.eng
dc.language.isoeng
dc.subjectinsurance
dc.subjectinformation extraction
dc.subjectspreadsheet
dc.titleUsing Word Embeddings to Determine Concepts of Values In Insurance Claim Spreadsheetseng
dc.typeMaster thesis
dc.date.updated2021-09-07T22:32:44Z
dc.creator.authorJustnes, Vemund
dc.identifier.urnURN:NBN:no-90440
dc.type.documentMasteroppgave
dc.identifier.fulltextFulltext https://www.duo.uio.no/bitstream/handle/10852/87784/1/using_word_embeddings_for_concept_determiniation.pdf


Files in this item

Appears in the following Collection

Hide metadata