Using Word Embeddings to Determine Concepts of Values In Insurance Claim Spreadsheets

dc.contributor.author	Justnes, Vemund
dc.date.accessioned	2021-09-07T22:32:45Z
dc.date.available	2021-09-07T22:32:45Z
dc.date.issued	2021
dc.identifier.citation	Justnes, Vemund. Using Word Embeddings to Determine Concepts of Values In Insurance Claim Spreadsheets. Master thesis, University of Oslo, 2021
dc.identifier.uri	http://hdl.handle.net/10852/87784
dc.description.abstract	Many business decisions are based on data which exist in spreadsheets. In the insurance domain, domain experts use spreadsheet tools for different analytical tasks, such as underwriting, and as a tool for exchanging data with a third party. In business-to-business insurance, a claimant or surveyor may send an overview of damages as a spreadsheet. Data in these spreadsheets are of interest when handling a claim, and can also include meta-information, such as parties, locations, events, etc. For instance, the claim handler will look in the spreadsheet and find the costs related to a damage, and then map those costs to specific insurance coverage types. In order to automate the mapping from a cost to an insurance coverage, both the description of the cost and the total sum need to be extracted from the spreadsheet. Automatically mining these spreadsheets is challenging as they have no standardized structure; i.e., they are semi-structured tables. I explored the idea of classifying the concept of textual values found in spreadsheets as a potential first step for mining data in semi-structured tables in the insurance domain. In order to use supervised learning, I created a data set consisting of 119,963 cell values gathered from spreadsheets linked to actual Norwegian property claims. I tried four different methods for the classification task: rule-based (RB), multinomial naïve Bayes (MNB), k-nearest neighbors (k-NN), and a method where I represented each concept by the mean embedding of all its samples (MI; short for multi-index). The RB used the raw text as input, the MNB used bag-of-character n-grams to featurize the text, and both the k-NN and the MI represented the text by sentence embeddings (where both the composition method and the distributed method was used). Measuring the accuracy of the models by F1-score (harmonic mean between correct classification and missed classifications), the RB has an accuracy of 52.90%, MNB has an accuracy of 79.11%, k-NN has an accuracy of 87.09%, while the MI has an accuracy of 64.78%. Although k-NN achieved the highest accuracy, it took nearly three hours to evaluate 23,998 test samples, whereas the MNB evaluated them in just under five seconds. Interestingly, I found that the MI (which with an inefficient implementation evaluated in roughly 20 seconds) reached an accuracy on par with the RB with just ten samples -- indicating that the MI is a practical method for streamlining annotation of text found in spreadsheets. To conclude, the approach I present is not suitable to extract information that affect the claim handling, as it relies too much on the formatting of values, such as monetary values. However, the approach makes meta-information more accessible, and therefore, the approach can be used to extract information, such as organisation names and locations. Although the k-NN is inefficient at classifying, the method can be used to extract meta-information, as such an approach can be handled as a background process that does not affect the claim handling. Finally, I suggest that some future work is done to improve the data set I created, investigate whether word embeddings fine-tuned to the insurance domain improves the accuracy, and investigate a method for making more efficient classification using the k-NN.	eng
dc.language.iso	eng
dc.subject	insurance
dc.subject	information extraction
dc.subject	spreadsheet
dc.title	Using Word Embeddings to Determine Concepts of Values In Insurance Claim Spreadsheets	eng
dc.type	Master thesis
dc.date.updated	2021-09-07T22:32:44Z
dc.creator.author	Justnes, Vemund
dc.identifier.urn	URN:NBN:no-90440
dc.type.document	Masteroppgave
dc.identifier.fulltext	Fulltext https://www.duo.uio.no/bitstream/handle/10852/87784/1/using_word_embeddings_for_concept_determiniation.pdf

Files in this item

Name:: using_word_embeddings_for_conc ...
Size:: 1.110Mb
Format:: application/

View/Open

Appears in the following Collection

Institutt for informatikk [4956]

Hide metadata

Using Word Embeddings to Determine Concepts of Values In Insurance Claim Spreadsheets

Files in this item

Appears in the following Collection

Browse

For library staff

RSS Feeds