Hide metadata

dc.contributor.authorJentoft, Matias
dc.date.accessioned2023-08-24T22:02:17Z
dc.date.available2023-08-24T22:02:17Z
dc.date.issued2023
dc.identifier.citationJentoft, Matias. Grammatical Error Correction with byte-level language models. Master thesis, University of Oslo, 2023
dc.identifier.urihttp://hdl.handle.net/10852/103885
dc.description.abstractGrammatical Error Correction (GEC) is the task of automatically correcting errors in human language on a sentence level or higher. Good performance on this task can have many real-world applications for foreign language learners, children, and people with various language impairments. In this thesis we explore how different representations of language affect the performance of GEC systems for English and Norwegian. We compare sequence-to-sequence language models with byte-level (byt5) and subword-level (t5/nort5/mt5) language representation. We show that byte-level representation of model input is just as good for this kind of tasks as subword-level representation, and for Norwegian it is actually better. We look at how the different levels of representation affect performance on specific error types that may occur in natural text. A special focus is put on how byte-level representation handles noisy text, i.e. text with sequences that contain spelling errors that might not have occurred in the training data of the models. We release the first language models for grammatical error correction in Norwegian. This has been possible with a large annotated second language learners corpus: Norsk Andrespråkskorpus (ASK). We modify the corpus into a parallel corpus, and use it for fine-tuning pre-trained models of the t5-family. Our best system, which is byte-level based and trained on multilingual data, achieves an f0.5 score of 0.581 on our test set, and beats the subword-level Norwegian-trained nort5.eng
dc.language.isoeng
dc.subjectnlp
dc.subjectlanguage representation
dc.subjectgec
dc.subjectgrammatical error correction
dc.subjectseq2seq
dc.subjectspråkteknologi
dc.subjectjentoft
dc.subjectlanguage models
dc.subjectsequence to sequence
dc.subjectbyte-level models
dc.titleGrammatical Error Correction with byte-level language modelseng
dc.typeMaster thesis
dc.date.updated2023-08-25T22:04:10Z
dc.creator.authorJentoft, Matias
dc.type.documentMasteroppgave


Files in this item

Appears in the following Collection

Hide metadata