dc.contributor.author | Jentoft, Matias | |
dc.date.accessioned | 2023-08-24T22:02:17Z | |
dc.date.available | 2023-08-24T22:02:17Z | |
dc.date.issued | 2023 | |
dc.identifier.citation | Jentoft, Matias. Grammatical Error Correction with byte-level language models. Master thesis, University of Oslo, 2023 | |
dc.identifier.uri | http://hdl.handle.net/10852/103885 | |
dc.description.abstract | Grammatical Error Correction (GEC) is the task of automatically correcting errors in human language on a sentence level or higher. Good performance on this task can have many real-world applications for foreign language learners, children, and people with various language impairments. In this thesis we explore how different representations of language affect the performance of GEC systems for English and Norwegian. We compare sequence-to-sequence language models with byte-level (byt5) and subword-level (t5/nort5/mt5) language representation. We show that byte-level representation of model input is just as good for this kind of tasks as subword-level representation, and for Norwegian it is actually better. We look at how the different levels of representation affect performance on specific error types that may occur in natural text. A special focus is put on how byte-level representation handles noisy text, i.e. text with sequences that contain spelling errors that might not have occurred in the training data of the models. We release the first language models for grammatical error correction in Norwegian. This has been possible with a large annotated second language learners corpus: Norsk Andrespråkskorpus (ASK). We modify the corpus into a parallel corpus, and use it for fine-tuning pre-trained models of the t5-family. Our best system, which is byte-level based and trained on multilingual data, achieves an f0.5 score of 0.581 on our test set, and beats the subword-level Norwegian-trained nort5. | eng |
dc.language.iso | eng | |
dc.subject | nlp | |
dc.subject | language representation | |
dc.subject | gec | |
dc.subject | grammatical error correction | |
dc.subject | seq2seq | |
dc.subject | språkteknologi | |
dc.subject | jentoft | |
dc.subject | language models | |
dc.subject | sequence to sequence | |
dc.subject | byte-level models | |
dc.title | Grammatical Error Correction with byte-level language models | eng |
dc.type | Master thesis | |
dc.date.updated | 2023-08-25T22:04:10Z | |
dc.creator.author | Jentoft, Matias | |
dc.type.document | Masteroppgave | |