Hide metadata

dc.date.accessioned2013-03-12T08:15:23Z
dc.date.available2013-03-12T08:15:23Z
dc.date.issued2012en_US
dc.date.submitted2012-02-13en_US
dc.identifier.citationBauge, Ola sverre. Dehyphenation. Masteroppgave, University of Oslo, 2012en_US
dc.identifier.urihttp://hdl.handle.net/10852/9043
dc.description.abstractAn increasing volume of text is currently in the process of being digitized from paper, or converted to plaintext from paper-centric file formats. Since these kinds of documents have typically been typeset for a two-dimensional printing surface, words are commonly hyphenated across lines. These hyphenations can cause noise in a corpus by either splitting words or running them together, resulting in omissions when the text is indexed. Longer content-words are affected disproportionately by this; for search applications especially, there's also potential for disrupting longer exact-phrase queries in particular. The task of dehyphenation involves detecting and removing only those hyphens that were inserted for typographical reasons at the time of typesetting, producing a text which should lie closer to the original. In this thesis, several empirical methods for dehyphenation are described, prototyped, and then evaluated on a heterogenous sample of English/Norwegian academic texts from the Norwegian Open Research Archive (NORA). Most of the methods investigated are intended to be applicable across many different alphabetic languages without requiring close supervision or previously-compiled dictionaries. Recommendations and suggestions for future work are given.eng
dc.language.isonoben_US
dc.titleDehyphenation : Some empirical methodsen_US
dc.typeMaster thesisen_US
dc.date.updated2012-05-18en_US
dc.creator.authorBauge, Ola sverreen_US
dc.subject.nsiVDP::420en_US
dc.identifier.bibliographiccitationinfo:ofi/fmt:kev:mtx:ctx&ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&rft.au=Bauge, Ola sverre&rft.title=Dehyphenation&rft.inst=University of Oslo&rft.date=2012&rft.degree=Masteroppgaveen_US
dc.identifier.urnURN:NBN:no-30782en_US
dc.type.documentMasteroppgaveen_US
dc.identifier.duo151215en_US
dc.contributor.supervisorJan Tore Lønningen_US
dc.identifier.bibsys121577325en_US
dc.identifier.fulltextFulltext https://www.duo.uio.no/bitstream/handle/10852/9043/1/Bauge2012-dehyphenation.pdf


Files in this item

Appears in the following Collection

Hide metadata