Hide metadata

dc.date.accessioned2024-06-26T15:35:07Z
dc.date.available2024-06-26T15:35:07Z
dc.date.created2024-05-29T16:03:27Z
dc.date.issued2024
dc.identifier.citationde Gibert, Ona Nail, Graeme Arefev, Nikolay Bañón, Marta van der Linde, Jelmer Ji, Shaoxiong Zaragoza-Bernabeu, Jaume Aulamo, Mikko Ramírez-Sánchez, Gema Kutuzov, Andrei Pyysalo, Sampo Oepen, Stephan Tiedemann, Jörg . A New Massive Multilingual Dataset for High-Performance Language Technologies. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024, 1116-1128 European Language Resources Association
dc.identifier.urihttp://hdl.handle.net/10852/111277
dc.description.abstractWe present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ≈ 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.
dc.languageEN
dc.publisherEuropean Language Resources Association
dc.rightsAttribution-NonCommercial 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by-nc/4.0/
dc.titleA New Massive Multilingual Dataset for High-Performance Language Technologies
dc.title.alternativeENEngelskEnglishA New Massive Multilingual Dataset for High-Performance Language Technologies
dc.typeChapter
dc.creator.authorde Gibert, Ona
dc.creator.authorNail, Graeme
dc.creator.authorArefev, Nikolay
dc.creator.authorBañón, Marta
dc.creator.authorvan der Linde, Jelmer
dc.creator.authorJi, Shaoxiong
dc.creator.authorZaragoza-Bernabeu, Jaume
dc.creator.authorAulamo, Mikko
dc.creator.authorRamírez-Sánchez, Gema
dc.creator.authorKutuzov, Andrei
dc.creator.authorPyysalo, Sampo
dc.creator.authorOepen, Stephan
dc.creator.authorTiedemann, Jörg
cristin.unitcode185,15,5,48
cristin.unitnameForskningsgruppen for språkteknologi
cristin.ispublishedtrue
cristin.fulltextoriginal
dc.identifier.cristin2271892
dc.identifier.bibliographiccitationinfo:ofi/fmt:kev:mtx:ctx&ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.btitle=Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)&rft.spage=1116&rft.date=2024
dc.identifier.startpage1116
dc.identifier.endpage1128
dc.type.documentBokkapittel
dc.type.peerreviewedPeer reviewed
dc.source.isbn9782493814104
dc.type.versionPublishedVersion
cristin.btitleProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)


Files in this item

Appears in the following Collection

Hide metadata

Attribution-NonCommercial 4.0 International
This item's license is: Attribution-NonCommercial 4.0 International