A New Massive Multilingual Dataset for High-Performance Language Technologies

dc.date.accessioned	2024-06-26T15:35:07Z
dc.date.available	2024-06-26T15:35:07Z
dc.date.created	2024-05-29T16:03:27Z
dc.date.issued	2024
dc.identifier.citation	de Gibert, Ona Nail, Graeme Arefev, Nikolay Bañón, Marta van der Linde, Jelmer Ji, Shaoxiong Zaragoza-Bernabeu, Jaume Aulamo, Mikko Ramírez-Sánchez, Gema Kutuzov, Andrei Pyysalo, Sampo Oepen, Stephan Tiedemann, Jörg . A New Massive Multilingual Dataset for High-Performance Language Technologies. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024, 1116-1128 European Language Resources Association
dc.identifier.uri	http://hdl.handle.net/10852/111277
dc.description.abstract	We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ≈ 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.
dc.language	EN
dc.publisher	European Language Resources Association
dc.rights	Attribution-NonCommercial 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by-nc/4.0/
dc.title	A New Massive Multilingual Dataset for High-Performance Language Technologies
dc.title.alternative	ENEngelskEnglishA New Massive Multilingual Dataset for High-Performance Language Technologies
dc.type	Chapter
dc.creator.author	de Gibert, Ona
dc.creator.author	Nail, Graeme
dc.creator.author	Arefev, Nikolay
dc.creator.author	Bañón, Marta
dc.creator.author	van der Linde, Jelmer
dc.creator.author	Ji, Shaoxiong
dc.creator.author	Zaragoza-Bernabeu, Jaume
dc.creator.author	Aulamo, Mikko
dc.creator.author	Ramírez-Sánchez, Gema
dc.creator.author	Kutuzov, Andrei
dc.creator.author	Pyysalo, Sampo
dc.creator.author	Oepen, Stephan
dc.creator.author	Tiedemann, Jörg
cristin.unitcode	185,15,5,48
cristin.unitname	Forskningsgruppen for språkteknologi
cristin.ispublished	true
cristin.fulltext	original
dc.identifier.cristin	2271892
dc.identifier.bibliographiccitation	info:ofi/fmt:kev:mtx:ctx&ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.btitle=Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)&rft.spage=1116&rft.date=2024
dc.identifier.startpage	1116
dc.identifier.endpage	1128
dc.type.document	Bokkapittel
dc.type.peerreviewed	Peer reviewed
dc.source.isbn	9782493814104
dc.type.version	PublishedVersion
cristin.btitle	Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)