Challenges of releasing audio material for spoken data: The case of the London–Lund Corpus 2

dc.date.accessioned	2022-01-28T18:48:10Z
dc.date.available	2022-01-28T18:48:10Z
dc.date.created	2021-10-06T14:37:28Z
dc.date.issued	2021
dc.identifier.citation	Põldvere, Nele Frid, Johan Johansson, Victoria Paradis, Carita . Challenges of releasing audio material for spoken data: The case of the London–Lund Corpus 2. Research in Corpus Linguistics. 2021, 9(1), 35-62
dc.identifier.uri	http://hdl.handle.net/10852/90265
dc.description.abstract	This article aims to describe key challenges of preparing and releasing audio material for spoken data and to propose solutions to these challenges. We draw on our experience of compiling the new London-Lund Corpus 2 (LLC-2), where transcripts are released together with the audio files. However, making the audio material publicly available required careful consideration of how to, most effectively, 1) align the transcripts with the audio and 2) anonymise personal information in the recordings. First, audio-to-text alignment was solved through the insertion of timestamps in front of speaker turns in the transcription stage, which, as we show in the article, may later be used as a valuable complement to more robust automatic segmentation. Second, anonymisation was done by means of a Praat script, which replaced all personal information with a sound that made the lexical information incomprehensible but retained the prosodic characteristics. The public release of the LLC-2 audio material is a valuable feature of the corpus that allows users to extend the corpus data relative to their own research interests and, thus, broaden the scope of corpus linguistics. To illustrate this, we present three studies that have successfully used the LLC-2 audio material.
dc.language	EN
dc.publisher	Asociación Española de Lingüística de Corpus (AELINCO)
dc.rights	Attribution 4.0 International
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.title	Challenges of releasing audio material for spoken data: The case of the London–Lund Corpus 2
dc.type	Journal article
dc.creator.author	Põldvere, Nele
dc.creator.author	Frid, Johan
dc.creator.author	Johansson, Victoria
dc.creator.author	Paradis, Carita
cristin.unitcode	185,14,34,70
cristin.unitname	Russland, Sentral-Europa og Balkan
cristin.ispublished	true
cristin.fulltext	original
cristin.qualitycode	1
dc.identifier.cristin	1943839
dc.identifier.bibliographiccitation	info:ofi/fmt:kev:mtx:ctx&ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.jtitle=Research in Corpus Linguistics&rft.volume=9&rft.spage=35&rft.date=2021
dc.identifier.jtitle	Research in Corpus Linguistics
dc.identifier.volume	9
dc.identifier.issue	1
dc.identifier.startpage	35
dc.identifier.endpage	62
dc.identifier.doi	https://doi.org/10.32714/ricl.09.01.04
dc.identifier.urn	URN:NBN:no-92860
dc.type.document	Tidsskriftartikkel
dc.type.peerreviewed	Peer reviewed
dc.source.issn	2243-4712
dc.identifier.fulltext	Fulltext https://www.duo.uio.no/bitstream/handle/10852/90265/1/Challenges%2Bof%2Breleasing%2Baudio%2Bmaterial%2Bfor%2Bspoken%2Bdata.pdf
dc.type.version	PublishedVersion