Hide metadata

dc.date.accessioned2020-06-29T18:01:55Z
dc.date.available2020-06-29T18:01:55Z
dc.date.created2019-10-23T21:21:33Z
dc.date.issued2019
dc.identifier.citationTørresen, Ole K. Star, Bastiaan Mier, Pablo Andrade-Navarro, Miguel A. Bateman, Alex Jarnot, Patryk Gruca, Aleksandra Grynberg, Marcin Kajava, Andrey V. Promponas, Vasilis J. Anisimova, Maria Jakobsen, Kjetill Sigurd Linke, Dirk . Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases. Nucleic Acids Research. 2019, 1
dc.identifier.urihttp://hdl.handle.net/10852/77288
dc.description.abstractThe widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.
dc.languageEN
dc.rightsAttribution 4.0 International
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.titleTandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases
dc.typeJournal article
dc.creator.authorTørresen, Ole K.
dc.creator.authorStar, Bastiaan
dc.creator.authorMier, Pablo
dc.creator.authorAndrade-Navarro, Miguel A.
dc.creator.authorBateman, Alex
dc.creator.authorJarnot, Patryk
dc.creator.authorGruca, Aleksandra
dc.creator.authorGrynberg, Marcin
dc.creator.authorKajava, Andrey V.
dc.creator.authorPromponas, Vasilis J.
dc.creator.authorAnisimova, Maria
dc.creator.authorJakobsen, Kjetill Sigurd
dc.creator.authorLinke, Dirk
cristin.unitcode185,15,29,50
cristin.unitnameCentre for Ecological and Evolutionary Synthesis
cristin.ispublishedtrue
cristin.fulltextoriginal
cristin.qualitycode2
dc.identifier.cristin1740016
dc.identifier.bibliographiccitationinfo:ofi/fmt:kev:mtx:ctx&ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.jtitle=Nucleic Acids Research&rft.volume=1&rft.spage=&rft.date=2019
dc.identifier.jtitleNucleic Acids Research
dc.identifier.volume47
dc.identifier.issue21
dc.identifier.startpage10994
dc.identifier.endpage11006
dc.identifier.doihttps://doi.org/10.1093/nar/gkz841
dc.identifier.urnURN:NBN:no-80434
dc.type.documentTidsskriftartikkel
dc.type.peerreviewedPeer reviewed
dc.source.issn0305-1048
dc.identifier.fulltextFulltext https://www.duo.uio.no/bitstream/handle/10852/77288/2/T%25C3%25B8rresen-NAR-2019.pdf
dc.type.versionPublishedVersion
dc.relation.projectNFR/251076
dc.relation.projectCOST/BM1405
dc.relation.projectEU/POWR.03.02.00-00-I029


Files in this item

Appears in the following Collection

Hide metadata

Attribution 4.0 International
This item's license is: Attribution 4.0 International