Skip to main content
Researchdata.se
ℹ️ This is a preview version of Researchdata.se. The site contents and features are under development.

Supporting data tracks for: "Breaking insect genome records: sequencing of Stylops ater (Strepsiptera) reveals a minute, compact genome with a reduced set of genes".

Supporting data tracks for: "Breaking insect genome records: sequencing of Stylops ater (Strepsiptera) reveals a minute, compact genome with a reduced set of genes".
https://doi.org/10.17044/SCILIFELAB.30604043
This data set contains supporting data tracks for the manuscript: "Breaking insect genome records: sequencing of Stylops ater (Strepsiptera) reveals a minute, compact genome with a reduced set of genes". Assembly and gene annotation are available on ENA under the umbrella project PRJEB71963. Here we publish the following additional resources: - repeatmasker.gff - repeat track generated via: a repeat library was modelled with the RepeatModeler2 [1] v2.0.2a package. As repeats can be part of actual protein-coding genes, the candidate repeats modelled by RepeatModeler were vetted against our protein set (minus transposons) to exclude any nucleotide motif stemming from low-complexity coding sequences. From the repeat library, identification of repeat sequences present in the genome was performed using RepeatMasker (https://www.repeatmasker.orgOpens in a new tab) v4.1.5 [2] - repeatrunner.gff - repeat track generated via: RepeatRunner [3]. RepeatRunner is a program that integrates RepeatMasker with BLASTX allowing analysing highly divergent repeats and divergent portions of repeats and identifying divergent protein coding portions of retro-elements and retroviruses not detected by RepeatMasker. - trna.gff - tRNA track - have been predicted through tRNAscan version 1.4 [4]. - rfam.gff - ncRNA track - As the main source of information we use the RNA family database Rfam (version 14.9) [5]. Rfam provides curated co-variance (CM) models, which can be used together with the Infernal [6] package to predict ncRNAs in genomic sequences. By default, the set of CM profiles is limited by us to only included broadly conserved, eukaryotic ncRNA families. /! In general, Rfam-derived ‘annotations’ should rather be seen as ‘predictions’. With the exception of some very well conserved ncRNA families, many of the resulting Rfam predictions need to be considered with some care. - ST_1.gff3 - Transcriptome assembly of Illumina RNA-seq library ST_1 (ENA: SAMEA12922144, ERX11689259) assembled using our in-house pipeline transcript_assembly (https://github.com/NBISweden/pipelines-nextflow/tree/master/subworkflows/transcript_assemblyOpens in a new tab) [7]. To minimise gene fusions in this parasite genome the maximum intron length was reduced from 500000 to 20000 (hisat2 --max-intronlen 20000). Otherwise default parameter were used. - ST_2.gff3 - Transcriptome assembly of Illumina RNA-seq library ST_2 (ENA: SAMEA12922144, ERX11689261) assembled using our in-house pipeline transcript_assembly (https://github.com/NBISweden/pipelines-nextflow/tree/master/subworkflows/transcript_assemblyOpens in a new tab) [7]. To minimise gene fusions in this parasite genome the maximum intron length was reduced from 500000 to 20000 (hisat2 --max-intronlen 20000). Otherwise default parameter were used. - rc2_evidence_abinitio.gff - gene models created from second MAKER run (rc2), combining evidence (from run 1 or rc1) and ab initio predictors. Specifically, AUGUSTUS was used for the rc2_evidence_abinitio.gff - rc2_evidence_genemark.gff - gene models created from second MAKER run (rc2), combining evidence (from run 1 or rc1) and ab initio predictors. Specifically, GeneMark was used for the rc2_evidence_genemark.gff References: [1] - Flynn JM, Hubley R, Goubert C, Rosen J, Clark AG, Feschotte C, Smit AF. (2020) RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences. 117 (17) 9451-9457. https://doi.org/10.1073/pnas.1921046117Opens in a new tab [2] - Smit AFA, Hubley R, Green, P. (2013-2015) RepeatMasker Open-4.0. [3] - Yandell Lab: https://www.yandell-lab.org/software/repeatrunner.htmlOpens in a new tab [4] - Lowe TM, Eddy SR. (1997) tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Research 25(5): 955–964. https://doi.org/10.1093/nar/25.5.955Opens in a new tab (https://doi.org/10.1093/nar/25.5.955Opens in a new tab) [5] - Ioanna Kalvari, Eric P Nawrocki, Nancy Ontiveros-Palacios, Joanna Argasinska, Kevin Lamkiewicz, Manja Marz, Sam Griffiths-Jones, Claire Toffano-Nioche, Daniel Gautheret, Zasha Weinberg, Elena Rivas, Sean R Eddy, Robert D Finn, Alex Bateman, Anton I Petrov, Rfam 14: expanded coverage of metagenomic, viral and microRNA families, Nucleic Acids Research, Volume 49, Issue D1, 8 January 2021, Pages D192–D200, https://doi.org/10.1093/nar/gkaa1047Opens in a new tab [6] - The recommended citation for using Infernal 1.1 is E. P. Nawrocki and S. R. Eddy, Infernal 1.1: 100-fold faster RNA homology searches (http://eddylab.org/publications.html#Nawrocki13cOpens in a new tab) , Bioinformatics 29:2933-2935 (2013). [7] - Github: https://github.com/NBISweden/pipelines-nextflow/tree/master/subworkflows/transcript_assemblyOpens in a new tab
Go to data source
Opens in a new tab
https://doi.org/10.17044/SCILIFELAB.30604043

Citation and access

Topic and keywords

Metadata

scilifelab
Swedish Museum of Natural History