deTEct - transposition event detector
Transposable elements (TEs) are mobile DNA sequences capable of replicating themselves within genomes independently of the host cell DNA. They are the "jumping genes" that move from one location on the genome to another. These events will add complexity into large-scale genomic functional analysis. It is crucial to identify TEs and explore the impact of TEs into structural variants identification.
Protocol
According to [1], we want to use the tool deTEct
[2] to identify transposition event.
Ingredients
Input: Structural variants (VCF file) of PBSV (on PBMM2 alignments) or Sniffles (on NGMLR alignments), transposon annotations (by resonaTE), reference genome (FASTA)
Test
I already have SV results called by lumpy, let's see if it works for this workflow.
Installation
conda create --name py38_deTEct python=3.8
conda activate py38_deTEct
conda install -c derkevinriehl transposition_detector_detect
Usage
transposition_deTEct -help
transposition_deTEct [-help] -seqHeadTXT <FILE> -transpGFF3 <FILE> -assmFasta <FILE> -svTool <FILE> -svFile <FILE> -outParsedFile <FILE> -outResultFile <FILE>
Demo
git clone https://github.com/DerKevinRiehl/transposition_detector_deTEct.git
cd PATH_TO/transposition_detector_deTEct
conda activate py38_deTEct
# demo for sniffles_ngmlr alignments
transposition_deTEct -seqHeadTXT demoFiles/sequence_heads.txt -transpGFF3 demoFiles/FinalAnnotations_Transposons.gff3 -assmFasta demoFiles/sequence_CB4856.fasta -svTool sniffles -svFile demoFiles/SX3351_addisababa.sniffles_ngmlr.vcf -outParsedFile demoFiles/sniffles_ngmlr/SX3351_addisababa.SV.vcf.gff3 -outResultFile demoFiles/sniffles_ngmlr/SX3351_addisababa.transpositionEvents.gff3
# demo for pbsv_pbsmm2 alignments
transposition_deTEct -seqHeadTXT demoFiles/sequence_heads.txt -transpGFF3 demoFiles/FinalAnnotations_Transposons.gff3 -assmFasta demoFiles/sequence_CB4856.fasta -svTool pbsv -svFile demoFiles/SX3351_addisababa.pbsv_pbmm2.vcf -outParsedFile demoFiles/pbsv_pbmm2/SX3351_addisababa.SV.vcf.gff3 -outResultFile demoFiles/pbsv_pbmm2/SX3351_addisababa.transpositionEvents.gff3
The actual scripts
conda activate py38_deTEct
seqhead=/home/ybao2/GithubWorkRepo/BulkDNA/transposon_annotation_reasonaTE/workspace/testProject/sequence_heads.txt
TE=/home/ybao2/GithubWorkRepo/BulkDNA/transposon_annotation_reasonaTE/workspace/testProject/finalResults/FinalAnnotations_Transposons.gff3
genome=/media/XLStorage/ybao2/RefGenome/dmel-all-chromosome-r6.39.fasta
SV=demoFiles/SX3351_addisababa.sniffles_ngmlr.vcf
outparse=demoFiles/sniffles_ngmlr/SX3351_addisababa.SV.vcf.gff3
outresult=demoFiles/sniffles_ngmlr/SX3351_addisababa.transpositionEvents.gff3
transposition_deTEct -seqHeadTXT $seqhead -transpGFF3 $TE -assmFasta $genome -svTool sniffles -svFile $SV -outParsedFile $outparse -outResultFile $outresult
which tells us we need multiple files produced by the tool reasonaTE
[3]
reasonaTE
Ingredients
- Input: Genome assembly (FASTA file).
- Output: Lots of transposon annotations (GFF3 file).
Installation
# Environment 1 - including all annotation tools
conda create -y --name transposon_annotation_tools_env python=2.7
conda activate transposon_annotation_tools_env
conda install -y mamba
conda install -c bioconda repeatmodeler repeatmasker # Recommended not too install via conda
mamba install -y -c bioconda genometools-genometools # for some users: mamba install -y -c bioconda -c conda-forge genometools-genometools
mamba install -y -c derkevinriehl transposon_annotation_reasonate
mamba install -y -c derkevinriehl transposon_annotation_tools_proteinncbicdd1000
conda install -y -c derkevinriehl transposon_annotation_tools_transposonpsicli
mamba install -y -c derkevinriehl transposon_annotation_tools_mitetracker
mamba install -y -c derkevinriehl transposon_annotation_tools_sinescan=1.1.2
mamba install -y -c derkevinriehl transposon_annotation_tools_helitronscanner
mamba install -y -c derkevinriehl transposon_annotation_tools_mitefinderii
mamba install -y -c derkevinriehl transposon_annotation_tools_mustv2
mamba install -y -c derkevinriehl transposon_annotation_tools_sinefinder
mamba install -y -c anaconda biopython
conda deactivate
# Environment 2 - including CD-Hit and Transposon Classifier RFSB
conda create -y --name transposon_annotation_reasonaTE
conda activate transposon_annotation_reasonaTE
conda install -y mamba
mamba install -y -c anaconda biopython
mamba install -y -c bioconda cd-hit blast seqkit
mamba install -y -c derkevinriehl transposon_annotation_reasonate transposon_classifier_rfsb
conda deactivate
Issue: The process is very slow and mamba
is not working with error message:
/anaconda3/envs/transposon_annotation_tools_env/lib/python2.7/site-packages/tqdm/__init__.py", line 3, in <module>
from .cli import main # TODO: remove in v5.0.0
File "/home/ybao2/anaconda3/envs/transposon_annotation_tools_env/lib/python2.7/site-packages/tqdm/cli.py", line 202
sys.stderr.write(f"Error:Unknown argument:{argv[0]}\n{help_short}")
^
SyntaxError: invalid syntax
New attempt
Remove the conda env
wget https://raw.githubusercontent.com/DerKevinRiehl/transposon_annotation_reasonaTE/main/environment_yml/transposon_annotation_tools_env.yml
wget https://raw.githubusercontent.com/DerKevinRiehl/transposon_annotation_reasonaTE/main/environment_yml/transposon_annotation_reasonaTE.yml
conda env create -f transposon_annotation_tools_env.yml
conda env create -f transposon_annotation_reasonaTE.yml
It seems to work.
Usage
conda activate transposon_annotation_tools_env
mkdir workspace
wget https://raw.githubusercontent.com/DerKevinRiehl/transposon_annotation_reasonaTE/main/workspace/testProject/sequence.fasta # demo fasta you could use
The actual script
#!/bin/sh
#SBATCH --qos=normal # Quality of Service
#SBATCH --job-name=reasonaTE # Job Name
#SBATCH --time=1-0:00:00 # WallTime
#SBATCH --nodes=1 # Number of Nodes
#SBATCH --ntasks-per-node=1 # Number of tasks (MPI processes)
#SBATCH --cpus-per-task=1 # Number of threads per task (OMP threads)
#SBATCH --output=Log/reasonaTE.out ### File in which to store job output
# Activate annaconda to make it usable in slurm
eval "$(conda shell.bash hook)"
workspace=workspace
projectname=testProject
genome=sequence.fasta
mkdir -p $workspace
# 1. Create a project
conda activate transposon_annotation_tools_env
reasonaTE -mode createProject -projectFolder $workspace -projectName $projectname -inputFasta $genome
# 2. Annotate genome with annotation tools
conda activate transposon_annotation_tools_env
reasonaTE -mode annotate -projectFolder $workspace -projectName $projectname -tool all
# 3. Parse annotations
conda activate transposon_annotation_tools_env
reasonaTE -mode parseAnnotations -projectFolder $workspace -projectName $projectname
# 4. Run the pipeline on the genome annotations
conda activate transposon_annotation_reasonaTE
reasonaTE -mode pipeline -projectFolder $workspace -projectName $projectname
# 5. Calculate final statistics
conda activate transposon_annotation_reasonaTE
reasonaTE -mode statistics -projectFolder $workspace -projectName $projectname
Check!
Step 3: Annotate results
conda activate transposon_annotation_tools_env
reasonaTE -mode checkAnnotations -projectFolder workspace -projectName testProject
reasonaTE -mode checkAnnotations -projectFolder workspace -projectName testProject Checking helitronScanner ... completed Checking ltrHarvest ... completed Checking ltrPred ... not completed Checking mitefind ... completed Checking mitetracker ... completed Checking must ... completed Checking repeatmodel ... not completed Checking repMasker ... not completed Checking sinefind ... completed Checking sinescan ... completed Checking tirvish ... completed Checking transposonPSI ... completed Checking NCBICDD1000 ... completed
Notice there is no result for `ItrPred`
To run ItrPred
, do separate installation following the tutorial [4]
Notice there is no result for `repeatmodel` and `repMasker`
Previously, I tried to use conda to work with these two and failed. The conda environment for these two tools have been reported errors. Running with the docker image also got me nowhere. I need to revisit this bit to install from source.
Step 3: Parse annotations:
conda activate transposon_annotation_tools_env
reasonaTE -mode checkParsed -projectFolder workspace -projectName testProject
29 NCBICDD1000.gff3 9 helitronScanner.gff3 180 ltrHarvest.gff3 0 mitefind.gff3 41 mitetracker.gff3 25 must.gff3 12 sinefind.gff3 90 sinescan.gff3 295 tirvish.gff3 15 transposonPSI.gff3
Supplementary
Annotation tools list
helitronScanner
ltrHarvest
mitefind
mitetracker
must
repeatmodel
repMasker
sinefind
sinescan
tirvish
transposonPSI
NCBICDD1000