EagleImp-RAP is implemented as a DNAnexus applet to run the EagleImp software on the UK Biobank Research Analysis Platform (UKB-RAP) in order to enable haplotype phasing and genotype imputation using UK Biobank haplotype reference panels.
This applet allows phasing and imputation with reference panels from UK Biobank to be executed inside the UKB-RAP environment, in compliance with UKB data-access policies.
EagleImp-RAP can only be executed under the following conditions:
-
You are an approved researcher on an active UK Biobank project.
-
You have access to the UKB Research Analysis Platform (UKB-RAP) via DNAnexus and are registered as an analyst for that project in DNAnexus.
-
You have permission from UK Biobank to use the UKB imputation reference panels.
-
You have access to the the UKB reference panel files (.qref) within your own UKB-RAP project space. (The exact mechanism - copy, link, or UKB-managed shared location - depends on UKB's policy.)
Users without UKB-RAP access or without permission to use the UKB reference panels cannot run EagleImp-RAP, because the required reference data are not accessible.
At the current stage EagleImp-RAP is not yet available as ready-to-use DNAnexus platform app. You need to use the command line interface of the DNAnexus and build your local EagleImp-RAP applet.
- dxtools: See Command Line Quickstart from the DNAnexus documentation how to set up the command line interface for the UKB-RAP on your system.
- Navigate to the repository folder on your local machine:
cd <path_to_repo>
- Navigate to the root folder (or an alternative folder) in the DNAnexus where you want to install the applet.
dx cd /
- Call
dx buildto build the applet in your project. (-fis only required if you want to overwrite an existing installation of EagleImp-RAP, e.g. for an update).
dx build -f ikmb-eagleimp
Now, the applet is ready to use on the DNAnexus.
Note: The following examples expect the applet to be installed in the root directory of your active project. If not, please use the correct path to your applet when calling dx run.
Note: The same input file conventions as for EagleImp apply, i.e. the file must not contain data for more than one chromosome and the file name must start with the chromosome number, followed by a dot. A preceding 'chr' is allowed.
To start phasing/imputation on a single file, simply copy that file to your project and run the applet with this file as input using dx run.
For a minimal running example, you could download an example data set from any of our imputation benchmark inputs at https://hybridcomputing.ikmb.uni-kiel.de/imputation-benchmark.
From the archives found there you can extract any input file and use this instead of <target.vcf.gz> in the following example.
The example uses a folder output as output directory relative to the location of the target file.
The output is downloaded afterwards using dx download.
dx cd <some_path_in_your_project>
dx upload <target.vcf.gz>
dx run /ikmb-eagleimp -itarget=<target.vcf.gz> -ibuild=hg38 -ireference=UKBcomplete --folder=output --instance-type=mem3_ssd1_v2_x16 --priority=high
dx download -r output
You can adjust the options to your need or add additional options such as -iallowStrandFlip=true to your needs. For more information on the available options see User options below.
You can also choose a different VM instance type, if you like.
IMPORTANT: Please note, that the minimum VM size for UKB imputation is 120 GB of RAM and 200 GB of hard disk space. We also recommend to always set the --priority=high option as UKB imputation takes time and thus, it is very likely that the job gets interrupted when started with lower priorities.
In order to process a batch of files using a VM instance for each file, you can generate a batch file for your inputs and execute it.
The following command generates a batch file called batch_example.0000.tsv.
Inputs are expected to be in the current local folder and match with chr<number>.*.vcf.gz.
For example, you could use the folder where you extracted one of the benchmark data sets downloaded in the example above.
dx generate_batch_inputs -i target='(chr.*).*.vcf.gz' -o batch_example
Upload your input files to the DNAnexus and run the batch file:
dx cd <some_path_in_your_project>
dx upload chr*.vcf.gz
dx run /ikmb-eagleimp --batch-tsv=batch_example.0000.tsv -ibuild=hg38 -ireference=UKBcomplete --folder=output --instance-type=mem3_ssd1_v2_x16 --brief --yes --priority=high
dx download -r output
Note: The options --brief and --yes reduce the output of the run command and disable asking for confirmation.
EagleIMP-RAP provides the following user options. The first three are mandatory:
target: The target file to be phased/imputed. Can be either in.bcfor.vcf.gzformat. The same input file conventions as for EagleImp apply, i.e. the file must not contain data for more than one chromosome and the file name must start with the chromosome number, followed by a dot. A preceding 'chr' is allowed.build: The genome build of your input file. Eitherhg19orhg38.reference: One of the supported reference panels:UKBcomplete,UKBmafor1000G.
The following options are optional and are exactly as in EagleImp:
skipPhasing: Skip phasing step (input is already phased). default =falseskipImputation: Skip imputation step (do phasing only). default =falseimputeInfo: Choose which probability information should accompany the hard called genotypes in the output:a: allele dosages,g: genotype dosages,p: genotype probabilities. Can be arbitrarily combined. default =aimputeR2filter: Output contains only variants with an estimated R2 greater or equal to this filter value. default =0imputeMAFfilter: Output contains only variants with a minor allele frequency greater or equal to this filter value. default =0outputPhasedFile: Generates an additional output file that contains only the phased input. default =falseoutputUnphased: Sites excluded from phasing are included in the phased output (if enabled). default =falseallowRefAltSwap: Allow target and reference variants to match if their reference and alternative alleles are swapped. default =trueallowStrandFlip: Allow target and reference variants to match if their reference and alternative alleles are strand flipped. default =falseK: K-best reference haplotypes are chosen for the phasing step prior to imputation. Either10000or32768. default =10000maxChunkMemory: Max memory (in GiB) used for chunks.0= auto (85% of free RAM), any other value overrides the automatic setting. Be careful, the VM may crash if the setting is too high. default =0
The output files are:
- A file containing the imputation output (not if
skipImputationwas set). - A file containing only the output of the phasing step (only if
outputPhasedFilewas set). - A file containing the (sample-wise) phasing confidences (not if
skipPhasingwas set). - A file containg additional information for each variant.
- The log file.
Note: The output genotype files are in the same format as the target input, i.e. either .bcf or .vcf.gz.
EagleImp-RAP was developed following the DNAnexus developer guidelines at https://documentation.dnanexus.com/developer/apps/intro-to-building-apps.
Using the dx-app-wizard from the dxtools package, we generated a standardized UKB-RAP applet structure containing the execution bash script, a .json configuration file and a template directory structure for required binaries and libraries. EagleImp, htslib and bcftools were compiled for the UKB-RAP target environment. We also added a pre-compiled version of Intel's libtbb (as required by EagleImp).
The execution script handles argument parsing, retrieval and indexing of input files, and downloading the selected UKB reference panel from a predefined location within UKB-RAP. After running EagleImp on the acquired virtual machine (VM), the script collects and returns all output files to the user’s project space.
Chromosome X input files require special handling, as male samples are haploid in nonPAR regions and diploid in PAR1/2. The applet therefore splits chromosome X into PAR1, nonPAR and PAR2 segments, processes each region using the appropriate reference panel, and merges the results into a unified output file.