IBM Support

Downloading data from NCBI via the command line

Question & Answer


Question

Downloading data from NCBI via the command line

Answer

Description

The National Center for Biotechnology Information (NCBI) offers a wealth of databases analysis tools and reports for use in research by the medical and scientific community.

These resources are freely available to download from the NCBI website. Because of the large sizes of most of the datasets (on the level of gigabytes or terabytes) the recommended method of transfer is with the Aspera Connect browser plugin.

You can use Aspera Connect directly through the NCBI website on your browser by clicking and downloading the datasets of your choice.

Alternatively you can also choose to download data from NCBI through the command line with ascp Asperas transfer tool which comes bundled with your Connect installation.

Usage

The general syntax for downloading data from NCBI is the following:

/path/to/ascp -T -k 1 -i path/to/private/key anonftp@ftp.ncbi.nlm.nih.gov:/path/to/data /local/location

The components of the command can be broken down as follows:

  • /path/to/ascp You will need to specify the full path to theascp program a reference for which can be found in the next section.
  • -k 1 If the transfer stops because of connection loss or other issues the k option tells the transfer to resume from where it left off rather than restarting the entire transfer over. This is important because of the large size of most NCBI data. The 1 specifies that a sparse checksum will be performed before resuming a transfer which is the best choice for NCBI data because a full checksum on large files may be slow. For more information on the resume transfer option see this Knowledge Base article.
  • -T This option tells the server not to encrypt the transfer as NCBIs download server doesnt offer encryption.
  • -i /path/to/private/key This is an option which specifies the path to the private key used to authenticate this transfer. Ensure that you specify the FULL path to the key (in other words ~/path/to/key or similar shortcuts will not work).
  • anonftp is the transfer user configured on NCBIs Aspera server.
  • ftp.ncbi.nlm.nih.gov is the hostname of NCBIs Aspera server.
  • /path/to/data is the path to the data you are downloading. You can find a reference of these paths here.
  • /local/location is the path to the folder on your own machine that you want the NCBI files to be downloaded to.

Private key and ascp locations

The private key you will use is asperaweb_id_dsa.openssh which comes with your Connect installation.

Below are locations where you can generally find the private key and the ascp executable. Where applicable replace usernamewith the name of the user you're logged in as.

Mac

Private key

  • Local installation of connect - /Users/username/Applications/Aspera\ Connect.app/Contents/Resources/asperaweb_id_dsa.openssh
  • System wide installation of Connect - /Applications/Aspera\ Connect.app/Contents/Resources/asperaweb_id_dsa.openssh

ascp

  • Local installation of connect - /Users/username/Applications/Aspera\ Connect.app/Contents/Resources/ascp
  • System wide installation of Connect - /Applications/Aspera\ Connect.app/Contents/Resources/ascp

Linux

Private key

  • /home/username/.aspera/connect/etc/asperaweb_id_dsa.openssh
  • /opt/aspera/etc/asperaweb_id_dsa.openssh

ascp

  • /opt/aspera/bin/ascp

Windows

Private key

  • "C:\Program Files (x86)\Aspera\Aspera Connect\etc\asperaweb_id_dsa.openssh"
  • C:\Users\username\AppData\Local\Programs\Aspera\Aspera Connect\etc\asperaweb_id_dsa.openssh

ascp

  • C:\Program Files\Aspera\Aspera Connect\bin\ascp.exe
  • C:\Users\username\AppData\Local\Programs\Aspera\Aspera Connect\bin\ascp.exe

Examples

The following examples demonstrate usage of ascp to download real data from NCBI. Commands for Mac Linux and Windows will be shown with the assumption that we are downloading from a user account on the system named janedoe and downloaded data will go to the folder NCBI_data in janedoes home directory. The path locations of the datasets are shown on NCBI's public download directory.

1. Say you need to download all the data NCBI offers on epigenomics. There is a 223.79 GB sized folder on the topic containing 5 subfolders worth of data. In order to download the entire folder via ascp you would use the following command:

On a Mac:

$ /Users/janedoe/Applications/Aspera\ Connect.app/Contents/Resources/ascp -T -k 1 -i /Users/janedoe/Applications/Aspera\ Connect.app/Contents/Resources/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/epigenomics /Users/janedoe/NCBI_data

On a Windows:

> C:\Users\aspera\AppData\Local\Programs\Aspera\Aspera Connect\bin\ascp.exe -T -k 1 -i C:\Users\janedoe\AppData\Local\Programs\Aspera\Aspera Connect\etc\asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/epigenomics C:\Users\janedoe\NCBI_data"

On Linux:

# /opt/aspera/bin/ascp -T -k1 -i /home/janedoe/.aspera/connect/etc/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/epigenomics /home/janedoe/NCBI_data

2. Perhaps you are conducting a study on tree-dwelling lizards and want to examine the genome data NCBI offers for the Anolis carolinensis species. To download the genome data for this species you would use the following command:

On a Mac:

$ /Users/janedoe/Applications/Aspera\ Connect.app/Contents/Resources/ascp -T -k 1 -i /Users/janedoe/Applications/Aspera\ Connect.app/Contents/Resources/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/genomes/anolis_carolinensis /Users/janedoe/NCBI_data

On a Windows:

> C:\Users\aspera\AppData\Local\Programs\Aspera\Aspera Connect\bin\ascp.exe -T -k 1 -i C:\Users\janedoe\AppData\Local\Programs\Aspera\Aspera Connect\etc\asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/genomes/anolis_carolinensis C:\Users\janedoe\NCBI_data"

On Linux:

# /opt/aspera/bin/ascp -T -k 1 -i /home/janedoe/.aspera/connect/etc/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/genomes/anolis_carolinensis /home/janedoe/NCBI_data



3. As part of a research paper youre writing you need to look at NCBIs RefSeq project data concerning protein and RNA sequencing data in humans. You know there is 1.69 GB worth of available data on NCBI and you proceed to download it with the following command:

On a Mac:

$ /Users/janedoe/Applications/Aspera\ Connect.app/Contents/Resources/ascp -T -k 1 -i /Users/janedoe/Applications/Aspera\ Connect.app/Contents/Resources/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/refseq/H_sapiens/mRNA_prot /Users/janedoe/NCBI_data

On a Windows:

> C:\Users\aspera\AppData\Local\Programs\Aspera\Aspera Connect\bin\ascp.exe -T -k 1 -i C:\Users\janedoe\AppData\Local\Programs\Aspera\Aspera Connect\etc\asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/refseq/H_sapiens/mRNA_prot C:\Users\janedoe\NCBI_data"

On Linux:

# /opt/aspera/bin/ascp -T -k 1 -i /home/janedoe/.aspera/connect/etc/asperaweb_id_dsa.openssh anonftp@ftp.ncbi.nlm.nih.gov:/refseq/H_sapiens/mRNA_prot /home/janedoe/NCBI_data

4. Another Windows example an actual command line and the file download status:

Windows_Command-line_NCBI_example.JPG

[{"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Product":{"code":"SSMURG","label":"IBM Aspera Connect"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB45","label":"Automation"}}]

Document Information

Modified date:
18 February 2020

UID

ibm10746935