Technical Setup

Most of the contents of this part of document are gratefully adapted from the newcomer's guide developed in the Zaugg and Huber Labs at the EMBL. We thank the authors of those documents for their effort and help.

Getting Set Up

Lab disk space

Due to increasing dataset sizes, groups at EMBL often acquire data storage solutions for their group.
Common options include acquiring storage capacity from EMBL IT or use unit specific storage solutions such as /g/scb and /g/scb2 accessible from different servers at EMBL and the EMBL cluster.
Besides shared solutions, there are also /g/ shares named after and dedicated to each group.

Before anything, check with your group or local experts what storage options are available and what share of the resources you are entitled to.

EMBL IT Services

For an overview of services provided by EMBL IT and general information about, for example, EMBL network access, user accounts, printing, data sharing, or file systems, see IT Services Intranet Page

EMBL HPC Cluster

The EMBL HPC Cluster is a shared resource which provides access to over 10k cores, 70TB memory, 100+ GPUs, a fast all-flash scratch filesystem and 100Gb ethernet network for scientific computing at EMBL. The access to the cluster is controlled by a Slurm job scheduler, which implements a fair share scheduling model to ensure balanced assignment of the cluster resources. The fast /scratch system, powered by BeeGFS provides over 1PB of storage and is capable of withstanding demanding I/O loads. Additional details about available hardware and configuration can be found in the hardware section of the cluster wiki.

For an overview of the cluster system and help, see the cluster wiki pages and the relevant IT Services intranet pages. In particular, new users should read and understand the information in the wiki pages. They provide all essential information to make effective use of the system.

Training materials for using the cluster can be found on the Bio-IT Portal. Users of the system will be automatically subscribed to the cluster user mailing list and we also recommend that users join the cluster channel on the EMBL chat system.

Use the EMBL cluster system with care! Read the training material and wiki pages, and ask us for guidance or help if you never worked with a cluster system before!

Here are a few guidelines and best practices:

  1. Write shell scripts for job submission.
  2. Try to estimate or measure how much memory and CPU time your job will need, particularly before submitting large quantities of jobs. This is a vital and necessary step. Poor use of the resources granted to your job(s) will incur in a queue priority penalisation for future submissions.
  3. If your job(s) read or write large amounts of data or many files (I/O operations), it is mandatory to use the /scratch directory as temporary input and output location. Create a directory in /scratch with your name, copy your data files to that location and modify your scripts to read from and write to this location. Submitting jobs that read or write directly to group shares (usually prefixed with /g/) is strongly discouraged as those shares have limited connectivity and are easily overloaded, affecting ALL cluster users and some systems outside the cluster. More information about /scratch can be found in the cluster wiki scratch and best practices sections.

Connecting via SSH

Connect to the login node of the EMBL cluster with:

ssh login.cluster.embl.de

On Windows you may need to install PuTTY or any other SSH capable client.

After connecting, you will have access to both your home directory at EMBL and your group share(s).
Use the login node to submit your jobs to the cluster system.
Note that you are not allowed to do any heavy computation on the login machine.
As with the shares above, the login node is shared by all cluster users so, be fair to your colleagues and avoid hogging resources there.

If you require the use of desktop software that needs to submit jobs to the cluster, there is a dedicated virtual desktop capable login node at login-gui02.cluster.embl.de.
The cluster wiki page about graphical access contains all the relevant information to connect to it.

Additionally, a few other systems at EMBL are able to submit jobs to the cluster.
Some include servers managed by GBCS that you can find in the GBCS hardware wiki page.

You may also find that your group has independent resources.
Ask your local computing expert for more specific information.

Connecting from the outside

In addition to VPN access that brings your computer to the internal network, EMBL provides SSH access from the outside, without VPN using:

ssh youEMBLUsername@ssh.embl.de

When successful, you will be on a gateway server from where you can then ssh to other machines such as login.cluster.embl.de.

If your connection is particularly unstable and causes disconnects or SSH is sluggish to interact with, there is an alternative protocol supported by ssh.embl.de.
You can download the client from the mosh website and instead of ssh type:

mosh youEMBLUsername@ssh.embl.de

Mosh uses a different technology that is more tolerant to interruptions.
You can even suspend your computer or move between networks and the connection will be restored.
However mosh is a younger protocol and has some limitations over SSH.

Another alternative is to combine ssh or mosh with tmux or screen which can be executed on the gateway server.
If the connection gets interrupted, you can always resume where you left of.
Many tutorials for these tools exist online. We selected the following for tmux and screen.

Data Transfers

While tempting to use ssh.embl.de with scp or rsync for file transfer, heavy traffic on this server will cause slowdowns and latency problems to everyone using it.

To avoid these issues, you should use the alternative server datatransfer.embl.de that was specifically setup for this purpose.
See also the Aspera section below for a fast protocol for large volume data transfers.

SSH login without password prompt

The convenience of ssh can sometimes be hindered by the need to manually input a password.
Several options exist to overcome this obstacle however some may be disabled by the server administrator due to security implications.
Such is the case for ssh.embl.de and key based authentication.

A list of alternatives and a few pros/cons are listed below:

  • Key based authentication using an ssh-agent - With this approach a set of encryption keys is exchanged between the client and the server. If the two match, the client is granted access. The key replaces the need for a password. Note: While possible to generate a passwordless set of keys, this is generally discouraged due to security risks, particularly if done on a laptop.
  • Multiplexing connections using ControlMaster and ControlPath - Available in newer versions of OpenSSH, this approach allows reconnecting to the same machine without the need to type a password as long as a previous connection to the same machine is active.

Available software in the cluster

EMBL IT Services provide a number of software packages to EMBL users.
Site-wide or EMBL-wide license agreements with certain software providers are also available.
A list is available in the IT software services intranet.
For additional information on your software of interest, contact IT Support.

For the cluster system, a large volume of bioinformatics software packages, optimized for the cluster hardware, are available through the modules environment system.
Type module avail to see a list of all available modules, and module load <module-name> to load a specific tool/toolchain for use.
You can also check if a specific software is available by using the command module spider <software-or-command>.

When using software in the cluster, always load a specific version of a module to increase reproducibility.
For example, if you want to use the aligner bowtie2, instead of module load Bowtie2 use module load Bowtie2/2.3.4.1-foss-2018b to load version 2.3.4.1.
If one specifies only the software name, without the version, the module system will load the latest available version which may lead to different results due to version differences.

Using EMBL Bio-IT Git/GitLab Service

Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.

Git enhances reproducible research by keeping a history of the changes over time.
We use it for many of the Bio-IT projects and courses/workshops and we strongly recommend it to everyone.
Specifically, we recommend that you create a repository for your project and track your scripts through git.
It is easy to set it up at EMBL using EMBL's Bio-IT GitLab Server.
This platform provides a versatile working Git solution with many applications and automation possibilities.
For common practices and further help, ask Bio-IT or people who work with it!

For an introductory guide to using git for version control, see the Git Novice - Software Carpentry lesson.
Note that, while the guide linked above refers to GitHub for remote repository hosting, we recommend using the similar, internally-managed, GitLab system at git.embl.de.
More information about the GitLab system and other associated services can be found in the "Computational Resources" chapter of this guide.

Editors and licenses

When working with programming languages, we are often asked "What integrated development environment (IDE) or editor do you use?".

We list below some of the solutions offered at EMBL.

You may also want to know that EMBL is recognized as an academic institution by many of the companies that develop these kinds of solutions.
When deciding, check if educational licenses or other kinds of special discounts are available.

Working with Jupyter/JupyterLab

Jupyter notebooks are a popular solution to perform exploratory and visual data analysis in a multitude of languages, including Python, R and Julia.

EMBL has recently setup a JupyterHub server where you will have access to multiple environments.
There are environments for many different purposes, including general data exploration, image analysis, GPU computing and a virtual desktops where you will be able to run graphical applications.

Working with RStudio

In addition to the JupyterHub solutions mentioned above, and particularly if you work in R, RStudio is a popular and powerful editor of choice.

To use RStudio at EMBL you have two possibilities:

Use RStudio on your computer

An open source edition of RStudio Desktop or RStudio Server can be installed without costs on your computer.
This option is convenient to work independently of any EMBL resource or to work offline.
Note, however, that if you plan to analyze large datasets, your workstation or laptop may be restrictive due to limited resources (CPU, RAM, disk).
To cope with these larger needs, there is a RStudio Workbench (also known as RStudio Server Pro) solution, presented below.

Use RStudio through a browser

RStudio Workbench, available at rstudio.embl.de is hosted at seneca.embl.de, a powerful server managed by GBCS.
To use it, you must use VPN or be connected to the EMBL network, follow the instructions (see text in yellow bar) on the login page and authenticate using your EMBL credentials.

Since RStudio Workbench serves many users but runs on a single powerful computer, be mindful of your colleagues and ensure your code is well behaved.

For your convenience, the server where RStudio is running also has access to the same module system available in the EMBL cluster, as well as group shares (/g/) and /scratch.
As with the cluster, avoid doing heavy I/O in this server as it will affect all RStudio users.
For heavy jobs, please submit them to the EMBL cluster.

If you need support about RStudio, the RStudio EMBL chat channel is the best place to go.

User account disk space

Each user has only 800 MB in his or her personal directory (/home/yourUsername).
This tends to fill up quickly as some software creates files in folders such as $HOME/.local, $HOME/.config and $HOME/.cache.
You can find how much space you are using by logging in to the cluster and executing the command quota.

Due to the limited storage capacity, you may need to relocate or point software to another location with more storage.
As /scratch is meant to be used as temporary space and older files are regularly deleted, it's not the ideal location.
The most common solution is to (sym)link any potentially large folders to a group share or to reconfigure software to use a folder outside $HOME.

To use the link approach you can execute:

mv ~/.cache /g/yourgroup/yourfolder/.cache
ln -s /g/yourgroup/yourfolder/.cache ~/.cache

and repeat the same operation for ~/.config, ~/.local and any other folder you identified as being large.
As these folders are used by different software and in order to avoid data corruption, make sure to only run the above commands when no other software is running.
This includes any open sessions in RStudio and JupyterHub servers.

Sharing files inside and outside of EMBL

ownCloud

ownCloud is a DropBox-like solution hosted at EMBL, ran by EMBL IT and accessible at oc.embl.de.
As an EMBL employee you are entitled to 50GB of storage that you are free to use for your work needs.

This platform can also be used to share and receive documents from collaborators both inside and outside of EMBL.
You can find detailed instructions in the owncloud as google drive alternative guide.

Aspera

If you need to transfer data amounts larger than your ownCloud quota, EMBL provides an Aspera endpoint.
Please refer to the IT Services Aspera page for additional information.

Backups

Advice for how to setup automatic backups on Windows and MacOS can be found in the data protection section of the IT services intranet.

For Linux, you are free to use your favourite backup tool.
One common recommendation is borgbackup that provides a simple command-line interface for encrypted backups over SSH.

Additional advice

For completion, you can find additional technical information in the following locations: