UCSC Genomics Institute Computing Infrastructure Information - User contributions [en]

Slurm Tips for Toil

2025-12-11T22:11:55Z

Anovak:

Here are some tips for running Toil workflows on the Phoenix Slurm cluster. Mostly you might want to run WDL workflows, but you can use some of these for other workflows like Cactus. You can also consult [https://github.com/DataBiosphere/toil/blob/master/docs/wdl/running.rst the Toil documentation on WDL workflows].

* Install Toil with WDL support with:

pipx install 'toil[wdl]'

To use a development version of Toil, you can install from source instead:

pipx 'toil[wdl]@git+https://github.com/DataBiosphere/toil.git'

Or for a particular branch:

pipx install 'toil[wdl]@git+https://github.com/DataBiosphere/toil.git@issues/123-abc'

If you don't have <code>pipx</code>, you would first need to:

python3 -m pip install --user pipx
python3 -m pipx ensurepath

This may in turn need you to log out and back in.

* For Toil options, you will want '''--batchSystem slurm''' to make it use Slurm and '''--batchLogsDir ./logs''' (or some other location on a shared filesystem) for the Slurm logs to not get lost.

* You may be able to speed up your workflow with '''--caching true''', to cache data on nodes to be shared among multiple simultaneous tasks.

* If using '''toil-wdl-runner''', you might want to add '''--jobStore ./jobStore''' to make sure the job store is in a defined, shared location so that you can use '''--restart''' later.

* If using '''toil-wdl-runner''', you will want to set the '''SINGULARITY_CACHEDIR''' and '''MINIWDL__SINGULARITY__IMAGE_CACHE''' environment variables for your workflow to locations on shared storage, and possibly to the default cache locations in your home directory. Otherwise Toil will set them to temporary per-workflow node-local directories for each node, and thus re-download images for each workflow run, and for each cluster node. To avoid this, you could, for example, before your run or in your '''~/.bashrc''' you could:

export SINGULARITY_CACHEDIR=$HOME/.singularity/cache
export MINIWDL__SINGULARITY__IMAGE_CACHE=$HOME/.cache/miniwdl

If you run a lot of workflows, or a workflow with a lot of containers, you will run out of space in your home directory. In that case, you can try using semi-persistent per-node storage for your image caches instead:

export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"
export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"

Slurm Tips for Toil

2025-12-11T22:11:10Z

Anovak:

Here are some tips for running Toil workflows on the Phoenix Slurm cluster. Mostly you might want to run WDL workflows, but you can use some of these for other workflows like Cactus. You can also consult [https://github.com/DataBiosphere/toil/blob/master/docs/wdl/running.rst the Toil documentation on WDL workflows].

* Install Toil with WDL support with:

pipx install 'toil[wdl]'

To use a development version of Toil, you can install from source instead:

pipx 'toil[wdl]@git+https://github.com/DataBiosphere/toil.git'

Or for a particular branch:

pipx install 'toil[wdl]@git+https://github.com/DataBiosphere/toil.git@issues/123-abc'

If you don't have <code>pipx</code>, you would first need to:

python3 -m pip install --user pipx
python3 -m pipx ensurepath

This may in turn need you to log out and back in.

* For Toil options, you will want '''--batchSystem slurm''' to make it use Slurm and '''--batchLogsDir ./logs''' (or some other location on a shared filesystem) for the Slurm logs to not get lost.

* You may be able to speed up your workflow with '''--caching true''', to cache data on nodes to be shared among multiple simultaneous tasks.

* If using '''toil-wdl-runner''', you might want to add '''--jobStore ./jobStore''' to make sure the job store is in a defined, shared location so that you can use '''--restart''' later.

* If using '''toil-wdl-runner''', you will want to set the '''SINGULARITY_CACHEDIR''' and '''MINIWDL__SINGULARITY__IMAGE_CACHE''' environment variables for your workflow to locations on shared storage, and possibly to the default cache locations in your home directory. Otherwise Toil will set them to temporary node-local directories for each node, and thus re-download images for each workflow run, and for each cluster node. To avoid this, you could, for example, before your run or in your '''~/.bashrc''' you could:

export SINGULARITY_CACHEDIR=$HOME/.singularity/cache
export MINIWDL__SINGULARITY__IMAGE_CACHE=$HOME/.cache/miniwdl

If you run a lot of workflows, or a workflow with a lot of containers, you will run out of space in your home directory. In that case, you can try using persistent per-node storage for your image caches instead:

export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"
export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"

Phoenix WDL Tutorial

2025-11-18T15:09:06Z

Anovak: /* Frequently Asked Questions */ Add an error someone asked about

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node.

To connect to the cluster:

1. Connect to the VPN.
2. SSH to <code>emerald.prism</code>. At the command line, run:

ssh emerald.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@emerald.prism

The first time you connect, you will see a message like:

The authenticity of host 'emerald.prism (10.50.1.67)' can't be established.
ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows.

Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]:

python3 -m pip install --user pipx
python3 -m pipx ensurepath

This may instruct you to **log out and log back in** or take some other action to adopt the new <code>PATH</code> settings.

When installing Toil, you need to specify that you want WDL support. To do this, you can run:

pipx install 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pipx install 'toil[wdl,aws,google]'

To change what extras are used when you have an existing Toil installation, you will need to use the <code>--force</code> option.

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands.

If you see something from <code>pipx</code> like:

- cwltoil (symlink missing or pointing to unexpected location)

Then <code>pipx uninstall toil</code>, remove the offending file from <code>~/.local/bin</code>, and try again.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory.

We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers].

If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.)

'''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows.

You can set that up for all your workflows with:

echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc

Then '''log out and log back in again''', to apply the changes.

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:24.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:24.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:24.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --slurmTime 00:10:00 --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Restarting the Workflow==

If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow.

This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart.

If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs].

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this.

=== Automatically Fetching Input Files ===

The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like:

toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir

If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command.

=== Manually Finding Input Files ===

If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

===How do I delete files in WDL?===
WDL doesn't have a built-in way to delete files; if you run a task that deletes a file, it will still exist in Toil's job store storage.

Toil [https://github.com/DataBiosphere/toil/commit/2de6eea2cc2e688b53062a98687445f0cca56669 recently gained support] for deleting files at the ''end'' of WDL workflows. So if you have a large file that you only need for part of your workflow, consider writing the part that creates and uses it as a separate sub-<code>workflow</code> and invoking it with <code>call</code>. Then the file will be cleaned up when the child workflow ends, leaving more space for files created in the parent workflow.

===I am unable to allocate resources!===
You might get an error like:

srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

If this happens, you probably haven't been granted access to the Phoenix cluster, or at least to the partition you are trying to use. You should email <code>cluster-admin@soe.ucsc.edu</code> to ask for access.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2025-03-18T20:48:24Z

Anovak: /* Writing the file */

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node.

To connect to the cluster:

1. Connect to the VPN.
2. SSH to <code>emerald.prism</code>. At the command line, run:

ssh emerald.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@emerald.prism

The first time you connect, you will see a message like:

The authenticity of host 'emerald.prism (10.50.1.67)' can't be established.
ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows.

Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]:

python3 -m pip install --user pipx
python3 -m pipx ensurepath

This may instruct you to **log out and log back in** or take some other action to adopt the new <code>PATH</code> settings.

When installing Toil, you need to specify that you want WDL support. To do this, you can run:

pipx install 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pipx install 'toil[wdl,aws,google]'

To change what extras are used when you have an existing Toil installation, you will need to use the <code>--force</code> option.

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands.

If you see something from <code>pipx</code> like:

- cwltoil (symlink missing or pointing to unexpected location)

Then <code>pipx uninstall toil</code>, remove the offending file from <code>~/.local/bin</code>, and try again.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory.

We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers].

If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.)

'''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows.

You can set that up for all your workflows with:

echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc

Then '''log out and log back in again''', to apply the changes.

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:24.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:24.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:24.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --slurmTime 00:10:00 --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Restarting the Workflow==

If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow.

This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart.

If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs].

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this.

=== Automatically Fetching Input Files ===

The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like:

toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir

If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command.

=== Manually Finding Input Files ===

If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

===How do I delete files in WDL?===
WDL doesn't have a built-in way to delete files; if you run a task that deletes a file, it will still exist in Toil's job store storage.

Toil [https://github.com/DataBiosphere/toil/commit/2de6eea2cc2e688b53062a98687445f0cca56669 recently gained support] for deleting files at the ''end'' of WDL workflows. So if you have a large file that you only need for part of your workflow, consider writing the part that creates and uses it as a separate sub-<code>workflow</code> and invoking it with <code>call</code>. Then the file will be cleaned up when the child workflow ends, leaving more space for files created in the parent workflow.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2025-03-18T20:46:04Z

Anovak: /* Writing your own workflow */

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node.

To connect to the cluster:

1. Connect to the VPN.
2. SSH to <code>emerald.prism</code>. At the command line, run:

ssh emerald.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@emerald.prism

The first time you connect, you will see a message like:

The authenticity of host 'emerald.prism (10.50.1.67)' can't be established.
ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows.

Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]:

python3 -m pip install --user pipx
python3 -m pipx ensurepath

This may instruct you to **log out and log back in** or take some other action to adopt the new <code>PATH</code> settings.

When installing Toil, you need to specify that you want WDL support. To do this, you can run:

pipx install 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pipx install 'toil[wdl,aws,google]'

To change what extras are used when you have an existing Toil installation, you will need to use the <code>--force</code> option.

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands.

If you see something from <code>pipx</code> like:

- cwltoil (symlink missing or pointing to unexpected location)

Then <code>pipx uninstall toil</code>, remove the offending file from <code>~/.local/bin</code>, and try again.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory.

We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers].

If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.)

'''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows.

You can set that up for all your workflows with:

echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc

Then '''log out and log back in again''', to apply the changes.

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --slurmTime 00:10:00 --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Restarting the Workflow==

If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow.

This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart.

If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs].

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this.

=== Automatically Fetching Input Files ===

The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like:

toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir

If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command.

=== Manually Finding Input Files ===

If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

===How do I delete files in WDL?===
WDL doesn't have a built-in way to delete files; if you run a task that deletes a file, it will still exist in Toil's job store storage.

Toil [https://github.com/DataBiosphere/toil/commit/2de6eea2cc2e688b53062a98687445f0cca56669 recently gained support] for deleting files at the ''end'' of WDL workflows. So if you have a large file that you only need for part of your workflow, consider writing the part that creates and uses it as a separate sub-<code>workflow</code> and invoking it with <code>call</code>. Then the file will be cleaned up when the child workflow ends, leaving more space for files created in the parent workflow.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2025-03-18T18:43:57Z

Anovak: /* How do I delete files in WDL? */

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node.

To connect to the cluster:

1. Connect to the VPN.
2. SSH to <code>emerald.prism</code>. At the command line, run:

ssh emerald.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@emerald.prism

The first time you connect, you will see a message like:

The authenticity of host 'emerald.prism (10.50.1.67)' can't be established.
ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows.

Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]:

python3 -m pip install --user pipx
python3 -m pipx ensurepath

This may instruct you to **log out and log back in** or take some other action to adopt the new <code>PATH</code> settings.

When installing Toil, you need to specify that you want WDL support. To do this, you can run:

pipx install 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pipx install 'toil[wdl,aws,google]'

To change what extras are used when you have an existing Toil installation, you will need to use the <code>--force</code> option.

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands.

If you see something from <code>pipx</code> like:

- cwltoil (symlink missing or pointing to unexpected location)

Then <code>pipx uninstall toil</code>, remove the offending file from <code>~/.local/bin</code>, and try again.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory.

We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers].

If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.)

'''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows.

You can set that up for all your workflows with:

echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc

Then '''log out and log back in again''', to apply the changes.

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Restarting the Workflow==

If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow.

This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart.

If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs].

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this.

=== Automatically Fetching Input Files ===

The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like:

toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir

If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command.

=== Manually Finding Input Files ===

If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

===How do I delete files in WDL?===
WDL doesn't have a built-in way to delete files; if you run a task that deletes a file, it will still exist in Toil's job store storage.

Toil [https://github.com/DataBiosphere/toil/commit/2de6eea2cc2e688b53062a98687445f0cca56669 recently gained support] for deleting files at the ''end'' of WDL workflows. So if you have a large file that you only need for part of your workflow, consider writing the part that creates and uses it as a separate sub-<code>workflow</code> and invoking it with <code>call</code>. Then the file will be cleaned up when the child workflow ends, leaving more space for files created in the parent workflow.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2025-03-18T18:43:13Z

Anovak: /* How do I delete files in WDL? */

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node.

To connect to the cluster:

1. Connect to the VPN.
2. SSH to <code>emerald.prism</code>. At the command line, run:

ssh emerald.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@emerald.prism

The first time you connect, you will see a message like:

The authenticity of host 'emerald.prism (10.50.1.67)' can't be established.
ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows.

Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]:

python3 -m pip install --user pipx
python3 -m pipx ensurepath

This may instruct you to **log out and log back in** or take some other action to adopt the new <code>PATH</code> settings.

When installing Toil, you need to specify that you want WDL support. To do this, you can run:

pipx install 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pipx install 'toil[wdl,aws,google]'

To change what extras are used when you have an existing Toil installation, you will need to use the <code>--force</code> option.

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands.

If you see something from <code>pipx</code> like:

- cwltoil (symlink missing or pointing to unexpected location)

Then <code>pipx uninstall toil</code>, remove the offending file from <code>~/.local/bin</code>, and try again.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory.

We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers].

If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.)

'''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows.

You can set that up for all your workflows with:

echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc

Then '''log out and log back in again''', to apply the changes.

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Restarting the Workflow==

If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow.

This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart.

If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs].

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this.

=== Automatically Fetching Input Files ===

The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like:

toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir

If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command.

=== Manually Finding Input Files ===

If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

===How do I delete files in WDL?===
WDL doesn't have a built-in way to delete files; if you run a task that deletes a file, it will still exist in Toil's job store storage.

Toil [https://github.com/DataBiosphere/toil/commit/2de6eea2cc2e688b53062a98687445f0cca56669 recently gained support] for deleting files at the ''end'' of WDL workflows. So if you have a large file that you only need for part of your workflow, consider writing that part as a separate sub-<code>workflow</code> and invoking it with <code>call</code>. Then the file will be cleaned up when the child workflow ends, leaving more space fro files created in the parent workflow.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2025-03-18T18:43:03Z

Anovak: /* How do I delete files in WDL? */

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node.

To connect to the cluster:

1. Connect to the VPN.
2. SSH to <code>emerald.prism</code>. At the command line, run:

ssh emerald.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@emerald.prism

The first time you connect, you will see a message like:

The authenticity of host 'emerald.prism (10.50.1.67)' can't be established.
ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows.

Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]:

python3 -m pip install --user pipx
python3 -m pipx ensurepath

This may instruct you to **log out and log back in** or take some other action to adopt the new <code>PATH</code> settings.

When installing Toil, you need to specify that you want WDL support. To do this, you can run:

pipx install 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pipx install 'toil[wdl,aws,google]'

To change what extras are used when you have an existing Toil installation, you will need to use the <code>--force</code> option.

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands.

If you see something from <code>pipx</code> like:

- cwltoil (symlink missing or pointing to unexpected location)

Then <code>pipx uninstall toil</code>, remove the offending file from <code>~/.local/bin</code>, and try again.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory.

We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers].

If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.)

'''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows.

You can set that up for all your workflows with:

echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc

Then '''log out and log back in again''', to apply the changes.

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Restarting the Workflow==

If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow.

This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart.

If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs].

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this.

=== Automatically Fetching Input Files ===

The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like:

toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir

If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command.

=== Manually Finding Input Files ===

If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

===How do I delete files in WDL?===
WDL doesn't have a built-in way to delete files; if you run a task that deletes a file, it will still exist in Toil's job store storage.

Toil [https://github.com/DataBiosphere/toil/commit/2de6eea2cc2e688b53062a98687445f0cca56669 recently gained support] for deleting files at the 'end' of WDL workflows. So if you have a large file that you only need for part of your workflow, consider writing that part as a separate sub-<code>workflow</code> and invoking it with <code>call</code>. Then the file will be cleaned up when the child workflow ends, leaving more space fro files created in the parent workflow.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2025-03-18T18:42:50Z

Anovak: /* Frequently Asked Questions */

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node.

To connect to the cluster:

1. Connect to the VPN.
2. SSH to <code>emerald.prism</code>. At the command line, run:

ssh emerald.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@emerald.prism

The first time you connect, you will see a message like:

The authenticity of host 'emerald.prism (10.50.1.67)' can't be established.
ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows.

Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]:

python3 -m pip install --user pipx
python3 -m pipx ensurepath

This may instruct you to **log out and log back in** or take some other action to adopt the new <code>PATH</code> settings.

When installing Toil, you need to specify that you want WDL support. To do this, you can run:

pipx install 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pipx install 'toil[wdl,aws,google]'

To change what extras are used when you have an existing Toil installation, you will need to use the <code>--force</code> option.

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands.

If you see something from <code>pipx</code> like:

- cwltoil (symlink missing or pointing to unexpected location)

Then <code>pipx uninstall toil</code>, remove the offending file from <code>~/.local/bin</code>, and try again.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory.

We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers].

If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.)

'''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows.

You can set that up for all your workflows with:

echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc

Then '''log out and log back in again''', to apply the changes.

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Restarting the Workflow==

If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow.

This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart.

If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs].

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this.

=== Automatically Fetching Input Files ===

The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like:

toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir

If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command.

=== Manually Finding Input Files ===

If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

===How do I delete files in WDL?===
WDL doesn't have a built-in way to delete files; if you run a task that deletes a file, it will still exist in Toil's job store storage.

Toil [https://github.com/DataBiosphere/toil/commit/2de6eea2cc2e688b53062a98687445f0cca56669 recently gained support] for deleting files at the **end** of WDL workflows. So if you have a large file that you only need for part of your workflow, consider writing that part as a separate sub-<code>workflow</code> and invoking it with <code>call</code>. Then the file will be cleaned up when the child workflow ends, leaving more space fro files created in the parent workflow.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Converting From Non-MFA VPN to the MFA-Enabled VPN on Windows

2025-03-17T14:19:14Z

Anovak:

If you are using OpenVPN Connect on Windows 10 or 11 to connect to the GI VPN, and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing:

https://its.ucsc.edu/mfa/enroll.html

OK! Let's get to it.

Disconnect from the VPN if you are already connected.

Then you will need to download the new OpenVPN config file from here:

https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn

The credentials to access that website are username: '''genecats''' and password: '''KiloKluster'''

Download that file by right-clicking on the link above and selecting "Save Link As...", and save it to your Desktop or some other area you will remember.. Launch the '''OpenVPN GUI''' app (usually there is an icon for it on your Desktop, but you can search for it if not). It will launch and appear in your system tray on the bottom right (the system tray icon kind of looks like a '''^''' icon). You should see the OpenVPN icon there, it looks like a little computer screen with a lock on it. Right click on the OpenVPN icon in the system tray, and you should see a small menu appear. Select "Import file". In the resulting window, browse to your Desktop or wherever you saved the '''prism-duo.ovpn''' file. Select that file an click "Open".

Once you import the file, you should be able to right click on OpenVPN Connect again in the system tray and select the profile you want to connect to. It should show multiple profiles, one for your old profile and one for your new profile. Select the new one, then select "Connect".

That's it! It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you need to use an authentication method other than Duo Push, you can append a comma, and then the name of the method (like "push", "sms", or "phone"), or a second factor code, to you password when you submit it.

If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working.

As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions.

Converting From Non-MFA VPN to the MFA-Enabled VPN on MacOS

2025-03-17T14:18:57Z

Anovak:

If you are using Tunnelblick on MacOS and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing:

https://its.ucsc.edu/mfa/enroll.html

OK! Let's get to it.

Disconnect from the VPN if you are already connected.

Then you will need to download the new OpenVPN config file from here:

https://giwiki.gi.ucsc.edu/downloads/

The credentials to access that website are username: '''genecats''' and password: '''KiloKluster'''

Go to the link above and right-click on '''prism-duo.ovpn''' and selecting "Save Link As...", and save it to your Desktop or some other area you will remember. Then open Tunnelblick and click on the Tunnelblick icon on the top right of your screen next to the date. It kind of looks like a small tunnel. In the window that opens, select "VPN Details...".

In the resulting window, select the "Configurations" tab on the top. You will see a list of Configurations on the left, and it should include the current configuration you use to connect. It may be called 'prism' or maybe 'client'.

Drag the new configuration called '''prism-duo.ovpn''' (from your Desktop) into the Configurations area beneath your old configuration. It should import the configuration. It will ask you if you want to install it for "Only You" or "All Users". Click "Only You". You will also be asked to type in your laptop password.

That's it! Select the new configuration on the left and click the "Connect" button on the bottom right. It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you need to use an authentication method other than Duo Push, you can append a comma, and then the name of the method (like "push", "sms", or "phone"), or a second factor code, to you password when you submit it.

If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working.

Once you have the new VPN working, feel free to delete the old profile from Tunnelblick by clicking on the old profile in the "Configurations" window, then click on the '''"-"''' button below to remove the old configuration.

As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions.

Converting From Non-MFA VPN to the MFA-Enabled VPN on Linux

2025-03-17T14:18:44Z

Anovak:

If you are using OpenVPN on Linux to connect to the GI VPN and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing:

https://its.ucsc.edu/mfa/enroll.html

OK! Let's get to it.

Disconnect from the VPN if you are already connected. All the various flavors and versions of Linux vary in the specifics, so you may not be following these exact instructions to get it to work. This is based on the Network Manager in Ubuntu, but most Ubuntu/Debian variants will be similar.

Then you will need to download the new OpenVPN config file from here:

https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn

The credentials to access that website are username: '''genecats''' and password: '''KiloKluster'''

Download that file right-clicking on the link above and selecting "Save Link As...", and save it to your Desktop or some other area you will remember. or some other easy to remember location.

We will be installing the Prism VPN profile via the Network Manager GUI interface.

Open '''Network Manager''' from '''Gnome Settings''' option and select the '''Network''' tab and click on the '''VPN +''' symbol:

[[File:Configuring_1.png|600px]]

From the '''Add VPN''' window, click on the '''Import from file...''' option:

[[File:Configuring_2.png|600px]]

You must navigate to your .ovpn file (/path/to/your/prism-duo.ovpn) and click on '''Open''' button:

[[File:Configuring_3.png|600px]]

Click on the '''Add''' button:

[[File:Configuring_4.png|600px]]

Finally, click the '''On/Off''' button to start on the new VPN:

[[File:Configuring_5.png|600px]]

That's it! It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you need to use an authentication method other than Duo Push, you can append a comma, and then the name of the method (like "push", "sms", or "phone"), or a second factor code, to you password when you submit it.

If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working.

Once you have the new VPN working, feel free to delete the old profile from the Network Manager.

As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions.

Converting From Non-MFA VPN to the MFA-Enabled VPN on MacOS

2025-03-17T14:18:16Z

Anovak:

If you are using Tunnelblick on MacOS and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing:

https://its.ucsc.edu/mfa/enroll.html

OK! Let's get to it.

Disconnect from the VPN if you are already connected.

Then you will need to download the new OpenVPN config file from here:

https://giwiki.gi.ucsc.edu/downloads/

The credentials to access that website are username: '''genecats''' and password: '''KiloKluster'''

Go to the link above and right-click on '''prism-duo.ovpn''' and selecting "Save Link As...", and save it to your Desktop or some other area you will remember. Then open Tunnelblick and click on the Tunnelblick icon on the top right of your screen next to the date. It kind of looks like a small tunnel. In the window that opens, select "VPN Details...".

In the resulting window, select the "Configurations" tab on the top. You will see a list of Configurations on the left, and it should include the current configuration you use to connect. It may be called 'prism' or maybe 'client'.

Drag the new configuration called '''prism-duo.ovpn''' (from your Desktop) into the Configurations area beneath your old configuration. It should import the configuration. It will ask you if you want to install it for "Only You" or "All Users". Click "Only You". You will also be asked to type in your laptop password.

That's it! Select the new configuration on the left and click the "Connect" button on the bottom right. It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you need to use an authentication method other than Duo Push, you can append a comma, and then the name of the method (like "push", "sms", or "phone"), or a numeric second factor code, to you password when you submit it.

If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working.

Once you have the new VPN working, feel free to delete the old profile from Tunnelblick by clicking on the old profile in the "Configurations" window, then click on the '''"-"''' button below to remove the old configuration.

As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions.

Converting From Non-MFA VPN to the MFA-Enabled VPN on Windows

2025-03-17T14:17:49Z

Anovak:

If you are using OpenVPN Connect on Windows 10 or 11 to connect to the GI VPN, and you are looking to convert to the new MFA-enabled GI VPN, you have come to the right place. You must already have Duo set up with your CruzID (which most of you do). If for some reason you don't have Duo set up yet on your phone, go here to enroll a device and configure Push Notifications with Duo before continuing:

https://its.ucsc.edu/mfa/enroll.html

OK! Let's get to it.

Disconnect from the VPN if you are already connected.

Then you will need to download the new OpenVPN config file from here:

https://giwiki.gi.ucsc.edu/downloads/prism-duo.ovpn

The credentials to access that website are username: '''genecats''' and password: '''KiloKluster'''

Download that file by right-clicking on the link above and selecting "Save Link As...", and save it to your Desktop or some other area you will remember.. Launch the '''OpenVPN GUI''' app (usually there is an icon for it on your Desktop, but you can search for it if not). It will launch and appear in your system tray on the bottom right (the system tray icon kind of looks like a '''^''' icon). You should see the OpenVPN icon there, it looks like a little computer screen with a lock on it. Right click on the OpenVPN icon in the system tray, and you should see a small menu appear. Select "Import file". In the resulting window, browse to your Desktop or wherever you saved the '''prism-duo.ovpn''' file. Select that file an click "Open".

Once you import the file, you should be able to right click on OpenVPN Connect again in the system tray and select the profile you want to connect to. It should show multiple profiles, one for your old profile and one for your new profile. Select the new one, then select "Connect".

That's it! It will ask you for your usual GI PRISM username and password that you usually use to connect to our VPN, and after that it will send a Duo Push notification to your phone, and then you should be logged in. Other than the Duo Push, the VPN behaves exactly like it did before. If you need to use an authentication method other than Duo Push, you can append a comma, and then the name of the method (like "push", "sms", or "phone"), or a numeric second factor code, to you password when you submit it.

If you have issues you can always revert back to the old configuration, which will still work for a while. We will disable the old VPN soon though, so make every effort to get the new VPN setup working.

As always, please email '''cluster-admin@soe.ucsc.edu''' if you need help or have any questions.

Slurm Tips for Toil

2025-02-14T19:12:39Z

Anovak:

Here are some tips for running Toil workflows on the Phoenix Slurm cluster. Mostly you might want to run WDL workflows, but you can use some of these for other workflows like Cactus. You can also consult [https://github.com/DataBiosphere/toil/blob/master/docs/wdl/running.rst the Toil documentation on WDL workflows].

* Install Toil with WDL support with:

pipx install 'toil[wdl]'

To use a development version of Toil, you can install from source instead:

pipx 'toil[wdl]@git+https://github.com/DataBiosphere/toil.git'

Or for a particular branch:

pipx install 'toil[wdl]@git+https://github.com/DataBiosphere/toil.git@issues/123-abc'

If you don't have <code>pipx</code>, you would first need to:

python3 -m pip install --user pipx
python3 -m pipx ensurepath

This may in turn need you to log out and back in.

* For Toil options, you will want '''--batchSystem slurm''' to make it use Slurm and '''--batchLogsDir ./logs''' (or some other location on a shared filesystem) for the Slurm logs to not get lost.

* You may be able to speed up your workflow with '''--caching true''', to cache data on nodes to be shared among multiple simultaneous tasks.

* If using '''toil-wdl-runner''', you might want to add '''--jobStore ./jobStore''' to make sure the job store is in a defined, shared location so that you can use '''--restart''' later.

* If using '''toil-wdl-runner''', you will want to set the '''SINGULARITY_CACHEDIR''' and '''MINIWDL__SINGULARITY__IMAGE_CACHE''' environment variables for your workflow to locations on shared storage, and possibly to the default cache locations in your home directory. Otherwise Toil will set them to node-local directories for each node, and thus re-download images for each workflow run, and for each cluster node. To avoid this, you could, for example, before your run or in your '''~/.bashrc''' you could:

export SINGULARITY_CACHEDIR=$HOME/.singularity/cache
export MINIWDL__SINGULARITY__IMAGE_CACHE=$HOME/.cache/miniwdl

Phoenix WDL Tutorial

2025-02-14T19:09:08Z

Anovak: /* Installing Toil with WDL support */

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node.

To connect to the cluster:

1. Connect to the VPN.
2. SSH to <code>emerald.prism</code>. At the command line, run:

ssh emerald.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@emerald.prism

The first time you connect, you will see a message like:

The authenticity of host 'emerald.prism (10.50.1.67)' can't be established.
ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows.

Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]:

python3 -m pip install --user pipx
python3 -m pipx ensurepath

This may instruct you to **log out and log back in** or take some other action to adopt the new <code>PATH</code> settings.

When installing Toil, you need to specify that you want WDL support. To do this, you can run:

pipx install 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pipx install 'toil[wdl,aws,google]'

To change what extras are used when you have an existing Toil installation, you will need to use the <code>--force</code> option.

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands.

If you see something from <code>pipx</code> like:

- cwltoil (symlink missing or pointing to unexpected location)

Then <code>pipx uninstall toil</code>, remove the offending file from <code>~/.local/bin</code>, and try again.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory.

We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers].

If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.)

'''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows.

You can set that up for all your workflows with:

echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc

Then '''log out and log back in again''', to apply the changes.

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Restarting the Workflow==

If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow.

This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart.

If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs].

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this.

=== Automatically Fetching Input Files ===

The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like:

toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir

If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command.

=== Manually Finding Input Files ===

If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2025-02-14T19:03:18Z

Anovak:

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node.

To connect to the cluster:

1. Connect to the VPN.
2. SSH to <code>emerald.prism</code>. At the command line, run:

ssh emerald.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@emerald.prism

The first time you connect, you will see a message like:

The authenticity of host 'emerald.prism (10.50.1.67)' can't be established.
ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows.

Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]:

python3 -m pip install --user pipx
python3 -m pipx ensurepath

This may instruct you to log out and log back in or take some other action to adopt the new <code>PATH</code> settings.

When installing Toil, you need to specify that you want WDL support. To do this, you can run:

pipx install 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pipx install 'toil[wdl,aws,google]'

To change what extras are used when you have an existing Toil installation, you will need to use the <code>--force</code> option.

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands.

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory.

We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers].

If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.)

'''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows.

You can set that up for all your workflows with:

echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc

Then '''log out and log back in again''', to apply the changes.

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Restarting the Workflow==

If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow.

This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart.

If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs].

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this.

=== Automatically Fetching Input Files ===

The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like:

toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir

If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command.

=== Manually Finding Input Files ===

If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2025-02-14T19:00:28Z

Anovak: /* Installing Toil with WDL support */

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node.

To connect to the cluster:

1. Connect to the VPN.
2. SSH to <code>emerald.prism</code>. At the command line, run:

ssh emerald.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@emerald.prism

The first time you connect, you will see a message like:

The authenticity of host 'emerald.prism (10.50.1.67)' can't be established.
ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows.

Toil is written in Python, and the modern way to install Python command line tools is with pipx. So [https://pipx.pypa.io/latest/installation/ install pipx]:

python3 -m pip install --user pipx
python3 -m pipx ensurepath

When installing, you need to specify that you want WDL support. To do this, you can run:

pipx install 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pipx install 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>. The <code>python3 -m pipx ensurepath</code> command should have added the <code>~/.local</code> directory to your <code>PATH</code> environment variable, to ensure you can find these commands.

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory.

We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers].

If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.)

'''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows.

You can set that up for all your workflows with:

echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc

Then '''log out and log back in again''', to apply the changes.

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Restarting the Workflow==

If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow.

This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart.

If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs].

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this.

=== Automatically Fetching Input Files ===

The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like:

toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir

If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command.

=== Manually Finding Input Files ===

If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Slurm Tips for Toil

2025-02-11T20:49:14Z

Anovak: Change to new extras syntax from https://github.com/pypa/pip/pull/11617

Here are some tips for running Toil workflows on the Phoenix Slurm cluster. Mostly you might want to run WDL workflows, but you can use some of these for other workflows like Cactus. You can also consult [https://github.com/DataBiosphere/toil/blob/master/docs/wdl/running.rst the Toil documentation on WDL workflows].

* Install Toil with WDL support with:

pip3 install --upgrade 'toil[wdl]'

To use a development version of Toil, you can install from source instead:

pip3 install 'toil[wdl]@git+https://github.com/DataBiosphere/toil.git'

Or for a particular branch:

pip3 install 'toil[wdl]@git+https://github.com/DataBiosphere/toil.git@issues/123-abc'

* You will then need to make sure your '''~/.local/bin''' directory is on your PATH. Open up your '''~/.bashrc''' file and add:

export PATH=$PATH:$HOME/.local/bin

Then make sure to log out and back in again.

* For Toil options, you will want '''--batchSystem slurm''' to make it use Slurm and '''--batchLogsDir ./logs''' (or some other location on a shared filesystem) for the Slurm logs to not get lost.

* You may be able to speed up your workflow with '''--caching true''', to cache data on nodes to be shared among multiple simultaneous tasks.

* If using '''toil-wdl-runner''', you might want to add '''--jobStore ./jobStore''' to make sure the job store is in a defined, shared location so that you can use '''--restart''' later.

* If using '''toil-wdl-runner''', you will want to set the '''SINGULARITY_CACHEDIR''' and '''MINIWDL__SINGULARITY__IMAGE_CACHE''' environment variables for your workflow to locations on shared storage, and possibly to the default cache locations in your home directory. Otherwise Toil will set them to node-local directories for each node, and thus re-download images for each workflow run, and for each cluster node. To avoid this, you could, for example, before your run or in your '''~/.bashrc''' you could:

export SINGULARITY_CACHEDIR=$HOME/.singularity/cache
export MINIWDL__SINGULARITY__IMAGE_CACHE=$HOME/.cache/miniwdl

Slurm Tips for Toil

2025-02-11T20:42:22Z

Anovak: Add quotes to protect brackets

Here are some tips for running Toil workflows on the Phoenix Slurm cluster. Mostly you might want to run WDL workflows, but you can use some of these for other workflows like Cactus. You can also consult [https://github.com/DataBiosphere/toil/blob/master/docs/wdl/running.rst the Toil documentation on WDL workflows].

* Install Toil with WDL support with:

pip3 install --upgrade 'toil[wdl]'

To use a development version of Toil, you can install from source instead:

pip3 install 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl]'

Or for a particular branch:

pip3 install 'git+https://github.com/DataBiosphere/toil.git@issues/123-abc#egg=toil[wdl]'

* You will then need to make sure your '''~/.local/bin''' directory is on your PATH. Open up your '''~/.bashrc''' file and add:

export PATH=$PATH:$HOME/.local/bin

Then make sure to log out and back in again.

* For Toil options, you will want '''--batchSystem slurm''' to make it use Slurm and '''--batchLogsDir ./logs''' (or some other location on a shared filesystem) for the Slurm logs to not get lost.

* You may be able to speed up your workflow with '''--caching true''', to cache data on nodes to be shared among multiple simultaneous tasks.

* If using '''toil-wdl-runner''', you might want to add '''--jobStore ./jobStore''' to make sure the job store is in a defined, shared location so that you can use '''--restart''' later.

* If using '''toil-wdl-runner''', you will want to set the '''SINGULARITY_CACHEDIR''' and '''MINIWDL__SINGULARITY__IMAGE_CACHE''' environment variables for your workflow to locations on shared storage, and possibly to the default cache locations in your home directory. Otherwise Toil will set them to node-local directories for each node, and thus re-download images for each workflow run, and for each cluster node. To avoid this, you could, for example, before your run or in your '''~/.bashrc''' you could:

export SINGULARITY_CACHEDIR=$HOME/.singularity/cache
export MINIWDL__SINGULARITY__IMAGE_CACHE=$HOME/.cache/miniwdl

Slurm Tips for vg

2025-01-16T16:58:39Z

Anovak: /* Setting Up */

This page explains how to set up a development environment for [https://github.com/vgteam/vg vg] on the Phoenix cluster.

==Setting Up==

1. After connecting to the VPN, connect to an interactive node:

ssh razzmatazz.prism

This node is relatively small, so you shouldn't run real work on it, but it is the place you need to be to submit Slurm jobs.

2. Make yourself a user directory under '''/private/groups''', which is where large data must be stored. For example, if you are in the Paten lab:

mkdir /private/groups/patenlab/$USER

3. (Optional) Link it over to your home directory, so it is easy to use storage there to store your repos. The '''/private/groups''' storage may be faster than the home directory storage.

mkdir -p /private/groups/patenlab/$USER/workspace
ln -s /private/groups/patenlab/$USER/workspace ~/workspace

4. Make sure you have SSH keys created and add them to Github.

cat ~/.ssh/id_ed25519.pub || (ssh-keygen -t ed25519 && cat ~/.ssh/id_ed25519.pub)
# Paste into https://github.com/settings/ssh/new

5. Make a place to put your clone, and clone vg:

mkdir -p ~/workspace
cd ~/workspace
git clone --recursive git@github.com:vgteam/vg.git
cd vg

6. vg's dependencies should already be installed on the cluster nodes. If any of them seem to be missing, tell cluster-admin@soe.ucsc.edu to install them.

7. Build vg as a Slurm job. This will send the build out to the cluster as a 64-core, 80G memory job, and keep the output logs in your terminal.

srun -c 64 --mem=80G --time=00:30:00 make -j64

This will leave your vg binary at '''~/workspace/vg/bin/vg'''.

==Misc Tips==

* For a lightweight job that outputs to your terminal or that can be waited for in a Bash script, run an individual command directly from <code>srun</code>:

srun -c1 --mem 2G --partition short --time 1:00:00 sleep 10

* If you need to run a few commands in the same shell, use <code>sbatch --wrap</code>:

sbatch -c1 --mem 2G --partition short --time 1:00:00 --wrap ". venv/bin/activate; ./script1.py && ./script2.py"

* To watch a batch job's output live, look at the <code>Submitted batch job 5244464</code> line from <code>sbatch</code> and run:

tail -f slurm-5244464.out

* '''Danger!''' If you ''really'' need an interactive session with appreciable resources, you can schedule one with <code>srun --pty</code>. But it is '''very easy''' to waste resources like this, since the job will happily sit there not doing anything until it hits the timeout. Only do this for testing! For real work, use one of the other methods!

srun -c 16 --mem 120G --time=08:00:00 --partition=medium --pty bash -i

* To send out a job without making a script file for it, use '''sbatch --wrap "your command here"'''.

* You can use arguments from SBATCH lines on the command line!

* You can use [https://github.com/CLIP-HPC/SlurmCommander#readme Slurm Commander] to watch the state of the cluster with the '''scom''' command.

Using Docker under Slurm

2024-12-04T14:45:05Z

Anovak: Explain how to pass the right GPUs

__TOC__

Sometimes it is convenient to ask Slurm to run your job in a docker container. This is just fine, however, you will need to fully test your job in a docker container beforehand (on mustard or emerald, for example) to see how much RAM and CPU resources it requires, so you can accurately describe in your slurm job submission file how many resources it needs.

== Testing ==

You can run your container on mustard then look at 'top' to see how much RAM and CPU it needs.

You also will need to be aware that you will need to pull your docker image from a registry, like DockerHub or Quay. And you should also run you docker container with the '--rm' flag, so the container cleans itself up after running. So your workflow would look something like this:

1: Pull image from DockerHub
2: docker run --rm docker/welcome-to-docker

Optionally you can clean up your image as well, but only if you don't have many jobs using that image on the same node. For example, if I wanted to remove the image laballed "weiler/mytools":

$ docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
weiler/mytools latest be6777ad00cf 19 hours ago 396MB
somedude/tools latest 9b1d1f6fbf6f 3 weeks ago 607MB

$ docker image rm be6777ad00cf

== Resource Limits ==

When running docker containers on Slurm, slurm cannot limit the resources that docker uses. Therefore, when you launch a container, you will need to know how much resources (RAM, CPU) it uses beforehand, determined by your testing. Then launch your job with the following --cpus and --memory parameters so docker itslef will limit what it uses:

docker run --rm '''--cpus=16 --memory=1024m''' docker/welcome-to-docker

The --memory argument is in megabytes (hence the 'm' at the end). So the above example will set a memory limit of 1GB.

== Docker and GPUs ==

If you are using GPUs with Docker, you need to make sure that your Docker container requests access to the ''correct'' GPUs: the ones which Slurm assigned to your job. These will be passed in the <code>SLURM_STEP_GPUS</code> (for GPUs for a single step) or <code>SLURM_JOB_GPUS</code> (for GPUs for a whole job) environment variables. They need to be passed to Docker like this:

docker run --gpus="\"device=${SLURM_STEP_GPUS:-$SLURM_JOB_GPUS}\"" nvidia/cuda nvidia-smi

'''Note the escaped quotes'''; the Docker command needs to have double-quotes ''inside'' the argument value. The <code>${:-}</code> syntax will use <code>SLURM_STEP_GPUS</code> if it is set and <code>SLURM_JOB_GPUS</code> if it isn't; if you know which will be set for your job, you can use just that one.

If you are using Nextflow, you will need to set <code>docker.runOptions</code> to include this flag.

docker.runOptions="--gpus \\\"device=$SLURM_JOB_GPUS\\\""

If you are using Toil to run CWL or WDL, the correct GPUs will be passed to containers automatically.

== Cleaning Scripts ==

We also have auto-cleaning scripts running that will delete any containers and images that were created/pulled more than 7 days ago. This includes the cluster nodes and also the phoenix head node itself. If you need a place to have your images/containers remain longer than that, please put them on mustard, emerald, crimson or razzmatazz.

Also, there are cleaning scripts in place that will destroy any running containers that have been running for over 7 days. We assume that such a container was not launched with '''--rm''' and needs to be cleaned up.

Slurm Tips for vg

2024-10-11T17:05:25Z

Anovak: /* Misc Tips */ Discourage interactive shells

This page explains how to set up a development environment for [https://github.com/vgteam/vg vg] on the Phoenix cluster.

==Setting Up==

1. After connecting to the VPN, connect to the cluster head node:

ssh phoenix.prism

This node is relatively small, so you shouldn't run real work on it, but it is the place you need to be to submit Slurm jobs.

2. Make yourself a user directory under '''/private/groups''', which is where large data must be stored. For example, if you are in the Paten lab:

mkdir /private/groups/patenlab/$USER

3. (Optional) Link it over to your home directory, so it is easy to use storage there to store your repos. The '''/private/groups''' storage may be faster than the home directory storage.

mkdir -p /private/groups/patenlab/$USER/workspace
ln -s /private/groups/patenlab/$USER/workspace ~/workspace

4. Make sure you have SSH keys created and add them to Github.

cat ~/.ssh/id_ed25519.pub || (ssh-keygen -t ed25519 && cat ~/.ssh/id_ed25519.pub)
# Paste into https://github.com/settings/ssh/new

5. Make a place to put your clone, and clone vg:

mkdir -p ~/workspace
cd ~/workspace
git clone --recursive git@github.com:vgteam/vg.git
cd vg

6. vg's dependencies should already be installed on the cluster nodes. If any of them seem to be missing, tell cluster-admin@soe.ucsc.edu to install them.

7. Build vg as a Slurm job. This will send the build out to the cluster as a 64-core, 80G memory job, and keep the output logs in your terminal.

srun -c 64 --mem=80G --time=00:30:00 make -j64

This will leave your vg binary at '''~/workspace/vg/bin/vg'''.

==Misc Tips==

* For a lightweight job that outputs to your terminal or that can be waited for in a Bash script, run an individual command directly from <code>srun</code>:

srun -c1 --mem 2G --partition short --time 1:00:00 sleep 10

* If you need to run a few commands in the same shell, use <code>sbatch --wrap</code>:

sbatch -c1 --mem 2G --partition short --time 1:00:00 --wrap ". venv/bin/activate; ./script1.py && ./script2.py"

* To watch a batch job's output live, look at the <code>Submitted batch job 5244464</code> line from <code>sbatch</code> and run:

tail -f slurm-5244464.out

* '''Danger!''' If you ''really'' need an interactive session with appreciable resources, you can schedule one with <code>srun --pty</code>. But it is '''very easy''' to waste resources like this, since the job will happily sit there not doing anything until it hits the timeout. Only do this for testing! For real work, use one of the other methods!

srun -c 16 --mem 120G --time=08:00:00 --partition=medium --pty bash -i

* To send out a job without making a script file for it, use '''sbatch --wrap "your command here"'''.

* You can use arguments from SBATCH lines on the command line!

* You can use [https://github.com/CLIP-HPC/SlurmCommander#readme Slurm Commander] to watch the state of the cluster with the '''scom''' command.

Phoenix WDL Tutorial

2024-07-16T20:40:10Z

Anovak: /* Configuring Toil for Phoenix */ Provide Julian's preferred storage paths, note Ceph bug and default home directory image storage.

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node.

To connect to the cluster:

1. Connect to the VPN.
2. SSH to <code>emerald.prism</code>. At the command line, run:

ssh emerald.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@emerald.prism

The first time you connect, you will see a message like:

The authenticity of host 'emerald.prism (10.50.1.67)' can't be established.
ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we might not be able to keep these in your home directory.

We would like to be able to store these on the cluster's large storage array, under <code>/private/groups</code>. However, Toil needs to use file locks in these directories to prevent simultaneous Singularity calls from producing internal Singularity errors, and Ceph currently has [https://tracker.ceph.com/issues/65607 a bug where these file locking operations can freeze the Ceph servers].

If you have '''a small number of container images''' that will fit in your home directory, you can keep them there. [https://github.com/DataBiosphere/toil/commit/cb0b291bb7f6212bfe69221dd9f09d72f83e92fb Since Toil 6.1.0], this is the default behavior and you don't need to do anything. (Unless you previously set <code>SINGULARITY_CACHEDIR</code> or <code>MINIWDL__SINGULARITY__IMAGE_CACHE</code>, in which case you need to unset them.)

'''If you don't have room in your home directory''' for container images, currently the recommended approach is to use node-local storage under <code>/data/tmp</code>. This results in each node pulling each container image, but images will be saved across workflows.

You can set that up for all your workflows with:

echo 'export SINGULARITY_CACHEDIR="/data/tmp/$(whoami)/cache/singularity"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="/data/tmp/$(whoami)/cache/miniwdl"' >>~/.bashrc

Then '''log out and log back in again''', to apply the changes.

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Restarting the Workflow==

If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow.

This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart.

If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs].

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this.

=== Automatically Fetching Input Files ===

The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like:

toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir

If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command.

=== Manually Finding Input Files ===

If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2024-07-16T20:24:49Z

Anovak: /* Frequently Asked Questions */ Note there's no more XDG_RUNTIME_DIR warning.

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node.

To connect to the cluster:

1. Connect to the VPN.
2. SSH to <code>emerald.prism</code>. At the command line, run:

ssh emerald.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@emerald.prism

The first time you connect, you will see a message like:

The authenticity of host 'emerald.prism (10.50.1.67)' can't be established.
ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier.

Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command:

echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc

Then use these commands to make sure that Toil knows where it ought to put its caches:

echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc

After that, '''log out and log back in again''', to apply the changes.

If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day].

'''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.]

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Restarting the Workflow==

If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow.

This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart.

If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs].

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this.

=== Automatically Fetching Input Files ===

The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like:

toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir

If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command.

=== Manually Finding Input Files ===

If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

You should upgrade Toil. [https://github.com/DataBiosphere/toil/commit/ff6bf60ab798a675c20156c749817c4313644b96 Since Toil 6.1.0], Toil no longer issues this warning, and just puts up with bad <code>XDG_RUNTIME_DIR</code> settings.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2024-07-16T20:20:12Z

Anovak: /* Reading the Log */ Explain --writeLogs

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node.

To connect to the cluster:

1. Connect to the VPN.
2. SSH to <code>emerald.prism</code>. At the command line, run:

ssh emerald.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@emerald.prism

The first time you connect, you will see a message like:

The authenticity of host 'emerald.prism (10.50.1.67)' can't be established.
ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier.

Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command:

echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc

Then use these commands to make sure that Toil knows where it ought to put its caches:

echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc

After that, '''log out and log back in again''', to apply the changes.

If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day].

'''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.]

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Restarting the Workflow==

If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow.

This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart.

If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

If you would like individual task logs to be saved separately for later reference, you can use the <code>--writeLogs</code> option to specify a directory to store them. For more information, see [https://toil.readthedocs.io/en/latest/wdl/running.html#managing-workflow-logs the Toil documentation of workflow task logs].

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this.

=== Automatically Fetching Input Files ===

The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like:

toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir

If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command.

=== Manually Finding Input Files ===

If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2024-07-16T20:16:39Z

Anovak: /* Reproducing Problems */ Explain new debug-job features

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node.

To connect to the cluster:

1. Connect to the VPN.
2. SSH to <code>emerald.prism</code>. At the command line, run:

ssh emerald.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@emerald.prism

The first time you connect, you will see a message like:

The authenticity of host 'emerald.prism (10.50.1.67)' can't be established.
ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier.

Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command:

echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc

Then use these commands to make sure that Toil knows where it ought to put its caches:

echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc

After that, '''log out and log back in again''', to apply the changes.

If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day].

'''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.]

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Restarting the Workflow==

If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow.

This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart.

If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this.

=== Automatically Fetching Input Files ===

The <code>toil debug-job</code> command has a <code>--retrieveTaskDirectory</code> option that lets you dump out a directory with all the files that a failing WDL task would use. You can use it like:

toil debug-job ./jobstore WDLTaskJob --retrieveTaskDirectory dumpdir

If there are multiple failing tasks, you might need to replace <code>WDLTaskJob</code> with the name of one of the failing jobs. See [https://toil.readthedocs.io/en/latest/running/debugging.html#fetching-job-inputs the Toil documentation on retrieving files] for more on how to use this command.

=== Manually Finding Input Files ===

If you can't use <code>toil debug-job</code>, you might need to manually dig through the job store for files. In the log of your failing Toil task, look for lines like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2024-07-16T20:06:55Z

Anovak: /* Connecting to Phoenix */ Use emerald login node

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to any of the machines with access to the Phoenix cluster. These interactive nodes are fairly large machines that can do some work locally, but you will still want to run larger workflows on the actual cluster. For this tutorial, we will use <code>emerald.prism</code> as our login node.

To connect to the cluster:

1. Connect to the VPN.
2. SSH to <code>emerald.prism</code>. At the command line, run:

ssh emerald.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@emerald.prism

The first time you connect, you will see a message like:

The authenticity of host 'emerald.prism (10.50.1.67)' can't be established.
ED25519 key fingerprint is SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.
This key is not known by any other names.
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>emerald.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:8hJQShO6jhrym9UVyMldKsKOnOFtWRChgjK5cZNhkAI.</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier.

Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command:

echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc

Then use these commands to make sure that Toil knows where it ought to put its caches:

echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc

After that, '''log out and log back in again''', to apply the changes.

If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day].

'''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.]

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Restarting the Workflow==

If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow.

This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart.

If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

GPU Resources

2024-06-28T16:32:15Z

Anovak: Add partition and time

When submitting jobs, you can ask for GPUs in one of two ways. One is:

#SBATCH --partition=gpu
#SBATCH --gres=gpu:1

That will ask for 1 GPU generically on a node with a free GPU. This request is more specific:

#SBATCH --partition=gpu
#SBATCH --gres=gpu:A5500:3

That requests 3 A5500 GPUs '''only'''.

We have several GPU types on the cluster which may fit your specific needs:

nVidia RTX A5500 : 24GB RAM
nVidia A100 : 80GB RAM

For the most part, Slurm takes care of making sure that each job only sees and used the GPUs assigned to it. Within the job, '''CUDA_VISIBLE_DEVICES''' will be set in the environment, but it will always be set to a list of your requested number of GPUs, starting at 0. Slurm re-numbers the GPUs assigned to each job to appear to start at 0, within the job. If you need access to the "real" GPU numbers (to log or to pass along to Docker), they are available in the '''SLURM_JOB_GPUS''' (for '''sbatch''') or '''SLURM_STEP_GPUS''' (for '''srun''') environment variable.

==Running GPU Workloads==

To actually use an nVidia GPU, you need to run a program that uses the CUDA API. There are a few ways to obtain such a program.

===Prebuilt CUDA Applications===

The Slurm cluster nodes have the nVidia drivers installed, as well as basic CUDA tools like nvidia-smi.

Some projects, such as tensorflow, may ship pre-built binaries that can use CUDA. You should be able to run these binaries directly, if you download them.

===Building CUDA Applications===

The cluster nodes do not have the full CUDA Toolkit. In particular, they do not have the '''nvcc''' CUDA-enabled compiler. If you want to compile applications that use CUDA, you will need to install the development environment yourself for your user.

Once you have '''nvcc''' available to your user, building CUDA applications should work. To run them, you will have to submit them as jobs, because the head node does not have a GPU.

===Containerized GPU Workloads===

Instead of directly installing binaries, or installing and using the CUDA Toolkit, it is often easiest to use containers to download a prebuilt GPU workload and all of its libraries and dependencies. THere are a few options for running containerized GPU workloads on the cluster.

====Running Containers in Singularity====

You can run containers on the cluster using Singularity, and give them access to the GPUs that Slurm has selected using the '''--nv''' option. For example:

singularity pull docker://tensorflow/tensorflow:latest-gpu
srun -c 8 --mem 10G --partition=gpu --time=00:20:00 --gres=gpu:1 singularity run --nv docker://tensorflow/tensorflow:latest-gpu python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())'

This will produce output showing that the Tensorflow container is indeed able to talk to one GPU:

INFO: Using cached SIF image
2023-05-15 11:36:33.110850: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-15 11:36:38.799035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /device:GPU:0 with 22244 MB memory: -> device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 8527638019084870106
xla_global_id: -1
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 23324655616
locality {
bus_id: 1
links {
}
}
incarnation: 1860154623440434360
physical_device_desc: "device: 0, name: NVIDIA RTX A5500, pci bus id: 0000:03:00.0, compute capability: 8.6"
xla_global_id: 416903419
]

Slurm's containment of the Slurm job to the correct set of GPUs is also passed through to the Singularity container; there is no need to specifically direct Singularity to use the right GPUs unless you are doing something unusual.

====Running Containers in Slurm====

Slurm itself also supports a '''--container''' option for jobs, which allows a whole job to be run inside a container. If you are able to [https://slurm.schedmd.com/containers.html convert your container to OCI Bundle format], you can pass it directly to Slurm instead of using Singularity from inside the job. However, Docker-compatible image specifiers can't be given to Slurm, only paths to OCI bundles on disk.

Stnad-alone tools to download a Docker image from Docker Hub in OCI bundle format ('''skopeo''' and '''umoci''') are not yet installed on the cluster. But the method using the '''docker''' command should work.

Slurm containers ''should'' have access to their assigned GPUs, but it is not clear if tools like '''nvidia-smi''' are injected into the container, as they would be with Singularity or the nVidia Container Runtime.

====Running Containers in Docker====

You might be used to running containers with Docker, or containerized GPU workloads with the nVidia Container Runtime or Toolkit. Docker is installed on all the nodes and the daemon is running; if the '''docker''' command does not work for you, ask cluster-admin to add you to the right groups.

The '''nvidia''' runtime is set up and will automatically be used.

While Slurm configures each Slurm job with a cgroup that directs it to the correct GPUs, '''using Docker to run another container escapes Slurm's confinement''', and using '''--gpus=1''' will ''always'' use the ''first'' GPU in the system, whether that GPU is assigned to your job or not. When using Docker, you ''must'' consult the '''SLURM_JOB_GPUS''' (for '''sbatch''') or '''SLRUM_STEP_GPUS''' (for '''srun''') environment variable and pass that along to your container. You should also impose limits on all other resources used by your Docker container, so that your whole job stays within the resources allocated by Slurm's scheduler. (TODO: find out how cgroups handles oversubscription between a Docker container and the Slurm container that launched it).

An example of a working command is:

srun -c 1 --mem 4G --partition=gpu --time=00:20:00 --gres=gpu:2 bash -c 'docker run --rm --gpus=\"device=$SLURM_STEP_GPUS\" nvidia/cuda:12.1.1-base-ubuntu22.04 nvidia-smi'

Note that the double-quotes are included in the argument to '''--gpus''' as seen by the Docker client, and that '''bash''' and single-quotes are used to ensure that '''$SLURM_STEP_GPUS''' is evaluated within the job itself, and not on the head node.

Slurm Queues (Partitions) and Resource Management

2024-03-25T20:29:53Z

Anovak: /* My job is not running but I want it to be running! */

== Partitions ==

Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues.

Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run.

{| class="wikitable"
|- style="font-weight:bold;"
! Partition Name
! Default Walltime Limit
! Maximum Walltime Limit
! style="border-color:inherit;" | Default Partition?
! Job Priority
! Maximum Nodes Utilized
|-
| short
| 10 minutes
| 1 hour
| style="border-color:inherit;" | Yes
| Normal
| All
|-
| medium
| 1 hour
| 12 hours
| style="border-color:inherit;" | No
| Normal
| 15
|-
| long
| 12 hours
| 7 days
| style="border-color:inherit;" | No
| Normal
| 10
|-
| high_priority
| 10 minutes
| 7 days
| style="border-color:inherit;" | No
| High
| All<br />
|-
| gpu
| 10 minutes
| 7 days
| No
| Normal
| 6
|}

If you do not specify a partition to run your job in (with e.g. <code>--partition=medium</code>), it will automatically be assigned the "short" partition by default. If you do not specify a walltime value in your job submission script (with e.g. <code>--time=00:30:00</code>), it will inherit the "Default Walltime Limit" of the partition it is assigned. Therefore, it is a very good idea to specify which partition your job will go in, and you should also specify a walltime limit, otherwise your jobs will inherit the default walltime limit in the chart above.

This all means that it is very important to '''TEST''' your jobs before running many of them! Submit one job and note how much resources it takes (RAM, CPU) and how long it takes to run. Then when you submit many of those jobs, you can correctly specify the number of CPU cores your job needs, how much RAM it needs (pad it by about 20% just in case), and how much time it needs to run (pad it by 40% to account for environmental variables like disk IO load and CPU context switching load).

You can test your jobs by running one job via '''srun''' with fairly high CPU, RAM and walltime limits (just so it isn't killed due to default limits), then noting how much in resources it consumed while running (after it finishes).

'''Example'''

seff 769059

'''Output'''

Job ID: 769059
Cluster: phoenix
User/Group: <user-name>/<group-name>
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 00:00:01
CPU Efficiency: 0.11% of 00:15:28 core-walltime
Job Wall-clock time: 00:00:58
Memory Utilized: 4.79 MB
Memory Efficiency: 4.79% of 100.00 MB

So if I needed to run like 1000 of these jobs, and they were all similar, I would select the "short" partition, 1 CPU core, maybe specify 8MB RAM, and maybe 90 seconds walltime limit. Note how I padded the RAM and walltime a bit to account for unexpected variable cluster conditions.

== '''high_priority''' Partition Notes ==

The "high_priority" partition is special in that it will have the highest priority of all jobs on the cluster and will push all other jobs aside in an effort to finish jobs in that partition as fast as possible. This is only available for emergency or mission critical batches that need to be completed in an unexpectedly critically fast way. Access to this partition is only granted on a per request basis, and is temporary until your batch finishes. Email '''cluster-admin@soe.ucsc.edu''' if you need to access the high_priority queue and make your case why it is necessary.

== My job is not running but I want it to be running ==

Even if your job is in the high-priority partition, that doesn't mean that the cluster will drop everything and run it immediately. Because we don't have pre-emption set up, high priority jobs still have to wait for currently-running jobs to finish, as well as for other high-priority jobs. And since, as noted above, jobs can be allowed to run for up to 7 days each, it is physically possible for even the highest-priority job in the whole cluster to not start for a whole week.

Here is a [https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/running-your-jobs/why-job-not-run/ good resource from Berkeley] about understanding and debugging Slurm job scheduling. Basically, Slurm uses the wall-clock limits of running jobs, and of jobs in the queue, to make a plan to start each job on some node at some time in the future. If jobs finish early, other jobs can start sooner than scheduled, and if there is space around higher-priority jobs, lower-priority jobs can be filled in.

If you want to know when Slurm plans to run your job, and why that is not right now, you can use the <code>--start</code> option for the <code>squeue</code> command:

$ squeue -j 1719584 --start
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
1719584 short snakemak flastnam PD 2024-01-22T10:20:00 1 phoenix-00 (Priority)

The <code>START_TIME</code> column is the time by which Slurm is sure it will be able to start your job if no higher-priority jobs come in first, and the <code>NODELIST(REASON)</code> column shows the nodes the job is running on, or the reason it is not running now, in parentheses. In this case, the job is not running because higher-priority jobs are in the way.

Phoenix WDL Tutorial

2024-03-21T16:23:01Z

Anovak: /* Preparing an input file */

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs.

To connect to the head node:

1. Connect to the VPN.
2. SSH to <code>phoenix.prism</code>. At the command line, run:

ssh phoenix.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@phoenix.prism

The first time you connect, you will see a message like:

The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established.
ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier.

Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command:

echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc

Then use these commands to make sure that Toil knows where it ought to put its caches:

echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc

After that, '''log out and log back in again''', to apply the changes.

If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day].

'''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.]

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an ''inputs file'', which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Restarting the Workflow==

If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow.

This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart.

If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Cluster Etiquette

2024-03-06T17:24:37Z

Anovak: Link to the storage visualization

When running jobs on the cluster, you must be very aware of how those jobs will affect other users.

1: Always test your job by running one first. Just one. Note how much RAM, how many CPU cores and how much time it takes to run. Then, when you submit 50 or 100 of those, you can specify limits in your Slurm batch file on how long the job should run, how much RAM it should use and how much time it takes. In that case, slurm can stop jibs that inadvertantly go too long or use too many resources.

2: Don't run too many jobs at once if they use a lot of disk I/O. If every job reads in a 100GB file, and you launch 20 of them at the same time, you could bring down the file server serving /private/groups. Run only maybe 5 at once in that case, or introduce a random delay at the start of your jobs. You can limit your concurrent jobs by specifying something like this in your job batch file:

#SBATCH --array=[1-279]%10

inputList=$1

input=$(sed -n "$SLURM_ARRAY_TASK_ID"p $inputList)

some_command $input

3: Don't use too much storage. Use http://logserv.gi.ucsc.edu/cgi-bin/private-groups.cgi to look at how your storage use is divided among your directories, and clean up large chunks of data that you do not need.

Slurm Tips for Toil

2024-02-16T19:08:32Z

Anovak:

Here are some tips for running Toil workflows on the Phoenix Slurm cluster. Mostly you might want to run WDL workflows, but you can use some of these for other workflows like Cactus. You can also consult [https://github.com/DataBiosphere/toil/blob/master/docs/wdl/running.rst the Toil documentation on WDL workflows].

* Install Toil with WDL support with:

pip3 install --upgrade toil[wdl]

To use a development version of Toil, you can install from source instead:

pip3 install git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl]

Or for a particular branch:

pip3 install git+https://github.com/DataBiosphere/toil.git@issues/123-abc#egg=toil[wdl]

* You will then need to make sure your '''~/.local/bin''' directory is on your PATH. Open up your '''~/.bashrc''' file and add:

export PATH=$PATH:$HOME/.local/bin

Then make sure to log out and back in again.

* For Toil options, you will want '''--batchSystem slurm''' to make it use Slurm and '''--batchLogsDir ./logs''' (or some other location on a shared filesystem) for the Slurm logs to not get lost.

* You may be able to speed up your workflow with '''--caching true''', to cache data on nodes to be shared among multiple simultaneous tasks.

* If using '''toil-wdl-runner''', you might want to add '''--jobStore ./jobStore''' to make sure the job store is in a defined, shared location so that you can use '''--restart''' later.

* If using '''toil-wdl-runner''', you will want to set the '''SINGULARITY_CACHEDIR''' and '''MINIWDL__SINGULARITY__IMAGE_CACHE''' environment variables for your workflow to locations on shared storage, and possibly to the default cache locations in your home directory. Otherwise Toil will set them to node-local directories for each node, and thus re-download images for each workflow run, and for each cluster node. To avoid this, you could, for example, before your run or in your '''~/.bashrc''' you could:

export SINGULARITY_CACHEDIR=$HOME/.singularity/cache
export MINIWDL__SINGULARITY__IMAGE_CACHE=$HOME/.cache/miniwdl

Phoenix WDL Tutorial

2024-02-13T23:01:15Z

Anovak: /* Debugging Workflows */ Explain restarting

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs.

To connect to the head node:

1. Connect to the VPN.
2. SSH to <code>phoenix.prism</code>. At the command line, run:

ssh phoenix.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@phoenix.prism

The first time you connect, you will see a message like:

The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established.
ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier.

Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command:

echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc

Then use these commands to make sure that Toil knows where it ought to put its caches:

echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc

After that, '''log out and log back in again''', to apply the changes.

If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day].

'''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.]

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Restarting the Workflow==

If you think your workflow failed from a transient problem (such as a Docker image not being available) that you have fixed, and you ran the workflow with <code>--jobStore</code> set manually to a directory that persists between attempts, you can add <code>--restart</code> to your workflow command and make Toil try again. It will pick up from where it left off and rerun any failed tasks and then the rest of the workflow.

This will ''will not'' pick up any changes to your WDL source code files; those are read once at the beginning and not re-read on restart.

If restarting the workflow doesn't help, you may need to move on to more advanced debugging techniques.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2024-02-06T15:09:09Z

Anovak: /* Reading the Log */ Update with new examples of the new log logging format.

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs.

To connect to the head node:

1. Connect to the VPN.
2. SSH to <code>phoenix.prism</code>. At the command line, run:

ssh phoenix.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@phoenix.prism

The first time you connect, you will see a message like:

The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established.
ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier.

Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command:

echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc

Then use these commands to make sure that Toil knows where it ought to put its caches:

echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc

After that, '''log out and log back in again''', to apply the changes.

If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day].

'''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.]

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stderr follows:

And

[2024-01-16T20:12:19-0500] [Thread-3 (statsAndLoggingAggregator)] [I] [toil.statsAndLogging] hello_caller.0.hello.stdout follows:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2024-01-31T15:28:10Z

Anovak: Turn of caching as is demanded

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs.

To connect to the head node:

1. Connect to the VPN.
2. SSH to <code>phoenix.prism</code>. At the command line, run:

ssh phoenix.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@phoenix.prism

The first time you connect, you will see a message like:

The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established.
ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier.

Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command:

echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc

Then use these commands to make sure that Toil knows where it ought to put its caches:

echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc

After that, '''log out and log back in again''', to apply the changes.

If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day].

'''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.]

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. Until Toil [https://github.com/DataBiosphere/toil/issues/4775 gets support for data file caching on Slurm], we will also need the <code>--caching false</code> option. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --caching false --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --caching false --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs and the output of commands run from WDL are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt:

And

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2024-01-31T15:24:57Z

Anovak: Fix groups directory path

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs.

To connect to the head node:

1. Connect to the VPN.
2. SSH to <code>phoenix.prism</code>. At the command line, run:

ssh phoenix.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@phoenix.prism

The first time you connect, you will see a message like:

The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established.
ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/private/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/private/groups</code>. Usually you would end up with <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/private/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier.

Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command:

echo 'BIG_DATA_DIR=/private/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc

Then use these commands to make sure that Toil knows where it ought to put its caches:

echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc

After that, '''log out and log back in again''', to apply the changes.

If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day].

'''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.]

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/private/groups</code>, and make a directory to work in.

cd /private/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt:

And

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Slurm Queues (Partitions) and Resource Management

2024-01-22T16:44:35Z

Anovak: Document how the priority system works and how to make the scheduler account for its choices

== Partitions ==

Due to heterogeneous workloads and different batch requirements, we have implemented partitions in slurm, which are similar to queues.

Each partition has different default and maximum walltime limits (aka "runtime" limits). You will need to select a partition to launch your jobs in based on what kind of jobs they are and how long they are expected to run.

{| class="wikitable"
|- style="font-weight:bold;"
! Partition Name
! Default Walltime Limit
! Maximum Walltime Limit
! style="border-color:inherit;" | Default Partition?
! Job Priority
! Maximum Nodes Utilized
|-
| short
| 10 minutes
| 1 hour
| style="border-color:inherit;" | Yes
| Normal
| All
|-
| medium
| 1 hour
| 12 hours
| style="border-color:inherit;" | No
| Normal
| 15
|-
| long
| 12 hours
| 7 days
| style="border-color:inherit;" | No
| Normal
| 10
|-
| high_priority
| 10 minutes
| 7 days
| style="border-color:inherit;" | No
| High
| All<br />
|-
| gpu
| 10 minutes
| 7 days
| No
| Normal
| 6
|}

If you do not specify a partition to run your job in (with e.g. <code>--partition=medium</code>), it will automatically be assigned the "short" partition by default. If you do not specify a walltime value in your job submission script (with e.g. <code>--time=00:30:00</code>), it will inherit the "Default Walltime Limit" of the partition it is assigned. Therefore, it is a very good idea to specify which partition your job will go in, and you should also specify a walltime limit, otherwise your jobs will inherit the default walltime limit in the chart above.

This all means that it is very important to '''TEST''' your jobs before running many of them! Submit one job and note how much resources it takes (RAM, CPU) and how long it takes to run. Then when you submit many of those jobs, you can correctly specify the number of CPU cores your job needs, how much RAM it needs (pad it by about 20% just in case), and how much time it needs to run (pad it by 40% to account for environmental variables like disk IO load and CPU context switching load).

You can test your jobs by running one job via '''srun''' with fairly high CPU, RAM and walltime limits (just so it isn't killed due to default limits), then noting how much in resources it consumed while running (after it finishes).

'''Example'''

seff 769059

'''Output'''

Job ID: 769059
Cluster: phoenix
User/Group: <user-name>/<group-name>
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 00:00:01
CPU Efficiency: 0.11% of 00:15:28 core-walltime
Job Wall-clock time: 00:00:58
Memory Utilized: 4.79 MB
Memory Efficiency: 4.79% of 100.00 MB

So if I needed to run like 1000 of these jobs, and they were all similar, I would select the "short" partition, 1 CPU core, maybe specify 8MB RAM, and maybe 90 seconds walltime limit. Note how I padded the RAM and walltime a bit to account for unexpected variable cluster conditions.

== '''high_priority''' Partition Notes ==

The "high_priority" partition is special in that it will have the highest priority of all jobs on the cluster and will push all other jobs aside in an effort to finish jobs in that partition as fast as possible. This is only available for emergency or mission critical batches that need to be completed in an unexpectedly critically fast way. Access to this partition is only granted on a per request basis, and is temporary until your batch finishes. Email '''cluster-admin@soe.ucsc.edu''' if you need to access the high_priority queue and make your case why it is necessary.

== My job is not running but I want it to be running! ==

Even if your job is in the high-priority partition, that doesn't mean that the cluster will drop everything and run it immediately. Because we don't have pre-emption set up, high priority jobs still have to wait for currently-running jobs to finish, as well as for other high-priority jobs. And since, as noted above, jobs can be allowed to run for up to 7 days each, it is physically possible for even the highest-priority job in the whole cluster to not start for a whole week.

Here is a [https://docs-research-it.berkeley.edu/services/high-performance-computing/user-guide/running-your-jobs/why-job-not-run/ good resource from Berkeley] about understanding and debugging Slurm job scheduling. Basically, Slurm uses the wall-clock limits of running jobs, and of jobs in the queue, to make a plan to start each job on some node at some time in the future. If jobs finish early, other jobs can start sooner than scheduled, and if there is space around higher-priority jobs, lower-priority jobs can be filled in.

If you want to know when Slurm plans to run your job, and why that is not right now, you can use the <code>--start</code> option for the <code>squeue</code> command:

$ squeue -j 1719584 --start
JOBID PARTITION NAME USER ST START_TIME NODES SCHEDNODES NODELIST(REASON)
1719584 short snakemak flastnam PD 2024-01-22T10:20:00 1 phoenix-00 (Priority)

The <code>START_TIME</code> column is the time by which Slurm is sure it will be able to start your job if no higher-priority jobs come in first, and the <code>NODELIST(REASON)</code> column shows the nodes the job is running on, or the reason it is not running now, in parentheses. In this case, the job is not running because higher-priority jobs are in the way.

Slurm Tips for vg

2024-01-05T15:35:25Z

Anovak: /* Setting Up */

This page explains how to set up a development environment for [https://github.com/vgteam/vg vg] on the Phoenix cluster.

==Setting Up==

1. After connecting to the VPN, connect to the cluster head node:

ssh phoenix.prism

This node is relatively small, so you shouldn't run real work on it, but it is the place you need to be to submit Slurm jobs.

2. Make yourself a user directory under '''/private/groups''', which is where large data must be stored. For example, if you are in the Paten lab:

mkdir /private/groups/patenlab/$USER

3. (Optional) Link it over to your home directory, so it is easy to use storage there to store your repos. The '''/private/groups''' storage may be faster than the home directory storage.

mkdir -p /private/groups/patenlab/$USER/workspace
ln -s /private/groups/patenlab/$USER/workspace ~/workspace

4. Make sure you have SSH keys created and add them to Github.

cat ~/.ssh/id_ed25519.pub || (ssh-keygen -t ed25519 && cat ~/.ssh/id_ed25519.pub)
# Paste into https://github.com/settings/ssh/new

5. Make a place to put your clone, and clone vg:

mkdir -p ~/workspace
cd ~/workspace
git clone --recursive git@github.com:vgteam/vg.git
cd vg

6. vg's dependencies should already be installed on the cluster nodes. If any of them seem to be missing, tell cluster-admin@soe.ucsc.edu to install them.

7. Build vg as a Slurm job. This will send the build out to the cluster as a 64-core, 80G memory job, and keep the output logs in your terminal.

srun -c 64 --mem=80G --time=00:30:00 make -j64

This will leave your vg binary at '''~/workspace/vg/bin/vg'''.

==Misc Tips==

* If you want an interactive session with appreciable resources, you can schedule one with '''srun'''. For example, to get 16 cores and 120G memory all for you, run:

srun -c 16 --mem 120G --time=08:00:00 --partition=medium --pty bash -i

* To send out a job without making a script file for it, use '''sbatch --wrap "your command here"'''.

* You can use arguments from SBATCH lines on the command line!

* You can use [https://github.com/CLIP-HPC/SlurmCommander#readme Slurm Commander] to watch the state of the cluster with the '''scom''' command.

Slurm Queues (Partitions) and Resource Management

2024-01-05T15:12:56Z

Anovak: /* Partitions */

Slurm Tips for vg

2024-01-05T15:11:59Z

Anovak:

This page explains how to set up a development environment for [https://github.com/vgteam/vg vg] on the Phoenix cluster.

==Setting Up==

1. After connecting to the VPN, connect to the cluster head node:

ssh phoenix.prism

This node is relatively small, so you shouldn't run real work on it, but it is the place you need to be to submit Slurm jobs.

2. Make yourself a user directory under '''/private/groups''', which is where large data must be stored. For example, if you are in the Paten lab:

mkdir /private/groups/patenlab/$USER

3. (Optional) Link it over to your home directory, so it is easy to use storage there to store your repos. The '''/private/groups''' storage may be faster than the home directory storage.

mkdir -p /private/groups/patenlab/$USER/workspace
ln -s /private/groups/patenlab/$USER/workspace ~/workspace

4. Make sure you have SSH keys created and add them to Github.

cat ~/.ssh/id_ed25519.pub || (ssh-keygen -t dsa && cat ~/.ssh/id_ed25519.pub)
# Paste into https://github.com/settings/ssh/new

5. Make a place to put your clone, and clone vg:

mkdir -p ~/workspace
cd ~/workspace
git clone --recursive git@github.com:vgteam/vg.git
cd vg

6. vg's dependencies should already be installed on the cluster nodes. If any of them seem to be missing, tell cluster-admin@soe.ucsc.edu to install them.

7. Build vg as a Slurm job. This will send the build out to the cluster as a 64-core, 80G memory job, and keep the output logs in your terminal.

srun -c 64 --mem=80G --time=00:30:00 make -j64

This will leave your vg binary at '''~/workspace/vg/bin/vg'''.

==Misc Tips==

* If you want an interactive session with appreciable resources, you can schedule one with '''srun'''. For example, to get 16 cores and 120G memory all for you, run:

srun -c 16 --mem 120G --time=08:00:00 --partition=medium --pty bash -i

* To send out a job without making a script file for it, use '''sbatch --wrap "your command here"'''.

* You can use arguments from SBATCH lines on the command line!

* You can use [https://github.com/CLIP-HPC/SlurmCommander#readme Slurm Commander] to watch the state of the cluster with the '''scom''' command.

Phoenix WDL Tutorial

2023-12-07T23:00:48Z

Anovak: Show the safer export syntax

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs.

To connect to the head node:

1. Connect to the VPN.
2. SSH to <code>phoenix.prism</code>. At the command line, run:

ssh phoenix.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@phoenix.prism

The first time you connect, you will see a message like:

The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established.
ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/public/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/public/groups</code>. Usually you would end up with <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier.

Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command:

echo 'BIG_DATA_DIR=/public/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc

Then use these commands to make sure that Toil knows where it ought to put its caches:

echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc

After that, '''log out and log back in again''', to apply the changes.

If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day].

'''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.]

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/public/groups</code>, and make a directory to work in.

cd /public/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> environment variable to tell Toil how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
export TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium"
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt:

And

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2023-12-01T16:14:21Z

Anovak: /* Testing at small scale single-machine */

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs.

To connect to the head node:

1. Connect to the VPN.
2. SSH to <code>phoenix.prism</code>. At the command line, run:

ssh phoenix.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@phoenix.prism

The first time you connect, you will see a message like:

The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established.
ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/public/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/public/groups</code>. Usually you would end up with <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier.

Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command:

echo 'BIG_DATA_DIR=/public/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc

Then use these commands to make sure that Toil knows where it ought to put its caches:

echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc

After that, '''log out and log back in again''', to apply the changes.

If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day].

'''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.]

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/public/groups</code>, and make a directory to work in.

cd /public/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> to tell Toiul how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt:

And

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2023-12-01T16:10:49Z

Anovak: Show using the partitions

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs.

To connect to the head node:

1. Connect to the VPN.
2. SSH to <code>phoenix.prism</code>. At the command line, run:

ssh phoenix.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@phoenix.prism

The first time you connect, you will see a message like:

The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established.
ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/public/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/public/groups</code>. Usually you would end up with <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier.

Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command:

echo 'BIG_DATA_DIR=/public/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc

Then use these commands to make sure that Toil knows where it ought to put its caches:

echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc

After that, '''log out and log back in again''', to apply the changes.

If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day].

'''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.]

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/public/groups</code>, and make a directory to work in.

cd /public/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem --time=02:00:00 --partition=medium --pty bash -i

This will start a new shell that can run for 2 hours; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

Additionally, since [https://github.com/DataBiosphere/toil/issues/4686 Toil can't manage Slurm partitions itself], we will use the <code>TOIL_SLURM_ARGS</code> to tell Toiul how long jobs should be allowed to run for (2 hours) and what [[Slurm Queues (Partitions) and Resource Management | partition]] they should go in.

mkdir -p logs
TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
TOIL_SLURM_ARGS="--time=02:00:00 --partition=medium" toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt:

And

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2023-12-01T16:02:52Z

Anovak: /* Configuring Toil for Phoenix */

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs.

To connect to the head node:

1. Connect to the VPN.
2. SSH to <code>phoenix.prism</code>. At the command line, run:

ssh phoenix.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@phoenix.prism

The first time you connect, you will see a message like:

The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established.
ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/public/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/public/groups</code>. Usually you would end up with <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier.

Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command:

echo 'BIG_DATA_DIR=/public/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc

Then use these commands to make sure that Toil knows where it ought to put its caches:

echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc

After that, '''log out and log back in again''', to apply the changes.

If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day].

'''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.]

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/public/groups</code>, and make a directory to work in.

cd /public/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --pty bash -i

This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

mkdir -p logs
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt:

And

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Slurm Tips for Toil

2023-11-20T21:54:12Z

Anovak: Show how to install with extras and a branch

Here are some tips for running Toil workflows on the Phoenix Slurm cluster. Mostly you might want to run WDL workflows, but you can use some of these for other workflows like Cactus. You can also consult [https://github.com/DataBiosphere/toil/blob/master/docs/running/wdl.rst#running-wdl-with-toil the Toil documentation on WDL workflows].

* Install Toil with WDL support with:

pip3 install --upgrade toil[wdl]

To use a development version of Toil, you can install from source instead:

pip3 install git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl]

Or for a particular branch:

pip3 install git+https://github.com/DataBiosphere/toil.git@issues/123-abc#egg=toil[wdl]

* You will then need to make sure your '''~/.local/bin''' directory is on your PATH. Open up your '''~/.bashrc''' file and add:

export PATH=$PATH:$HOME/.local/bin

Then make sure to log out and back in again.

* For Toil options, you will want '''--batchSystem slurm''' to make it use Slurm and '''--batchLogsDir ./logs''' (or some other location on a shared filesystem) for the Slurm logs to not get lost.

* You may be able to speed up your workflow with '''--caching true''', to cache data on nodes to be shared among multiple simultaneous tasks.

* If using '''toil-wdl-runner''', you might want to add '''--jobStore ./jobStore''' to make sure the job store is in a defined, shared location so that you can use '''--restart''' later.

* If using '''toil-wdl-runner''', you will want to set the '''SINGULARITY_CACHEDIR''' and '''MINIWDL__SINGULARITY__IMAGE_CACHE''' environment variables for your workflow to locations on shared storage, and possibly to the default cache locations in your home directory. Otherwise Toil will set them to node-local directories for each node, and thus re-download images for each workflow run, and for each cluster node. To avoid this, you could, for example, before your run or in your '''~/.bashrc''' you could:

export SINGULARITY_CACHEDIR=$HOME/.singularity/cache
export MINIWDL__SINGULARITY__IMAGE_CACHE=$HOME/.cache/miniwdl

Phoenix WDL Tutorial

2023-11-09T22:12:25Z

Anovak: /* Configuring Toil for Phoenix */

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs.

To connect to the head node:

1. Connect to the VPN.
2. SSH to <code>phoenix.prism</code>. At the command line, run:

ssh phoenix.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@phoenix.prism

The first time you connect, you will see a message like:

The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established.
ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/public/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/public/groups</code>. Usually you would end up with <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier.

Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command:

echo 'BIG_DATA_DIR=/public/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc

Then, after logging out and in again, use these commands to make sure that Toil knows where it ought to put its caches:

echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc

After that, '''log out and log back in again''', to apply the changes.

If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day].

'''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.]

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/public/groups</code>, and make a directory to work in.

cd /public/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --pty bash -i

This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

mkdir -p logs
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt:

And

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2023-10-27T22:32:00Z

Anovak: /* Configuring Toil for Phoenix */ Complain about file locking

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs.

To connect to the head node:

1. Connect to the VPN.
2. SSH to <code>phoenix.prism</code>. At the command line, run:

ssh phoenix.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@phoenix.prism

The first time you connect, you will see a message like:

The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established.
ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/public/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/public/groups</code>. Usually you would end up with <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier.

Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command:

echo 'BIG_DATA_DIR=/public/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc

Then use these commands to make sure that Toil knows where it ought to put its caches:

echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc

After that, '''log out and log back in again''', to apply the changes.

If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day].

'''If you get errors about mutexes, lock files, or other weird problems with Singularity''', try moving these directories to inside <code>/data/tmp</code> on the individual nodes, or unsetting them and letting Toil use its defaults (and exhaust our Docker pull limits). [https://github.com/DataBiosphere/toil/issues/4654 It is not clear that <code>/private/groups</code> actually implements the necessary file locking correctly.]

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/public/groups</code>, and make a directory to work in.

cd /public/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --pty bash -i

This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

mkdir -p logs
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt:

And

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2023-10-23T20:24:46Z

Anovak: Cache Singularity stuff outside the size-limited home directories.

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs.

To connect to the head node:

1. Connect to the VPN.
2. SSH to <code>phoenix.prism</code>. At the command line, run:

ssh phoenix.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@phoenix.prism

The first time you connect, you will see a message like:

The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established.
ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/public/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/public/groups</code>. Usually you would end up with <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code>.

Remember this path; we will need it later.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. However, since these files can be large, and the home directory quota is only 30 GB, we can't keep these in your home directory. We will need to use the <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code> directory you created earlier.

Make sure that that directory is available in your <code>~/.bashrc</code> file by editing and running this command:

echo 'BIG_DATA_DIR=/public/groups/YOURGROUPNAME/YOURUSERNAME' >>~/.bashrc

Then use these commands to make sure that Toil knows where it ought to put its caches:

echo 'export SINGULARITY_CACHEDIR="${BIG_DATA_DIR}/.singularity/cache"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${BIG_DATA_DIR}/.cache/miniwdl"' >>~/.bashrc

After that, '''log out and log back in again''', to apply the changes.

If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step.

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/public/groups</code>, and make a directory to work in.

cd /public/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --pty bash -i

This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

mkdir -p logs
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt:

And

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]

Phoenix WDL Tutorial

2023-10-20T21:21:46Z

Anovak: Remind people where the data needs to live.

'''Tutorial: Getting Started with WDL Workflows on Phoenix'''

Instead of giant shell scripts that only work on one grad student's laptop, modern, reusable bioinformatics experiments should be written as workflows, in a language like Workflow Description language (WDL). Workflows succinctly describe their own execution requirements, and which pieces depend on which other pieces, making your analyses reproducible by people other than you.

Workflows are also easily scaled up and down: you can develop and test your workflow on a small test data set on one machine, and then run it on real data on the cluster without having to worry about whether the right tasks will run in the right order.

This tutorial will help you get started writing and running workflows. The '''Phoenix Cluster Setup''' section is specifically for the UC Santa Cruz Genomics Institute's Phoenix Slurm cluster. The other sections are broadly applicable to other environments. By the end, you will be able to run workflows on Slurm with [https://toil.readthedocs.io/en/latest/ Toil], write your own workflows in WDL, and debug workflows when something goes wrong.

=Phoenix Cluster Setup=

Before we begin, you will need a computer to work at, which you are able to install software on, and the ability to connect to other machines over SSH.

==Getting VPN access==

We are going to work on the Phoenix cluster, but this cluster is kept behind the Prism firewall, where all of our controlled-access data lives. So, to get access to the cluster, you need to get access to the VPN (Virtual Private Network) system that we use to allow people through the firewall.

To get VPN access, follow the instructions at https://giwiki.gi.ucsc.edu/index.php/Requirement_for_users_to_get_GI_VPN_access. Note that this process involves making a one-on-one appointment with one of our admins to help you set up your VPN client, so make sure to do it in advance of when you need to use the cluster.

==Connecting to Phoenix==

Once you have VPN access, you can connect to the "head node" of the Phoenix cluster. This node is where everyone logs in, but you should *not* run actual work on this node; it exists only to give you access to the files on the cluster and to the commands to control cluster jobs.

To connect to the head node:

1. Connect to the VPN.
2. SSH to <code>phoenix.prism</code>. At the command line, run:

ssh phoenix.prism

If your username on the cluster (say, <code>flastname</code>) is different than your username on your computer (which might be <code>firstname</code>), you might instead have to run:

ssh flastname@phoenix.prism

The first time you connect, you will see a message like:

The authenticity of host 'phoenix.prism (10.50.1.66)' can't be established.
ED25519 key fingerprint is SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])?

This is your computer asking you to help it decide if it is talking to the genuine <code>phoenix.prism</code>, and not an imposter. You will want to make sure that the "key fingerprint" is indeed <code>SHA256:SUgdBXgsWwUJXxAz/BpGzlGFLOsFtZzeqQ3kzdl3iuI</code>. If it is not, someone (probably the GI sysadmins, but possibly a cabal of hackers) has replaced the head node, and you should verify that this was supposed to happen. If the fingerprints do match, type <code>yes</code> to accept and remember that the server is who it says it is.

==Installing Toil with WDL support==

Once you are on the head node, you can install Toil, a program for running workflows. When installing, you need to specify that you want WDL support. To do this, you can run:

pip install --upgrade --user 'toil[wdl]'

If you also want to use AWS S3 <code>s3://</code> and/or Google <code>gs://</code> URLs for data, you will need to also install Toil with the <code>aws</code> and <code>google</code> extras, respectively:

pip install --upgrade --user 'toil[wdl,aws,google]'

This will install Toil in the <code>.local</code> directory inside your home directory, which we write as <code>~/.local</code>. The program to run WDL workflows, <code>toil-wdl-runner</code>, will be at <code>~/.local/bin/toil-wdl-runner</code>.

By default, the command interpreter *will not* look there, so if you type <code>toil-wdl-runner</code>, it will complain that the command is not found. To fix this, you need to configure the command interpreter (bash) to look where Toil is installed. To do this, run:

echo 'export PATH="${HOME}/.local/bin:${PATH}"' >>~/.bashrc

After that, **log out and log back in**, to restart bash and pick up the change.

To make sure it worked, you can run:

toil-wdl-runner --help

If everything worked correctly, it will print a long list of the various option flags that the <code>toil-wdl-runner</code> command supports.

If you ever want to upgrade Toil to a new release, you can repeat the <code>pip</code> command above.

==Configuring Toil for Phoenix==

Toil is set up to work in a large number of different environments, and doesn't necessarily rely on the existence of things like a shared cluster filesystem. However, on the Phoenix cluster, we have a shared filesystem, and so we should configure Toil to use it for caching the Docker container images used for running workflow steps. So, use these commands to make sure that Toil knows where it ought to put its caches:

echo 'export SINGULARITY_CACHEDIR="${HOME}/.singularity/cache"' >>~/.bashrc
echo 'export MINIWDL__SINGULARITY__IMAGE_CACHE="${HOME}/.cache/miniwdl"' >>~/.bashrc

After that, '''log out and log back in again''', to apply the changes.

If you don't do this, Toil will re-download each container image, on each node, for each run of each workflow. That wastes a lot of time, and can exhaust the [https://docs.docker.com/docker-hub/download-rate-limit/#whats-the-download-rate-limit-on-docker-hub limits on how many containers you are allowed to download each day], so it is important not to skip this step.

==Configuring your Phoenix Environment==

'''Do not try and store data in your home directory on Phoenix!''' The home directories are meant for code and programs. Any data worth running a workflow on should be in a directory under <code>/public/groups</code>. You will probably need to email the admins to get added to a group so you can create a directory to work in somewhere under <code>/public/groups</code>. Usually you would end up with <code>/public/groups/YOURGROUPNAME/YOURUSERNAME</code>.

=Running an existing workflow=

First, let's use <code>toil-wdl-runner</code> to run an existing demonstration workflow. We're going to use the MiniWDL self-test workflow, from the [https://github.com/chanzuckerberg/miniwdl#readme MiniWDL project].

First, go to your user directory under <code>/public/groups</code>, and make a directory to work in.

cd /public/groups/YOURGROUPNAME/YOURUSERNAME
mkdir workflow-test
cd workflow-test

Next, download the workflow. While Toil can run workflows directly from a URL, your commands will be shorter if the workflow is available locally.

wget https://raw.githubusercontent.com/DataBiosphere/toil/d686daca091849e681d2f3f3a349001ca83d2e3e/src/toil/test/wdl/miniwdl_self_test/self_test.wdl

==Preparing an input file==

Near the top of the WDL file, there's a section like this:

workflow hello_caller {
input {
File who
}

This means that there is a workflow named <code>hello_caller</code> in this file, and it takes as input a file variable named <code>who</code>. For this particular workflow, the file is supposed to have a list of names, one per line, and the workflow is going to greet each one.

So first, we have to make that list of names. Let's make it in <code>names.txt</code>

echo "Mridula Resurrección" >names.txt
echo "Gershom Šarlota" >>names.txt
echo "Ritchie Ravi" >>names.txt

Then, we need to create an *inputs file*, which is a JSON (JavaScript Object Notation) file describing what value to use for each input when running the workflow. (You can also reach down into the workflow and override individual task settings, but for now we'll just set the inputs.) So, make another file next to <code>names.txt</code> that references it by relative path, like this:

echo '{"hello_caller.who": "./names.txt"}' >inputs.json

Note that, for a key, we're using the workflow name, a dot, and then the input name. For a value, we're using a quoted string of the filename, relative to the location of the inputs file. Absolute paths and URLs will also work for files; more information on the input file syntax is in [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md#json-input-format the JSON Input Format section of the WDL specification].

==Testing at small scale single-machine==

We are now ready to run the workflow!

You don't want to run workflows on the head node. So, use Slurm to get an interactive session on one of the cluster's worker nodes, by running:

srun -c 2 --mem 8G --pty bash -i

This will start a new shell; to leave it and go back to the head node you can use <code>exit</code>.

In your new shell, run this Toil command:

toil-wdl-runner self_test.wdl inputs.json -o local_run

This will, by default, use the <code>single_machine</code> Toil "batch system" to run all of the workflow's tasks locally. Output will be sent to a new directory named <code>local_run</code>.

This will print a lot of logging to standard error, and to standard output it will print:

{"hello_caller.message_files": ["local_run/Mridula Resurrecci\u00f3n.txt", "local_run/Gershom \u0160arlota.txt", "local_run/Ritchie Ravi.txt"], "hello_caller.messages": ["Hello, Mridula Resurrecci\u00f3n!", "Hello, Gershom \u0160arlota!", "Hello, Ritchie Ravi!"]}

The <code>local_run</code> directory will contain the described text files (with Unicode escape sequences like <code>\u00f3</code> replaced by their corresponding characters), each containing a greeting for the corresponding person.

To leave your interactive Slurm session and return to the head node, use <code>exit</code>.

==Running at larger scale==

Back on the head node, let's prepare to run a larger run. Greeting 3 people isn't cool, let's greet one hundred people!

Go get this handy list of people and cut it to length:

wget https://gist.githubusercontent.com/smsohan/ae142977b5099dba03f6e0d909108e97/raw/f6e319b1a0f6a0f87f93f73b3acd24795361aeba/1000_names.txt
head -n100 1000_names.txt >100_names.txt

And make a new inputs file:

echo '{"hello_caller.who": "./100_names.txt"}' >inputs_big.json

Now, we will run the same workflow, but with the new inputs, and against the Slurm cluster.

To run against the Slurm cluster, we need to use the <code>--jobStore</code> option to point Toil to a shared directory it can create where it can store information that the cluster nodes can read. We will add the <code>--batchLogsDir</code> option to tell Toil to store the logs from the individual Slurm jobs in a folder on the shared filesystem. We'll also use the <code>-m</code> option to save the output JSON to a file instead of printing it.

mkdir -p logs
toil-wdl-runner --jobStore ./big_store --batchSystem slurm --batchLogsDir ./logs self_test.wdl inputs_big.json -o slurm_run -m slurm_run.json

This will tick for a while, but eventually you should end up with 100 greeting files in the <code>slurm_run</code> directory.

=Writing your own workflow=

In addition to running existing workflows, you probably want to be able to write your own. This part of the tutorial will walk you through writing a workflow. We're going to write a workflow for [https://en.wikipedia.org/wiki/Fizz_buzz Fizz Buzz].

==Writing the file==

===Version===

All WDL files need to start with a <code>version</code> statement (unless they are very old <code>draft-2</code> files). Toil supports <code>draft-2</code>, WDL 1.0, and WDL 1.1, while Cromwell (another popular WDL runner used on Terra) supports only <code>draft-2</code> and 1.0.

So let's start a new WDL 1.0 workflow. Open up a file named <code>fizzbuzz.wdl</code> and start with a version statement:

version 1.0

===Workflow Block===

Then, add an empty <code>workflow</code> named <code>FizzBuzz</code>.

version 1.0
workflow FizzBuzz {
}

===Input Block===

Workflows usually need some kind of user input, so let's give our workflow an <code>input</code> section.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
}

Notice that each input has a type, a name, and an optional default value. If the type ends in <code>?</code>, the value is optional, and it may be <code>null</code>. If an input is ''not'' optional, and there is no default value, then the user's inputs file ''must'' specify a value for it in order for the workflow to run.

===Body===

Now we'll start on the body of the workflow, to be inserted just after the inputs section.

The first thing we're going to need to do is create an array of all the numbers up to the <code>item_count</code>. We can do this by calling the WDL <code>range()</code> function, and assigning the result to an <code>Array[Int]</code> variable.

Array[Int] numbers = range(item_count)

WDL 1.0 has [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#standard-library a wide variety of functions in its standard library], and WDL 1.1 has even more.

===Scattering===

Once we create an array of all the numbers, we can use a <code>scatter</code> to operate on each. WDL does not have loops; instead it has scatters, which work a bit like a <code>map()</code> in Python. The body of the scatter runs for each value in the input array, all in parallel. We're going to increment all the numbers, since FizzBuzz starts at 1 but WDL <code>range()</code> starts at 0.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
}

===Conditionals===

Inside the body of the scatter, we are going to put some conditionals to determine if we should produce <code>"Fizz"</code>, <code>"Buzz"</code>, or <code>"FizzBuzz"</code>. To support our <code>fizzbuzz_override</code>, we use an array of it and a default value, and use the WDL <code>select_first()</code> function to find the first non-null value in that array.

Each execution of a scatter is allowed to declare variables, and outside the scatter those variables are combined into arrays of all the results. But each variable can be declared only ''once'' in the scatter, even with conditionals. So we're going to use <code>select_first()</code> at the end and take advantage of variables from un-executed conditionals being <code>null</code>.

Note that WDL supports conditional ''expressions'' with a <code>then</code> and an <code>else</code>, but conditional ''statements'' only have a body, not an <code>else</code> branch. If you need an else you will have to check the negated condition.

So first, let's handle the special cases.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
}
}

===Calling Tasks===

Now for the normal numbers, we need to convert our number into a string. In WDL 1.1, and in WDL 1.0 on Cromwell, you can use a <code>${}</code> substitution syntax in quoted strings anywhere, not just in command line commands. Toil technically will support this too, but it's not in the spec, and the tutorial needs an excuse for you to call a task. So we're going to insert a call to a <code>stringify_number</code> task, to be written later.

To call a task (or another workflow), we use a <code>call</code> statement and give it some inputs. Then we can fish the output values out of the task with </code>.</code> access, only if we don't make a noise instead.

Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string])
}

We can put the code into the workflow now, and set about writing the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1

if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}

===Writing Tasks===

Our task should go after the workflow in the file. It looks a lot like a workflow except it uses <code>task</code>.

task stringify_number {
}

We're going to want it to take in an integer <code>the_number</code>, and we're going to want it to output a string <code>the_string</code>. So let's fill that in in <code>input</code> and <code>output</code> sections.

task stringify_number {
input {
Int the_number
}
# ???
output {
String the_string # = ???
}
}

Now, unlike workflows, tasks can have a <code>command</code> section, which gives a command to run. This section is now usually set off with triple angle brackets, and inside it you can use <code>~{}</code>, that is, Bash-like substitution but with a tilde, to place WDL variables into your command script. So let's add a command that will echo back the number so we can see it as a string.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string # = ???
}
}

Now we need to capture the result of the command script. The WDL <code>stdout()</code> returns a WDL <code>File</code> containing the standard output printed by the task's command. We want to read that back into a string, which we can do with the WDL <code>read_string()</code> function (which also [https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md#string-read_stringstringfile removes trailing newlines]).

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
}

We're also going to want to add a <code>runtime</code> section to our task, to specify resource requirements. We're also going to tell it to run in a Docker container, to make sure that absolutely nothing can go wrong with our delicate <code>echo</code> command. In a real workflow, you probably want to set up optiopnal inputs for all the tasks to let you control the resource requirements, but here we will just hardcode them.

task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

The <code>disks</code> section is a little weird; it isn't in the WDL spec, but Toil supports Cromwell-style strings that ask for a <code>local-disk</code> of a certain number of gigabytes, which may suggest that it be <code>SSD</code> storage.

Then we can put our task into our WDL file:

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

===Output Block===

Now the only thing missing is a workflow-level <code>output</code> section. Technically, in WDL 1.0 you aren't supposed to need this, but you do need it in 1.1 and Toil doesn't actually send your outputs anywhere yet if you don't have one, so we're going to make one. We need to collect together all the strings that came out of the different tasks in our scatter into an <code>Array[String]</code>. We'll add the <code>output</code> section at the end of the <code>workflow</code> section, above the task.

version 1.0
workflow FizzBuzz {
input {
# How many FizzBuzz numbers do we want to make?
Int item_count
# Every multiple of this number, we produce "Fizz"
Int to_fizz = 3
# Every multiple of this number, we produce "Buzz"
Int to_buzz = 5
# Optional replacement for the string to print when a multiple of both
String? fizzbuzz_override
}
Array[Int] numbers = range(item_count)
scatter (i in numbers) {
Int one_based = i + 1
if (one_based % to_fizz == 0) {
String fizz = "Fizz"
if (one_based % to_buzz == 0) {
String fizzbuzz = select_first([fizzbuzz_override, "FizzBuzz"])
}
}
if (one_based % to_buzz == 0) {
String buzz = "Buzz"
}
if (one_based % to_fizz != 0 && one_based % to_buzz != 0) {
# Just a normal number.
call stringify_number {
input:
the_number = one_based
}
}
String result = select_first([fizzbuzz, fizz, buzz, stringify_number.the_string]
}
output {
Array[String] fizzbuzz_results = result
}
}
task stringify_number {
input {
Int the_number
}
command <<<
# This is a Bash script.
# So we should do good Bash script things like stop on errors
set -e
# Now print our number as a string
echo ~{the_number}
>>>
output {
String the_string = read_string(stdout())
}
runtime {
cpu: 1
memory: "0.5 GB"
disks: "local-disk 1 SSD"
docker: "ubuntu:22.04"
}
}

Because the <code>result</code> variable is defined inside a <code>scatter</code>, when we reference it outside the scatter we see it as being an array.

==Running the Workflow==

Now all that remains is to run the workflow! As before, make an inputs file to specify the workflow inputs:

echo '{"FizzBuzz.item_count": 20}' >fizzbuzz.json

Then run it on the cluster with Toil:

mkdir -p logs
toil-wdl-runner --jobStore ./fizzbuzz_store --batchSystem slurm --batchLogsDir ./logs fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

Or locally:

toil-wdl-runner fizzbuzz.wdl fizzbuzz.json -o fizzbuzz_out -m fizzbuzz_out.json

=Debugging Workflows=
Sometimes, your workflow won't work. Try these ideas for figuring out what is going wrong.

==Debugging Options==
When debugging a workflow, make sure to run the workflow with <code>--logDebug</code>, to set the log level to <code>DEBUG</code>, and with <code>--jobStore /some/path/to/a/shared/directory/it/can/create</code> so that the stored files shipped between jobs are in a place you can access them.

When debug logging is on, the log from every Toil job is inserted in the main Toil log between these markers:

=========>
Toil job log is here
<=========

Normally, only the logs of failing jobs are reproduced like this.

==Reading the Log==

When a WDL workflow fails, you are likely to see a message like this:

WDL.runtime.error.CommandFailed: task command failed with exit status 1
[2023-07-16T16:23:54-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host phoenix-15.prism

This means that the command line command specified by one of your WDL tasks exited with a failing (i.e. nonzero) exit code, which will happen when either the command line command is written wrong, or when the error detection code in the tool you are trying to run detects and reports an error.

Go up higher in the log until you find lines that look like:

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard error at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stderr.txt:

And

[2023-07-16T16:23:54-0700] [MainThread] [I] [toil.wdl.wdltoil] Standard output at /data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/stdout.txt:

These will be followed by the standard error and standard output log data from the task's command. There may be useful information (such as an error message from the underlying tool) in there.

==Reproducing Problems==

When trying to fix a failing step, it is useful to be able to run a command outside of Toil or WDL that might reproduce the problem. In addition to getting the standard output and standard error logs as described above, you may also need input files for your tool in order to do this. In the log of your failing Toil taks, look for likes like this:

[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Failed job accessed files:
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/no-job/file-4f886176ab8344baaf17dc72fc445445/toplog.sh' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmprwhi6h3q/toplog.sh'
[2023-07-16T16:23:54-0700] [MainThread] [W] [toil.fileStores.abstractFileStore] Downloaded file 'files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam' to path '/data/tmp/c3d51c0611b9511da167528976fef714/9b0e/467f/tmpjyksfoko/Sample.bam'
...

The <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam</code> part is a Toil file ID, and it is a relative path from your <code>--jobStore</code> value to where the file is stored on disk. So if you ran the workflow with <code>--jobStore /private/groups/patenlab/anovak/jobstore</code>, you would look for this file at:

/private/groups/patenlab/anovak/jobstore/files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-1bb5d92ae8f3413eb82fe8ef88686bf6/Sample.bam

==More Ways of Finding Files==

Sometimes, a step might not fail, but you still might want to see the files it is using as input. If you have the job store path, you can use the <code>find</code> command to try and find the files by name. For example, if you want to look at <code>Sample.bam</code>, you can look for it like this:

find /path/to/the/jobstore -name "Sample.bam"

If you want to find files that were ''uploaded'' from a job, look for lines like this in the job's log:

[2023-07-16T15:58:39-0700] [MainThread] [D] [toil.wdl.wdltoil] Virtualized /data/tmp/2846b6012e3e5535add03b363950dd78/cb23/197c/work/bamPerChrs/Sample.chr14.bam as WDL file toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam

You can take the <code>toilfile:2703483274%3A0%3Afiles%2Ffor-job%2Fkind-WDLTaskJob%2Finstance-b4c5x6hq%2Ffile-c4e4f1b16ddf4c2ab92c2868421f3351%2FSample.chr14.bam/Sample.chr14.bam</code> URI and URL-decode it with, for example, [https://www.urldecoder.io/], getting this:

toilfile:2703483274:0:files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam

Then you can take the part after the last colon, <code>files/for-job/kind-WDLTaskJob/instance-b4c5x6hq/file-c4e4f1b16ddf4c2ab92c2868421f3351/Sample.chr14.bam/Sample.chr14.bam</code>, and that is the path relative to the job store where this file can be found.

==Using Development Versions of Toil==

Sometimes, bugs will be fixed in the development version of Toil, but not released yet. To try the current development version of Toil, you can install it like this:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git#egg=toil[wdl,aws,google]'

If you want to use a particular branch or commit, like <code>aaa451b320fc115b3563ced25cb501301cf86f90</code>, you can do:

pip install --upgrade --user 'git+https://github.com/DataBiosphere/toil.git@aaa451b320fc115b3563ced25cb501301cf86f90#egg=toil[wdl,aws,google]'

==Frequently Asked Questions==

===I am getting warnings about <code>XDG_RUNTIME_DIR</code>===
You may be seeing warnings that <code>XDG_RUNTIME_DIR is set to nonexistent directory /run/user/$UID; your environment may be out of spec!</code>

This happens because Slurm is providing Toil with an <code>XDG_RUNTIME_DIR</code> environment variable that points to a directory that doesn't exist, which the XDG spec says it shouldn't be doing. This is a known bug in the GI Slurm configuration, and Toil is letting you know that it is working around it.

===Toil said it was <code>Redirecting logging</code> somewhere, but I can't find that file!===
The Toil worker process for each job will say that it is <code>Redirecting logging to /data/tmp/somewhere/worker_log.txt</code>, and when running in single machine mode these messages go to the main Toil log.

The Toil worker logs are automatically cleaned up when the worker finishes. If you want to see the individual worker logs in the Toil log, use the <code>--logDebug</code> option to Toil.

If you are looking for the log for a worker process that did not finish (i.e. that crashed), make sure to look on the machine that the worker actually ran on, not on the head node.

=Additional WDL resources=

For more information on writing and running WDL workflows, see:

* [https://docs.openwdl.org/en/stable/ The WDL dcoumentation]
* [https://www.youtube.com/playlist?list=PL4Q4HssKcxYv5syJKUKRrD8Fbd-_CnxTM The "Learn WDL" video course on YouTube]
* [https://github.com/openwdl/wdl/blob/main/versions/1.1/SPEC.md The WDL 1.1 language specification]