Running Nextflow on AWS batch

September 24, 2018

This blog post was created based on my desire as a bioinformatian to run Nextflow on AWS batch and thereby provide myself, colleagues and collaboraors with an easy way to batch process large numbers of samples without needing a HPC environment or other large computing resource such as an expensive workstation with multiple cores and lots of RAM. I have a few devops skills but they are limited and I would definitely not class myself as a Linux sys admin. Therefore the thought of getting AWS batch up and running with the potential to run up unforseen bills if not careful was daunting.

Thankfully Nextflow has amazing resources and a helpful community. A combination of these web pages were invaluable in guiding me through this complex process:

In addition the Nextflow gitter channel is a friendly place and I can't thank Paolo Di Tommaso (@PaoloDiTommaso) and Alex Peltzer (@alex_peltzer) enough for their advice and patience in the face of some beginners noob questions.

In many ways this post follows the steps of Alex's post very closely, but I documented it as I went along since there were a few bits I missed when following it as a new user.

Creating an IAM user

First off it is worth creating a new AWS user that will have limited permissions and is not your root user

  • Browse to https://console.aws.amazon.com/iam
  • Add a group and for the moment do not attach any permission policies to the group Add_AWS_group
  • Add a user and give them programmatic access Add_AWS_user
  • Now to give the group some permissions. Add_policy_to_group.For access to AWS Batch give them the AWSBatchFullAccess policy. For S3 you can give them
    • full S3 permissions Add_full_S3_permissions_to_group
    • S3 permissions to selected buckets Give_selected_S3_permissions_to_group
  • The picture below shows test-group with both full S3 and a limited access S3 policy I created attached to my test group. In a real situation only one of these would be required policies_attached

Creating Roles for AWS Batch

You will require specific roles to be configured. These are used by the AWS batch queue to create new EC2 instances on demand and in particular use the cheaper spot prices available.

Under the same IAM portal:

  • Click on Roles on the left hand side.

  • Create a role called ecsInstanceRole and under the type of trusted entity chose AWS service, selecting the EC2 service, and then EC2 again as the use case ec2_ec2_use_case

  • Click the ‘Next: Permissions’ button and give the role the following policies

    • AmazonS3FullAccess
    • AmazonEC2ContainerServiceforEC2Role ecs_role
  • Create another role called AmazonEC2SpotFleet by clicking create role and under the type of trusted entity chose AWS service, selecting EC2 followed by EC2 Spot Fleet Role as the use case. A single policy will be available: AmazonEC2SpotFleetTaggingRole. Create the role with this policy.

  • Now click on Roles in the left hand navigation and select the AmazonEC2SpotFleet role you have just created and attach the following additional policies to it if not already present(you can use the search box to filter)

    • AmazonEC2SpotFleetRole
    • AmazonEC2SpotFleetAutoscaleRole
    • AmazonEC2SpotFleetTaggingRole
  • After creation the role should look like this spot_fleet_role

Create a custom AMI image

Use the instructions here https://www.nextflow.io/docs/latest/awscloud.html#custom-ami to make a custom AMI. However if using the Amazon ECS Linux image as a base, the user and coresponding file paths should be ec2-user. This image will have both aws cli and Docker software pre-installed.

Creating an AWS Batch compute environment and queue

If this is your first time there will be a ‘Get started’ button. Click this and then the ‘Skip wizard’ button.

Compute Environment

  • Click on ‘Compute Environments’ in the left hand navigation and then the ‘Create environment’ button.
    • Select Managed
    • Give the compute environment a name
    • Service role: If it exists select ‘AWSBatchServiceRole’ else ‘Create new role’
    • Instance role: Select the ecsInstanceRole you created earlier.
    • There is no need to select EC2 key pair Compute environment
    • Provisioning model - chooseSpot
    • Maximum price: I found 30% worked OK
    • Allowed instance types. I preferred to specify an instance type rather than use the default ‘optional’. I chose m4.large. Instance types can be found here and the pricing here. Spot prices of between 30 and 40% are usually provisioned.
    • Specify vCPUs selecting the minimum as 0 so that when no jobs are processing, all EC2 instances will be shut down and no charges made. I selected
    • Minimum vCPUs 0
    • Desired vCPUs 2
    • Maximum vCPUs 8
    • Enable user-specified Ami ID √
    • Select AMI image ID previously made e.g ami-003731153957a6e06
    • Leave networking with default values
    • Click create Compute resources

Compute Queue

  • Click on ‘Job Queues’ and then the ‘Create Queue’ button
    • Give it a name
    • Set priority to 1
    • Select previously created compute environment create_aws_queue

Running Nextflow on AWS batch

Configuration

The nextflow.config should contain the following directives as a minimum. Alter the values as appropriate. Play particular attention to those in the aws block and the queue, memory and container in the process block. The critical configuration is to specify the process executor as awsbatch, to give the queue you created in the earlier step and the accessKey and secretKey for the IAM user made in the first step. The latter can be found by going to the IAM section on AWS clicking on the user creator and creating an access key under the ‘Security credentials’ tab for the user. The secret key will be shown this one time only so keep it somewhere safe or be prepared to create a new one. NEVER share the secret key with anyone, publish it on a blog post or put in in a public git repo. Therefore if you want to share the nextflow.config file (e.g in a git repo) you can define these three variables in your environment:

  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • AWS_DEFAULT_REGION
aws {
  accessKey = '<ACCESS KEY>'
  secretKey = '<SECRET KEY>'
  region = 'eu-west-2'
}
executor.awscli = '/home/ec2-user/miniconda/bin/aws'
process {
  executor = 'awsbatch'
  queue = 'test-queue'
  memory = '3.5G'
  container = '<DOCKER HUB NAME>'
}

Parameterisation of the Nextflow command

The working directory must be a S3 path. For large input and output files I would also recommend using S3 buckets. For instance I define params called input_dir and output dir (to be used by publishDir directives) as S3 paths. I often use a single S3 bucket and sub paths within that for each ‘run’ of a workflow. If you define input files and publishDir outpuyt files as local paths these will be uploaded and downloaded to AWS as required. For large files this may not be desirable.

nextflow run example.nf \
-resume
-work-dir  s3://nextflow-batch-test/workdir \
--input_dir s3://nextflow-batch-test/fastqs \
--output_dir s3://nextflow-batch-test/output \ 
--other-workflow-param some_value

I have probably missed some steps so please feel to get in touch via twitter (@bioinformant). Once again big thanks to Paolo and the Nextflow team and to Alex.


comments powered by Disqus