Computational Genomics Kubernetes Installation: Difference between revisions

From UCSC Genomics Institute Computing Infrastructure Information

No edit summary
(Talk about perf and NUMA issues when profiling)
Line 94: Line 94:
  https://github.com/rcurrie/kubernetes
  https://github.com/rcurrie/kubernetes


== Inlining Jobs in Shell and Shell in Jobs ==
==Inlining Jobs in Shell and Shell in Jobs==


When interactively developing on Kubernetes, it can be useful to be able to have a shell command you can copy and paste to run a Kubernetes job, rather than having to create YAML files on disk. Similarly, it can be useful to have shell scripting inline in your Kubernetes job definitions, rather than having to bake your experimental script into a Docker container. Here's an example that does both, putting the YAML inside a heredoc and putting the script to run in the container inside a multiline YAML string. We precede this with a command to delete the job, so you can modify your script and re-paste it to replace a failed or failing job. We also make sure to mount the AWS credentials in the container, so that the ''aws'' command will be able to access S3 if you install it.
When interactively developing on Kubernetes, it can be useful to be able to have a shell command you can copy and paste to run a Kubernetes job, rather than having to create YAML files on disk. Similarly, it can be useful to have shell scripting inline in your Kubernetes job definitions, rather than having to bake your experimental script into a Docker container. Here's an example that does both, putting the YAML inside a heredoc and putting the script to run in the container inside a multiline YAML string. We precede this with a command to delete the job, so you can modify your script and re-paste it to replace a failed or failing job. We also make sure to mount the AWS credentials in the container, so that the ''aws'' command will be able to access S3 if you install it.
Line 179: Line 179:


Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid".  From the drop-down menu, select "CG Kubernetes Cluster".  It will take you to a page detailing the current resource usage and activity on the nodes.  This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster.
Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid".  From the drop-down menu, select "CG Kubernetes Cluster".  It will take you to a page detailing the current resource usage and activity on the nodes.  This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster.
==Profiling with Perf==
You can use Linux's "perf" to profile your code on the Kubernetes cluster. Here is an example of a job that does so. You need to obtain a "perf" binary that matches the version of the kernel that the Kubernetes ''hosts'' are running, which most likely does not correspond to any version of "perf" available in the Ubuntu repositories. Here we download a binary previously uploaded to S3. Also, the Kubernetes hosts have '''Non-Uniform Memory Access (NUMA)''': some physical memory is "closer" to some physical cores than to toher physical cores. The system is divided into '''NUMA nodes''', each containing some cores and some memory. Memory access from a node to its own memory is significantly faster than memory access from a node to other nodes' memory. For consistent profiling, it is important to restrict your application to a single NUMA node if possible, with "numactl", so that all accesses are local to the NUMA node. If you don't do this, your application performance will vary arbitrarily depending on whether and when threads are scheduled on the different NUMA nodes of the system.
apiVersion: batch/v1
kind: Job
metadata:
  name: username-profiling
spec:
  ttlSecondsAfterFinished: 1000
  template:
    metadata: # Apply a lable saying that we use NUMA node 0
      labels:
        usesnuma0: "Yes"
    spec:
      affinity: # Say that we should not schedule on the same node as any other pod with that label
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: usesnuma0
                operator: In
                values:
                - "Yes"
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: main
        imagePullPolicy: Always
        image: ubuntu:18.04
        command:
        - /bin/bash
        - -c
        - |
          set -e
          DEBIAN_FRONTEND=noninteractive apt-get update
          DEBIAN_FRONTEND=noninteractive apt-get install -y awscli numactl
          # Use this particular perf binary that matches the hosts' kernels
          # If it is missing or outdated, get a new one from Erich or cluster-admin
          aws s3 cp s3://vg-k8s/users/adamnovak/projects/test/perf /usr/bin/perf
          chmod +x /usr/bin/perf
          # Do your work with perf here.
          # Use numactl to limit your code to NUMA node 0 for consistent memory access times
        volumeMounts:
        - mountPath: /tmp
          name: scratch-volume
        - mountPath: /root/.aws
          name: s3-credentials
        resources:
          limits:
            cpu: 24 # One NUMA node on our machines is 24 cores.
            memory: "150Gi"
            ephemeral-storage: "400Gi"
      restartPolicy: Never
      volumes:
      - name: scratch-volume
        emptyDir: {}
      - name: s3-credentials
        secret:
          secretName: shared-s3-credentials
  backoffLimit: 0

Revision as of 22:49, 13 March 2020

The Computational Genomics Group has a Kubernetes Cluster running on several large instances in AWS. The current cluster makeup includes three worker nodes, each with the following specs:

* 96 CPU cores (3.1 GHz)
* 384 GB RAM
* 3.3 TB Local NVMe Flash Storage
* 25 Gb/s Network Interface 

Getting Authorized to Connect

If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to:

cluster-admin@soe.ucsc.edu

Let us know which group you are with and we can authorize you to use the cluster in the correct namespace.

Authenticating to Kubernetes

We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster.

To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached:

https://cg-kube-auth.gi.ucsc.edu

Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://cg-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". If you see any errors in red, but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://cg-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try.

Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your "namespace:" line as directed. We will let you know which namespace to use.

Testing Connectivity

Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine.

A quick test should go as follows:

$ kubectl get nodes
NAME          STATUS   ROLES    AGE   VERSION
k1.kube       Ready    <none>   13h   v1.15.3
k2.kube       Ready    <none>   13h   v1.15.3
k3.kube       Ready    <none>   13h   v1.15.3
master.kube   Ready    master   13h   v1.15.3

Running Pods and Jobs with Requests and Limits

When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise your pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your job from "running away" unexpectedly and chewing up more resources than expected.

Here is a good example of a job file that specifies limits:

job.yml

apiVersion: batch/v1
kind: Job
metadata:
  name: $USER-$TS
spec:
  backoffLimit: 0
  ttlSecondsAfterFinished: 30
  template:
    spec:
      containers:
      - name: magic
        image: robcurrie/ubuntu
        imagePullPolicy: Always
        resources:
          requests:
            cpu: "1"
            memory: "2G"
            ephemeral-storage: "2G"
          limits:
            cpu: "1"
            memory: "2G"
            ephemeral-storage: "3G"
        command: ["/bin/bash", "-c"]
        args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done']
      restartPolicy: Never
      priorityClassName: medium-priority

Please note that the "request" and "limit" item fields should be the same. You would think that you could set the limit higher than the request, but in reality they need to match in order for the pod to stay within the kubernetes resource limit bubble. If you set the limit higher than the request, then you are risking the pod using more memory than the scheduler expects, and the node can start killing off random other colocated pods by way of OOM, which is very, very bad for the cluster. If you omit the "requests" section altogether, the limit values will be used, so if you use only one, use "limits".

Also note the "priorityClassName" line. Available values are:

high-priority
medium-priority
low-priority

That affects how quickly your jobs move up the queue in the event there are a lot of queued jobs. Always use "medium-priority" as the default unless you specifically know you need it higher or lower. Higher priority jobs will always go in front of lower priority jobs.

NOTE: Jobs and pods that have completed over 72 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that.

Jobs that run over 72 hours will not be deleted, only the ones that have exited over 72 hours ago.

A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation:

https://github.com/rcurrie/kubernetes

Inlining Jobs in Shell and Shell in Jobs

When interactively developing on Kubernetes, it can be useful to be able to have a shell command you can copy and paste to run a Kubernetes job, rather than having to create YAML files on disk. Similarly, it can be useful to have shell scripting inline in your Kubernetes job definitions, rather than having to bake your experimental script into a Docker container. Here's an example that does both, putting the YAML inside a heredoc and putting the script to run in the container inside a multiline YAML string. We precede this with a command to delete the job, so you can modify your script and re-paste it to replace a failed or failing job. We also make sure to mount the AWS credentials in the container, so that the aws command will be able to access S3 if you install it.

kubectl delete job username-job
kubectl apply -f - <<'EOF'
apiVersion: batch/v1
kind: Job
metadata:
  name: username-job
spec:
  ttlSecondsAfterFinished: 1000
  template:
    spec:
      containers:
      - name: main
        imagePullPolicy: Always
        image: ubuntu:18.04
        command:
        - /bin/bash
        - -c
        - |
          set -e
          DEBIAN_FRONTEND=noninteractive apt-get update
          DEBIAN_FRONTEND=noninteractive apt-get install -y awscli cowsay
          cowsay "Listing files"
          aws s3 ls s3://vg-k8s/
        volumeMounts:
        - mountPath: /tmp
          name: scratch-volume
        - mountPath: /root/.aws
          name: s3-credentials
        resources:
          limits:
            cpu: 1
            memory: "4Gi"
            ephemeral-storage: "10Gi"
      restartPolicy: Never
      volumes:
      - name: scratch-volume
        emptyDir: {}
      - name: s3-credentials
        secret:
          secretName: shared-s3-credentials
  backoffLimit: 0
EOF

Make sure to replace "username-job" with a unique job name that includes your username.

View the Cluster's Current Activity

One quick way to check the cluster's utilization is to do:

kubectl top nodes

NAME          CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
k1.kube       1815m        1%     1191Mi          0%        
k2.kube       51837m       53%    46507Mi         12%       
k3.kube       1458m        1%     61270Mi         15%       
master.kube   111m         5%     1024Mi          46%

That means the worker nodes, k1, k2 and k3, are using minimal memory, k2 is using 52% CPU but lots of room still open for new jobs. Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users.

Another good way to get a lot of details about the current state of the cluster is through the Kubernetes Dashboard:

https://cgl-k8s-dashboard.gi.ucsc.edu/

Select the "token" login method, and paste in this (long) token:

eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZC10b2tlbi0ycDY4cCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImE5ZGI2Y2I0LWMyYWYtNDM3My04ZmM2LWE4YWYwYTBmNGRkNCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlcm5ldGVzLWRhc2hib2FyZDprdWJlcm5ldGVzLWRhc2hib2FyZCJ9.ZVQjryG_ksfvReIfq4Frb6M4sE6OVDOXnFy9Aii-h3mrpdHRE6bgjdAvSGZ0jJSIUEz5GgPBQ0lCwhyZocivHHr4zTrNxMkOFZhPDnpvF6RVIDWTkqmH9Dg6qmro0gTJP75oKBpt7dFN2pW4zvqOAzqPmh7qxfoVusN8X6U13YirMFEf65-aGL-_FFNBsEzvjkC-BgXWbtk3YZc8CJL7xtvlKLyE6u6jC9Qx0SWnwzkALlxmzo_yYTDKpIrWiQGEqzLQOxKml-H0kSYLDX-t4sTivXp4vCw_ruoqwIpLnnQAC7q3ZtSTxHIrxbB7n_M8gfhpXtwprbPav-XmBk1xaQ

The dashboard is read-only, so you won't be able to edit anything, it's mostly for seeing what's going on and where.

You can also take a look at current resource consumption by taking a look at our Ganglia Cluster monitor tool:

https://ganglia.gi.ucsc.edu/

That website requires a username and password:

username: genecats
password: KiloKluster

That's mostly for keeping the scrip kiddies and bots from banging on it.

Once you get in, you should see a drop-down menu near the top left of the screen near "Genomics Institute Grid". From the drop-down menu, select "CG Kubernetes Cluster". It will take you to a page detailing the current resource usage and activity on the nodes. This can be useful for see if anyone else is using the whole cluster, or just to get an idea of how many resources are available for your batch of jobs to assign to the cluster.

Profiling with Perf

You can use Linux's "perf" to profile your code on the Kubernetes cluster. Here is an example of a job that does so. You need to obtain a "perf" binary that matches the version of the kernel that the Kubernetes hosts are running, which most likely does not correspond to any version of "perf" available in the Ubuntu repositories. Here we download a binary previously uploaded to S3. Also, the Kubernetes hosts have Non-Uniform Memory Access (NUMA): some physical memory is "closer" to some physical cores than to toher physical cores. The system is divided into NUMA nodes, each containing some cores and some memory. Memory access from a node to its own memory is significantly faster than memory access from a node to other nodes' memory. For consistent profiling, it is important to restrict your application to a single NUMA node if possible, with "numactl", so that all accesses are local to the NUMA node. If you don't do this, your application performance will vary arbitrarily depending on whether and when threads are scheduled on the different NUMA nodes of the system.

apiVersion: batch/v1
kind: Job
metadata:
  name: username-profiling
spec:
  ttlSecondsAfterFinished: 1000
  template:
    metadata: # Apply a lable saying that we use NUMA node 0
      labels:
        usesnuma0: "Yes"
    spec:
      affinity: # Say that we should not schedule on the same node as any other pod with that label
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: usesnuma0
                operator: In
                values:
                - "Yes"
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: main
        imagePullPolicy: Always
        image: ubuntu:18.04
        command:
        - /bin/bash
        - -c
        - |
          set -e
          DEBIAN_FRONTEND=noninteractive apt-get update
          DEBIAN_FRONTEND=noninteractive apt-get install -y awscli numactl
          # Use this particular perf binary that matches the hosts' kernels
          # If it is missing or outdated, get a new one from Erich or cluster-admin
          aws s3 cp s3://vg-k8s/users/adamnovak/projects/test/perf /usr/bin/perf
          chmod +x /usr/bin/perf
          # Do your work with perf here.
          # Use numactl to limit your code to NUMA node 0 for consistent memory access times
        volumeMounts:
        - mountPath: /tmp
          name: scratch-volume
        - mountPath: /root/.aws
          name: s3-credentials
        resources:
          limits:
            cpu: 24 # One NUMA node on our machines is 24 cores.
            memory: "150Gi"
            ephemeral-storage: "400Gi"
      restartPolicy: Never
      volumes:
      - name: scratch-volume
        emptyDir: {}
      - name: s3-credentials
        secret:
          secretName: shared-s3-credentials
  backoffLimit: 0