Undiagnosed Diseaase Project Kubernetes Installation
The Undiagnosed Disease Project (UDP) has a Kubernetes Cluster running on one large GPU server. The current cluster makeup includes one worker node with the following specs:
* 72 CPU cores (3.1 GHz) * 384 GB RAM * 3.2 TB Local NVMe Flash Storage * 4 NVIDIA GPUs * 25 Gb/s Network Interface
Getting Authorized to Connect
If you require access to this kubernetes cluster, contact Benedict Paten asking for permission to use it, then pass on that permission via email to:
cluster-admin@soe.ucsc.edu
Let us know which group you are with and we can authorize you to use the cluster in the correct namespace.
NOTE: You need GI VPN Access to access this kubernetes installation.
Authenticating to Kubernetes
We will authorize (authz) you to use the cluster on the server side, but you will also need to authenticate (authn) using your '@ucsc.edu' email address and a unique Java Web token. These credentials are installed in ~/.kube/config in whatever machine you are coming from to get to the cluster.
To authenticate and get your base kubernetes configuration, go to this URL (below), which will ask you to authenticate to Google. Use your '@ucsc.edu' email address as the login. It will then ask you to authenticate via CruzID Gold if your web browser doesn't already have the authentication token cached:
https://udp-kube-auth.gi.ucsc.edu
Once you authenticate (via username/password and 2-factor auth for CruzID Gold), it will pass you back to the 'https://udp-kube-auth.gi.ucsc.edu' website and it should confirm authentication on the top with a message saying "Successfully Authenticated". If you see any errors in red, but are sure you typed in your password and 2-factor auth correctly, click on the above link again (https://udp-kube-auth.gi.ucsc.edu) and authenticate a second time, which should work. There is a quirk where the web token doesn't always pass back to us correctly on the first try.
Upon success, you will be able to click the blue "Download Config File" button, which contains your initial kubernetes config file. Copy this file to your home directory as ~/.kube/config. Follow the directions on the web page to insert your "namespace:" line as directed. We will let you know which namespace to use.
Testing Connectivity
Once your ~/.kube/config file is set up correctly, you should be able to connect to the cluster. All our shared servers here at the Genomics Institute have the 'kubectl' command installed on them, but if you are coming from somewhere else, just make sure the "kubectl" utility is installed on that machine.
A quick test should go as follows:
$ kubectl get nodes NAME STATUS ROLES AGE VERSION k1.kube Ready <none> 13h v1.15.3 k2.kube Ready <none> 13h v1.15.3 k3.kube Ready <none> 13h v1.15.3 master.kube Ready master 13h v1.15.3
Running Pods and Jobs with Requests and Limits
When running jobs and pods on kubernetes, you will always want to specify "requests" and "limits" on resources, otherwise your pods will get stuck with the default limits which are tiny (to protect against runaway pods). You should always have an idea of how much resources will be consumed by your jobs, and not use much more than that, in order not to hog all the resources. It also prevents your jib from "running away" unexpectedly and chewing up more resources than expected.
Here is a good example of a job file that specifies limits:
job.yml apiVersion: batch/v1 kind: Job metadata: name: $USER-$TS spec: backoffLimit: 0 ttlSecondsAfterFinished: 30 template: spec: containers: - name: magic image: robcurrie/ubuntu imagePullPolicy: Always resources: requests: cpu: "1" memory: "2G" ephemeral-storage: "2G" limits: cpu: "1" memory: "2G" ephemeral-storage: "3G" command: ["/bin/bash", "-c"] args: ['for i in {1..100}; do echo "$i: $(date)"; i=$((i+1)); sleep 1; done'] restartPolicy: Never
Please note that the "request" and "limit" item fields should be the same. You would think that you could set the limit higher than the request, but in reality they need to match in order for the pod to stay within the kubernetes resource limit bubble. If you set the limit higher than the request, then you are risking the pod using more memory than the scheduler expects, and the node can start killing off random other colocated pods by way of OOM, which is very, very bad for the cluster.
NOTE: Jobs and pods that have completed over 72 hours ago but have not been cleaned up will be automatically removed by the garbage collector. Most jobs will have the "ttlSecondsAfterFinished" configuration item in them, so they will automatically cleaned up after that time expires, but leaving the old pods and jobs around pins the disk space they were using while they remain, so it's good to get rid of them as soon as they are done unless you are debugging a failure or something like that.
Jobs that run over 72 hours will not be deleted, only the ones that have exited over 72 hours ago.
A lot of other good information can be viewed on Rob Currie's github page, which includes examples and some "How To" documentation:
https://github.com/rcurrie/kubernetes
View the Cluster's Current Activity
One quick way to check the cluster's utilization is to do:
kubectl top nodes NAME CPU(cores) CPU% MEMORY(bytes) MEMORY% udp-k8s-1 1815m 1% 1191Mi 0% udp-k8s-master 111m 5% 1024Mi 46%
Ignore the master node as that one only handles cluster management and doesn't run jobs or pods for users.
Another good way to get a lot of details about the current state of the cluster is through the Kubernetes Dashboard:
https://cgl-k8s-dashboard.gi.ucsc.edu/
Select the "token" login method, and paste in this (long) token:
eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZC10b2tlbi0ycDY4cCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50Lm5hbWUiOiJrdWJlcm5ldGVzLWRhc2hib2FyZCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImE5ZGI2Y2I0LWMyYWYtNDM3My04ZmM2LWE4YWYwYTBmNGRkNCIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlcm5ldGVzLWRhc2hib2FyZDprdWJlcm5ldGVzLWRhc2hib2FyZCJ9.ZVQjryG_ksfvReIfq4Frb6M4sE6OVDOXnFy9Aii-h3mrpdHRE6bgjdAvSGZ0jJSIUEz5GgPBQ0lCwhyZocivHHr4zTrNxMkOFZhPDnpvF6RVIDWTkqmH9Dg6qmro0gTJP75oKBpt7dFN2pW4zvqOAzqPmh7qxfoVusN8X6U13YirMFEf65-aGL-_FFNBsEzvjkC-BgXWbtk3YZc8CJL7xtvlKLyE6u6jC9Qx0SWnwzkALlxmzo_yYTDKpIrWiQGEqzLQOxKml-H0kSYLDX-t4sTivXp4vCw_ruoqwIpLnnQAC7q3ZtSTxHIrxbB7n_M8gfhpXtwprbPav-XmBk1xaQ
The dashboard is read-only, so you won't be able to edit anything, it's mostly for seeing what's going on and where.