Hoffman2 Happy Hours: Running interactive & batch jobs¶

Raffaella D'Auria, PhD¶

Today Learning Outcomes¶

  • the Hoffman2 Cluster: system overview recap
  • how to follow this presentation on a terminal or on a jupyter notebook
  • working interactively on the Hoffman2 Cluster
  • where to look for applications already available on Hoffman2
  • running non interactive work-flow (batch jobs)

H2Cluster.png

The cluster at your fingertips¶

In [ ]:
# What type of compute nodes?

qhost -F arch | tail -n +4 | xargs -l2 | grep -v ^sge | awk '{print $12,$3}'  | awk -F = '{print $2}' | sort | uniq -c | awk 'BEGIN {print "CPU-type\t\t# nodes\t\t#cores/node\t# tot. cores"} {SUM_NODES +=$1; SUM_CORES +=$1*$3; {printf "%-16s %8d\t %8d\t\t %8d\n", $2,$1,$3,$1*$3}} END {print "TOTALS\t\t\t"SUM_NODES"\t\t-\t\t\t"SUM_CORES}'
In [ ]:
qhost -F arch | tail -n +4 | xargs -l2 | grep -v ^sge | awk '{print $12,$3,$8}'  | awk -F = '{print $2}' | sort | uniq -c | awk 'BEGIN {print "CPU-type\t\t# nodes\t\t#cores/node\t# tot. cores\t\tmemory/core (GB)\ttot memory (GB)"} {SUM_NODES +=$1; SUM_CORES +=$1*$3; SUM_MEM +=$4; {printf "%-16s %8d\t %8d\t\t %8d\t\t %.3f\t\t\t %.3f\n", $2,$1,$3,$1*$3,$4/$3,$4}} END {print "TOTALS\t\t\t"SUM_NODES"\t\t-\t\t\t"SUM_CORES"\t\t -\t\t\t"SUM_MEM}'

How to follow this presentation¶

In this presentation we assume that:

  • you already have an account on the Hoffman2 Cluster
  • on your local computer you have access to a terminal and an SSH client or you have installed a remote desktop to connect to the cluster

If you will be running this presentation as a jupyter notebook:

  • you should have python installed

NOTE:¶

  • commands that you can cut and paste into your terminal may be preceded by (and should not include) the $ which indicates the terminal prompt

How to run this presentation on a terminal¶

If you use a terminal and SSH to connect to the Hoffman2 Cluster:

  1. open a terminal on your local computer and SSH into the Hoffman2 Cluster with the command (substitute joebruin w/ your Hoffman2 user name):

    ssh joebruin@hoffman2.idre.ucla.edu

  2. when applicable cut and paste the commands from the slides omitting the $ character which is included to indicate the unix (or terminal) prompt

A summary of the commands is also available as a text file in:

/u/project/systems/PUBLIC_SHARED/dauria/F2023-INTRO-TO-H2/running-jobs.txt

How to run this presentation from a remote desktop¶

If use a remote desktop (NoMachine or X2Go) to connect to the Hoffman2 Cluster

  1. start a new connection (or reconnect to an existing suspended connection)
  2. open a terminal on the remote desktop

How to run this presentation as a jupyter notebook:¶

This presentation is a jupyter notebook, if you so choose you can run it following these steps:

  1. open a terminal on your local computer
  2. download the python script h2jupy with the command:

    $ curl -O https://raw.githubusercontent.com/rdauria/jupyter-notebook/main/h2jupynb

    or:

    $ wget https://raw.githubusercontent.com/rdauria/jupyter-notebook/main/h2jupynb

How to run this presentation as a jupyter notebook - cont'd:¶

  1. run the script with, for example, if your Hoffman2 Cluster account is joebruin (substitute joebruin w/ your user name):

    $ python h2jupynb -u joebruin -t 2 -m 5

    or:

    $ python3 h2jupynb -u joebruin -t 2 -m 5

How to run this presentation as a jupyter notebook - cont'd II:¶

  1. when the jupyter notebook interface opens of your local browser, click on the New button and select terminal
  2. when the terminal opens on your browser issue the command (do not include the $ sign):

$ cp /u/project/systems/PUBLIC_SHARED/dauria/F2023-INTRO-TO-H2/H2HH-jobs.ipynb ./

  1. navigate back to the Jupyter Notebook homepage and search for and double click on:

    H2HH-Running-jobs.ipynb

  1. this presentation should open as a notebook

On which resources will your job run?¶

https://www.hoffman2.idre.ucla.edu/Using-H2/Computing/Computing.html#computational-resources-on-the-hoffman2-cluster

Highp vs shared vs campus resources:¶

  • highp refers to the use of group owned compute nodes
    • users can run jobs for up to 14 days
    • only for users in group who own resources
  • shared refers to the use of temporarily unused group owned compute nodes
    • users can run jobs for up to 24 hours
    • only for users in group who own resources
  • campus refers to compute nodes owned by OARC/IDRE and made available to the UCLA community
    • users can run jobs for up to 24 hours

What computational resources do I have access to?¶

Open a terminal on the Hoffman2 Cluster and issue:

$ myresources

if the first line of your ouput contains:

User joebruin is in the following resource group(s): campus

you do NOT have access to group-owned compute nodes and can only run for up to 24 hours on nodes owned by OARC/IDRE

if the first line of your output contains:

User joebruin is in the following resource group(s): gobruins evebruin

you have access to the nodes purchased by groups: gobruins and evebruin and you can run for up to 24 hours on shared queues and for up to 14 days when requesting to run on owned resources (highp mode)

In [ ]:
# Do I have access to highp resources?
# if you are running this presentation as a jupyter notebook you can test your resources by running this cell:

myresources -u rdtest

Do I have access to highp resources?¶

To find out paste in a terminal connected to the clster the command (omitting the $ character indicative of the unix prompt):

$ myresources

what do you see?

Working interactively on the Hoffman2 Cluster¶

Any work that will use substantial computational resources should be run on compute nodes and not on the login nodes.

To get an interactive session on one core of a compute node, from a terminal issue the following command(omitting the $ character indicative of the unix prompt):

$ qrsh

What happens?

(To terminate your interactive session, after the prompt returns, type: Control + d or logout)

Working interactively on the Hoffman2 Cluster (Cont'd)¶

Customizing your interactive session. To request:

  • a specific runtime of, for example, 12 hours, use:

    $ qrsh -l h_rt=12:00:00

  • a specific amount of memory, for example 4GB, use:

    $ qrsh -l h_data=4G

  • an entire node in exclusive mode (e.g., all of its cores and memory):

    $ qrsh -l exclusive

  • a session on group-owned nodes (check first if you have access with the command myresources):

    $ qrsh -l highp

  • access to a GPU card:

    $ qrsh -l gpu,cuda=1

See also: https://www.hoffman2.idre.ucla.edu/Using-H2/Computing/Computing.html#examples-of-how-to-request-resources

Working interactively on the Hoffman2 Cluster - multiple cores¶

Customizing your interactive session. To request multiple computing cores:

  • from the same node (server), use, for example to request 8 cores:

    $ qrsh -pe shared 8

  • across multiple nodes (servers), use, for example to request 42 cores:

    $ qrsh -pe dc* 42

See also: https://www.hoffman2.idre.ucla.edu/Using-H2/Computing/Computing.html#requesting-multiple-cores

Working interactively on the Hoffman2 Cluster - Examples¶

Putting it all together a few examples:

  • To request an interactive session for 1 hour with 4GB per core and 6 cores on the same node:

    $ qrsh -l h_rt=1:00:00,h_data=4G -pe shared 6

  • To request an interactive session for 2 hours with 3GB per core and 48 cores across any node:

    $ qrsh -l h_rt=2:00:00,h_data=3G -pe dc* 48

What GPU cards are available and how to request them¶

https://www.hoffman2.idre.ucla.edu/Using-H2/Computing/Computing.html#gpu-access

GPU cards available to all Hoffman2 users:

GPU type Compute capability No. of CUDA cores Global memory side
A100 8.0 6912 80GB
V100 7.0 5120 32 GB
RTX2080Ti 7.5 4352 10 GB
P4 6.1 2560 8 GB

What GPU cards are available and how to request them (Cont'd)¶

https://www.hoffman2.idre.ucla.edu/Using-H2/Computing/Computing.html#gpu-access

Scheduler options to request specific GPU cards:

GPU type scheduler options
A100 -l gpu,A100,cuda=1
V100 -l gpu,V100,cuda=1
RTX2080Ti -l gpu,RTX2080Ti,cuda=1
P4 -l gpu,P4,cuda=1

E.g. to request a session on a specific GPU card issue at the command prompt:

$ qrsh -l gpu,P4,cuda=1,h_rt=3:00:00

NOTE: GPU cards are a hot commodity and you may need to wait for a while!

How to check the status of GPU nodes¶

To see all the CUDA GPU nodes (you may not have access to all) and their running jobs, issue at the command line:

$ qhost -l cuda.0.name=* -q -j

In [ ]:
qhost -l cuda.0.name=* -q -j

What applications/software is already available on Hoffman2?¶

Refer to: https://www.hoffman2.idre.ucla.edu/Using-H2/Software/Software.html

Screenshot%202023-07-26%20at%209.34.40%20AM.png

Apps available via modules¶

To see what applications are available in the current hierarchy, at a terminal connected to Hoffman2 issue the command:

$ module av # press enter to scroll down and exit the view

To look for a specific software, for example R, issue the command:

$ modules_lookup -m R

In [ ]:
## Most centrally installed apps are available via `modulefiles` 
## (if you are running this presentation as a jupyter notebook execute this cell):

    module av --no-pager
In [ ]:
## Most centrally installed apps are available via `modulefiles` to look for a specifc software use `modules_lookup`
## (if you are running this presentation as a jupyter notebook execute this cell):

    modules_lookup
In [ ]:
## To look for a specific appliaction say R
## (if you are running this presentation as a jupyter notebook execute this cell
## or paste the command in your terminal):

modules_lookup -m R
In [ ]:
## Load an application in your environment
## (if you are running this presentation as a jupyter notebook execute this cell
## or paste the command in your terminal):

     which R
In [ ]:
## Load an application in your environment - continued:
## (if you are running this presentation as a jupyter notebook execute this cell
## or paste the command in your terminal):
     which R
     module load gcc/10.2.0; module load R/4.3.0
     which R

Submitting non interactive (batch) jobs¶

What is a batch job?¶

  • a simulation that can be executed following a recipe (without user intervention)

Why?¶

  • instead of waiting for your interactive session to start, batch jobs start whenever resources become available
  • you get notified by email message when the job starts, finishes or errors out
  • your job does not depend on the persistence of your network connection
  • after submitting batch jobs you can close the connection to the cluster
  • you can submit very many batch jobs and they will all execute as resources become available

How?¶

  • you will generally use a submission script (a job recipe) in conjunction with the commend: qsub
  • a submission script sets the job environment and contains the sequence of commands needed to run the job
  • a submission script (may) contain instructions for the scheduler (requests for resources, etc.)

Hands-on submitting a job¶

In [ ]:
## Submitting non interactive (batch) jobs

# create a time-stamped directory, cd to it and copy in it the submission script: 
# /u/local/apps/submit_scripts/submit_job.sh 
timestamp=`date "+%F"` 
mkdir $HOME/H2HH_$timestamp; cd $HOME/H2HH_$timestamp; pwd
if [ ! -f "submit_job.sh" ]; then 
   cp /u/local/apps/submit_scripts/submit_job.sh ./submit_job.sh
else 
   echo "File: submit_job.sh already present"; 
fi 

# check that the submission script has been copied in the current directory:
ls -l submit_job.sh
In [ ]:
# now submit the job: 
qsub submit_job.sh 

# is my job running? 
myjobs 

# save the job ID number into the variable $JOB_ID for later use:
JOB_ID=`myjob | grep submit_job | awk '{print $1}'`

# echo the JOB_ID:
echo "JOB_ID=$JOB_ID"

When will my job start?¶

Very many jobs are constantly running on the cluster... how many?

In [ ]:
#first four jobs queuing (status "p" pending):

qstat -s p | head -n 6
In [ ]:
# tot. no. of currently jobs pending 

qstat -s p | grep qw | wc -l
In [ ]:
#Let's count the total number of compute cores requested using some handy command line expressions:

count=1; qstat -s p | grep qw | awk -v count=$count '{count=count+$8} END {print "Total no. of cores requested: "count}'
In [ ]:
#first four jobs running (status "r" running):

qstat -s r | head -n 6
In [ ]:
# tot. no. of jobs running

qstat -s r | grep r | wc -l
In [ ]:
#Let's count the total number of compute cores currently running jobs using some handy command line expressions: 

count=1 ; val=0 ; qstat -s r | grep @ | awk -v count=$count '{count=count+$9} END {print "Total no. of cores in use: "count}'

Anatomy of a submission script¶

In [ ]:
# let's take a look at the submission script:

cat submit_job.sh
In [ ]:
# let's take a look at the joblog file: 

cat joblog.${JOB_ID}

Where to find sample submission scripts¶

Under: https://www.hoffman2.idre.ucla.edu/Using-H2/Software/Software.html

Look for a specific software and navigate to the Batch use tab:

Screenshot%202023-07-26%20at%209.49.19%20AM.png

  • Paste the script (use copy button) in a file on Hoffman2

Screenshot%202023-07-26%20at%209.53.32%20AM.png

  • use, for example, the nano editor:

    $ nano stata_submit.sh

  • paste in the shell the script, edit as needed and exit and save (Control + x)

  • submit the job w/:

    $ qsub stata_submit.sh

In [ ]:
# or look in:

    ls /u/local/apps/submit_scripts

Submitting R jobs the easy way¶

In [ ]:
# Submit R jobs with R_job_submitter.sh:  

# create temporary directory in your $SCRATCH and change directory to it:
if [ ! -d $SCRATCH/R_tests ]; then mkdir $SCRATCH/R_tests; fi; cd $SCRATCH/R_tests 

# copy the R file R-benchmark-25.R:
if [ ! -f R-benchmark-25.R ]; then cp /u/local/apps/submit_scripts/R/R-benchmark-25.R ./;fi 

# submit the R script R-benchmark-25.R to the queues using R_job_submitter.sh:
/u/local/apps/submit_scripts/R_job_submitter.sh -n R-benchmark-25.R  -m 1 -t 1 -s 4 -v 4.0.2 -nts 
JOB_ID2=`myjob | grep R-benchmar | awk '{print $1}'`

# echo JOB_ID:
echo "JOB_ID=$JOB_ID2"
In [ ]:
# check the submission status of the job(s):

myjob
In [ ]:
# check if output has been generated:

ls -ltr
In [31]:
# check the submission script generated by `/u/local/apps/submit_scripts/R_job_submitter.sh`:

cat R-benchmark-25.cmd
In [32]:
# let's check the joblog file (one of last two files in the list above): 

cat R-benchmark-25.joblog.$JOB_ID2
 
Job R-benchmark-25, ID no. 735109 started on:   n1020
Job R-benchmark-25, ID no. 735109 started at:   Wed Nov 22 11:55:23 PST 2023
 
Loading R/4.0.2
  Loading requirement: intel/.2019.2 curl/8.4.0
Currently Loaded Modulefiles:
 1) intel/.2019.2 <aL>   2) curl/8.4.0 <aL>   3) R/4.0.2  

Key:
<module-tag>  <aL>=auto-loaded  

R CMD BATCH --no-save --no-restore  R-benchmark-25.R R-benchmark-25.out.735109
real 28.07
user 32.88
sys 2.31
 
Job R-benchmark-25, ID no. 735109 finished at:  Wed Nov 22 11:55:51 PST 2023
 
In [ ]:
# let's check the output file: 

cat  R-benchmark-25.out.$JOB_ID2

Alternatively get a sample submission script from the H2Docs¶

  • visit https://www.hoffman2.idre.ucla.edu/Using-H2/Software/Software.html#r
  • click on the tab "Batch use"
  • navigate to the R submission script
  • click on the "copy button" shown on the top right corner of the submission script
  • paste it in a new file