Author: Jerome Lauret, Modified: Alexandr Prozorov


flowchart TD
 subgraph SL7["📦SL7 Container"]
        Analysis["Your Analysis"]
  end
 subgraph Alma["🎯 New OS (AlmaLinux 9)"]
        SL7
  end
    Alma -. Job submission .-> HTCondor[/"📤 Job Scheduler (HTCondor)<br><i>submit from Alma9</i>"/]
    HTCondor --  Execution in container --> Analysis

    style Analysis stroke-width:1px,stroke-dasharray: 1



Summary

  • Update ~/.cshrc and ~/.login for NFSv4:
      setenv GROUP_DIR /star/nfs4/AFS/star/group
    
  • Use new a9 starsub nodes when ssh: rcas60xxstarsub03 - starsub07

  • Use SL7 container via Singularity for job submission:

    Add to your XML scheduler inside <job> </job>:

    <shell>singularity exec -e -B /direct -B /star -B /afs -B /gpfs -B /sdcc/lustre02 /cvmfs/star.sdcc.bnl.gov/containers/rhic_sl7.sif</shell>
    
  • Test your script inside the container:
      singularity exec -e -B /direct -B /star -B /afs -B /gpfs -B /sdcc/lustre02 /cvmfs/star.sdcc.bnl.gov/containers/rhic_sl7.sif csh
    
    • In the container:
      cons # or your compile comand
      root myMacro.C("test_file") # verify that it works
      exit # leave the container for further sumbitting to HTCondor
      
  • Submit OUTSIDE the container from direct Alma 9.

  • Instead of star-submit/star-submit-template → use star-submit-beta/star-submit-template-beta

⚙️ Transition from SL7 to Alma 9

The facility CPUs are being upgraded from Scientific Linux 7 (SL7) to a newer operating system called Alma 9 (a9).
At this stage, more than 2/3rd of the farm was converted to a9, therefore, in order to facilitate large number of jobs, you are encouraged to follow the instructions herein (this is why jobs submitted from SL7 are pending for so long).

On a9, AFS is no longer supported which means that we also need to find a replacement for AFS. This replacement has been set to be NFS version 4.

Below are instructions allowing to bridge this gap while we transition (both the underlined OS and file system).


⚠️ Notes on Code Compatibility

In this transition period, STAR will continue to use a code based on SL7 (we do NOT yet have A9 native support for our code). This means you will need to assemble and compile codes as usual, using rcas node.
Here is a quick rundown recipe on how to run on a9 in SL7 containers.


🔧 Environment Setup

First, make sure your login is modified as follows:

  • Instead of having something like:
    setenv GROUP_DIR /afs/rhic.bnl.gov/star/group
    

    replace by:

    setenv GROUP_DIR /star/nfs4/AFS/star/group
    
  • You need to modify BOTH your $HOME/.cshrc and $HOME/.login.

  • This will take care of using NFSv4 instead of AFS. Using rcas nodes, that should work as usual (if not, please do not revert to AFS but report any issues).

  • NB: For now, we are testing ONLY official STAR libraries - please do NOT use other libraries, private or otherwise.

🖥️ Submit Nodes

The submit nodes for launching jobs on Alma 9 are named starsub0X where X is a number from 1 to 7 (ex: starsub03).

  • However, we ask you use 03 or upper number for now.
    Reasons: starsub03 and above are at Condor version 24.0.4 2025-02-02 BuildID: 784178 while 01/02 are still at 23.9.6 2024-08-08 BuildID: 748275 PackageID: 23.9.6-1 (adjustments were made with the latest version).

  • starsub0X nodes are Alma 9 nodes. The STAR software is not yet available on those but you will be able to submit to the a9 farm from there.


📤 Submitting from A9 (starsub0X)

If you are a STAR Scheduler user:

Please use star-submit-beta and/or star-submit-template-beta.
All you need to do is to add the following line in your XML:

<shell>singularity exec -e -B /direct -B /star -B /afs -B /gpfs -B /sdcc/lustre02 /cvmfs/star.sdcc.bnl.gov/containers/rhic_sl7.sif</shell>

Example:

<?xml version="1.0" encoding="utf-8" ?> 
<job>
 <command>root4star -q -b StHbtDiHadron.C\(1000000,100,-1,\"\",\"$FILELIST\"\)</command>
 <stdout URL="file:/star/u/carcassi/scheduler/out/$JOBID.out" />
 <input URL="file:/star/data21/reco/productionCentral/FullField/P02gc/2001/312/st_physics_2312011_raw_0017.MuDst.root" />
</job>

Change to

<?xml version="1.0" encoding="utf-8" ?> 
<job>
 <shell>singularity exec -e -B /direct -B /star -B /afs -B /gpfs -B /sdcc/lustre02 /cvmfs/star.sdcc.bnl.gov/containers/rhic_sl7.sif</shell> <!-- highlight -->
 <command>root4star -q -b StHbtDiHadron.C\(1000000,100,-1,\"\",\"$FILELIST\"\)</command>
 <stdout URL="file:/star/u/carcassi/scheduler/out/$JOBID.out" />
 <input URL="file:/star/data21/reco/productionCentral/FullField/P02gc/2001/312/st_physics_2312011_raw_0017.MuDst.root" />
</job>

If you are not a STAR scheduler user:

Make sure that whatever you do to submit jobs, you execute a shell script in the container.
In condor land, this may be adjusting your JDL to read as follows:

Arguments = "singularity exec -e -B /direct -B /star -B /afs -B /gpfs -B /sdcc/lustre02 /cvmfs/star.sdcc.bnl.gov/containers/rhic_sl7.sif /blabla/where-my-csh-script-is.csh"

Instead of:

Arguments = /blabla/where-my-csh-script-is.csh

Adapt this as needed.


🧪 Testing a Job Interactively

NOTE: Before submitting, you may want to test ONE job interactively to make sure it works.

  • Remember that on starsub0X, you are on Alma 9 and therefore, our code is not yet supported as indicated earlier.

  • Therefore, you will need to start a shell like this:
    singularity exec -e -B /direct -B /star -B /afs -B /gpfs -B /sdcc/lustre02 /cvmfs/star.sdcc.bnl.gov/containers/rhic_sl7.sif csh
    
  • This will start a SL7 login on an Alma9 node.

  • From that shell, you can execute one of the generated .csh scripts and verify all goes according to plan.
    cons
    root my_macro.C
    exit
    
  • If this runs, you are ready to submit outside the singularity shell. It means you should exit the container and send jobs to HTCondor from pure Alma 9 node, not inside the SL7 containter.

⚠️ Possible Issues

  • There has been reports of issues with the 32bits version of ROOT/CInt - if you encounter an issue, please try the 64bits environment.
  setup 64b
  • While using SIMD instructions, there may be a need to restrict jobs to some CPU architecture.
    We currently do not have a flag in the STAR scheduler for this but:

    requirements = (Microarch >= "x86_64-v4")
    

    would limit to one kind of nodes with specific SIMD instructions.

  • From the production test, we have evidence of a slowdown when more jobs are running.