Installation and Example Training
This section provides instructions for downloading, configuring, and running the ML research platform on the Frontier system at Oak Ridge National Laboratory.
STEP 1: Downloading the ML Research Platform
The ML research platform is a Python project managed with Git. To obtain the source code, clone the repository and switch to the frontier branch using the following commands:
git clone https://github.com/grnydawn/miles-credit.git
cd miles-credit
git checkout frontier
Note
You can also clone the repository using git protocol with git@github.com:grnydawn/miles-credit.git.
STEP 2: Creating a Python Virtual Environment and Installing Packages
Run the following make command in the top-level directory of the repository:
make venv
make install
Note
The make commands are provided for easy installation. Please refer to the Makefile for details about the commands.
STEP 3: Training a Model on a Frontier Node
Note
This section explains how to run the model on an interactive Frontier node rather than in batch mode.
First, update the output path(save_loc) to a location where you have write permissions in frontier_xformer.xml in the config sub-directory, as shown below.
# the location to save your workspace, it will have
# (1) pbs script, (2) a copy of this config, (3) model weights, (4) training_log.csv
# if save_loc does not exist, it will be created automatically
save_loc: '/ccs/home/grnydawn/scrfrontier/data/credit-output/example/'
seed: 1000 # random seeed
Next, obtain an allocation on an interactive Frontier computing node.
salloc -A <ACCOUNT> -J inter -t 2:00:00 -q debug -N 1
Next, retrieve the name of the node using the hostname command.
hostname
Finally, run the following commands to execute the CrossFormer model on the node.
export MASTER_ADDR=<HOSTNAME>
export MASTER_PORT=29500
cd <CREDIT_REPO>
make train_xformer_srun
Note
<HOSTNAME> is the same name displayed when running the hostname command above.
<CREDIT_REPO> is the top-level directory of the Git repository.
The make commands are provided for easy execution of training. Please refer to the Makefile for details about the commands.
Note
The training configurations is specified in <CREDIT_REPO>/config/frontier_xformer.yml
The input data files used in the example training are located under /lustre/orion/atm112/world-shared/grnydawn/data/CREDIT, which is a part of files provided from https://app.globus.org/file-manager/collections/2fc90d8f-10b7-44e1-a6a5-cf844112822e/overview
If you are interested in downloading the files used for this example training, please download the files listed below and adjust the data file paths in <CREDIT_REPO>/config/frontier_xformer.yml.
All_2010_staged.mean.LatLonLev.nc
All_2010_staged.std.LatLonLev.nc
SixHourly_TOTAL_2017-01-01_2017-12-31_staged.zarr
SixHourly_TOTAL_2018-01-01_2018-12-31_staged.zarr
SixHourly_TOTAL_2019-01-01_2019-12-31_staged.zarr
SixHourly_TOTAL_2020-01-01_2020-12-31_staged.zarr
SixHourly_TOTAL_2021-01-01_2021-12-31_staged.zarr
static_operation_ERA5_zhght.nc
Once all the above steps are completed successfully, the training progress will be displayed on the screen with progress bar indicators.