.. _installation: Installation and Example Training =================================== This section provides instructions for downloading, configuring, and running the ML research platform on **the Frontier system at Oak Ridge National Laboratory**. STEP 1: Downloading the ML Research Platform --------------------------------------------- The ML research platform is a Python project managed with Git. To obtain the source code, clone the `repository `_ and switch to the **frontier** branch using the following commands: .. code-block:: bash git clone https://github.com/grnydawn/miles-credit.git cd miles-credit git checkout frontier .. note:: You can also clone the repository using git protocol with **git@github.com:grnydawn/miles-credit.git**. STEP 2: Creating a Python Virtual Environment and Installing Packages ---------------------------------------------------------------------- Run the following make command in the top-level directory of the repository: .. code-block:: bash make venv make install .. note:: The make commands are provided for easy installation. Please refer to the **Makefile** for details about the commands. STEP 3: Training a Model on a Frontier Node -------------------------------------------- .. note:: This section explains how to run the model on an **interactive Frontier node** rather than in batch mode. First, update the output path(`save_loc`) to a location where you have write permissions in **frontier_xformer.xml** in the **config** sub-directory, as shown below. .. code-block:: none # the location to save your workspace, it will have # (1) pbs script, (2) a copy of this config, (3) model weights, (4) training_log.csv # if save_loc does not exist, it will be created automatically save_loc: '/ccs/home/grnydawn/scrfrontier/data/credit-output/example/' seed: 1000 # random seeed Next, obtain an allocation on an interactive Frontier computing node. .. code-block:: bash salloc -A -J inter -t 2:00:00 -q debug -N 1 Next, retrieve the name of the node using the **hostname** command. .. code-block:: bash hostname Finally, run the following commands to execute the CrossFormer model on the node. .. code-block:: bash export MASTER_ADDR= export MASTER_PORT=29500 cd make train_xformer_srun .. note:: * **** is the same name displayed when running the **hostname** command above. * **** is the top-level directory of the Git repository. * The **make** commands are provided for easy execution of training. Please refer to the **Makefile** for details about the commands. .. note:: * The training configurations is specified in **/config/frontier_xformer.yml** * The input data files used in the example training are located under **/lustre/orion/atm112/world-shared/grnydawn/data/CREDIT**, which is a part of files provided from `https://app.globus.org/file-manager/collections/2fc90d8f-10b7-44e1-a6a5-cf844112822e/overview `_ * If you are interested in downloading the files used for this example training, please download the files listed below and adjust the data file paths in **/config/frontier_xformer.yml**. * All_2010_staged.mean.LatLonLev.nc * All_2010_staged.std.LatLonLev.nc * SixHourly_TOTAL_2017-01-01_2017-12-31_staged.zarr * SixHourly_TOTAL_2018-01-01_2018-12-31_staged.zarr * SixHourly_TOTAL_2019-01-01_2019-12-31_staged.zarr * SixHourly_TOTAL_2020-01-01_2020-12-31_staged.zarr * SixHourly_TOTAL_2021-01-01_2021-12-31_staged.zarr * static_operation_ERA5_zhght.nc Once all the above steps are completed successfully, the training progress will be displayed on the screen with progress bar indicators.