🛸Migrate your tasks to Coretex

Intro

In this walkthrough you will learn how to adapt your existing project so it can be executed inside the Coretex platform.

These instructions assume you have already worked with Projects, Tasks and Datasets so steps for creating them will be skipped.

For this tutorial, we have chosen a popular Andrey Karpathy's GitHub repository nanoGPT. It is an implementation of a famous Transformers architecture for character-level text prediction based on a corpus of text taken from Tiny Shakespeare dataset.

Usually the first step would be to upload training data to Coretex, but a script which downloads and prepares the data is hosted inside of the repository as a single file, so we will be skipping the data upload step and simply run that script when our task runs.

Task structure

Every Coretex Task has to have at least these 3 files at the root level:

task.yaml - a YAML formatted file containing basic information about the task and a list of task parameters
main.py - Python file used as an entry point for running your task
requirements.txt and/or environment.yaml file

Step 1 - Create `task.yaml`

Usually, the task. file would contain a dataset parameter, representing the ID of the dataset our task would use. It would look something like this:

param_groups:
	- name: inputs
    params:
    	- name: dataset
        description: Dataset id that is used for fetching dataset from coretex.
        value: 4116
        data_type: dataset
        required: true

Since we will download the .txt file with training data on each run, we can remove the dataset parameter from our task.yaml. In case you want to use your own data, make sure to upload the Dataset, leave the dataset parameter in, and later when running the job simply chose your uploaded Dataset as your parameter.

Step 2 - Define the virtual environment

Currently there are two primary virtual environment managers supported: venv and Conda.

Every Coretex task must have at least requirements.txt file with a dependency to coretex Python library, irrelevant of the virtual environment manager you are planning to use.

To use Python's venv virtual environment manager you need to list your dependencies in requirements.txt in the root of your task.

To use the Conda virtual environment manager you need to addenvironment.yml file to the root of your task.

If you have both environment.yamlandrequirements.txt files in your task Coretex will assume Conda environment and userequirements.txtonly if you explicitly linked it in yourenvironment.yamlwith -r flag.

Step 3 - Run entry point

Lets create main.py file as the main entry point for our run. This file should have the following structure:

from coretex import CustomDataset, ExecutingExperiment
from coretex.project import initializeProject

def main(experiment: ExecutingExperiment[CustomDataset]):
    # ... experiment code
    pass

if __name__ == "__main__":
    initializeProject(main)

Functionality to specify a custom Python file as an entry script is coming soon.

It's important to define a main function that accepts a single argument of a generic type ExecutingExperiment[DatasetType] and invokes initializeProject() function with this main function passed in as an argument. This will set up all of the plumbing necessary to efficiently manage and track the run execution.

You can replaceCustomDatasetwith any class derived from theDatasetclass if you are using any of the supported Tasks for a more comfortable development experience.

Looking into nanoGPT's repo structure we can see the code for training is inside the train.py file, so we'll wrap it into a function called run()so we can import it into main.pyand invoke it.

Apart from that, we need to run the script for downloading the Tiny Shakespeare dataset and preparing the data for training. The script is located in data/shakespeare_char/prepare.py, so we will wrap the contents of that file in a function called prepare(), so we can import it in main.py and invoke it before training.

Our main.py file will look like this:

from coretex import CustomDataset, ExecutingExperiment
from coretex.project import initializeProject

from train import run
from data.shakespeare_char.prepare import prepare

def main(experiment: ExecutingExperiment[CustomDataset]):
    prepare() # download and prepare data
    run() # train the model

if __name__ == "__main__":
    initializeProject(main)

At this point scheduling the run in a queue will allow the platform to manage its execution automatically. Regardless of the execution node chosen, Coretex first creates the virtual environment, installs all of the dependencies and runs the entry script, providing it with the execution context in the experiment object.

Step 4 - Model upload

Since our run creates a model as a result, we will first need to create a temporary folder where it can be stored until the upload to Coretex is finished. We can do this in train.py using FolderManager class, replacing the destination path variable out_dir:

from coretex.folder_management import FolderManager
out_dir = FolderManager.instance().createTempFolder("model")

The only thing left to do is to write a helper function for uploading the model into the Coretex Model registry. We first get the current run, then create an empty Model object attached to this run and finally upload the model file to this new Model object.

def saveModel(modelPath: str) -> None:
    experiment: ExecutingExperiment[CustomDataset] = ExecutingExperiment.current()
    model = Model.createModel(experiment.name, experiment.id, 0, {})
    model.upload(modelPath)

The updated main.py file looks like this:

from coretex import CustomDataset, ExecutingExperiment, Model
from coretex.project import initializeProject
from coretex.folder_management import FolderManager

from train import train
from data.shakespeare_char.prepare import prepare

def saveModel(modelPath: str) -> None:
    experiment: ExecutingExperiment[CustomDataset] = ExecutingExperiment.current()
    model = Model.createModel(experiment.name, experiment.id, 0, {})
    modelPath = FolderManager.instance().getTempFolder("model")
    model.upload(modelPath)

def main(experiment: ExecutingExperiment[CustomDataset]):
    modelPath = FolderManager.instance().createTempFolder("model")
    prepare()
    train(experiment.dataset)
    saveModel(modelPath)

if __name__ == "__main__":
    initializeProject(main)

Step 5 - Logging

One purpose of the initializeProject() function is to initialize the Python logger to work in the Coretex ecosystem.

If you are using the standard Python logger in your project there is no need to change anything in your code. All logs will be displayed in the Run Console in Coretex.

Please be aware standard Python prints and other logging methods aren't going to be visible on the Coretex platform automatically. Make sure you replace all your print statements with Python's logger.

Run task

Running python main.py in terminal and providing appropriate command line arguments will start the run.

An example of a complete terminal command:

python main.py --username [email protected] --password ****** --projectId 123

The expected output should look something like this:

[MLService] Login successful INFO: Experiment execution started [MLService] Downloading dataset: [==============================] - 100% - Finished INFO: found vocab_size = 65 (inside ~/.coretex/samples/177015/meta.pkl) Initializing a new model from scratch WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0 WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0 WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0 WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0 WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0 WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0 number of parameters: 10.65M using fused AdamW: False

You can do the same thing through the Coretex web platform by importing it in Project and then starting the run through the web UI.

Congratulations!

If everything went well you should be able to see the run permanently stored in your account if you navigate to Runs or Models tab in the left-hand menu bar.

PreviousRun your first workflow NextLocal Datasets and Runs

Last updated 1 year ago

Was this helpful?

Intro

Task structure

Step 1 - Create task.yaml

Step 2 - Define the virtual environment

Step 3 - Run entry point

Step 4 - Model upload

Step 5 - Logging

Run task

Congratulations!

Step 1 - Create `task.yaml`