Write code locally and run it on GPUs in Seconds¶

So let's face it, if you are doing anything with deep learning, GPUs are a must.

They are expensive, and infrastructure is hard to set up. Most of the time, you're coding when the GPUs are sitting idle, and it's a pain to pay for the uptime when no deep learning scripts work on the first go.

This was a problem I faced as "GPU Poor". I didn't want to spend money on GPUs when I was coding or doing something that didn't leverage the GPU compute. Even for things like downloading data, models, and data transformation, you don't need a GPU but still have to do it on a GPU.

And especially on cloud providers, where you will have to worry about infrastructure. You can set up a VM with a GPU attached, then choose an image which is not even well-documented. If not done properly, you will have to install CUDA and stuff from scratch. If that also doesn't work, most of the time you resort to using a Docker container with the right installations.

And if you start doing multi-GPU training, that's one more burden. Some GPU images don't even support NCCL for communication between GPU nodes, so you will have to be careful about that as well.

So if you just want to set up a GPU and run, it's a lot of effort. There are providers like Runpod, Vast AI, and others that make it easier.

I run a research lab called CognitiveLab [cognitivelab.in], where we do a bunch of model training, synthetic data generation, RL runs, and more. We wanted something that was easy to use, train, and flexible enough so that we don't need to be constrained by it.

But when I looked for a solution where I could write my code locally on my machine and run it on a GPU, I stumbled across this beautiful solution called Modal. It's been 1 year since I started using it, and it's been a blessing.

I will be covering the following:

How to handle datasets on Modal efficiently, including creating and managing volumes for seamless data access.
Writing training scripts using libraries like Unsloth and Axolotl to fine-tune models with minimal effort.
Evaluating trained models with automated metrics to ensure performance and reliability.
Serving these models in a scalable and high-throughput manner using vLLM for real-world applications.

By the end of this, you'll have a clear understanding of how to write and experiment with training scripts locally and run them on GPUs as quickly as possible using Modal.

I will be mainly covering SFT examples, but if you guys are interested, I will write a blog on how to set it up for RL with RL training and reward environments happening on different GPUs.

Inspiration¶

Thinking Machines, the startup founded by ex-OpenAI CTO Mira Murati, recently launched Tinker, which allows developers to write training loops in Python on their laptops and run them on distributed GPUs.

Check out their announcement: Thinking Machines on X.

This is every developer's dream! However, I have been using Modal to achieve something similar for quite some time now.

PS: From the looks of it, their API is much more sophisticated. They have implemented several optimizations under the hood, such as efficient batching. Here is a tweet that dives deeper into the details.

This is really cool, but you can do the same thing using @modal with just a few more lines of code.

You write the training loops on your local machine and then run it in on any number of GPUs and its way more flexible:

- It works with existing repos
- You only pay for the time… https://t.co/IVOalMvFVC
— Adithya S K (@adithya_s_k) October 3, 2025

Lot of you were interested in how to do this, so here we go.

You would have come across the term Serverless GPUs.

Let's just say Modal is a GPU provider platform that does right by serverless GPUs, and they have one of the best developer experiences ever.

If you are doing anything in Python, training models, deploying them, writing servers, building agentic systems, then Modal can be used.

As per the official Modal website:

AI infrastructure that developers love, and that's 100% factual.

Run inference, train, batch process with sub-second cold start, instant auto-scaling, and a developer experience that feels local.

Fun fact: Even Lovable uses Modal for running their sandbox.

Getting Started¶

First, all you have to do is:

pip install modal

and do:

modal setup

You can also authenticate through their API keys:

export MODAL_TOKEN_ID=
export MODAL_TOKEN_SECRET=

This is all you need to set up Modal.

Core Concepts¶

With Modal, you always start by creating an App, an Image, and Volumes.

App - To create an App, it's pretty simple:

import modal

# Create the Modal app
app = modal.App("<app_name>")

Volumes - Then we can create or use existing volumes.

You can think of volumes as a storage file system where you can store anything like model weights, datasets, scores, and more.

If you want something to persist, add it in the volume. The best part is, for a function, you can have multiple volumes with different routes; you can have a volume for model weights in the /model path and for the dataset in the /dataset path.

Something like this:

dataset_volume = modal.Volume.from_name("dataset-volume", create_if_missing=True)
model_volume = modal.Volume.from_name("model-volume", create_if_missing=True)

Then you write the mapping that will be passed into functions:

volume_config = {
    "/dataset": dataset_volume,
    "/model": model_volume
}

This is just to illustrate how you can attach volumes to any function. This gives us awesome power.

You can download datasets, process them all on CPU instances, and when it comes time to train, just attach the same volume and use it, which makes life much easier.

I generally create a volume for a single experiment or training run so that I have everything consolidated that can be used across the project.

Images - Next thing will be the images.

This is the most important part. Defining an image can be tricky at first, but once it's done, you don't have to worry about it. Initially, it can take up some time.

But it's very important. Refer to Modal Image docs to see all the ways to create an image.

Here is a sample example image:

train_image = (
    modal.Image.debian_slim(python_version="3.11")
    .uv_pip_install(
        "accelerate==1.9.0",
        "datasets==3.6.0",
        "hf-transfer==0.1.9",
        "huggingface_hub==0.34.2",
        "peft==0.16.0",
        "transformers==4.54.0",
        "trl==0.19.1",
        "unsloth[cu128-torch270]==2025.7.8",
        "unsloth_zoo==2025.7.10",
        "wandb==0.21.0",
    )
    .env({"HF_HOME": "/model_cache"})
)

So the base image uses Debian with Python 3.11, and then you install all the packages using uv. Then you set the environment HF_HOME so that everything is cached, and you won't have to download again and again. This is a good starting image.

Pro tip: I have a set of images that you can use for anything training-related. I have images to serve LLMs using vLLM, SGLang, training using Unsloth, MS Swift, and more. I will share these and go deeper into how to create images in a better way later.

Functions - There is one more thing: Functions.

You can basically have any Python function to make it run on GPU or CPU on Modal. All you have to do is add a decorator:

@app.function(
    image=image,
    secrets=[modal.Secret.from_dotenv()],  # local .env variables
    volumes={"/data": volume},
    timeout=3600,  # 1 hour timeout
)
def any_python_function():
    # Your code here
    pass

This is where you define which image the function will use, what secrets you will be passing, which volumes will be attached, and what is the timeout (there is a general timeout of 24 hrs).

To know all the properties of a function, refer to Modal Function docs

Now the basics are out of the way. Let's do some training, fine-tuning, evaluation, and serving.

Tutorials¶

I have created comprehensive tutorials for each training approach:

📄 View Python Script

Learn how to take an existing codebase (Andrej Karpathy's nanoGPT) and run it on Modal's serverless GPUs with minimal modifications. Perfect for beginners to understand:

How to copy local repositories into Modal containers
Data preparation, training, and sampling pipelines
Managing persistent storage with Modal volumes
Running existing Python projects on remote GPUs

Level	GPU Required	Time
Beginner	1× A100-40GB (or T4/L40S for testing)	30 mins - 2 hours

Build Your Own ChatGPT from Scratch - The Complete Pipeline

📄 View Python Script

The ultimate educational LLM training pipeline covering every step from raw text to a functioning ChatGPT. You'll learn:

Training a custom BPE tokenizer (like GPT-4)
Base model pretraining from random initialization
Midtraining for conversation format and tool use
Supervised fine-tuning on multiple tasks
Reinforcement learning for improved reasoning
Comprehensive evaluation on real benchmarks
Deployment with CLI and web UI

Level	GPU Required	Time
Advanced	4-8× A100-80GB	4-8 hours (full speedrun)

This is the most comprehensive tutorial - you'll understand exactly how ChatGPT works by building one yourself. Perfect for those who want to go deep.

3. Fine-tuning Gemma 3-4B with Unsloth ¶

End-to-end vision-language model training and deployment

📄 View Python Script

A production-grade pipeline covering the complete ML workflow from data to deployment. You'll learn:

Fine-tuning vision-language models with LoRA
Optimized single-GPU training with Unsloth
Model evaluation with automated metrics
Serving with vLLM for high-throughput inference
Auto-scaling deployment strategies

Level	GPU Required	Time
Intermediate	1× A100-80GB (or L40S)	3-6 hours (full pipeline)

4. Multi-GPU Training with Axolotl ¶

Distributed training for large models (Llama 8 - 70B+)

📄 View Python Script

Advanced distributed training techniques for massive models. You'll learn:

Multi-GPU training with Accelerate and DeepSpeed
YAML-based configuration for reproducibility
Dataset preprocessing for large-scale training
Scaling from 2 to 8 GPUs
Cost optimization strategies for expensive training runs

Level	GPU Required	Time
Advanced	2-8× A100-80GB	4-12 hours (depends on model size)

This is multi-GPU training, and you can run all types of parallelism (data, tensor, pipeline, FSDP) using Modal as well. For the sake of simplicity, I have used Accelerate, but you can go all out with the setup up to 8 GPUs. I have mainly been using Modal for multi-GPU training with a maximum of 8 GPUs. I haven't done multi-node training yet (should be possible with sandboxes, but the setup process might be a bit complex).

I think these 3 examples will give you a good picture to replicate the process across multiple things.

Final thoughts¶

As someone working with AI models, infrastructure is crucial to get right as it's expensive and takes a lot of time to set up, especially for individual researchers and small labs who will find it hard to set up and manage infrastructure.

With Modal, infrastructure becomes as easy as writing a python script and running it on CPU/GPU, deploying it, scaling it.

In this, I go over the details on how to use Modal mainly for running training, eval, and serving scripts for LLM models, but you can do a lot more with Modal.

Fun fact: Gitvizz uses modal to run all their backend code and I have been running it for 4 months which cost me less than 4$ and it scales really well. After using modal I completely stopped using k8s and stuff for smaller projects.

Need Help?¶

If your organization needs help with optimally using Modal, we at CognitiveLab can help you set it up and manage it for you.

Reach out to us through our website or DM me on twitter @adithya_s_k

Write code locally and run it on GPUs in Seconds¶

Stop Paying for Idle GPUs: Serverless Training with Modal¶

Inspiration¶

Ok, what is Modal?¶

Getting Started¶

Core Concepts¶

Tutorials¶

1. Training NanoGPT on Modal¶

2. Training Nanochat on Modal¶

3. Fine-tuning Gemma 3-4B with Unsloth¶

4. Multi-GPU Training with Axolotl¶

Final thoughts¶

Need Help?¶

Resources¶

1. Training NanoGPT on Modal ¶

2. Training Nanochat on Modal ¶

3. Fine-tuning Gemma 3-4B with Unsloth ¶

4. Multi-GPU Training with Axolotl ¶