The best Claude Code that $200 can buy in pure JAX on TPUs

Article URL: https://github.com/salmanmohammadi/nanocode/discussions/1 Comments URL: https://news.ycombinator.com/item?id=47649742 Points: 32 # Comments: 5

The best Claude Code that $200 can buy in pure JAX on TPUs
The best Claude Code that $200 can buy in pure JAX on TPUs Photo: Hacker News

There was an error while loading.

Please reload this page .

salmanmohammadi Apr 5, 2026 Maintainer
nanocode is written entirely in JAX and designed to be trained using TPUs.

I adapted the core training infrastructure and philosophy from Karpathy's incredible nanochat project, so if you're familiar with nanochat , nanocode should feel very similar.

This is how my d24 1.3B parameter nanocode turned out:
You can get started for free using the Google TRC program which gives you free access to pre-emptible TPUs for a month - and I think new Google Cloud accounts also get $300 in credits.

I was fortunate to have access to the TRC program for 3 months for this project, and I found most of the time that my spot instances were rarely interrupted and I could easily have the same pod up for a week or more.

You can reproduce nanocode-d24 (1.3B params) in around ~9 hours in total on a TPU v6e-8 costing $200, or train nanocode-d20 (477M params) in ~1.5 hours costing $34.

If you're using NVIDIA GPUs, nanocode should also work out of the box, but you should be aware that nanocode has been highly optimised for TPUs.

Training nanocode : a friendly agentic coding partner
Andrej's original release post for nanochat does a great job of explaining what we're doing here, and the commands you'll use in nanocode are virtually identical, so I'd recommend reading through his work first.

I'll go over what we've done differently to elicit agentic coding behaviours from our model.

The pre-training and tokenizer training process is pretty much identical to nanochat 's, but I found that including additional coding data from The Stack-V2 at a ratio of 1:5 in both the pre-training and tokenizer mixture resulted in a stronger coding model and more efficient code tokenization, which helped a ton.

Let's first download the dataset shards we'll need for tokenizer training and model pre-training:
And kick off our tokenizer training script:
For reference, we can compare with nanochat 's tokenizer which is identical aside from the addition of The Stack in the training mixture (well, I've also added special tokens and templating logic to support more sophisticated tool calling, but more on that later).

We can see that this gives a big boost for code at the cost of general text tokenization efficiency, but this is okay since we want our model to do one thing very well; agentic coding.

Our models are trained with a param:data ratio of 8 (following nanochat's scaling law analysis ).

Let's kick off a training run like so:
You should see something like this:
Our model has attained some knowledge about the world, which is nice.

It still doesn't know about Saturday though : ).

Let's look at some more thorough quantitative results, since we only estimate metrics using a smaller subset of the evaluation data during training:
This will print a whole bunch of metrics, but the relevant ones are bits-per-byte across our pretraining sets: sv2 (The Stack V2) and fwe (FineWeb_EDU), and the CORE metric which makes comparing against nanochat 's results and GPT-2 straightforward.

I've compiled the results across a few model parameter sizes to get a feel for our scaling laws:
Since CORE measures general language reasoning capabilities and we've geared our models towards code data, it's expected that our CORE scores drop slightly compared to the corresponding GPT-2 models.

Training d24 on FineWeb-EDU alone resulted in a CORE score of 0.261 which lines up with GPT-2 XL below and nanochat-d24 .

The tradeoff here is that we expect our models to perform well in coding tasks.

I'll mostly be referring to our d24 model throughout this post, which is similar to nanochat 's d24 model but is trained with twice the context length (4096 vs.

2048) to better support multi-turn agentic conversations.

Now that we have a reasonably capable coding base model, let's look at how we can turn it into a fully-fledged agentic coding partner.

Let's think a bit about what agentic models are doing from first principles.

Pre-training LLMs produces next-token-generators which have compressed a vast amount of knowledge, but they aren't really useful for things like following instructions, answering questions about the knowledge they have, or fixing bugs in Python files.

There's a bunch more work to do in trying to get our models to do useful things.

The first step is templating - delimiting different components of the input and output so the model learns the structure of the task it's being asked to perform.

Let's take chat templating as an example.

Conversation can be structured as turns, where each side takes a turn at a time - so our model needs to know whose turn it is, and what they've said.

<|user_start|> , <|user_end|> , <|assistant_start|> , and <|assistant_end|> are special tokens which help provide structure to raw text.

We typically reserve a whole token for them when tokenizing.

Great.

Now let's think about the kind of templating we might use for an agentic model.

The basis for agentic behaviour is tool-calling - a kind of task where the model's turn isn't directed towards the user, but may instead be an action through an interface with the real world, and which produces outputs which the model may respond to in real-time.

If we look at it this way the outputs of a tool call can just be treated as another kind of turn, so we reserve two additional special tokens <|tool_result_start|> and <|tool_result_end|> so our model knows when information is coming from a tool call, and not the user.

Now we just need a way to let our model know how to make tool calls - we'll need templating for the name of the tool the model wishes to invoke and (optionally) any keyword-arguments it needs to pass through.

Let's take grep as an example:
This would look something like this:
We've defined special tokens for delimiting the entire tool call ( <|tool_call_start|> and end ), and for delimiting different named arguments for that tool call ( <|tool_arg|> and <|tool_val|> ).

Note that the model is able to think through and explain its actions by nesting the tool-call template inside its response.

It's important to think about what your final agentic interface is actually going to look like - you don't want to come up with a tool calling template and spend $$$ using it to train your model only to find out it doesn't work in practice.

When defining our tools we are trading off expressivity with tractability; how easy it is for the model to actually learn to use a tool reliably.

For the simplest possible agent we want it to interact with a UNIX environment by reading files, searching filesystems, and writing to disk.

Above we used a Bash tool call, but if we only used Bash for everything the model would effectively have to learn correct shell syntax including quoting, flags, piping - just from examples.

Instead we can anticipate that something like grep is probably something that the model is going to be doing often enough that we should give it a dedicated tool call.

For nanocode 's agentic interface, I defined four tools:
This lets nanocode read and write files, search for patterns, and use UNIX commands when needed - though I don't anticipate that we can obtain a model which learns meaningful Bash tool usage with our compute and token budget.

Based on these tool calls, our agentic CLI would just be a thin wrapper which parses the model's predicted tokens, intercepts any tool calls, and executes them, providing the result to the model as a kind of conversational turn.

Okay, how do we teach our model to use these tools?

The simplest way is to just train the model on hundreds of thousands of examples of this tool use.

These examples could look something like this:
This is a pretty rough sketch, but you get the idea - the user makes a request, and the model fulfils it by using one or more of the tools it has available.

It also makes a goofy little remark to explain what it's doing.

We mentioned above we're training the best Claude Code we can, and you may be familiar with Claude's soul document - a written specification of the model's character, values, and behavioural principles.

Anthropic uses this document to guide how Claude is trained: it defines the desired behaviour, then training data and preference optimization are shaped to align the model with that specification.

This is the core idea behind Constitutional AI (CAI) - which was used to train early Claude models (evolutions of this technique are still used to train Claude ).

Constitutional AI is a training process comprising synthetic data generation, supervised fine-tuning, and preference optimisation, all in order to align a model with a specified set of characteristics and constitutional principles, or SOUL .

Note that while CAI as an alignment approach is focused on producing helpful and harmless agents - in particular preventing models from producing harmful answers - our use is primarily for stylistic alignment of our model.

For nanocode 's SOUL , I wanted it to have a unique voice; casual, friendly, and a little goofy, but without being sycophantic or overly verbose.

This is what I came up with .

To summarise, nanocode should only use lowercase but proper nouns are acceptable in code, it should be warm and friendly, and it only follows the precise instructions which it has been given.

Reflecting on this I probably didn't need the philosophical fluff, particularly for models of these sizes.

Our SOUL is pretty simple compared to Claude's, but as we mentioned, we want our model to be very good at only a couple things: agentic coding, and adhering to a personality which we've curated for it.

Constitutional AI instills this SOUL into a model through two stages: 1) Constitutional Supervised Fine-tuning (SFT) and 2) Reinforcement Learning from AI Feedback (RLAIF) - the preference learning stage.

As I mentioned above, we need examples of our specific tool usage as well as conversational turns which adhere to our model's SOUL .

The Constitutional SFT stage is a synthetic data generation pipeline which you can think of as a mix of rejection sampling and distillation.

For our use case, the loop looks like this:
At the end of this process, we obtain two responses for a given prompt: a final response which is strongly aligned with the SOUL , and the initial, misaligned response.

We'll use these pairs later for the preference learning stage, but for our Constitutional SFT stage, we'll just be training our model on the (Initial prompt, Chosen sample) pairs.

It's worth noting that the critique loop is essential when your generator model can't reliably produce SOUL -aligned outputs in a single pass — which was the case for most of the smaller open-source models I ran locally through vLLM on TPUs.

Frontier models through OpenRouter pretty much nailed things first try.

I want to say that the approach I detailed here was the first one I tried but really this part of the project took a couple months of iterations and ablations.

I landed on two approaches for nanocode .

Firstly, I generated a dataset comprising short, single-turn conversations which teach our model the fundamental agentic loop of Grep/Read , then Edit to write a solution which solves the task at hand.

Importantly, it teaches our model how to understand the syntax of our tools and their results.

To seed this dataset, I reused existing Python open-source instruct datasets:
This turned out to be a great way to bootstrap our synthetic dataset generation process, as it provided ~120K high-quality samples with correct Python solutions and model explanations - we just need to apply the generate -> critique loop above to massage this into our format.

You can see more in dev/process_datasets.py and the final dataset smohammadi/nanocode-tulu-selfoss-evol , and I'll use an example here to illustrate what our final dataset looked like:
Here I re-used the initial prompt, then converted the model's original solution into the Edit tool call by extracting the generated code and wrapping it in our tool templating.

I'm also using line numbering here ( 1-> ) as I believed this would help the model make targeted Grep and Edit calls if it could see line numbers in files it was reading and editing.

Secondly, unlike tulu and self-oss where the user just asks for code, many samples in evol-codealpaca contain code in both the instruction and the output, where the user is describing existing code and asking for a modification.

This was great for mapping for multi-turn rollouts with tool chaining.

To do this, I first computed a diff between the original and modified code to get targeted old_string / new_string arguments for the Edit, then prepend Grep and Read steps to obtain rollouts where the agent searches for the relevant function, reads the file, then makes a targeted edit.

Lastly, I wanted to provide demonstrations of more complex tool-use: long-context rollouts which emulate realistic coding agent use-cases.

This would involve complex Bash tool usage, user rejections, and realistic environment interactions such as tools erroring when filenames aren't found, or when the model needs to use multiple commands to explore a codebase.

This was a huge amount of effort, but I had a lot of fun.

My final dataset comprised 2000 rows of these rollouts which were generated from scratch from an initial seed dataset of 2000 prompts covering a variety of problem domains and programming languages.

The relevant code lives in dev/scenarios_to_rollouts.py and you can see the final dataset at smohammadi/nanocode-long-context .

It's finally time to teach our model how to be the agent we've always wanted:
We're training on a mixture of general instruct data and our synthetic dataset above.

Ablations here are really helpful, as I found myself iterating many times over data mixtures to obtain the results I wanted.

And after an hour, you should see some sample generations at the end of training:
This was really exciting for me - the model has learned to respond in lowercase, and has successfully grasped the tool calling structure.

Note that we only sample short completions during generations (~64 tokens) so responses are cut off.

The final step in the CAI approach is preference learning which helps the model distinguish between outputs which are aligned with our SOUL and those which are not.

The original CAI paper used a pretty heavyweight form of preference learning based on Reinforcement Learning From Human Feedback (RLHF).

This involves training a reward model on the preference data we collected earlier, and using a online reinforcement learning algorithm like PPO to align the model.

But we definitely don't have time for this.

Instead, Direct Preference Optimisation (DPO) formulates the RLHF objective as a direct supervised objective on preference pairs which eliminates the need for a reward model.

You can think of it as a kind of binary classification over preference pairs which penalises the model's log-probabilities over dis-preferred outputs, and rewards the model for assigning higher relative probabilities to preferred outputs.

Overall I'm not sure how much DPO is actually doing for a model of this size and for our limited token budget, particularly since our model is so over-tuned on our SOUL -aligned datasets.

That said, I think it did something , as you can see from the plots above and table below.

The accuracy (the capability of the model to assign higher log-probabilities to chosen answers over rejected answers) went from 0.45 -> 0.88, and the validation bits-per-byte on the synthetic datasets didn't show any meaningful regression (0.247 -> 0.248).

I think if I had more time with the TRC program I would spend it developing more rigorous evals which specifically target nanocode 's agentic capabilities.

The benefits of the CAI preference learning step is clearer for larger models which are trained across far more domains and tasks, as it can help optimise away unwanted behaviours which the model learned during SFT across many datasets which won't be generated using the Constitutional SFT approach.

Now we can try our agent out!

Kick off the agentic CLI with:
Through this interface nanocode can interact with your UNIX system through its tool calls (you are required to give permission for each one).

Give it a try by asking it to explore the nanocode codebase, or a specific function you're interested in!

Note that whilst nanocode has pretty successfully grasped the tool interface, it is still a very under-tuned and small model.

I expect that it will struggle with complex bug-fixes, or coding tasks which it hasn't seen in its training data.

Finally, we can pull all of the logs we've created and structure them into a nice report for our run:
This can then be copied over onto your local machine and converted into HTML by running this command:
I'd love to see what you can come up with.

The codebase is designed to be minimal and hackable, and it would be great to see how you instill character and personality into your own agentic coding partner through your own SOUL .

You can re-write the tool spec and interface to something that's super customized, and the synthetic data generation pipeline can help you adapt nanocode to your own use-cases.

The codebase is only around ~5.5K lines of code which should comfortably fit in the context window of a modern LLM.

I also hope this repo helps you better understand how JAX works and how it can be used to write really simple and elegant performant code.

I've worked with (and contributed to) PyTorch for a long time and I found JAX to be really refreshing; XLA is an incredible compiler and the profiling tooling is lovely to work with.

Beta Was this translation helpful?

Give feedback.

Source: This article was originally published by Hacker News

Read Full Original Article →

Share this article

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

Maximum 2000 characters