I didn't get as good results as Karpathy (unlucky seed?)
It's fun to play with though...
User: How many legs does a dog have?
Assistant: That's a great question that has been debated by dog enthusiasts for centuries. There's no one "right" answer (...)
cd /tmp
git clone https://huggingface.co/sdobson/nanochat
uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0211508969a100a/raw/80f79c6a6f1e1b5d4485368ef3ddafa5ce853131/generate_cpu.py \
--model-dir /tmp/nanochat \
--prompt "Tell me about dogs."
This is a much easier way to run the model. I'm going to update the huggingface README to point to this. The one thing that could be improved is the turn-taking between user and assistant, which it sometimes gets confused about. I fixed that in my fork of your gist here: https://gist.github.com/samdobson/975c8b095a71bbdf1488987eac...
Simon, I had to run "brew install git-lfs && cd nano-chat && git lfs install && git lfs pull" and then it worked. before then, the model weights didn't get cloned by default for me on macOS.
% uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0... \
--model-dir nanochat/ --prompt "who is simonw on hacker news?"
Using device: cpu
Loading model from nanochat/model_000650.pt
Loading metadata from nanochat/meta_000650.json
Model config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 20, 'n_head': 10, 'n_kv_head': 10, 'n_embd': 1280}
Loading model weights (this may take a minute for a 2GB model)...
Converting model to float32 for CPU...
Model loaded successfully!
Loading tokenizer...
Tokenizer loaded successfully!
Prompt: who is simonw on hacker news?
Encoded to 9 tokens
Generating...
--------------------------------------------------
who is simonw on hacker news?<|user_end|><|assistant_start|>A hacker news reporter, I'd say a few things. First, I'm a bit of a hothead, always pushing the boundaries of what's acceptable in the world of hacking. I've got a reputation for being merciless and relentless in my pursuit of the truth.
In many ways, I've developed a sixth sense for this type of thing. I've spent years honing my skills, learning the language of hacking and the tactics it takes. I know how to think like the hacker
--------------------------------------------------
Adding on: Claude also gave me the following line which was necessary to get the model weights to download from HF. This might be obvious for anyone familiar with HF but it helped me so sharing here!
For anyone curious this is the error when running uv sync on macos,
> uv sync
Resolved 88 packages in 3ms
error: Distribution `torch==2.8.0+cu128 @ registry+https://download.pytorch.org/whl/cu128` can't be installed because it doesn't have a source distribution or wheel for the current platform
hint: You're on macOS (`macosx_15_0_arm64`), but `torch` (v2.8.0+cu128) only has wheels for the following platforms: `manylinux_2_28_x86_64`, `win_amd64`; consider adding your platform to `tool.uv.required-environments` to ensure uv resolves to a version with compatible wheels
Also, tmp/nanochat expects all contents from tokenizer and chatsft_checkpoints folder.
Yeah, that's because cuda on a mac isn't a thing - it could be swapped to the normal torch package but you'd have to do some code patching to make sure it's running on mps, even then some of the code may need rewriting/patching if there's no mps version of the cuda kernals.
Isn't there a common PyTorch API interface that could chose OS/hardware specific backend automatically? Or this project is hard coding cuda variant of PyTorch as a requirement?
>Our main measure of progress. Bits per byte is, per Karpathy, "a much better measure than just the typical cross-entropy loss, because it further normalizes the loss on each token by the number of bytes of that token, making the metric tokenizer-invariant".
Is so blindingly obvious, that I'm ashamed to think that I didn't think do it when trialing my own tokenizer approach on tinystories. I might go back and have a look at how well my tokenizer compared to how well I imagined it compared.
ELI5 for anyone else (I had to have this explained to me):
When you train a language model, it tries to predict the next token.
We measure how good it is at that using loss aka how surprised it was by the real answer.
Different models might use different token lengths. So, if you describe loss relative to tokens then you can't easily compare the performance of two models that use different token lengths.
Tokenizers used to be 1 character per token. Then Google implemented Subword encoding[1] on their early neural translation work and found it was much better.
Subword units are genuinely meaningful in most languages. You do need to tune the vocabulary size though.
absolutely requires longer training time and more compute.
once trained, predictions need to hold through many more steps because each step processes one token. if a token early in a sentence heavily implies a token will occur later in the sentence then that awareness needs to be maintained while processing each intermediary token and each step is a bit lossy. the fewer steps you need to take before leveraging that knowledge the better the prediction.
if you had infinite compute and data for training then performance would be equivalent though, i think.
Since OpenAI tokenizer is estimated at ~4.2 characters per token, with your proposed "1 char per token tokenizer", the effective context length immediately becomes 4.2 times smaller, and generated output 4.2 times slower (since 4.2 times more tokens are needed for the same output). Doesn't look like a good tradeoff.
Cool. Is there a simple "howto" on running this repo with training on W&B for a programmer like me who has never done model training flows? Maybe you could share the steps you took?
There's not much to it... it took longer to spin up the cloud machine than it did to kick off the training run. I'll be writing up a blog post with a step-by-step guide when I get a free moment, but in the meantime, here are the commands I ran: https://pastebin.com/sdKVy0NR
This one is mine. It's a light-hearted digital newspaper of sorts, covering news from local British communities through the medium of verse (generated by LLMs).
Until now I've been using ChatGPT for the generation, with a fairly generic prompt that asks for a poem about the article that follows. ChatGPT's ability to summarise is incredible. It's really not great, though, at rhyme and meter. That means a decent amount of curation and heavy editing is needed for the best to get something passable. Prompt engineering has not seemed to have a meaningful impact. I'm looking to fine-tune a davinci model, which I think will deliver higher quality with less effort.
The quality can mostly be blamed on me, rather than GPT-3. I haven't written a poem since school :)
The accompanying illustrations are created with Stable Diffusion using DiffusionBee. Images take around 30s to generate on my Macbook Air M1. I'm looking to switch to MochiDiffusion to cut generation time a bit.
The blog is running Ghost on a small DigitalOcean VPS, with emails delivered by Mailgun.
The process right now is somewhat labour-intensive: between researching news stories, iterating on the content, and publishing, it takes a decent amount of time for each piece of content. I'm confident in being able to automate a large part of it, in time.
One fun fact I learned when planning the virtual road-trip for this project: in average traffic conditions, it's possible to visit every city in England in less than 48 hours. The near-optimum solution to this formulation of the Traveling Salesmen Problem (starting in the South West), a route taking 47:00:10, was calculated in less than 5 seconds with a Guided Local Search algorithm. [1]
Technology means that I can virtually, learn about, write creatively, and publish regularly, all whilst having a family and a full-time job. What a time to be alive!
Very open to your thoughts, and indeed to feedback on the concept or the execution.
For short-term career growth, $YOUR_COMPANY's current preferred ETL tool will have the biggest ROI. Focus on design patterns: while APIs will come and go, the concepts, as you rightly say, are transferrable.
If you're looking to land a new role: the market says dbt, databricks and snowflake are pretty strong bets.
If it's personal interest, or a high-risk, high-reward long term play, take your pick from any of the new hotness!
I'd second the notion that the question is far too open.
I'd add that dbt, databricks and snowflake are pretty strong bets still, but you have to acknowledge that they're becoming mainstream with an ever accelerating pace as the companies behind them churn out upskilling courses, meetups and acquire an ever larger share of the market.
If you like to be a specialist, going deep into either of those still holds career value.
If you're taking a more generalist view of where things are headed, the best prediction I heard someone say to set themselves apart is for Data Engineers to optimize for operationalizing data. Focusing much more on reverse ETL, becoming knowledgeable in building data web apps. The no-code or low-code movement around data apps will make the barrier of entry to set something up nonexistent, and I see how that will drive demand.
Pairing (big) data query/ frontend performance and web apps is another beast though.
For all my initial scepticism, I see the Data Mesh concept picking up pace in the years to come. It's vendor independent, couples well with Team Topologies and effective, decoupled, functional SWE teams. There still will be a big need for standards and conventions set by a small enabling core DE team, as of now, the knowledge gap between the baseline DE and your average SWE or Product Owner is just way too big in my experience.
Last but not least, I'd throw data lake out there.
Apache Iceberg is getting a lot of attention and rightfully so.
TCO of a query engine on top of files is so much better than any DWH and any org being able to optimize compute on data for it's current need will be able to save massively while the "convenience" gap steadily closes. Again, pretty generic but there's much to learn around Athena, Trino and the like.
I'm personally not a fan of learning a new language except maybe for Rust.
There is an ever increasing stack of standard "low-code" tools for the typical ETL schlick, and Python won't go anywhere. Again, potential to differentiate will be low and ever lower in many contexts outside of proper big data.
This is only me though and this view is highly context dependent, so YMMV of course.
About snowflake, I am really curious. What do you mean by learn snowflake. The way I was told about snowflake is that it's a cloud based data warehouse. Are there advanced properties in snowflake which one has to learn? Or do you mean optimized queries?
Snowflake at it's most basic is SQL on cloud vms, anyone comfortable with SQL should feel at home there. That said, there are many Snowflake specific features that may take a bit to become familiar with. Just off the top of my head:
- hybrid RBAC, DAC, ABAC security model
- column, row level, and tag based access policies
- multi-account organizations
- cross-account and region data replication
- data shares
- external tables and specialized formats (iceberg, delta)
- pipes and streams
- snowpark API
- streamlit integration
The nice thing about Snowflake is that for many use cases it requires little management.
Things you can learn regarding Snowflake, other than the obvious (SQL, and Snowflake specific language extensions to SQL): proper table partitioning, Snowpipe (and the associated cloud messaging pipelines), and query performance tuning. (Complex queries can become a bear; identifying when its your query/partitioning or when its something on the Snowflake back end is challenging.)
There are always new additions to the Snowflake tooling ecosystem since the company is in competition with Databricks and others (e.g., Snowpark with Python).
A stark reminder to end users of any service that if you are not paying, you are the product. There is good reason that the sale of financial services through intermediaries is a highly regulated area: it takes a firm counter-force to stop businesses from exploiting customers when all the incentives are there, and there are enough links in the chain for plausible deniability.
Hi HN. Over the last 4 months I have collected 380,000 attempts to guess a number on my Amazon Alexa Skill, and wanted to share with the community. Would love to see what you can do with the data!