More

sammyd56 · 2025-10-17T20:50:15 1760734215

An agent is just an LLM calling tools in a loop. If you're a "show me the code" type person like me, here's a worked example: https://samdobson.uk/posts/how-to-build-an-agent/

sammyd56 · 2025-10-13T19:36:59 1760384219

I'm doing a training run right now (started 20min ago). You can follow it at https://api.wandb.ai/links/sjd333-none/dsv4zkij

Will share the resulting model once ready (4 hours from now) for anyone to test inference.

sammyd56 · 2025-10-14T00:04:59 1760400299

I've uploaded the model here: https://huggingface.co/sdobson/nanochat

I didn't get as good results as Karpathy (unlucky seed?)

It's fun to play with though...

User: How many legs does a dog have? Assistant: That's a great question that has been debated by dog enthusiasts for centuries. There's no one "right" answer (...)

simonw · 2025-10-14T00:44:21 1760402661

I got your model working on CPU on macOS by having Claude Code hack away furiously for a while. Here's a script that should work for anyone: https://gist.github.com/simonw/912623bf00d6c13cc0211508969a1...

You can run it like this:

  cd /tmp
  git clone https://huggingface.co/sdobson/nanochat
  uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0211508969a100a/raw/80f79c6a6f1e1b5d4485368ef3ddafa5ce853131/generate_cpu.py \
    --model-dir /tmp/nanochat \
    --prompt "Tell me about dogs."

sammyd56 · 2025-10-14T01:44:28 1760406268

This is a much easier way to run the model. I'm going to update the huggingface README to point to this. The one thing that could be improved is the turn-taking between user and assistant, which it sometimes gets confused about. I fixed that in my fork of your gist here: https://gist.github.com/samdobson/975c8b095a71bbdf1488987eac...

vessenes · 2025-10-14T01:16:33 1760404593

Simon, I had to run "brew install git-lfs && cd nano-chat && git lfs install && git lfs pull" and then it worked. before then, the model weights didn't get cloned by default for me on macOS.

% uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0... \ --model-dir nanochat/ --prompt "who is simonw on hacker news?" Using device: cpu Loading model from nanochat/model_000650.pt Loading metadata from nanochat/meta_000650.json Model config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 20, 'n_head': 10, 'n_kv_head': 10, 'n_embd': 1280} Loading model weights (this may take a minute for a 2GB model)... Converting model to float32 for CPU... Model loaded successfully! Loading tokenizer... Tokenizer loaded successfully!

Prompt: who is simonw on hacker news? Encoded to 9 tokens

Generating... -------------------------------------------------- who is simonw on hacker news?<|user_end|><|assistant_start|>A hacker news reporter, I'd say a few things. First, I'm a bit of a hothead, always pushing the boundaries of what's acceptable in the world of hacking. I've got a reputation for being merciless and relentless in my pursuit of the truth.

In many ways, I've developed a sixth sense for this type of thing. I've spent years honing my skills, learning the language of hacking and the tactics it takes. I know how to think like the hacker --------------------------------------------------

homeless_engi · 2025-10-14T05:20:15 1760419215

Adding on: Claude also gave me the following line which was necessary to get the model weights to download from HF. This might be obvious for anyone familiar with HF but it helped me so sharing here!

git lfs install

iamcreasy · 2025-10-14T01:42:52 1760406172

For anyone curious this is the error when running uv sync on macos,

> uv sync Resolved 88 packages in 3ms error: Distribution `torch==2.8.0+cu128 @ registry+https://download.pytorch.org/whl/cu128` can't be installed because it doesn't have a source distribution or wheel for the current platform

hint: You're on macOS (`macosx_15_0_arm64`), but `torch` (v2.8.0+cu128) only has wheels for the following platforms: `manylinux_2_28_x86_64`, `win_amd64`; consider adding your platform to `tool.uv.required-environments` to ensure uv resolves to a version with compatible wheels

Also, tmp/nanochat expects all contents from tokenizer and chatsft_checkpoints folder.

stoobs · 2025-10-14T10:34:21 1760438061

Yeah, that's because cuda on a mac isn't a thing - it could be swapped to the normal torch package but you'd have to do some code patching to make sure it's running on mps, even then some of the code may need rewriting/patching if there's no mps version of the cuda kernals.

iamcreasy · 2025-10-14T22:22:02 1760480522

Isn't there a common PyTorch API interface that could chose OS/hardware specific backend automatically? Or this project is hard coding cuda variant of PyTorch as a requirement?

Lerc · 2025-10-13T20:53:28 1760388808

The comment beside the first chart

>Our main measure of progress. Bits per byte is, per Karpathy, "a much better measure than just the typical cross-entropy loss, because it further normalizes the loss on each token by the number of bytes of that token, making the metric tokenizer-invariant".

Is so blindingly obvious, that I'm ashamed to think that I didn't think do it when trialing my own tokenizer approach on tinystories. I might go back and have a look at how well my tokenizer compared to how well I imagined it compared.

SeanAnderson · 2025-10-13T22:27:44 1760394464

ELI5 for anyone else (I had to have this explained to me):

When you train a language model, it tries to predict the next token.

We measure how good it is at that using loss aka how surprised it was by the real answer.

Different models might use different token lengths. So, if you describe loss relative to tokens then you can't easily compare the performance of two models that use different token lengths.

So, compare loss to bytes of text data instead.

typpilol · 2025-10-13T22:13:11 1760393591

Why hasn't anyone made a tokenizer that's 1 character per token. Is it because it requires an insane amount of compute?

Or would the loss of efficiency make it dumber then modern tokenizers?

nl · 2025-10-14T00:06:12 1760400372

Tokenizers used to be 1 character per token. Then Google implemented Subword encoding[1] on their early neural translation work and found it was much better.

Subword units are genuinely meaningful in most languages. You do need to tune the vocabulary size though.

[1] https://aclanthology.org/P16-1162/

SeanAnderson · 2025-10-13T22:49:55 1760395795

yes to both.

absolutely requires longer training time and more compute.

once trained, predictions need to hold through many more steps because each step processes one token. if a token early in a sentence heavily implies a token will occur later in the sentence then that awareness needs to be maintained while processing each intermediary token and each step is a bit lossy. the fewer steps you need to take before leveraging that knowledge the better the prediction.

if you had infinite compute and data for training then performance would be equivalent though, i think.

skirmish · 2025-10-13T22:47:13 1760395633

Since OpenAI tokenizer is estimated at ~4.2 characters per token, with your proposed "1 char per token tokenizer", the effective context length immediately becomes 4.2 times smaller, and generated output 4.2 times slower (since 4.2 times more tokens are needed for the same output). Doesn't look like a good tradeoff.

royosherove · 2025-10-13T19:43:41 1760384621

Cool. Is there a simple "howto" on running this repo with training on W&B for a programmer like me who has never done model training flows? Maybe you could share the steps you took?

sammyd56 · 2025-10-13T19:59:27 1760385567

There's not much to it... it took longer to spin up the cloud machine than it did to kick off the training run. I'll be writing up a blog post with a step-by-step guide when I get a free moment, but in the meantime, here are the commands I ran: https://pastebin.com/sdKVy0NR

royosherove · 2025-10-13T21:59:32 1760392772

Ah I was missing the WANDB_RUN env var. so did not get any logs. thanks!

bravura · 2025-10-14T01:21:41 1760404901

The measures that drop exponentially like val/bpb and train/loss you should put the x-axis in log-scale. That will better show you if it's converged

sammyd56 · 2025-10-14T10:21:36 1760437296

Great call, thankyou - I switched to log scale for those metrics - agree that it is much clearer.

bravura · 2025-10-14T14:23:08 1760451788

Sorry fat fingers. It should be the y axis that is log scale, not x axis. (Sometimes both is good.)

Did you notice the inflection point in which the loss drops faster than expected in the top graph? Maybe you should let it run more…

sammyd56 · on Jan 8, 2023

That's probably because fine-tuning is not the right approach for this use-case. A better approach might look more like this: https://github.com/openai/openai-cookbook/blob/main/examples...

sammyd56 · on Jan 6, 2023

Hi HN,

This one is mine. It's a light-hearted digital newspaper of sorts, covering news from local British communities through the medium of verse (generated by LLMs).

Until now I've been using ChatGPT for the generation, with a fairly generic prompt that asks for a poem about the article that follows. ChatGPT's ability to summarise is incredible. It's really not great, though, at rhyme and meter. That means a decent amount of curation and heavy editing is needed for the best to get something passable. Prompt engineering has not seemed to have a meaningful impact. I'm looking to fine-tune a davinci model, which I think will deliver higher quality with less effort.

Some example from the current process:

Poem: https://rhymingreporter.art/farewell-little-red/ | Original article: https://www.cornwalllive.com/whats-on/food-drink/little-red-...

Poem: https://rhymingreporter.art/flowing-frocks-icy-blue/ | Original article: https://www.cornwalllive.com/whats-on/whats-on-news/gallery/...

The quality can mostly be blamed on me, rather than GPT-3. I haven't written a poem since school :)

The accompanying illustrations are created with Stable Diffusion using DiffusionBee. Images take around 30s to generate on my Macbook Air M1. I'm looking to switch to MochiDiffusion to cut generation time a bit.

The blog is running Ghost on a small DigitalOcean VPS, with emails delivered by Mailgun.

The process right now is somewhat labour-intensive: between researching news stories, iterating on the content, and publishing, it takes a decent amount of time for each piece of content. I'm confident in being able to automate a large part of it, in time.

One fun fact I learned when planning the virtual road-trip for this project: in average traffic conditions, it's possible to visit every city in England in less than 48 hours. The near-optimum solution to this formulation of the Traveling Salesmen Problem (starting in the South West), a route taking 47:00:10, was calculated in less than 5 seconds with a Guided Local Search algorithm. [1]

Technology means that I can virtually, learn about, write creatively, and publish regularly, all whilst having a family and a full-time job. What a time to be alive!

Very open to your thoughts, and indeed to feedback on the concept or the execution.

[1] https://developers.google.com/optimization/routing/tsp

sammyd56 · on Dec 27, 2022

What is your goal?

For short-term career growth, $YOUR_COMPANY's current preferred ETL tool will have the biggest ROI. Focus on design patterns: while APIs will come and go, the concepts, as you rightly say, are transferrable.

If you're looking to land a new role: the market says dbt, databricks and snowflake are pretty strong bets.

If it's personal interest, or a high-risk, high-reward long term play, take your pick from any of the new hotness!

thenaturalist · on Dec 27, 2022

I'd second the notion that the question is far too open.

I'd add that dbt, databricks and snowflake are pretty strong bets still, but you have to acknowledge that they're becoming mainstream with an ever accelerating pace as the companies behind them churn out upskilling courses, meetups and acquire an ever larger share of the market.

If you like to be a specialist, going deep into either of those still holds career value.

If you're taking a more generalist view of where things are headed, the best prediction I heard someone say to set themselves apart is for Data Engineers to optimize for operationalizing data. Focusing much more on reverse ETL, becoming knowledgeable in building data web apps. The no-code or low-code movement around data apps will make the barrier of entry to set something up nonexistent, and I see how that will drive demand.

Pairing (big) data query/ frontend performance and web apps is another beast though.

For all my initial scepticism, I see the Data Mesh concept picking up pace in the years to come. It's vendor independent, couples well with Team Topologies and effective, decoupled, functional SWE teams. There still will be a big need for standards and conventions set by a small enabling core DE team, as of now, the knowledge gap between the baseline DE and your average SWE or Product Owner is just way too big in my experience.

Last but not least, I'd throw data lake out there. Apache Iceberg is getting a lot of attention and rightfully so. TCO of a query engine on top of files is so much better than any DWH and any org being able to optimize compute on data for it's current need will be able to save massively while the "convenience" gap steadily closes. Again, pretty generic but there's much to learn around Athena, Trino and the like.

I'm personally not a fan of learning a new language except maybe for Rust. There is an ever increasing stack of standard "low-code" tools for the typical ETL schlick, and Python won't go anywhere. Again, potential to differentiate will be low and ever lower in many contexts outside of proper big data. This is only me though and this view is highly context dependent, so YMMV of course.

thewhitetulip · on Dec 27, 2022

About snowflake, I am really curious. What do you mean by learn snowflake. The way I was told about snowflake is that it's a cloud based data warehouse. Are there advanced properties in snowflake which one has to learn? Or do you mean optimized queries?

datatrashfire · on Dec 27, 2022

Snowflake at it's most basic is SQL on cloud vms, anyone comfortable with SQL should feel at home there. That said, there are many Snowflake specific features that may take a bit to become familiar with. Just off the top of my head:

- hybrid RBAC, DAC, ABAC security model - column, row level, and tag based access policies - multi-account organizations - cross-account and region data replication - data shares - external tables and specialized formats (iceberg, delta) - pipes and streams - snowpark API - streamlit integration

cfeduke · on Dec 27, 2022

The nice thing about Snowflake is that for many use cases it requires little management.

Things you can learn regarding Snowflake, other than the obvious (SQL, and Snowflake specific language extensions to SQL): proper table partitioning, Snowpipe (and the associated cloud messaging pipelines), and query performance tuning. (Complex queries can become a bear; identifying when its your query/partitioning or when its something on the Snowflake back end is challenging.)

There are always new additions to the Snowflake tooling ecosystem since the company is in competition with Databricks and others (e.g., Snowpark with Python).

thewhitetulip · on Dec 27, 2022

Goals are either Upskilling or self learning

Any guidelines about design patterns?

I'm ok learning something for the sake of it. Tired of wfh and sitting in my house. I learnt Go a few years ago just for the sake of it.

I was planning to learn compiler construction and create a toy language or create an sqlite alternative for fun

sammyd56 · on March 21, 2021

A stark reminder to end users of any service that if you are not paying, you are the product. There is good reason that the sale of financial services through intermediaries is a highly regulated area: it takes a firm counter-force to stop businesses from exploiting customers when all the incentives are there, and there are enough links in the chain for plausible deniability.

sammyd56 · on Aug 19, 2020

Very interesting concept. A couple of initial observations:

* Creating models from a file borks on anything non-UTF8 (i.e. most legacy system outputs)

* `synth model inspect` output does not match the docs - how do I see the JSON?

openquery · on Aug 19, 2020

> Creating models from a file borks on anything non-UTF8 (i.e. most legacy system outputs)

Yes - this is a WIP. Thanks for pointing it out

> `synth model inspect` output does not match the docs - how do I see the JSON?

Ah yes this is a typo in the docs. We'll fix it up. What you're looking for is: `synth --format json model inspect <model-id> | jq`

Thanks for the feedback!

sammyd56 · on Jan 25, 2020

This looks really cool and haven't seen this before - thanks for sharing. I've updated the description on Kaggle.

sammyd56 · on Jan 23, 2020

Hi HN. Over the last 4 months I have collected 380,000 attempts to guess a number on my Amazon Alexa Skill, and wanted to share with the community. Would love to see what you can do with the data!

sammyd56 · on Jan 23, 2020

Thanks for the tip :) Not sure about the licence - think it was the default option. I'll take a look.

edit: Updated licence to CC-BY-SA (i.e. do what you want as long as you credit)

edit2: Don't seem to be able to re-post :(

gus_massa · on Jan 23, 2020

Reposted in https://news.ycombinator.com/item?id=22132464