Writing a SQLite Clone from Scratch in C (2022)

bruce511 · on May 5, 2023

In software, as in most things, most of learning is done by doing. Creating a clone of something from scratch is a really useful learning technique and has the benefit of a defined specification.

It doesn't even really matter if the job is even completed or if the program is ever used. The important part is the doing, not the result.

I confess we did similar things at uni 30 years ago, although we mostly re-implemented arcade games, not sql databases. Databases are likely more useful to understand.

All programmers have to go through 3 stages;

A) I program to prove to myself I can

B) I program to prove to others I can

C) I know I can write this, you know I can write this, but its cheaper and easier to use this existing, available, code to do the job.

I like c) programmers in business because they understand that the goal of the business is making money not generating code. A guy who -could- write SQLite, but then understands why -using- SQLite would be better, is enormously valuable.

(Not to mention his SQL chops are probably pretty good by then)

asveikau · on May 5, 2023

Regarding C), I think it actually gets tough.

> but its cheaper and easier to use this existing, available, code to do the job.

What often happens is that the existing, available, sometimes even popular and well regarded code has lots of issues that you may only find after you deploy the thing. Then, you need to decide what tack to take. Can you engage upstream? Do you end up maintaining third party code yourself to fix the issues? Working around quirky bugs? You may even find yourself so frustrated with all of the above that you wind up abandoning the dependency and writing something new in-house.

But honestly, getting back to the topic, sqlite is not one of those things. Sqlite is a very solid library.

bruce511 · on May 5, 2023

Yes, I agree not every bit of available code is perfect. A lot of it is crap. Choosing an existing library is worth taking dome time and research over.

Inevitably some libraries will need replacing. In some cases we've ended up rolling our own. But in these cases the time spent with the discarded library is not wasted. Rather it gives us a chance to understand the domain better, to understand our needs better, and ultimately to have better criteria for making our choices.

Things like documentation, support, bug fixing, source code, and so on have all become important parts of the evaluation process.

Interestingly our internal libraries are often less-well documented (if at all) the author is sometimes no longer around, and it usually has all kinds of quirks we work around. This after taking longer, and costing more, written from an unclear spec by people not terribly familiar with the domain.

And yes, we've hone the nuclear route of ripping out and rewriting, and that is expensive , but no more expensive than defaulting to "write ourselves ", and indeed usually cheaper because we have a better idea of what we want.

fps_doug · on May 5, 2023

I think one of the things you learn on your way to C) is when it makes sense to roll your own. It is difficult, and maybe something that actually happens after c.

It seems to be very hard to make that call in time, and in the past I feel like every time we chose to implement something on our own, the existing solution would have been better in hindsight, and every time we said "ok, that existing solution looks fine" it turned out to be a dumpster fire a year down the line. But obviously the times where you made the right call just don't stick out that much because stuff "just worked", and also, in case you think you made the wrong decision, the alternative could've been ever worse. But still, I guess everybody knows that feeling? :-)

lelanthran · on May 5, 2023

The problem is that a lot of the existing code that will be chosen by type C is written by type A.

Type B is writing code for money, so it almost never gets reused outside an organisation.

andai · on May 5, 2023

The kind of people who write code to prove to themselves that they can, may be exactly the kind of programmers whose code we should be using.

They love programming, which means they do it a lot (beyond the bare minimum), which means they're probably pretty good!

chaxor · on May 5, 2023

Found the js programmer

throwawawy0623x · on May 5, 2023

An extension of C) is that this not only applies to open source. I was in a startup where the new employees decided to throw away working stable services to write their own version and they wasted a lot of time for little benefit. In the meantime the startup failed to deliver the necessary NEW functionality and we lost a lot of potential customers and sales people who were tired of waiting.

Working with other people's code is one of the most valuable skills. Be willing to live with existing code for a while and understand why it was implemented that way. Only then can you know what you should change.

brabel · on May 5, 2023

I think I've been through A, B and C... and finally got back to A :D

My problem is that quite a few libraries we use eventually stop being sufficient (either because it has bugs, performance issues, limited use-case, too-wide use-case etc), and then we have to either contribute to it, with the politics involved which can be more work than the code changes, or we just bite the bullet and write things ourselves. For that reason, I find it really good to be capable enough to write things from scratch when that makes sense.

skripp · on May 5, 2023

I love these type of projects. This is definitely on the top of my reading list now.

Another wonderful “from scratch” resource is “Crafting interpreters” [0] that I was recommended here on hn.

[0] https://craftinginterpreters.com/introduction.html

tmountain · on May 5, 2023

Another good one is handmade hero for creating a game from the absolute beginning. It’s extremely lengthy, but very information dense, and it’s only $15 USD.

https://handmadehero.org/

AlchemistCamp · on May 5, 2023

Wow, you weren’t kidding. I can’t imagine how anyone could join in and get caught up to the present day state of the code.

> ”How long is the series expected to last?”

> ”The series started on November 17th, 2014. Sessions are limited to two hours or less per day so it remains manageable for people who practice coding along with the series at home. There is no prescribed length for the series. It will continue until all the game and engine code is complete.”

smaddox · on May 5, 2023

Even just watching the first ~20 episodes dramatically altered how I approach coding, for the better. At least, when combined with his blog posts, e.g. https://caseymuratori.com/blog_0015 (you can skip the snarky intro)

AlchemistCamp · on May 7, 2023

That sounds fantastic. I’m going to bookmark this for my next vacation.

detrites · on May 5, 2023

I clicked around, read the FAQ and skimmed some videos, but I could find no description, nor images of the actual game. The best I could find was in a 2hr episode numbered in the 600's, about 20 wireframe boxes being moved around.

Is there any description of the game you're going to invest all this time to build?

And, if those wireframe boxes really are the result of 1200 hrs of coding - count me out.

teleforce · on May 5, 2023

This is a book on Building Git clone:

https://shop.jcoglan.com/building-git/

HN post on the review of James Coglan’s Building Git:

https://news.ycombinator.com/item?id=35626698

ternaryoperator · on May 5, 2023

Another good one is Jacobin[0], a JVM being written in go.

[0] http://jacobin.org

unwind · on May 5, 2023

This seems serious, and a very nice introduction/primer on how databases work under the hood, thanks for sharing!

One minor nitpick after glancing quickly at the code: there's pointer arithmetic with 'void *' pointers [1], which is a GCC extension and not portable [2]. It should be rewritten using the proper casts, in my opinion.

[1]: https://cstack.github.io/db_tutorial/parts/part8.html#access...

[2]:https://stackoverflow.com/questions/3523145/pointer-arithmet...

planede · on May 5, 2023

I would probably just avoid all GNUism in C. Some GNU C extensions are just plain crazy:

  // this compiles in GNU C
  int * foo(int x) {
    return x;
  }

Just stick to -std=c<year> -pedantic-errors if you have to write C.

kazinator · on May 5, 2023

Your understanding is incomplete.

  $ gcc -c test.c
  test.c: In function ‘foo’:
  test.c:3:12: warning: return makes pointer from integer without a cast [-Wint-conversion]
     return x;
            ^

The translation unit has a constraint violation, for which ISO C requires a diagnostic. The compiler supplied one.

ISO C doesn't say that translation must be terminated.

ISO C doesn't define different classes of diagnostic like "error" or "warning".

I didn't tell GCC to be an ISO C compiler. The above diagnostic tells us that while conversions from integers to pointers happen implicitly, the GNU C dialect, like ISO C, wants that diagnosed. You have to go out of your way to modify the dialect not to have that diagnosed, like:

  $ gcc -Wno-int-conversion -c test.c
  $

That's not GNU C any more, but a tweaked sub-dialect of GNU C with a diagnostic requirement removed.

unwind · on May 5, 2023

Yes, I agree especially for code intended to teach something else it's not a good idea to include GCCisms.

I didn't find any section about compilation set-up (compiler, warning levels, flags etc) at all in the OP, but didn't spend a lot of time looking. Perhaps it would be a good addition too to specify which dialect of C is used.

morelisp · on May 5, 2023

What's the GNUism here? Silently casting ints to pointers is a time-honored C tradition.

planede · on May 5, 2023

Well, that's just ill formed in ISO C.

edit: hmm, maybe not:

https://cigix.me/c17#6.3.2.3.p5

Looks like gcc rejects the implicit conversion with -std=c17 -pedantic-errors as it's such a bad idea to allow this, but ISO C actually allows this.

gpderetta · on May 5, 2023

That's about explicit conversion right?

    void * ptr;
    uintptr_t x = (uintptr_t)ptr;

But allowing this would be just insane:

    uintptr_t x = ptr;

I haven't programmed in plain C (as opposed to C++) in decades, but I can't believe it is allowed... ... but gcc on godblot only raises a warning but still happily compiles it. o_O

planede · on May 9, 2023

Yeah, you are right that 6.3 lists all conversions, implicit and explicit.

https://cigix.me/c17#6.3.p1

> Several operators convert operand values from one type to another automatically. This subclause specifies the result required from such an implicit conversion, as well as those that result from a cast operation (an explicit conversion). The list in 6.3.1.8 summarizes the conversions performed by most ordinary operators; it is supplemented as required by the discussion of each operator in 6.5.

Weather a conversion is allowed in a certain context should be specified for wording for that given context.

For example assignment:

https://cigix.me/c17#6.5.16.1.p1

Return statements are similarly restricted:

https://cigix.me/c17#6.8.6.4.p3

> If the expression has a type different from the return type of the function in which it appears, the value is converted as if by assignment to an object having the return type of the function.

(emphasis mine)

However I can't make sense of the footnote:

> The return statement is not an assignment. The overlap restriction of subclause 6.5.16.1 does not apply to the case of function return. The representation of floating-point values may have wider range or precision than implied by the type; a cast may be used to remove this extra range and precision.

Why doesn't the overlap restriction apply, if return is "as if by assignment"? Makes no sense to me.

uecker · on May 5, 2023

In general, the C standard only requires a diagnostic for constraint violations.

morelisp · on May 5, 2023

> ISO C

Not being in ISO C(99, I'm less sure about 89) doesn't make it a GNUism. At best you're disagreeing with GCC over its default standards version.

Clang only changed the default in 2022, for example. https://reviews.llvm.org/D129881

pyrolistical · on May 5, 2023

Since SQLite is already written in C, it would be interesting to do a rewrite in a different system language like zig

nine_k · on May 5, 2023

Indeed: Zig, Rust, Ada/spark, Nim, ATS — anything would be more interesting to see than C.

OTOH two different C implementations would be more interesting to compare for a C connoisseur.

eatonphil · on May 5, 2023

Here's one I wrote about a SQL database on top of RocksDB in Zig.

In the next year or two though I'll probably get around to doing the LSM tree bits myself. But not yet.

https://notes.eatonphil.com/zigrocks-sql.html

nickpeterson · on May 5, 2023

How about a C clone in SQLite?

euroderf · on May 5, 2023

Sounds like a job for a LLM !

fooker · on May 5, 2023

Sqlite has a fairly exhaustive test suite.

I wonder how far we are from auto-generating code which manages to pass these tests and maybe produce same results on random queries compared to SQLite3.

Two years ago, I’d have said 50 years. That estimate is now blown.

chasil · on May 5, 2023

The final test suite was used to pass DO-178B, allowing SQLite to be the only major database that is approved for use in critical avionics.

Dr. Hipp remarked that he spent years of 80 hour weeks attaining this quality.

C can be used safely, but it can be very hard.

tonyedgecombe · on May 5, 2023

Or generating the test suite from the specification. Or finding ambiguities in the specification.

I suspect our jobs are going to change quite a lot in the coming years.

comfypotato · on May 5, 2023

Doesn’t SQL have decades worth of optimizations? I remember something from my databases class where the professor was mentioning how it’s to the point now where nobody really understands every aspect of what goes on between the query and the machine instructions.

Still cool as a hobby project.

klabb3 · on May 5, 2023

That’s probably true but building a clone that people understand, even if slower, is extremely valuable. Understanding an existing technology deeply is a gateway to innovation.

That said, SQLite is so remarkably stable, lightweight like a feather, and incredibly battletested, that even a hypothetically superior alternative would have an immense uphill battle to reach the same value prop.

valleyer · on May 5, 2023

SQL is not the name of a piece of software, and SQLite is developed by a single person.

scblock · on May 5, 2023

While SQLite is developed by a very small team[1] and does not openly accept patches[2], it is not developed by a single person.

1: https://www.sqlite.org/crew.html 2: https://www.sqlite.org/copyright.html

valray · on May 5, 2023

From the link, it appears that it was developed by a single person in 2000, who then brought on board 2 other people as contributors in 2002 and after.

emrah · on May 5, 2023

Building is a great way to learn for sure but exact cloning gets boring for me. I always end up needing to introduce some kind of twist to the original idea to keep the project interesting beyond a certain point

lang4d · on May 5, 2023

From Part 3, does malloc(PAGE_SIZE) actually guarantee a single full page is allocated or would the allocation be positioned arbitrarily and stretch across two pages?

webshit2 · on May 5, 2023

That should be a valloc or aligned_alloc for it to actually be page-aligned

simula67 · on May 5, 2023

How unfortunate that Sqlite Virtual Machine has opcodes to directly manipulate B-Trees. The opcode documentation[1] is even written in terms of B-Trees. This means that it would be too complicated to replace the storage engine with something else, necessitating the need for seperate products such as duckdb.

[1] https://www.sqlite.org/opcode.html

detrites · on May 5, 2023

Looks like this is from 2017-2019, incomplete, and hasn't been active much since then.

Unanswered issue about the likelihood of continuation: https://github.com/cstack/db_tutorial/issues/94

esaym · on May 5, 2023

Yea, disappointing.

erichocean · on May 5, 2023

I would love to see the equivalent of Crafting Interpreters for SQLite, including the two-part structure with a simple (but slow) Java implementation in Part 1 followed by a more-complicated (but fast) rewrite in C in Part 2.

But this is good too.

rohitpaulk · on May 5, 2023

Did a version of this myself using Python, was a great learning experience.

I got hooked on the whole “build your own X” format after watching a pluralsight course called “Learn by building Redis with Ruby” by Xavier Shay - highly recommend.

Shorel · on May 5, 2023

Good, hopefully this will be the second implementation that the W3C standards committee will accept, so SQLite is an approved web standard instead of the disappointing thing we got.

glonq · on May 5, 2023

Even as an oldschool C programmer, I would be hesitant to do somthing like this in C. Why not C++, where you can stay close to the metal but enjoy a few niceties?

lucasmedeiros7 · on May 6, 2023

cool project

rurban · on May 5, 2023

This would be a good chance to outline and fix some of the worst design decisions of the various sqlite hacks and security issues.

No stack, so no proper subfunction model, such as triggers or functions calling other functions. Simple remote queries can be made destructive, due to the extremely hackish design. The horrible, insecure by default text search engine.

Unfortunately this guy treats sqlite as a role model of a good design. Maybe he will find out when it's too late. https://github.com/rurban/hardsqlite