In software, as in most things, most of learning is done by doing. Creating a clone of something from scratch is a really useful learning technique and has the benefit of a defined specification.
It doesn't even really matter if the job is even completed or if the program is ever used. The important part is the doing, not the result.
I confess we did similar things at uni 30 years ago, although we mostly re-implemented arcade games, not sql databases. Databases are likely more useful to understand.
All programmers have to go through 3 stages;
A) I program to prove to myself I can
B) I program to prove to others I can
C) I know I can write this, you know I can write this, but its cheaper and easier to use this existing, available, code to do the job.
I like c) programmers in business because they understand that the goal of the business is making money not generating code. A guy who -could- write SQLite, but then understands why -using- SQLite would be better, is enormously valuable.
(Not to mention his SQL chops are probably pretty good by then)
> but its cheaper and easier to use this existing, available, code to do the job.
What often happens is that the existing, available, sometimes even popular and well regarded code has lots of issues that you may only find after you deploy the thing. Then, you need to decide what tack to take. Can you engage upstream? Do you end up maintaining third party code yourself to fix the issues? Working around quirky bugs? You may even find yourself so frustrated with all of the above that you wind up abandoning the dependency and writing something new in-house.
But honestly, getting back to the topic, sqlite is not one of those things. Sqlite is a very solid library.
Yes, I agree not every bit of available code is perfect. A lot of it is crap. Choosing an existing library is worth taking dome time and research over.
Inevitably some libraries will need replacing. In some cases we've ended up rolling our own. But in these cases the time spent with the discarded library is not wasted. Rather it gives us a chance to understand the domain better, to understand our needs better, and ultimately to have better criteria for making our choices.
Things like documentation, support, bug fixing, source code, and so on have all become important parts of the evaluation process.
Interestingly our internal libraries are often less-well documented (if at all) the author is sometimes no longer around, and it usually has all kinds of quirks we work around. This after taking longer, and costing more, written from an unclear spec by people not terribly familiar with the domain.
And yes, we've hone the nuclear route of ripping out and rewriting, and that is expensive , but no more expensive than defaulting to "write ourselves ", and indeed usually cheaper because we have a better idea of what we want.
I think one of the things you learn on your way to C) is when it makes sense to roll your own. It is difficult, and maybe something that actually happens after c.
It seems to be very hard to make that call in time, and in the past I feel like every time we chose to implement something on our own, the existing solution would have been better in hindsight, and every time we said "ok, that existing solution looks fine" it turned out to be a dumpster fire a year down the line. But obviously the times where you made the right call just don't stick out that much because stuff "just worked", and also, in case you think you made the wrong decision, the alternative could've been ever worse. But still, I guess everybody knows that feeling? :-)
An extension of C) is that this not only applies to open source. I was in a startup where the new employees decided to throw away working stable services to write their own version and they wasted a lot of time for little benefit. In the meantime the startup failed to deliver the necessary NEW functionality and we lost a lot of potential customers and sales people who were tired of waiting.
Working with other people's code is one of the most valuable skills. Be willing to live with existing code for a while and understand why it was implemented that way. Only then can you know what you should change.
I think I've been through A, B and C... and finally got back to A :D
My problem is that quite a few libraries we use eventually stop being sufficient (either because it has bugs, performance issues, limited use-case, too-wide use-case etc), and then we have to either contribute to it, with the politics involved which can be more work than the code changes, or we just bite the bullet and write things ourselves. For that reason, I find it really good to be capable enough to write things from scratch when that makes sense.
Another good one is handmade hero for creating a game from the absolute beginning. It’s extremely lengthy, but very information dense, and it’s only $15 USD.
Wow, you weren’t kidding. I can’t imagine how anyone could join in and get caught up to the present day state of the code.
> ”How long is the series expected to last?”
> ”The series started on November 17th, 2014. Sessions are limited to two hours or less per day so it remains manageable for people who practice coding along with the series at home. There is no prescribed length for the series. It will continue until all the game and engine code is complete.”
Even just watching the first ~20 episodes dramatically altered how I approach coding, for the better. At least, when combined with his blog posts, e.g. https://caseymuratori.com/blog_0015 (you can skip the snarky intro)
I clicked around, read the FAQ and skimmed some videos, but I could find no description, nor images of the actual game. The best I could find was in a 2hr episode numbered in the 600's, about 20 wireframe boxes being moved around.
Is there any description of the game you're going to invest all this time to build?
And, if those wireframe boxes really are the result of 1200 hrs of coding - count me out.
This seems serious, and a very nice introduction/primer on how databases work under the hood, thanks for sharing!
One minor nitpick after glancing quickly at the code: there's pointer arithmetic with 'void *' pointers [1], which is a GCC extension and not portable [2]. It should be rewritten using the proper casts, in my opinion.
$ gcc -c test.c
test.c: In function ‘foo’:
test.c:3:12: warning: return makes pointer from integer without a cast [-Wint-conversion]
return x;
^
The translation unit has a constraint violation, for which ISO C requires a diagnostic. The compiler supplied one.
ISO C doesn't say that translation must be terminated.
ISO C doesn't define different classes of diagnostic like "error" or "warning".
I didn't tell GCC to be an ISO C compiler. The above diagnostic tells us that while conversions from integers to pointers happen implicitly, the GNU C dialect, like ISO C, wants that diagnosed. You have to go out of your way to modify the dialect not to have that diagnosed, like:
$ gcc -Wno-int-conversion -c test.c
$
That's not GNU C any more, but a tweaked sub-dialect of GNU C with a diagnostic requirement removed.
Yes, I agree especially for code intended to teach something else it's not a good idea to include GCCisms.
I didn't find any section about compilation set-up (compiler, warning levels, flags etc) at all in the OP, but didn't spend a lot of time looking. Perhaps it would be a good addition too to specify which dialect of C is used.
I haven't programmed in plain C (as opposed to C++) in decades, but I can't believe it is allowed...
... but gcc on godblot only raises a warning but still happily compiles it. o_O
> Several operators convert operand values from one type to another automatically. This subclause specifies the result required from such an implicit conversion, as well as those that result from a cast operation (an explicit conversion). The list in 6.3.1.8 summarizes the conversions performed by most ordinary operators; it is supplemented as required by the discussion of each operator in 6.5.
Weather a conversion is allowed in a certain context should be specified for wording for that given context.
> If the expression has a type different from the return type of the function in which it appears, the value is converted as if by assignment to an object having the return type of the function.
(emphasis mine)
However I can't make sense of the footnote:
> The return statement is not an assignment. The overlap restriction of subclause 6.5.16.1 does not apply to the case of function return. The representation of floating-point values may have wider range or precision than implied by the type; a cast may be used to remove this extra range and precision.
Why doesn't the overlap restriction apply, if return is "as if by assignment"? Makes no sense to me.
I wonder how far we are from auto-generating code which manages to pass these tests and maybe produce same results on random queries compared to SQLite3.
Two years ago, I’d have said 50 years. That estimate is now blown.
Doesn’t SQL have decades worth of optimizations? I remember something from my databases class where the professor was mentioning how it’s to the point now where nobody really understands every aspect of what goes on between the query and the machine instructions.
That’s probably true but building a clone that people understand, even if slower, is extremely valuable. Understanding an existing technology deeply is a gateway to innovation.
That said, SQLite is so remarkably stable, lightweight like a feather, and incredibly battletested, that even a hypothetically superior alternative would have an immense uphill battle to reach the same value prop.
From the link, it appears that it was developed by a single person in 2000, who then brought on board 2 other people as contributors in 2002 and after.
Building is a great way to learn for sure but exact cloning gets boring for me. I always end up needing to introduce some kind of twist to the original idea to keep the project interesting beyond a certain point
From Part 3, does malloc(PAGE_SIZE) actually guarantee a single full page is allocated or would the allocation be positioned arbitrarily and stretch across two pages?
How unfortunate that Sqlite Virtual Machine has opcodes to directly manipulate B-Trees. The opcode documentation[1] is even written in terms of B-Trees. This means that it would be too complicated to replace the storage engine with something else, necessitating the need for seperate products such as duckdb.
I would love to see the equivalent of Crafting Interpreters for SQLite, including the two-part structure with a simple (but slow) Java implementation in Part 1 followed by a more-complicated (but fast) rewrite in C in Part 2.
Did a version of this myself using Python, was a great learning experience.
I got hooked on the whole “build your own X” format after watching a pluralsight course called “Learn by building Redis with Ruby” by Xavier Shay - highly recommend.
Good, hopefully this will be the second implementation that the W3C standards committee will accept, so SQLite is an approved web standard instead of the disappointing thing we got.
Even as an oldschool C programmer, I would be hesitant to do somthing like this in C. Why not C++, where you can stay close to the metal but enjoy a few niceties?
This would be a good chance to outline and fix some of the worst design decisions of the various sqlite hacks and security issues.
No stack, so no proper subfunction model, such as triggers or functions calling other functions.
Simple remote queries can be made destructive, due to the extremely hackish design.
The horrible, insecure by default text search engine.
Unfortunately this guy treats sqlite as a role model of a good design. Maybe he will find out when it's too late.
https://github.com/rurban/hardsqlite
It doesn't even really matter if the job is even completed or if the program is ever used. The important part is the doing, not the result.
I confess we did similar things at uni 30 years ago, although we mostly re-implemented arcade games, not sql databases. Databases are likely more useful to understand.
All programmers have to go through 3 stages;
A) I program to prove to myself I can
B) I program to prove to others I can
C) I know I can write this, you know I can write this, but its cheaper and easier to use this existing, available, code to do the job.
I like c) programmers in business because they understand that the goal of the business is making money not generating code. A guy who -could- write SQLite, but then understands why -using- SQLite would be better, is enormously valuable.
(Not to mention his SQL chops are probably pretty good by then)