Author here. I'll be blunt and repeat a prediction I made 3 years ago or so:
C is finished if it doesn't address the buffer overflow problem, and this proposal is a simple, easy, backwards compatible way to do it. It is simply too expensive to deal with buffer overflow bugs anymore.
This one addition will revolutionize C programming like adding function prototypes did.
More usefully, if you want people to use D instead, what is stopping them, what reasons do they give? How can these be mitigated, cos while I like C I sure would like something better.
Is the number of devices that C is running on a good metric? I think the number of developers or number of projects using C makes more sense to track. And then get the market share. If there are 10x more devices today than 10 years ago then I'd sure hope C was running on more devices but anything less than 10x means C is losing ground.
Thank you for the new knowledge that "eso" is pronounced similar to "iso" in some dialects of English, I didn't know that.
However, the word "isoteric" is more correctly spelled (in non-phonetic spelling) as esoteric. The prefix "eso-" means "inside" in Greek, as in "esothermic", or "esophagus". The prefix "iso-" means "equal", as in "isomorphism", "isosceles", "isometric", etc.
Many apologies - I was not being sarcastic and I'm sorry that this is how my comment came across. As chongli says I'm a native speaker of Greek and I really didn't know how "eso" is pronounced by native English speakers. I've lived for 15 years in the UK and I'm still surprised to hear how people pronounce the more obscure words in their language (some of which come from Greek).
Oops, sorry, I apologise for the mistake. I have seen too much bad behaviour on the 'net, so naturally I assumed the worst. It's a valuable lesson at the modest cost of a few karma points. (I guess I violated HN guidelines too, there. Good thing I don't have the power to downvote yet. I might have done so, and never discovered my mistake.)
You didn't cause the downvote. Really. But in any case, there are more important things in the world than HN karma. And keep up your Greek lessons. That's one language I'd love to learn, if only I had the time. But I understand it is fiendishly difficult for non-native speakers. (Source: Greek to Me: Adventures of the Comma Queen by Mary Norris.)
The same is likely true of Fortran, since Fortran code is shipped around with several Python data science libraries and included in R. Does that mean Fortran is a thriving language, or does it just mean Fortran was used a long time ago to write some important libraries that are now hard to get rid of?
Except Fortran 2018 is quite modern, supports modules, generics, and even OOP, first class support on CUDA alongside C++, whereas C18 hardly changed since C89 besides some cosmetic stuff and it is as secure as when it got used to rewrite UNIX in the early 70s.
It could be argued though that less usage means fewer stake holders to convince of the need for specific changes to the language, which helps with increased evolution. (I have only cursory knowledge of what's happening in C and none about what's happening in Pascal nowadays, just pointing out that being a smaller community might ironically help the language).
Lol C programmer since 1994 here... just started a Rust project with the specific goal of learning Rust. The project itself is just to scratch my own itch. I’d honestly probably write it in Python if I didn’t want to see what Rust was all about :)
C doesn't "run on hardware" .. unless you're talking interpreted C. Of course compiled machine code is running on more hardware but that's just a truism.
The question is are people using C to program these hardware more? .. or are people gravitating towards safer compiled languages (Rust?). That's a valid question, even if the answer is "no C's usage is only increasing."
C has been "losing ground" not because of random per peeves of those who never wrote a line of code in C but because since C's last standard update there have been other programming languages that offer developers something of value so that the trade-off between using C or any alternative starts to make technical sense.
It also helps that C's standardization proceeds in ways that feel somewhat between sabotage and utter neglect.
Meanwhile, C is still the absolute best binary interop language devised by mankind.
> C has been "losing ground" not because of random per peeves of those who never wrote a line of code in C
This is not a random pet peeve, and WalterBright is as far as you can get from someone "who never wrote a line of code in C". This is the cause of numerous security bugs in the past and currently, and the reason most C material written in the 70s/80s is unsafe to be used today (mostly due to usage of strlen/etc vs strnlen/etc).
A question: since your company also makes a C/C++ compiler (and the repo has very :), have you considered adding this addition to it, as an experimental feature, perhaps to demonstrate its usefulness to other developers and standard bodies? (Although, now that I think of it, D itself might serve the same purpose)
I don't see much point in it. I've proposed this change to C in front of several audiences, and it never received any traction. If you want to experiment with it, you can use DasBetterC, i.e. running the D compiler with the `-betterC` switch, which enables programs to be built requiring only the C Standard Library.
Fair warning - once you get accustomed to DasBetterC, you're not likely to want to go back to C :-)
> If you want to experiment with it, you can use DasBetterC, i.e. running the D compiler with the `-betterC`
I've been meaning to experiment with DasBetterC for a while, and I have a project C I've been wanting to migrate to something with proper strings (it's an converter for some binary file formats, but now I want it to import some obscure text formats too). Maybe that's the push I needed :)
After 20 minutes and about 250 out of 2098 lines converted, the error messages are very good and give very nice hints about what to change, I must say I prefer them to Rust's verbose messages.
DasBetterC's trial-by-fire was when I used it to convert DMD's backend from C to D.
I'm sure you already know this, but the trick to translating is to resist the urge to refactor and fix bugs while you're at it. Convert files one at a time, and after each run the test suite.
Only after it's all converted and passing the test suite can refactoring and bug fixing be considered.
I don't get why it hasn't gotten traction. When I read it, it was immediately obvious to me that this would be extremely helpful. I want it yesterday, and so should everyone.
Pro tip: Google the name of the person before responding to them, it can help avoid the taste of foot in your mouth which you are currently experiencing.
I’m new here, so this seems like a valid criticism to me — but judging by the number of downvotes, it may not be. Can someone explain why this comment is incorrect?
Perhaps because so many of know Walter from his work and his history here on HN? Sometimes you have to just trust that someone is who we all say they are.
What argument from authority is being made by anyone?
The GP decided, out of the blue, to accuse the author of never having written a line of C code in his life. That's kind of inappropriate in any context, IMO, but just downright laughable when the author is well-known for singlehandedly writing several compilers and a whole new language.
He never said explicitly, he was just making a general statement. Not that it matters whether he did or didn't, there's a lot of things wrong with C, it will most likely eventually disappear, but not for reasons outlined in this article. That's what he was saying.
Well, he dismissed Bright’s argument as a random pet peeve from people who haven’t written a line of code in C before, so yes, I do think he said it explicitly.
> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.
This is one of HN's comment guidelines. If you're not sure that someone is who you think they are, you can just ask, e.g.: "Hey, are you Walter Bright who did X and Y?"
What's the implication here? I only know one COBOL developer but they seem to be doing quite well for themselves, making over $400k a year for something like 15 hours of work a week.
> Meanwhile, C is still the absolute best binary interop language devised by mankind.
You're mistaking the “C” ABI with the C language. The so-called C ABI should actually be called the UNIX-derived ABI, as (i) C doesn't define an ABI and (ii) C can perfectly produce binaries using another ABI (such as e.g. the “Pascal” one, common on the DOS platform).
Maybe people are voting this down because they think it's directed at Walter Bright in particular, but I think there is actually some truth in the harsh comment.
Nothing about Walter Bright in this statement, but some of the harshest criticisms from others I have seen of C are not from expert practitioners in C.
People who are experts and also critics seem to have a more practical, realistic, nuanced critique, that understands history and challenges to adoption, admits that the long history and difficulty of replacing C isn't exactly for no reason.
That's the way I interpreted it because it's true. A lot of the criticisms are misdirected one by people that haven't used C except being forced to use it for few assignments in school, C++ jockeys that think C is the 30 year out of date version of C that's supported by C++, and people that haven't used it at all for anything real.
I also agree that what the standard committee has been doing for the last 20 years amounts to willful sabotage.
What about between c99 and c18? Is there anything you can think of? I think the _s() functions, advertised as security features, are a weak effort. Anything else come to mind?
Nothing really, if anything VLAs have proven such a mistake that Google lead an effort to remove all instances of VLA use out of the Linux kernel.
Also the amount of UB descriptions just increased and are now well over 200.
Annex K was badly managed, a weak effort as you say, given that pointer and size were still handled separately, and in the end instead of coming up with a better design with struct based handles, like sds, everything was dropped.
ISO C drafts are freely available, I recommend everyone that thinks that they know C out of some book, or have only read K&R book, to actually read them.
> some of the harshest criticisms from others I have seen of C are not from expert practitioners in C.
But were they expert practitioners of C in the past? My experience is that most of the harshest criticisms of C come from former C experts who moved on to other languages because it became clear to them that C would never be fixed - Walter Bright included.
Yes I know, and for clarity I appreciate your work and insight, and frequently enjoy your comments here.
My point was that people were mistaking the comment for an attack on you, which I don't think was necessarily intended or needs to be without it being a valid point about a different set of critics.
One of the niceties of C is that I can get anything done pretty damn quickly, without anything getting in my way. The syntax is extremely simple, too, vs. Rust. Rust is close to Perl when it comes to syntax; full of symbols. Too implicit for me. I want to look at the code, and I want to understand what the heck is going on, even if it is written by someone else. I usually do, with C. Rust? Not so much, and believe me, I tried. I would not like to call myself an idiot, either. :)
A viable C replacement is Ada, although it is not for people who dislike the "code is documentation" bit.
> One of the niceties of C is that I can get anything done pretty damn quickly, without anything getting in my way.
not saying you're necessarily wrong (c is definitely simpler than rust), but I think most people would write a similar comment for whatever language they feel most comfortable with. I write c++ most days. even though it's the most verbose and complicated language I've ever used, I can still probably get stuff done way faster than in another language I happen to pick up.
if I'm just throwing something together really fast, I do mostly use the old school C functions though. scanf is way nicer than streams.
Agreed - in that I write it in Perl if I need it to just work right now. Perl is still, to me, the most useful programming language for completing a generic program in the smallest amount of time. (Part of that is due to CPAN, and part to the gazillion built-in features of Perl 5)
I agree. I’m an older programmer (almost 61) and I’ve been using Perl for a long time - maybe 20 or 25 years. I reach for it when a job feels like a bit too much for a bash (1) script.
I don’t go out of my way to teach or suggest Perl to my younger colleagues. I don’t know why that is. They don’t usually reach for Python, which I think would probably be their best choice.
Maybe I’m just a curmudgeon ... and by the way, can you please stay off my lawn?
One way rust's just a different tool for a different job, is it really doesn't optimize for knowing everything that's happening inside someone else's code. It's a great example of the difference in procedural vs functional programming, where in rust you mostly just care what your function args are and what it returns.
Nothing wrong with learning multiple languages, of course. C was my first professional language, and I spend most of my days in rust now. No shortage of rust devs who are big on C too.
You might like zig. It's still pre 1.0, but I feel like it really has that "get out of your way" feel of C with a ton of safety. If you write tests, you can get tested memory safety, too.
Zig has a LOT of good stuff going for it but one of my pet peeves is how arsey the linter gets about formatting. No tabs (They might begrudgingly fix that one), no multiline comment/string support (and before anyone tries to correct me on the strings front, you look at that syntax and tell me it isn't an intentional joke), you must use these specific line endings (That one was actually fixed in master recently iirc)
The syntax is also currently REALLY unstable. As in: The hello world has changed almost every major version. Hopefully that too will be squashed with 1.0
To be fair though, Zig is probably the lest egregious and most flexible of the modern "C killers". I can see it's really trying to innovate low level programming. I really like it's flexible malloc systems and support for dynamic linking at runtime. It's compile time code execution is excellent too.
The fact that they're actually trying to support obscure platforms like the z80 is a good indicator that they're staying true to C's "code anywhere" mantra. That's why I'm mostly focusing on linting issues of all things.
Ah. If anything, the whole "rust as better than C for everything" thing is starting to hurt its reputation, regardless of veracity. People get so focused on its use in perf-critical applications they ignore its other strengths. eg I've never replaced C code with it, but we redid our whole PHP backend as a rust app, because it's great for rigidly defined business rules too.
Ah. If anything, the whole "rust as better than C for everything" thing is starting to hurt its reputation, regardless of veracity. People get so focused on its use in perf-critical applications they ignore its other strengths. eg I've never replaced C code with it, but we redid our whole PHP backend as a rust app.
Young programmers seem to prefer learning Rust than C. Generational replacement will take care of making Rust prevalent, no matter what existing programmers think.
Do they? Or are they just told that they should be using Rust? Even putting aside outdated curriculum, Rust is a fairly involved language to teach to a new programmer.
Indeed. What languages are most common in universities anyways? Haskell? Java? C++? OCaml? Which ones are the most common?
Maybe he meant outside of the education system. I think the reason for that would be hype and peer pressure, and the feeling of novelty, with a hint of FOMO. I do not see any languages being pushed/hyped as hard as Rust.
I investigated this in Florida. I checked Florida Tech, Embry-Riddle, and a half dozen state universities. One was Java, two or three were C++ (really C with cout and maybe vector), and all the rest were plain C.
I can't imagine universities actually caring to teach new programmers Rust... it's an overly complex language that most professors themselves would steer far away from because they know there's more to programming than following trends.
(We learned C++ in university in New York which was basically C with occasional help from C++'s standard library).
My much younger brother is currently enrolled at a a major university in comp sci. His coursework is primarily in Java but with certain classes in C and other languages.
My guess is that those four see more use than Rust does, with
Java and C++ forming the base of a typical undergraduate curriculum alongside Python and C and Haskell and OCaml showing up in classes where the concepts behind them are typically introduced. (FWIW: my college experience was C++, C, Scala for the required courses.)
I believe we will come back to C eventually. Kind of irrelevant but: I learnt C by writing mods for ioquake3 forks. :) Fun times. I was about 13 years old.
For every young programmer learning Rust there are probably 10000 learning C.
C is still the only language you can count on anyone with a programming-related education to have knowledge of. (That doesn't translate into being able to program is C, but still.)
I love the move to d/rust/zig/nim/... but there are other issues too. Ecosystem of libraries, stabilisation of common patterns (futures and Tokio issues are still out there), platform compatibilities, industry support for moving away from known solutions, and many other issues. Even if we all suddenly knew Rust perfectly tomorrow, there are other issues in the way.
futures is definitely the big one for me. Getting all concurrency fully on async/awaits is amazing, e.g. actix-web awaiting an endpoint calling juniper for graphql, with async resolver methods making async calls against my DB, without needing to spawn a single thread, is lovely. Still doesn't work as smoothly as that, even though on paper it ought to. Getting close though.
One of the key areas C is used that Rust cannot be used easily is in limited embedded devices. That looks like it'll be the case for at least 10 years and probably much longer than that.
The embedded Rust ecosystem is actually pretty vibrant these days. The number and quality of #![no_std] crates is improving rapidly.
The main limitation is going to be if your MCU is supported by LLVM. If you're targetting ARM, RISC-V, MSP430 or Xtensa you might get further than you'd expect.
There's a hell of a lot of tooling built around C though. So nomatter what happens I don't think we'll ever be rid of it.
Many, if not most, small MCUs can also run C++ these days as just as well as C. If you want to write simpler and more robust applications, it's easily achievable.
> C code is being replaced by Rust fast. The only limit is how quickly programmers can become good at Rust. It's already happening.
I think Rust has been very quickly fading into obscurity. What Rust hast brought to the tables was nearly the same what was brought by 100+ other programming languages in attempts to "fix C."
Footnote: With Ada/SPARK being much more battle-tested and ATS being a much more flexible & complete solution. Though I wouldn't exactly recommend ATS in terms of learning curve.
It isn't. You can tell how much a language is used by the inverse of the number of blogposts about it. People who have jobs don't have the time to write about how they would solve problems using that language, because they already are and have better things to do in their free time.
It is. Since 2012 C has dropped from about 4% of Github code to 3%. Meanwhile Rust has climbed from 0 to about 1%. That on its own doesn't prove that people are moving from C to Rust but it's not a risky guess. I did see some data on language transitions a while ago but can't find it now unfortunately (why is browser history search still so shit?).
Because Google wants people to search using the Google search engine, and see ads while they're at it. That would happen less often if Chrome were properly capable of searching the history.
Opera did full text history search a decade ago, but that browser doesn't exist anymore.
Also, history is expected not to be tied to the device in the cloud era. What more history data could you scalably sync other than URLs and page titles?
The number of people employed as python programmers is vastly smaller than the number of people who can write hello world in python, which is what the majority of blog-spam is about.
There's a lot of truth in it, but for a slightly different reason. Almost everyone I work with knows or writes C regularly. They're usually very senior people, and not once have I ever heard them talk about a blog, let alone write one. so there is this large group of C practitioners out there that simply don't know or care about all these other things happening around them. To some degree, it doesn't matter since there are plenty of jobs doing this.
It simply impossible for C to finish or vanish. It is one of the most tested and rock solid pillar in programming world. Developers already mastered how to handle the issues you mentioned in article. They are not such big to discard C.
I mean it in the sense of starting new development of a major new project with it. Of course, C will be around a very long time, like COBOL and FORTRAN.
I would wager the percentage of embedded projects than are picking C has decreased in the past 10 years. I have no direct evidence of that, but I think it's a likely guess.
I totally agree that C shepherds you into pointers. I also think C shepherds you into writing everything from scratch. Most of all I think it's a self-perpetuating cycle of "there's no system for X (e.g. sized arrays, packaging system, classes) so everyone makes their own, and now all other code feels slightly incompatible with all other code."
C could have gained the safety of function prototypes without them. Note that the function bodies and call sites had the type information. It could have been propagated through the compiler and assembler to the linker, which would then check for compatibility.
In some ways it would have worked much better. Header files can easily be wrong. The object files being linked are what really matter.
So, suppose we decided to implement this today, on a GNU toolchain. At the call site, we'd determine the parameter types based on the conventional promotions. This info gets put into the assembly output with an assembler directive. The assembler sees that, then encodes it in an ELF section. It might get a new section called ".calltype" or it is an extension to the symbol table or it involves DWARF. Similar information is produced for the function body. The linker comes along, compares the two, and accepts or rejects as appropriate.
Yes. I proposed a method without mangling, but I suppose there isn't any reason why C couldn't use mangled names.
It also isn't a requirement that C++ use mangled names. Other ways of carrying the type information are possible. I like the idea of a reference to DWARF debug info, which C++ is already using to support stack unwinding for exceptions.
> It also isn't a requirement that C++ use mangled names.
Overloading requires the type information to be part of the symbol "name" (ie, whatever is used for symbol lookup and linking) wether that is a mangled string or more complex data structure.
> In my experience with language design, a little bit of syntactic sugar can have transformative results.
Promises/async functions in JS and C# do absolutely nothing that you couldn't do without them. But they've had a structural effect on the average developer's ability to write scalable code.
It is fundamentally like the 80286 segments, but with all sorts of usability troubles solved. The 80286 segments were impractical because there were a small number available and because the OS couldn't safely hand over direct control. Every little segment adjustment required calling the OS.
You bring this up often and I ask you to modify your wording about every time I see it. Android will not require these extensions, no hardware ships with it yet. Android says they will support it.
And I give you the official wording of Google every time.
> Google is committed to supporting MTE throughout the Android software stack. We are working with select Arm System On Chip (SoC) partners to test MTE support and look forward to wider deployment of MTE in the Android software and hardware ecosystem. Based on the current data points, MTE provides tremendous benefits at acceptable performance costs. We are considering MTE as a possible foundational requirement for certain tiers of Android devices.
> Starting in Android 11, for 64-bit processes, all heap allocations have an implementation defined tag set in the top byte of the pointer on devices with kernel support for ARM Top-byte Ignore (TBI). Any application that modifies this tag is terminated when the tag is checked during deallocation. This is necessary for future hardware with ARM Memory Tagging Extension (MTE) support.
>....
> This will disable the Pointer Tagging feature for your application. Please note that this does not address the underlying code health problem. This escape hatch will disappear in future versions of Android, because issues of this nature will be incompatible with MTE
I am not sure if I have made myself clear, because I have no issues with the Google documents on this and I believe they are very clear: this feature is optional! Optional optional optional, only on hardware that supports it will Google implement these things because they literally cannot use it otherwise. Your wording has always implied that this is a requirement to run Android 11 and it is not, and that is what I am asking you to change. Like, what’s wrong with being accurate and saying “Google is implementing support for this in Android 11”? “This feature may be used to classify Android devices”?
Well, it could also solve the problem of sizeof(array) not working inside the function.
More specifically, at the moment, it evaluates to the size of the pointer itself, which is useless. On the other hand;
static void
foo(int a[..])
{
for (size_t i = 0; i < (sizeof a / sizeof int); i++)
{
// ...
}
}
... would be very useful, as it's the same syntax you can already use inside the function where the array is declared, which makes refactoring code into separate functions easier, as you don't have to replace instances of sizeof with your new size_t parameter name.
The only thing I'd like to see is compatibility with the static keyword; so that you can declare it as a sized-array but still indicate a compile-time minimum number of array elements. At the moment, in C99, this does not compile without serious diagnostics which would immediately highlight the problem:
#include <stdio.h>
static void
foo(int a[static 4])
{
for (size_t i = 0; i < 4; i++)
printf("%d\n", a[i]);
}
int
main(void)
{
int a[] = { 1, 2, 3 };
foo(a); // Passing an array with 3 elements to a function that requires at least 4 elements
foo(NULL); // Passing no array to a function that requires an array with at least 4 elements
return 0;
}
demo.c:14:3: warning: array argument is too small; contains 3 elements, callee requires at least 4 [-Warray-bounds]
demo.c:15:3: warning: null passed to a callee that requires a non-null argument [-Wnonnull]
It is not just for the size argument. The array becomes an abstract data type whose bounds are consulted when the array is indexed. That's the key.
Yes, it's not a lot of effort to manually add a size_t argument. But it is far too tedious and error-prone to expect a programmer to add all the bounds checks. Being able to effortlessly tell the compiler "please check for me" is the huge win.
The second huge win is that the array is type-checked. So if you pass it to another function, the compiler enforces that it must again be passed with the size included. You don't get that by manually adding a size argument.
Bounds checking is a solved problem in C. The challenge is proving that your program _cannot_ go out of bounds. That is very much not a solved problem in C.
I'm curious what you'd want the automatica bounds checking to do?
Just terminating the program won't be much better than OOB memory access in many cases.
Continuing but discarding OOB writes/use a dummy for OOB reads could lead to much worse behavior.
An exception (or setting errno since this is C) would need that exception to be handled somewhere in a sensible way in which case you could just as easily add manual bounds checking.
I may be wrong, but something that you and I recognize as syntactic sugar may not be recognized as such by other, less experienced programmers.
So those programmers might just use the sugared approach and avoid the problem of writing past the end of an array, without ever knowing how tedious and/or difficult debugging such problems can be. They might sort of never even realize that they dodged a bullet simply due to some sugar.
And the like assumes that I have the space to waste a native type on every array. So if I’m using a 10 length array, I need to provision a native 32 or 64 bit value for “10”.
In embedded system this wouldn’t happen. At least not mine, I’m running up on limits all over the place even being careful with bitfields and appropriately sized types.
He’s right of course that foo(array[]) is converted to a pointer but that’s why I think you should always use array as a pointer so YOU know not to rely on its automatic protections.
I get the point; but I just don’t see C making this change.
So... don't use array syntax in the function prototype and definition? The proposal doesn't PROHIBIT passing a pointer, it would just offer an option to pass a fat array.
I think you mean fat pointer. And yea, that’s nice for people that don’t care their 4bit array has 64bits of native type reserved... I think other people would care.
So, I go back to the idea that it seems unlikely this would ever be an official C change.
Pascal (Delphi and FreePascal incarnations) does that just fine with strings and dynamic arrays and in a way that is compatible with C. Just friggin' steal it an be done.
I hold hopes people will discover as I did, the Real real problem with the C language is the C library and the diglossia. Maybe that's a different language[1], but it's one you get with just a bit of #define and discipline.
For example: If you had just put the length before the array buffer, you could've saved a stall in almost every use. That's a problem with out-of-order processing that's hard to fix. Maybe your compiler will get sufficiently smart, or maybe CPUs will collude with the memory controller (or something else amazing will happen), but those things are really hard. However we fixed it ourselves; we didn't need anyone to do it for us, because (due to laziness or luck) C gave us enough of the tools we needed to do what we needed to do.
I think that's a bigger deal than buffer overflows, as unpopular an opinion as that is.
> C is finished if it doesn't address the buffer overflow problem
Assuming that it is so: when? It seems that C—despite all its shortcomings—remains a very popular language in some problem domains. For some platforms it seems like it's really the only performant HLL option.
The real troubles are undefined behavior and aliasing. Buffer overflows are just a well known gimmick of the language that is more or less controllable with some discipline. Aliasing is hell. You cannot even use a global variable safely!
Isn't much of the undefined behavior in C that people love to complain about intentionally left in the standard for the purpose of optimization? Similarly, bounds checks are necessary in insecure contexts (ie most places) but you probably don't want them slowing down (for example) an MD simulation.
Edit: But to be clear, C really ought to have first class arrays. If you truly don't want bounds checks in a specific scenario for some arcane reason, you could still explicitly pass a raw pointer and index on that. (The same as you would in any sane systems language.)
UB has nothing to do with optimization. It's about working around the differences between all the platforms that a C program might need to be compiled for. UB covers things like the layout of a signed integer(might be two's compliment, or it might not). It's about letting the platform or compiler dictate what the program does in the rare case where the program does something that might result in different behavior on different compilers and platforms.
Note that I'm using "platform" to refer to the CPU instruction set.
Signed integer overflow could have been marked as implementation-defined rather than undefined behavior. That would have meant that compiling a program with overflows on most systems would produce the same results, but compiling it for the occasional rare sign-and-magnitude machine would produce slightly different results. However, they didn't do this. Instead, they said that it's undefined behavior, which means that any program that overflows integers has no guarantees about it's behavior at all - it could crash right away, generate the correct result 99 out of 100 times, or the compiler could outright reject the program.
A good example of this is calling functions with the wrong parameter types. UB in C, but practically allowed by every compiler. No machine would care if you do this... until WASM came along and suddenly every function call is checked at module instantiation time for exactly this behavior. This is because all WASM embedders are fundamentally optimizing compilers. And what is the mother of all optimizations? Inlining: the process of copypasting code from a function into wherever it is called. If a function is being called with the wrong arguments, how do you practically do that? You can't.
It is meaningless to talk about UB without also talking about optimizations. If you do not optimize code, then you do not have UB. You have behavior that is defined by something - if not the language spec, then the implementation of that spec, or a particular version of a compiler. There are plenty of systems with undocumented behavior that is nonetheless still defined, deterministic, and accessible. Saying that something is UB goes one step beyond that: it is saying that regardless of your mental model of the underlying machine, the language does not work that way, and the optimizer is free to delete or misinterpret any code that relies on UB.
> UB has nothing to do with optimization. It's about working around the differences between all the platforms that a C program might need to be compiled for.
That's what it used to mean. But at some points compilers people decided that since UB means literally "anything can happen" they can make optimizers optimize the shit out of the code assuming that UB can't be there.
C code that used to work 20 years ago, because the UB in it resulted in some weird but non-catastrophic behavior, doesn't work at all compiled modern compilers.
Other commenters already responded to this, but I thought I'd link an article I came across a while back that gives a concrete and easy to understand example of how UB can be leveraged for optimization by modern compilers. (https://devblogs.microsoft.com/oldnewthing/20140627-00/?p=63...)
Sorry? I’m unsure what you mean here, because there are plenty of ways to use globals in ways I would call “safe”: no undefined behavior, correct output, …
I was not talking about this, but about aliasing a variable on the same translation unit.
int x = 7;
void f() { /* do things using x */ }
void insidious_function(int *p) { *p = 3; }
now, inside f you cannot be sure that x equals 7, even if you never write into it. You may call some functions, that in turn call the insidious function that receives the address of x as a parameter. There's no way to be sure that the value of x is not changed, just by looking at your code.
I'm not sure whom this proposal is aimed at exactly.
Any production-quality C code will already use a (pointer + count) combo when passing arrays to a function, which is something that will still be needed under your proposal because the vast majority of arrays is dynamically sized. So unless all arrays in C are given the fat pointer treatment, I don't really see how what you suggest would make much of a difference. That is, if fat pointers are made the first class language construct, then, yes, that can be useful... though I disagree if it's not done, it will cause a demise of C.
pointer + size does not really fix anything, as you are relying on the programmer to correctly keep track of the size. I'm not even sure what alternative this improves upon. even more error-prone null value marking the end? praying the array will be big enough (looking at you, gets!)?
unless you have a team of incredibly diligent coders, people are going to read past the end of bare arrays over and over again. one specific mistake I keep seeing is where people misinterpret the meaning of a variable named `size`. is it the number of elements or the size in bytes? who knows, but it's probably UB either way if you're wrong.
> misinterpret the meaning of a variable named `size`
Quite right. I use, and highly recommend, the convention that `size` is for number of bytes, `length` is for number of elements, and `capacity` for the allocated number of elements.
you could do that, if you were using a library that provided/understood that struct. not sure how common this is; I work with c++ much more than c.
the problem with this approach is that you are still relying on the application programmer to provide the correct size at the beginning and not to mess it up by directly accessing the struct member later. private/public does not really exist in c, so it is a lot harder to enforce invariants within an object. the library could make the struct layout a private implementation detail (ie, not fully define the struct in the header provided to the client and take a pointer to the struct as arguments in the API) to at least discourage this. you could combine this approach with a my_array_struct_init function that returns a pointer to an empty array object. this is a common approach taken in c libraries (eg, libcurl) where the author really doesn't want you messing with their structs.
- Statically check the code, since the static analysis tool knows for certain which value is the size and can check that you're using correctly.
- Initialize the size correctly, since you don't have to enter it twice, or more crucially, remember to change it twice (or create a #define in another part of the file, name it, and document it)
You also make an excellent point yourself about the meaning of 'size'. If this was standardized, it would be the same everywhere, minimizing the risk of ambiguity.
"relying on the programmer to correctly keep track of the size"
I don't interpret Walter's suggestion that way. Of course, I might be wrong. Since the compiler must know the size of the array at the time it's declared, my thinking is that the compiler is smart enough to pass the size without the programmer having to even think about it.
> C is finished if it doesn't address the buffer overflow problem, and this proposal is a simple, easy, backwards compatible way to do it.
Is this really "simple, easy, backwards compatible"?
I think Rust kind of counterexamples this.
While Rust can throw around slices [] (effectively runtime length), throwing around [u8; 8] and [u8; 9] (compile time length) to the same function gets nasty.
Perhaps all the constexpr work in Rust will make this a lot easier.
Sorry, I misspoke. But this wasn't meant to be about Rust.
My point was that: if you dump slices/fat pointers into C, how does it help?
Slices/fat pointers and the corresponding checks are a runtime thing, and that's absolutely anathema to a lot of C programmers. If it's not anathema to the C programmer, they probably aren't in C anyway.
So, now you need a way so that compiled slices/fat pointers mean something, and I'm not convinced that doesn't have a lot of ramifications that are being glossed over.
The checks can be at compile time, given how good the data flow analysis is in the compiler. The rest are at runtime, presumably coming with a compiler switch to turn them off.
Most people using D leave the checks on.
Even without the checks, however, I can vouch that implicitly carrying around the length of the array with the array pointer is a vast improvement in the clarity of the written code.
I don't see a future where C survives, not only because of memory corruption bugs (although that's a pretty big one), but also for usability: the lack of package manager, common build system, good documentation, good standard library, etc. are just too much to compete with any modern system language.
I've been seeing those exact words for decades now, and C is still going strong. Every few year a new language comes, somes writes something in it, that was written in C before, someone might even write a basic OS in it, and after a few years, that language is almost forgotten, a new one is here, and again, someone is writing something in it, but in the end, we still use C for the things we used it 10, 20, for some, even 30 years ago.
Usage of C in new projects has fallen dramatically in the latest decades. It used to be the case that C was considered a general purpose programming language and applications such as Evolution were written in it. Today big applications in C are increasingly rare, and Rust is only accelerating this trend - nobody wants to have buffer overflows anymore.
> lack of package manager, common build system, good documentation.
This is where C is superior to virtually every other language. It has K&R to start with [1], a wealth of examples to progress from there, man pages, autotools, cmake, static and shared libraries.
> good standard library.
It should have hash tables at least, but it isn't bad.
[1] Which is still the best language book ever written (yes, it has some anti patterns, you unlearn them quickly).
Huh? In what way is C’s books, documentation or build system superior to that found in other languages? Most languages have plenty of good books written about them. And plenty of code examples online. I can’t speak for other languages but I find MDN (Javascript) and the rust docs consistently better than C’s man pages. Ruby’s documentation is great too.
As for build systems, autotools is a hilarious clown car of a disaster. You write a script (automake) to generate a huge, slow script (configure) to generate a makefile to finally invoke gcc? It is shockingly convoluted. It seems more like a code generation art project than something people should use. CMake papers over it about as well as it can, but I think cmake is (maybe by necessity) more complex than some other entire programming languages. In comparison, in rust “cargo build” will build my project correctly on any platform, any time, with usually no effort on my part beyond writing the names and versions of my dependencies in a file.
And as for package management, C is stuck in the 80s. It limps by, but it doesn’t have a package manager as we know them today. There’s no cargo, gems, npm, etc equivalent. Apt is no solution if you want your software to work on multiple distros (which all have their own ideas about versioning). Let alone writing software that builds on windows, Mac and Linux.
So no, C is not superior to other modern languages in its docs, build system or package manager. It is vastly inferior. I still love it. But we’ve gotten much, much better at making tooling in the last few decades. And sadly that innovation hasn’t been ported back to C.
Sure, it has not been standardized, it is not part of the standard library, so what? Did the world stop? I mean, practically speaking, who cares? Implement it, or find libraries that did. There are plenty. I posted this one because it exists for an OS; OpenBSD, since 1999. Plus, AFAIK ohash is portable enough. It consists of 2 files, and you can compile it with -std=c89. Only the bounded attribute is ignored.
If you want I could have brought up hcreate, hdestroy, and hsearch:
> The functions hcreate(), hsearch(), and hdestroy() are from SVr4, and are described in POSIX.1-2001 and POSIX.1-2008.
I use stb myself, so I have no qualms with that. The point is rather that GP was discussing praise for C’s standard library, and even the most portable single-file include-only dependency remains just that: an external dependency that isn’t part of the C standard library (and no, posix isn’t C).
Does it make much of a difference though? Take the hyped Rust for example. Most useful stuff is in crates, i.e. an external dependency. No one seems to have a problem with that.
Personally I do not mind using libraries typically installed by the Linux distribution's package manager anyways.
If the question is whether or not I think the C standard library could be improved, then yes, I would say it could, but I do not want it to have a hash table and all sorts of stuff like that, because there are lots and lots of ways to implement them, and they might not suit my needs. C is great, because you can build it from the ground up (if you want to) to make it specifically for your use case. It gives you the building blocks. I believe I have a comment regarding this somewhere, that I like C because it does not implement stuff for you that is in some ways "generalized", which is often a bad thing. This is my problem with "it should have hash tables at least". You cannot implement it in such a way that it suits everyone's needs.
Rust not having a good standard library is a huge problem. This increases the risk of a rust codebase due to the high number of third party dependencies.
I only said that "I do not mind using libraries typically installed by the Linux distribution's package manager", which was in respect to C.
As far as Rust goes, yes, I do not like that crates are full of one-liners, and so forth. It shares the same problems that npm has. I ran cargo build on many Rust projects before. No way.
What do you call linux distro's package managers then? I mean, in distributions like Debian you can even download a package's source code with apt-get.
>What do you call linux distro's package managers then?
If you want to count them as package managers, they're by far the worst ones of all the well known languages (with some notable exceptions e.g. guix's and nixos's).
They're not portable between distributions or even different versions of the same distribution (!), since it's non-trivial to install older versions of libraries (or, hell, different versions of the same library at the same time). Not to mention that it's a very manual and tedious process in comparison to all the other language specific package manager. 'Dependency hell' is a problem virtually limited to distro package managers (and languages like C and C++ that depend on them).
Getting older, unmaintained C programs to run on Linux is an incredibly frustrating experience and I think a perfect demonstration of how the current distro package manager approach is wholly insufficient.
> If you want to count them as package managers, they're by far the worst ones of all the well known languages (with some notable exceptions e.g. guix's and nixos's).
The have the only feature I care about: cross-language dependency management.
Unless you are suggesting to reimplement everything in each language and then make users install ten different XML parser, SSL implementations, etc. just because not-implemented-in-my-favorite-language syndrome.
Those are features which makes C flexible on main-stream platforms and also usable for so many other platform where other languages just don't/won't work.
I don't see C being in much worse shape than C++ with respect to build system and package manager. It's slow going, but progress seems to be happening there.
Are you saying both are doomed? Or is there some scenario where C++ survives without C?
I think both are, long term (think FORTRAN where it’s not particularly popular but a lot of existing code is maintained and not rewritten).
C++ is actually in a slightly better spot ironically because it’s harder to integrate with. If you have a C program you can pretty easily start replacing parts with Rust. You can’t do the same with C++ which insulates it better in that sense.
Reports of Fortran's death (latest standard 2018) are greatly exaggerated (much like C). It's receded to a niche, but it's still a very important niche (numerical, HPC). Hopefully, the development of a new Fortran front end for LLVM (from PGI/Nvidia?) pans out, as this would fill a gap in LLVM's offerings, and provide more competition for ifort and gfortran.
I definitely like the "lack of package manager, common build system". For me, having those is a negative for a language like rust.
You see, my OS already comes with those, and I expect to use them. I have the Debian package system: dpkg, apt, aptitude, and so on. It's a big mess when other software tries to steal that role. I have the traditional build systems and more: make, cmake, autoconf, scons, and so on. If I'm building a large project with multiple languages, I'm going to use one of those tools. If a language wants to fight me on that, I'm not interested in that language.
Thanks to LLVM and GCC you can happily write embedded code in a higher level language, but the vendors don't bother supporting it because a lot of embedded coding isn't really what we would call software (no tests etc.)
Toolchains are one side, but garbage collection and big standard libraries are also a big reason. Anything with under a MB of RAM has a choice of several modern languages, but it is still basically just C, C++, Rust, Lua or MicroPython.
People have been looking at that for years now. I’m convinced it’s going to happen one day. It might not be Rust, but it’s going to happen that we will have different models for writing these kinds of things
Perhaps a combination of a language like Zig (a 1:1 replacement for situations where you really do want a lot of manual low-level control) and higher-level languages like Rust eating into more and more of the use cases.
"a pair consisting of a pointer to the start of the array, and a size_t of the array dimension"
No, that still doesn't fix the ABI. It's syntactic sugar. It is most definitely not passing an array.
Passing an array means exactly that, no more and no less. For example, suppose this is your array:
double foo[100][100];
The size is 80000 bytes. That is exactly how much data needs to be copied onto the stack, no more and no less.
Getting the array dimensions is secondary. It would be nice to have them work. They could automatically get names. They could get size checks, so a function might declare itself compatible with a certain range of sizes. That's all a bonus, of much lower importance than the actual ability to pass an array.
The inability to pass an array impacts numerous other languages because they use the C ABI. If you can't put those 80000 bytes on the stack in C, then you can't do it in any language. The whole software ecosystem is thus impoverished.
Are you non-jokingly suggesting copying the entire array to and from the stack each time, as an alternative to the OP's proposal (and as a default best-practice)?
Yes. If you don't really want to pass an array then don't do that. The language shouldn't get in the way when somebody wants to pass an array.
Take the address, and pass a pointer, if that is what you want to do.
Maybe I want the callee to be able to modify the array without affecting the caller. Maybe I'm even telling the linker to put that array in ROM, but I want a writable copy in the callee.
Whatever... I have my reasons. The language shouldn't block me.
I realize exactly how inefficient it would be. If it hurts, don't do that.
I'm the kind of person who optimizes with assembly, counts cache misses, counts TLB misses, and pays attention to pipeline stalls. I definitely understand the performance implications, and I definitely wouldn't be passing arrays around all the time.
That said, I want the ability. I want the language to let me do what I want, and on rare occasions I want to pass an array. Let me pass an array.
Ok, but you didn't just say this should be possible, you said it should be the default best-practice. Even if it were useful in a handful of cases, this would be a terrible default way of doing things.
I didn't say it should be the default, but yes it should be. It is for structs.
We can have giant structs. I've seen some over a megabyte in size. The default is that the callee gets a copy. (depending on the ABI it could be in the "wrong" stack frame, but it is a distinct copy)
Are we having huge problems with structs being passed by value? I don't think so. Normal people pass pointers, except when they actually want to pass by value. It works fine.
I have frequently seen beginners struggle with C arrays and pointers. Part of the trouble is that you can't pass an array. You can try, but the compiler quietly substitutes different code. It's a source of confusion, generating incorrect mental models of what is going on.
Beginners don't struggle in Java, JavaScript, Python, C#, Ruby, and the dozen (at least) other languages that exclusively pass arrays by reference.
But all of this is way off-topic from the OP: the original point was, "Passing a pointer and length separately is error-prone; there should be a way to easily package the two together and this should be the default pattern for 90% of cases." Then you came in and said "No, instead C should support this totally orthogonal side-case that's a bad idea 90% of the time but has some niche uses." It's not a bad suggestion in itself, necessarily, but it's totally unrelated to the original proposal, much less an alternative to it.
The original claim was that the proposal would be C really passing arrays. In the article it says:
"the inability to pass an array to a function as an array, even if it is declared to be an array. C will silently convert the array to be a pointer, and will rewrite the function declaration so it is semantically a pointer"
...and later, referring to the new syntax:
"an array is passed"
In no way is it so. It has nothing to do with passing arrays. It's passing a fat pointer, which is different.
The weird thing is that the compiler obviously knows how to copy an array, because you can pass a copy of a struct that contains an array. I have a vague impression that early versions of C couldn't pass either structs or arrays, only scalars and pointers.
> While [the first edition of K&R] foreshadowed the newer approach to structures, only after it was published did the language support assigning them, passing them to and from functions, [...]
Yes, but is that happening? I think the answer is no. The mere ability to pass huge structures does not cause programmers to do that.
I believe the same would be true if the language allowed passing arrays. Programmers would not generally pass them around. There is no need for the language to protect us from this by failing to implement the ability to pass arrays.
I think I agree with that position. I seem to have gotten the impression that you were doing this and not seeing a performance impact, which has been wrong in my experience.
I love how so many people here argue with the Walter Bright about technical aspects of C.
I have been a member of many programming languages communities, and every language has its own culture. C was always a language for the arrogant. "The real programmers" that can handle their memory, not afraid to work with pointers and that can get their code right.
I've been there, done that for many years, and became more humble with time. In a way I still love the brutal simplicity and low-level nature of C, but I would use it only if absolutely can't use any other language for technical reasons, and I would be really, really cautious.
> Oh, how they dare argue with WALTER BRIGHT. The hubris!
With all due respect, but if someone says something invalid, the fact that they have authority on a subject does not mean that we should agree.
As far as I understand the article (And I'm not the great Walter Bright, so I may be wrong) - the author states that "void foo(char a[..])" is better syntax than "void foo(size_t s, char a[])" but does not provide any arguments for it. Furthermore, the author initially fails to mention that there has been an attempt to fix the array-to-pointer-decay issue, when discussing "C's Biggest Mistake".
So, yeah, the author may be right that this has been C's biggest mistake. I don't know whether that is true or not, I do not have his experience. It is certainly true that this mistake would be high on rankings of all mistakes that C did. Still, the initial "sleight of hand" move followed by unsubstantiated argument leads to a post with the quality similar to that of a twitter post. Maybe even worse, since, you know, it's posted on a place other than twitter, so we are actually talking about it as if it was something serious.
on a somewhat related note, I've always wished for something like `explicit` that prevents assigning different typedefs for the same underlying type to each other. like suppose I have two types, WorldVec (vector in worldspace) and ViewVec (vector in view/sceenspace). under the hood they are both typedefs for float[3], so I can freely assign them back and forth. but any vector operation that mixes the types would almost always be a bug, since they are in different spaces. would be cool to get this functionality out of the humble typedef.
This has always bugged me as well. I've generally solved this by wrapping things in a struct. Type checking will use the (incompatible) wrappers and a modern compiler should optimize them away. To avoid strict aliasing violations when converting between equivalent wrapped types you can use a union and employ a function to hide the verbosity.
I have no idea if this is the "right" way to do things, but it seems to work.
That's what Microsoft did from some version on in their build tools, all the HANDLE's etc used to just be typedef void , now they're a dummy struct each (HANDLE__ ). Seems to be a good solution.
C is considered strictly typed but that’s only when compared to the likes of JS, Python, and co. What you’re taking about is incredibly important and using it in other truly strongly typed languages has opened my eyes to just how much compile time safety a language can really provide, practically free of cost.
you can certainly argue it should have been a default warning from the beginning, but mistakes were made and it's a bit too late to change that. people don't like when old, battle-tested code suddenly starts spewing a new warning everywhere after a compiler update.
in reality, most production build systems at least use -Wall (or their compiler's equivalent) and possibly also have a list of specific warnings turned on/off for different parts of the code. it would be nice to have some saner defaults, but it just doesn't matter that much.
But that's because Rust designs for zero cost abstractions in release mode. Overflow checking is not zero cost, so it's only enabled in debug mode, which is the maximum safety possible here while keeping it zero cost at (release) runtime. Without dependent types or something similar I don't know if it would be possible to check bounds.
I actually believe the way C works is great, as simple as it could be and it gives you power to do exactly what you want, they way you want it.
I would certainly hate(and would continue using them btw) that they remove normal pointers from C. It would be like removing s expressions from lisp.
I believe that the solution to this "mistake" is just not using c directly, using other languages to write C code for you, or use c primitives that are 100% well tested.
That is what we do, our c primitives-libraries-modules are written and tested by lisp and our own language.
It it then very easy to use that code in python or c++, swift or whatever as libraries or modules.
The proposal does not suggest removing normal pointers from C. D has these dynamic arrays, and normal pointers too. Although one sees pointers used less and less in D code.
Do you mean std::extent? [0] You can do the same in C if you define a macro that uses sizeof [1]
This doesn't diminish the advantage of std::array, though, as it embeds the size of the array into the object, unlike when a raw array is passed and 'decays' to a pointer.
I mean like you can pass it through a function by using the “array syntax” when defining the parameter and making the size a template parameter. Like so:
template <size_t N>
void foo(int bar[N])
And this gives you the size without an additional size parameter as you’d usually need in C (of course with the limitation that the parameter now has to be a compile-time sized array).
I don't think it is a mistake in language design. In the 90s, memory was a rare good, and it still is in the microprocessor world, where "only" a few kilobytes of RAM are available. There are performance critical paths where passing a size_t is just unnecessary.
The actual mistake is to don't pass size_t as a user. This is one kind of "premature optimization". We can safely say the language design doesn't encourage the user to write safe code, and succeror languages do that.
Don't get me wrong — I just try to do the point that C itself is not the point to blame. It's people using computers who write the million dollar bugs.
The #1 undetected bug problem with C programs is buffer overflows. Experience shows it is extremely difficult to verify that arbitrary C code doesn't have buffer overflows in it. Assistance from the core language design can improve things a great deal.
D allows passing both raw pointers as parameters and pointer/length pairs. It's up to the user to choose. In practice, people have simply moved away from using raw pointers into buffers.
As for performance, in C to determine the length of a string one uses strlen(). Over and over and over again on the same string. This can be a major performance problem, even not considering the memory cache effects. When I look at speeding up C code, often the first nuggets of gold is reviewing all the explicit and implicit uses of strlen(). (Implicit uses are functions like strcat()). It's also the first place I look for bugs when reviewing C code - anytime you see a sequence of strlen, strcat, strcpy, it's often broken (typically in neglecting somewhere to account for the extra 0 byte).
All of this I agree with. In a better world 'arrays' would have added in the 1980's. The arguments about memory limitations is spurious since if you're writing good code you always pass a pointer and the length. Always no exceptions.
Yeah and all the string functions should have been marked as depreciated with C89 and fully depreciated with C99.
Yes, you can always pass a pointer and a length explicitly. And that's what the "safe" versions of e.g. string functions do. But it's still incumbent on you as the programmer to use them properly. It would still be beneficial to have a compiler mode where all that was done for you automatically and it was impossible to have a buffer overrun.
There is a difference between passing a length as a function argument, and actually storing string lengths alongside the strings in memory.
It's not unheard of to have millions of tiny little (< 10 character) strings, and not storing lengths alongside them can shave off a sizeable portion of space requirements.
Anytime a program has a special case like that, it makes sense to craft an optimal data structure for it.
Also, consider that the terminating 0 byte isn't just one byte. There's also the alignment of what malloc returns, which may be 4 or 8 or even 16 bytes.
malloc also has internal bookkeeping overhead, typically 16-32 bytes per allocation.
Which is why one should never allocate a single (short) string from a generic allocator. Instead, one allocates a big chunk upfront (e.g. 4K bytes or more), and breaks from that, using a simple index that points to the first unused bytes.
In this way, the overhead of allocating a string is truly only the terminating zero byte - no alignment constraints. This scheme is easy to implement as long as strings don't need to be freed individually.
I must be a really really good programmer, since I rarely see the need to use strlen().
For one, strings are just chunks of memory like other arrays. So for almost any string that is not a literal in the source code, you just store offset/length as needed, like for any other array. I have sizeable projects (on the order of 10K lines) that have maybe 0 or 1 instances of strlen() in the code.
Very often though, strings are bounded and pretty short (since they are meant for human consumption), and in these cases, using strlen() when looking up a string is often sensible, since persisting the length might require more memory than the string itself.
Another case is when you're scanning the string from left to right anyway, so you just "stream" through it until you find the terminating 0. That's how printf() and friends work (they take a formatting string), and arguably this scheme works just fine.
Btw, the length of a string literal is (sizeof "Hello" - 1). You can also initialize char arrays using string literals and have the size available:
static const char name[] = "Foobar";
static const int nameLen = sizeof name - 1;
Look how many C library functions implicitly call strlen() internally.
Also, look at functions like sscanf(). It can be orders of magnitude slower than fscanf() because every invocation calls strlen while fscanf incrementally reads from the current file pointer. I don't know why sscanf doesn't also work incrementally but the implementations I've tested don't do that.
The main point here being: if strings had a size_t size plus data, that would change an O(n) scan to an O(1) length lookup, and that would have huge performance gains throughout the C library, not to mention your own code as well.
I checked musl libc and the way the implementation implements sscanf is by calling vsscanf with a custom FILE stream. And that file stream is implemented using __string_read(), which does indeed call strnlen().
I figure it would be possible for musl to implement that stream using a function that scans for NUL and copies at the same time, but maybe that's not an improvement in the end.
It would be much simpler of course if sscanf() would take the length of the input string as an additional argument. But actually, I don't really care.
Because, does this even matter? Using sscanf() is far from ideal anyway. The stdio functions are not what you use if you're going for performance. Their conversions are probably not the fastest (being quite featureful), and they are even locale dependent which is a huge mess!
Heck, when we're going for performance to a degree where a strlen() matters (bear in mind that we have to read the input at least once anyway, so the waste is definitely bounded) we should certainly not be parsing text at all. That is much more wasteful in comparison.
Much if not most of libc is there to provide you a portable base to (comparatively) quickly get your project up and running, and to simply to keep old software going, but it's certainly not to help you achieve performance.
Interestingly, I assume you meant to end your post with "pointer to char" not "char" itself, but asterix is the the italics formatting character on HN so it's italicized it. But the funny thing is that it's italicized the "reply" button (as well as an empty i-tag after "char").
You must pass a size_t somewhere, surely? Otherwise you have no idea how long the array is - this is about doing it properly rather than relying on yourself at 9AM to get it right everytime.
Then don't return or expect arrays. Return and expect Pointers of a given type.
That's not going away in the article's proposal. It's being complimented by an array syntax that makes the current size (in memory) of a non-static data structure bounded upfront.
In C you can declare pointers to arrays, the syntax is just somewhat strange. You can even declare it as a pointer to a variable sized array with c99, eg:
Yes. As I've long said, C does not have arrays. It has pointers and notations for initializing memory, but it does not have arrays.
There are actually two problems here. One is the absence of bounds checking. The other, which is related but technically orthogonal, is the hole in the type system: an array is not an object. It's been true since Unix v7 that you can pass or return a struct by value, but you can't pass an array by value unless you wrap it in a struct.
The type system also makes no distinction between a pointer to a single object and a pointer into an array of objects. I've worked on static analysis tools that try to find potential buffer overflows, and this turns out to be a surprisingly big problem. One has to do a global dataflow analysis just to discover which pointer variables could ever point at array elements.
It would help if you define what you mean by "object". This term has a definition in the C standard by which arrays are unquestionably objects. It is true that you cannot pass or return them by value, but that does not mean they are not objects, that means the ability to pass or return things by value is not a property of objects.
About the singular vs array question: this is true but just a special case of the absence of bounds checking, is it not? C's approach of allowing e.g. "int i;" to be addressed as if it were an array of length 1, allowing construction not just of &i but also &i+1 as pointer values, is valid and sometimes useful, but you have to make sure you never access *(&i+1). That's the same problem as how given "int a[2];", accessing a[2] is not valid, as far as I can see.
What GP means is "there are no array expressions in C's syntax". You can't copy them or assign them, and I'm not sure that they are part of any kind of "type system" in C (although some compilers have a notion of array type internally, which is obvious from their error messages).
That's just not true, there are array expressions, and arrays are part of the type system. I give him more credit than that. (The "you can't copy them" is technically untrue as well but that is just because of sloppy wording.)
You're right, I forgot about compound literals, which were introduced in C99 and which are a rarely used feature.
But I guess my statement that "C doesn't have array expressions" was true before the advent of C99. And that's also why array decay made even more sense back then. (It still makes a lot of sense today IMO).
I didn't mean compound literals, I do not really see how they change things here, I meant that there are a few cases where arrays don't decay to pointers, and supporting those requires compilers to make arrays a part of the type system. Example: given "int a[3];", how else would you compute (&a+1) ?
To me, the question is what is actually "the type system" and what is "the allocation system" or whatever else aspect of implementing a compiler.
So determining the type of "&a" is not an issue, it's just one case in determining the type of a C expression (look up the object "a", is it an array? The type of the expression is a pointer to the array element type).
This is not a special case, at least not more special than how to determine the type of the expression "a", or "1".
Are you aware that given int a[3];, (&a+1) and (a+3) denote the same address? If you are, how can that possibly work if as you suggest, &a and a are indistinguishable to the compiler, that they are both seen as a pointer to int?
I’ve been writing C code for 30 years. What I like about it is code from back then is pretty much similar to what you’d right today so there’s limited need to chase the train. Sure it has security issues but if you don things ‘right’ it’s still untouchable, imo. Most other languages are heavy and slow compared to C. A lot are not even backwardly compatible (python) or suffer from being kitchen sinks (C++). My own opinion is that python is the only other ‘must know’ language as you can do RAD. For speed you resort to C. There are so many toy languages in between with some nice features (coroutines, native JSOn support, etc). But lots are just meh. All personal opinions.
I think the stack for the next 30 years is going to be zig for low level code, something on the erlang vm for networks and distributed systems, and Julia for numerical computation.
It's just too hard to "do things right" in C. Even people who have been "doing things right" for decades make mistakes.
This was probably the most confusing thing about C when I first started learning programming back in the day. When you call a function you pass the value, except in arrays where it gets converted as a pointer. It was explained back then to me that the reason is because copying the whole array was not efficient so it was better to pass the reference.
Minor nitpick: there is no allocation going on here - you’re simply reserving a fixed-size buffer on the stack (assuming the array is local to a function).
I would call that a stack allocation, but yes they are slightly different. In my mind its a feature that arrays allocated on the stack and heap can interchangeably be given as an argument to a function.
Unfortunately working with dynamically-allocated buffers is still a thing; adding array syntax just favors one form of size specification without solving the other case.
Although C is usable on many types of hardware, interesting things could be done, e.g. on desktop OSes if certain hardware-specific extensions were made.
One of the things I wish we could do with our now-absurdly-large pointers (64-bit) is to reserve a handful of bits for other information such as the size. Sure it means we can’t store anything at location 2^64-1 but it wasn’t that long ago we only had 32-bit pointers and the 33rd bit is twice as many addresses all by itself so I think we can lose a few.
For example, if all allocations were rounded up to buckets of a certain size, the precise byte count would not need to be encoded in the pointer (just the number of buckets, requiring fewer bits). There could be a couple bits to give pointers a type for other interesting scenarios, e.g. perhaps a pointer identified as an “immediate value” that isn’t actually allocated at all, and it is “dereferenced” by treating its “address” as the “stored” value. There could even be a couple of bits to track use of common allocators (it would be so nice to simply know that a pointer was allocated by "malloc" vs. "new" for example).
In high-level languages, then, the syntax change would be not to identify arrays specifically but pointers with encodings that are “complete” (e.g. "char const complete*" or something), covering both stack arrays and dynamic buffers.
> Unfortunately working with dynamically-allocated buffers is still a thing; adding array syntax just favors one form of size specification without solving the other case.
All that is needed is a mechanism for forming a fat pointer from a pointer and a length. In D this looks like:
int* p = cast(int*)malloc(length * sizeof(int));
if (!p) fatalError();
int[] a = p[0 .. length];
...
int x = a[length + 1]; // runtime error: buffer overflow
In C, this could be done via a macro with no additional core language changes.
Something I see as wrong with C outside of the context of the Linux kernel is mostly something wrong with us the developers. We are far too content to live in filth.
In addition to unsafe/irregular buffer handling, I also constantly see poor data structure choice, presumably due to a lack of default choice of library. It is very common to see code scanning linked lists when they should be doing map look ups. (And often even the linked list operations are ad-hoc and repeated for every type of struct with an embedded next/prev pointer.) Everyone always defaults to linked listing it up because they never have to pay the up-front cost of finding a library or investing in re-inventing the wheel. I think this is also why you see so much sketchy buffer code - no one has bothered investing in safer buffer abstractions.
Perhaps some of this is caused by the difficulty of taking on dependencies in a portable way. (CMake/Autotools can make this better, but it is a far cry from NPM.)
It's a shame that Niklaus Wirth doesn't get more respect and attention. But some of his ideas were just a little ahead of their time - I think the typical size for a Pascal string was just a single byte, which was a bit too limiting.
Wirth's Pascal didn't have a string type. You could have a fixed-length arrty of CHAR, but you couldn't have _fewer_ than 16 characters in a 16-character array, and you couldn't pass a 16-character-array to a function with a 256-character-array parameter. Only the magic ‘functions’ built in to the language like write() could accept strings of different lengths. Since this made the language worse than FORTRAN and handling text, most implementations added some sort of string handling.
C's pointers wouldn't be an issue if the world had used the Intel iAPX 432 processor instead of the 8086. The iAPX 432 included bounds checking for every array in hardware (among many other features), so it was impossible to make an out-of-bounds access.
Unfortunately the iPAX 432 was delayed, so Intel introduced the 8086 as a stopgap processor and computers have been using the x86 architecture ever since. It's interesting to think that if history had gone a bit differently, whole classes of security problems would not exist.
> so it was impossible to make an out-of-bounds access.
That's a strong claim for something as vaguely defined as "the length of an array."
What if I allocate a huge array of memory and emulate another processor using that memory? What if I subpartition an array? What if I overallocate an array to avoid reallocations?
C-the-language has no concept of size-tagged arrays at runtime, and I guess it's baked in deeply due to the various guarantees made about sizeof(array), &array[0], and ability to cast &array[0] back to the original array. The iAPX hardware would have gone unused
I don't see a mistake here, certainly not a "biggest mistake".
This is C, not C++. Keep it simple.
Here, the idea is that there is no special type for "pointer+size" (what the author proposes as an array). Ok, let's add one and see the implications.
- How do I get the size, the number of elements? A "sizeof" like operator?
- Can I resize the array? If yes, how? If no, why?
- What happens if I overflow? Undefined behavior?
- A memcpy-like would be an obvious function to implement, what happens if sizes differ?
- What is the relationship between static arrays (ex: int a[5]) and "pointer+size" arrays? Are these completely different types? Is there an implicit cast between the two?
- About casting, how can I go from a separate pointer and size to an array and vice versa? If it is possible at all.
- What if I do a bit of pointer magic to access the internal representation of the array? Probably undefined behavior.
It is much more complex than "just add array[..]", I expect more tradeoffs, more undefined behaviors (C wouldn't be C without them). Adding complexity to the language can actually make things worse.
As for zero-terminated strings, they have advantages and drawbacks. They are preferred by the C library, but you can do pointer+size if you want by using mem* instead of str* , or %.*s instead of %s in printf (not sure about this one).
Ziglang is basically this plus getting rid of the preprocessor and almost all UB. It's extremely clean, quite safe, and in many ways far far simpler than C.
Here are it's answers:
- How do I get the size, the number of elements?
Builtin .len field operator. @sizeOf works too.
- Can I resize the array? If yes, how? If no, why?
No. Array lengths are comptime known; there is something called a slice which is runtime known and bounds checked in safe releases.
- What happens if I overflow? Undefined behavior?
Panic, in safe releases. UB in dangerous releases (small or fast)
- A memcpy-like would be an obvious function to implement, what happens if sizes differ?
Bounds checked at runtime for safe releases.
- What is the relationship between static arrays (ex: int a[5]) and "pointer+size" arrays? Are these completely different types? Is there an implicit cast between the two?
Yes, and yes. Arrays can implicitly be converted to slices at compile time; slicing into a slice with compile-time known index width yields an array.
- About casting, how can I go from a separate pointer and size to an array and vice versa? If it is possible at all.
There is an escape hatch function for this.
- What if I do a bit of pointer magic to access the internal representation of the array? Probably undefined behavior.
It's defined, but unsafe.
Do I haven't really worked too much in zig (it's not my daily driver) but I think it says something that all of these questions have answers to them, they are sensible, and very easy to remember.
I agree that missing bounds checks are the biggest problem, with the unhealthy attitude of dismissing the Annex K for political reasons. My Safe C library has only adoption with the big players.
But even weirder is the total lack of a proper string library. Nobody but Microsoft uses whar, and they are the only ones with the proper whar_t size. Everybody else was wrong with size 4. But nowadays it should be clear that only u8 is the only way forward, C++ even adopted now char8_t for it. But they all still ignore the unicode problems with an overly simplistic, glorified null-terminated memory buffer library. These are not strings anymore nowadays. Strings have multiple representations of characters in unicode, strings need the unicode version to be exposed which changes every year. They need a proper fold case and norm API, otherwise you cannot compare them, so you cannot search for strings. Grep would be happy to find unicode strings, but it still cannot. coreutils still cannot do unicode in 2020.
Also the complete lack of security, esp with names, ie identifiers. Such as pathnames. Most filesystems just ignore security, spoofing, bidi changes, mixed scripts as if this problem does not exist at all. Strings are not normalized, not properly fold cased.
The _l locale mess, it still relies on global runtime state, which is not compile-time optimizable, in opposition to _l or simply just a new u8 API. Not reentrant. Not compile-time optimizable. It's a huge mess.
gcc cannot do compile-time constexprs checks, only clang can, leading to up to 200x faster libc code. gcc cannot do user-defined warnings of errors.
glibc, FreeBSD libc, musl, none of it fixes anything.
It's probably worth remembering the history of C - it didn't appear fully formed, lots of stuff evolved as people used it - for example in V6 Unix += used to be =+.
In particular you used to use structures in a weird way, essentially field names within structures lived in their own global name space, any pointer could be used with any structure - there weren't unions yet so this was used for good effect in Unix kernel drivers (there was a standard buffer queue header you could add your own stuff at the end of.
I think this was kind of descended from the BCPL/Bliss world view where explicit pointers were a relatively new thing in languages and their typing was pretty simple (there was a limit to the number of indirections allowed) - fully orthogonal typing systems were only just becoming a thing then.
Also I suspect that the idea that a[i] was the same as a+i was an idea with legs, this is still legal C:
Stupid question, but how do I access the size of an array using this fancy new declaration if it were to be added? It doesn't seem like any sugar is there to provide "range based for loops."
Wait I would just use `sizeof` but then I'm still doing pointer math then?
The C standard doesn’t forbid fat pointers. You could have a compiler that implements fat pointers (and crashes on out of bounds access attempts) without violating the standards in any way, since our of bounds accesses are undefined behavior.
To me, i think C is a powerful language that is weakened by 2 things:
1- Trying to find the proper style and methods to write few lines of code.
The reason for this is because C is an old language that kept changing. Thus, you can read a book, yet find someone to tell you “you shouldn’t do it that way”.
2- compilers made C into different flavors.
Microsoft C compiler provides scanf_s with the old scanf being deprecated. In the other hand, gcc has different approaches without the scanf_s that Microsoft has. This can be so annoying to use.
The proposed change has another important aspect to it: it would help standardize a way of passing arrays across language boundaries. Currently when using FFI, you often have to decompose fat pointers into pointer and length before calling a foreign function, or compose a fat pointer from a pointer and length in extern functions. This could mean languages like rust, d, etc. Could just pass arrays directly.
I for one would love to see this proposal become a reality.
Personally I like it the way it is. If you want to copy an array when making a function call you can define a struct with a array in it, and pas the structure.
If C did pass array lengths it still wouldn't matter since C doesn't (and in my opinion shouldn't) check for overflows.
I don’t think it is possible, not without changing some parts of the C’s specification. At the very least you’d need to be able to somehow encode the length of the buffer in the pointer to it. (There is no semantic difference between a pointer to a simple, fixed-length variable and a pointer to an array.)
You can keep a list of every allocation and every time the code does a memory read/write the implementation can look up to see if the pointer is within a valid allocation space. I have implemented some things like this: https://www.youtube.com/watch?v=pvkn9Xz-xks
That is "modern" C++ is frequently being used as "C with classes and basic safety".
I wonder if a better C can be made by just stripping-down the bloated C++ and introducing the "unsafe" keyword for dangerous features like directly using arrays, etc.
- A concise syntax for declaring, accessing and mutating them. Dealing with lists is such a common thing in programming languages, that it's simply crazy to not have a proper syntax for them.
- Generics/templating, so that you can use concrete types instead of 'void *'. Having that prevents mistakes and also tends to make code self-documenting.
"If it's such a good idea to use a safety belt and airbags, why do we need special devices for it? Why can't I just use a piece of rope I had in a drawer and some leftover balloons from my previous birthday party?"
There are lots of places you can make C's syntax more concise. What do you get for changing the syntax here? Why is it worth it?
void* was merely for example. Yes make it typed when you use it in your C code. Also used accessor functions to wrap array index so you can switch on & off a macro for bounds checking, absolutely do that. Does new syntax change anything if you do these things?
Don't much care for the seatbelt analogy there, remove your working, properly fitted seatbelts with our red ones because they're easier to see? These kinds of analogies always break down. Especially car analogies for programming and yes, I use them too.
It’s not a “mistake”. This article is complaining about a misinterpretation of C’s functionality. Arrays are not “real” data structures in C: there’s no such thing. The array-ish syntax that’s available is just a some syntactic sugar on top of pointers. You could say that having the sugar at all is a mistake. Or that C is incomplete without first-class array types. This is a cute hack, but at this point (far more than 10 years ago) it’s probably better to move on to Rust if you don’t like this aspect of C, rather than proposing to hack the language.
> Arrays are not “real” data structures in C: there’s no such thing.
I assure you, there are arrays in C.
int a[100]; // `a` is an array, not a pointer
int* p; // `p` is a pointer, not an array
a = p; // error, array is not a pointer
> The array-ish syntax that’s available is just a some syntactic sugar on top of pointers.
Sorry, this is incorrect. In some circumstances, C will implicitly convert an array to a pointer, which is what the article is about, but don't mistake a conversion with identity.
>"What mistake has caused more grief, more bugs, more workarounds, more endless hours consumed, etc., than any other? Many people would say null pointers. I don’t agree.
Conflating pointers with arrays.
I don’t mean them using the same syntax, or the implicit conversion of arrays to pointers. I mean the inability to pass an array to a function as an array, even if it is declared to be an array. C will silently convert the array to be a pointer, and will rewrite the function declaration so it is semantically a pointer:
[...]
This seemingly innocuous convenience feature is the root of endless evil. It means that once arrays leave the scope in which they are defined, they become pointers, and lose the information which gives the extent of the array — the array dimension. What are the consequences of losing this information?
An alternative must be used.
For strings, it’s the whole reason for the 0 terminator.
For other arrays, it is inferred programmatically from the context. Naturally, every situation is different, and so an endless array (!) of bugs ensues.
The trainwreck just unfolds in slow motion from there.
The galaxy of C string functions, from the unsafe strcpy() to sprintf() onwards, is a direct result. There are various attempts at fixing this, such as the Safe C Library. Then there are all the buffer overflows, because functions handed a pointer have no idea what the limits are, and no array bounds checking is possible."
PDS: The root of all of this -- is that C, being a low-level, close-to-the-hardware, designed in the 1970's programming language (some in academia pejoratively call it a "glorified assembler"), was not designed with a proper string storage class as we know them in programming languages today; instead, arrays of characters were substituted for this purpose, and those arrays were not implemented containing total size (length) and dimensionality information.
Basically an array in C -- is a set of contiguous memory, which has a starting address (the pointer passed), and a stated element size that the compiler knows about, but not the length (aka, total size, element count, etc.) of that array, nor its dimensionality.
Observation: C's arrays need length information signalled in an out-of-band fashion (that is, this information cannot exist as a zero (0) -- somewhere in the array).
The irony of all of this is that C was invented at AT&T, and AT&T for the longest time had difficulty with phreakers exploiting 2600hz signals to gain access to its long distance trunk lines, from which they could call to anywhere in AT&T's system for free.
But, that's what the engineering error of in-band signaling generates...
C, by using arrays to implement strings, and letting the zero terminator (information about string length) exist in the memory space of the string, made exactly the same engineering mistake -- just in software -- and that is the mistake of in-band signaling.
Now, that being said, hindsight is 2020, and it couldn't be expected that Dennis Ritchie, who invented C in the 1970's would have foreseen the consequences of that engineering "mistake" (AKA, "act which generated quite the education for a future populace". <g>).
Such is the price of being an innovator.
On the one hand, he advanced computer technology greatly -- far beyond the technology advancements created by most of his contemporaries of his day...
On the other, that advancement gave us this highly educational engineering "mistake" -- that we can all learn from!
Such is the price of being an innovator -- and pressing the "bleeding edge" of what is possible...
Humanity could not be advanced without such innovators, and the occasional future flaws (and the wisdom that comes from examining them in hindsight!) that their innovations generate...
That's the thing, they're not a type - they're just an array of char with an unspecified length, indistinguishable from any other array of char or pointer to char. Only the convention of ending them with a null character makes them usable at all.
C is finished if it doesn't address the buffer overflow problem, and this proposal is a simple, easy, backwards compatible way to do it. It is simply too expensive to deal with buffer overflow bugs anymore.
This one addition will revolutionize C programming like adding function prototypes did.