It's not C/C++. C/C++ is not a language. The project's goal is to write a very simple UNIX-based operating system in C++ (does anyone else smell a contradiction?). The code is predominantly C++.
> write a very simple UNIX-based operating system in C++ (does anyone else smell a contradiction?)
I'm going to use this as an opportunity to champion C++ for systems programming despite the fact that it wasn't really your argument. I feel obligated to do this because I agreed with you for a long time on this but have changed my mind over the past year or two.
The argument against C++ as a systems programming language has always been that it gives you too much freedom to hide complexity, but I've written systems code in both C and C++ professionally and it's been my experience that the extra burden of having to write everydamnthing over again in C (or use a relatively poorly tested container library, as compared to something like the STL) is not worth the "obvious performance characteristics."
I'd further contend that in non-trivial software all perf characteristics are non-trivial, so even though a + b might be doing all kinds of crazy things in C++, you're much more likely to run into problems invoking store_value() which might interact with a number of underlying libraries and system calls. At the end of the day you're hopelessly screwed without profiling and the utility perf works just as well on C++ programs as it does on C programs.
Finally, the majority of code in my experience is either slow path or at least not the bottleneck anyway. You can almost always optimize a systems program by making smarter system calls (or hardware invocations), while optimizing CPU is rarely worth it. Even if you're CPU bound, it's likely that it's in doing something like SSL, wherein you're almost definitely wrong to try to jump into a project like OpenSSL with micro-optimizations. With that in mind, why not allow yourself to move faster by creating a std::vector? When you use the proper conventions, it's generally as fast anyway.
Just because people can (and have, a lot) misused C++ and created way too much complexity with it doesn't mean that it's not the preferred language in the hands of someone who knows what they're doing. (And yes, it's much harder to learn how C++ works in a meaningful way to avoid these pitfalls, but if it's your job, learn it anyway).
Intrusive datastructures were the thing I missed most from C, but boost::intrusive satisfies my desires when it's absolutely necessary.
In general, I'm anti-boost, but I think it's a personal bias and we use the hell out of it at work to great effect. The one thing I'll give boost::intrusive over sys/queue.h is that the type system helps you a lot more to catch issues and the common case is a bit simpler (a struct that exists in a single linked list and a single hash table, for example).
I've been burned many times by something like the following
In C, you can use a bit of macro trickery to get the same safety, e.g: by defining a 0-sized array of the some type alongside the intrusive node. Then, macros that do the "cast down" from the intrusive node to the container element can also do a type-comparison (using some trickery) between the anchor's array and the container type.
I've seen it implemented, but everyone uses the typical non-type-safe one anyway :)
I don't think I've ever had such bugs in a lot of code though, since I tend to wrap the "containerof" call with a little function like: foo_of_lru_entry and foo_of_hash_entry. A bit of boilerplate for each data structure, but worth it.
The rest of the data structures in the STL follow the same (anti-)pattern.
They all either take ownership of your data, or point at your data unidirectionally.
This means that given a pointer to your data you cannot, for example, delete it from multiple data structures it is contained within without going from these structures roots to re-find the pointers to your data.
Whereas with intrusive data structures (the way it is done in the Linux kernel and other "advanced" C projects), you can easily embed your structure in multiple data structures, such that you can do very quick deletion or modification of the data without re-finding it.
If you really think that, you've missed CS 101. Or maybe std::list is the only linked list implementation you've seen. In that case, I agree, one should never use std::list.
Linux's list.h is extremely useful, and for a wide variety of circumstances, is the most efficient way to manage your data.
Ok, I'll elaborate, especially since my view is at odds with your statement "for a wide variety of circumstances".
As I see it, the only use case where linked lists are superior to other types of lists, like, perhaps, ArrayList in java or vector/deque in C++, is if the following conditions are met:
1: you care about ordering - often you don't care about ordering and in that case, there is no need for a linked list because you can achieve O(1) insertion & removal then too: insertion can always be at the end, removal can be a "swap with end element, remove end element" operation.
2: inserting and/or removing from the middle of the list is a common operation, if it is not a common operation, then the added cost of doing so with a vector may still be outweighed by a vectors other advantages
3: you do not require random access - lists do not provide random access and lookup is O(n). At the expense of additional complexity in implementation and more memory overhead, you could reduce lookup to, OTOH, O(log n) by using a skip-list
4: you do not iterate through the list often - if you do, you are likely going to blow the cache and mess up prefetching due to poor cache locality. Iterating through an array-based data structure can be much faster in this case.
I would say that for a list to make sense, you MUST have 1 and 2 and probably should have 3. 4 is optional, but if true, should make you consider if there might not be a more suitable data structure. In my own personal experience, this is rare. In fact, in my own personal experience, usually, code either does not require 1 or requires 1 but not 2 - either way, lists are not the appropriate data structure in those cases.
Basically, the short version is that they have very poor cache locality, an (IMHO) narrow use case where other data structures don't have superior performance and they take up more memory per node than a lot of other types of lists.
You linked to a stackoverflow question in another comment saying that they attribute std::list's flaws to linked lists as a whole. The biggest issue they seemed to mention, though, was cache locality - I fail to see how intrusive linked lists solve this. The only solution would be to preallocate nodes in consecutive memory locations, but you still take a hit as the links take up memory (whereas in array-based lists you do not need to store links for each node) and if you need to insert/delete in the middle (why are you using linked lists if this isn't the case?) then you end up jumping around the preallocated nodes anyway and after a while will lose any cache-friendliness you may have had.
Maybe you can elaborate what you meant?
To end with an appeal to authority ;-) I'll quote tptacek[1]:
C programmers are trained to use linked lists. They are the first variable-length containers most programmers come into contact with and so C programmers tend to be imprinted on them like ducklings. Linked lists are a poor general purpose container.
EDIT: I guess I missed an anti-condition: if you don't care about performance, then use whatever models your problem best, which may well be a linked list (though the same is true for std::list).
This is a strawman. This isn't a useful case for lists.
Since he needs to scan the list to find the position to add/remove to/from.
If he used an intrusive list, the removal from a vector-with-list would be faster in a list already for relatively small N's.
What I learn from this video is that Stroustroup is also misguided about linked lists. He thinks you always need to do a linear search to do useful operations. The whole point of lists is the O(1) operations, not O(N) operations.
Indeed, std::list is the culprit here: "We shape our tools and then our tools shape us". std::list users become unaware that list operations are usable on elements without a linear search first.
Even O(1) operations on lists can break cache locality. Since you're chaining using pointers, all bets that your cells are nearby are off. With vectors, they always are.
Assuming that your list elements are in the neighboring cache lines (also how it compares to vectors in real life performance, as opposed to time complexity, really depends on how often your vector has to grow for your workload.. often a reasonable size can be chosen up front to minimize this, for example), but your point is of course valid.
Though I have seen vector-like structure implementations that ocmbine arrays and linked lists so that when you run out of space, it allocates a new array and links it to the end of the old array. I've implemented memory pooling systems like this myself. Works well, because as long as you know which bucket/array the element is in, you have random access and you also have good linear scanning performance, and extending the size doesn't require memmove's.
I'm not saying that linked lists have no uses (my original comment about never was not meant quite serious - never is much too strong a word) - they certainly do - but I am saying that I feel they have much more limited use cases than most people seem to think.
I agree that intrusive linked lists have many advantages that non-intrusive linked lists do not have. The main advantage of intrusive data structures is, as you said, that membership in multiple structures is in one place allowing constant time deletion from all structures.
Replying to your other comment:
Deleting from the middle of the list is an extremely common requirement, when you have objects that need to be enumerable in some order[s] and sometimes deleted.
I don't know what work you do in C or C++ (in other languages, I happily use their lists data types without caring if they used linked lists under the hood, because chances are, if I care about absolute performance, I'd have been using C or C++ to begin with), but in my own work, requiring both order and common insertion/deletion in the middle has been rare - or the elements and lists were so small that linear scan and copy was fast. But I guess it depends on your work, your logic here certainly sounds reasonable to me.
When you need indexing, use a vector. When you need quick add/remove, use lists. When you need both, use both.
Sure, you could use, say, a vector of elements that are themselves nodes in an intrusive linked list. Using a random access data structure to index another data structure is something I've done myself. I suppose I'm mostly arguing against the naive use of linked lists that seems to be rampant, because I think that most uses do not do all the interesting things you mention, in which case, IMHO, my points still stand and linked lists have a narrow use case.
You don't need to jump around. Say you have a request object you found via a hash table. After handling it you decide to destroy it. You now want to delete it from various lists it is in. You can do it on all lists in O(1) for each list. This is relatively cache-local (only need 2 extra random accesses, potential misses, for each list).
See, this is the kind of thing that I was looking for when I asked you to elaborate. You're not really using the list as a convenient growable array replacement (as you say yourself, you don't think of them as containers), you're using them as a secondary structure to maintain order in cases where middle removal or insertion is important (or maybe the other data structures are the secondary ones.. whatever). I think I better understand what you meant all along now and your reasoning makes sense to me, I can definitely see the use of this and agree that the problems of linked lists basically boil down to the following: they're seen as containers.
I think in that case, everything I said is totally true. I was definitely somewhat wrong about intrusive lists in the cases where some combination of these are true: another structure is used when lookup is needed, middle insertion/deletion is common, copying is expensive, order is important, elements are members of multiple lists where membership is commonly modified together.
I regularly have my data objects within a hash table, a list, and a search tree simultaneously and efficiently, with no dynamic allocation to sustain any of that.
This sounds good - I don't think most people really do this though. The problem isn't just std::list (or std:: containers in general). Your original reply to me was "If you really think that, you've missed CS 101" but perhaps you missed CS 101, because in CS 101 they (in mine and in any course notes and data structure books I've read) teach the common unobtrusive-linked-list-container data structure for which my points are completely valid. They're not teaching your use cases, or if they are, its certainly not nearly as common.
EDIT: I wanted to add that all of this really depends on what you are storing to begin with. If you are only storing an integer or two, the overhead of the links will kill you and your cache efficiency. If on the otherhand you are storing lots in each node then a few extra pointers isn't going to do much. Also if you're only storing a few bytes (or a few dozen) copying isn't all that expensive and then whether lists or vectors are faster depends on access patterns like I said in my previous comment. A lot of, eg, my python lists are lists of numbers. In high performance c++ code a lot of my vectors simply contained ids, numbers, points (2d or 3d) or similarly simple data.
It's nice to be able to have productive discussion on the internet!
We can agree on almost all points, I think.
The CS 101 bit you mention is a good point: intrusiveness matters for asymptotics in CS, and should be taught in academia, not (very small portions of) industry.
By the way, when you mention "a vector whose elements are linked list heads", a much more typical scenario is an array/vector whose elements contain one or more list heads.
You were wondering about the kinds of software I was working on that needs this stuff: high-performance storage controllers. We need to maintain a lot of concurrency, handle a lot of kinds of failures, etc. We often require the same objects (e.g: an ongoing I/O request) to be looked up in various ways, associated with failure domains, timed out, etc. So we want it organized by many different data structures, and we need the O(1) of deleting it from the various structures when an I/O request dies.
We also shun dynamic allocations, aside from relatively rare memory pools for large structures that contain the various allocations we need. Intrusive style allows us so many nice things within this style:
* Avoiding the runtime costs of dynamic allocations
* Avoiding the memory costs of extra indirections incurred by dynamic allocations and STL (non-intrusive) style
* Avoiding handling out-of-memory errors in every single code path: there are almost no allocations anywhere, almost all functions become "void" error-free functions!
* Having optimal asymptotics for our operations
* Having reusable generic data structures in C without templates (we dislike C++)
Compared to all this, the STL stuff is just horrible. STL is widely considered to be a superb library, but I find it horrid.
> 1: you care about ordering - often you don't care about ordering and in that case, there is no need for a linked list because you can achieve O(1) insertion & removal then too: insertion can always be at the end, removal can be a "swap with end element, remove end element" operation.
Whether you care about ordering or not, linked lists work great. Adding to an array is only amortized O(1). It is worst-case O(N). Why pay O(N) for the worst case when you can pay a cheap O(1) always?
> 2: inserting and/or removing from the middle of the list is a common operation, if it is not a common operation, then the added cost of doing so with a vector may still be outweighed by a vectors other advantages
Deleting from the middle of the list is an extremely common requirement, when you have objects that need to be enumerable in some order[s] and sometimes deleted.
> 3: you do not require random access - lists do not provide random access and lookup is O(n). At the expense of additional complexity in implementation and more memory overhead, you could reduce lookup to, OTOH, O(log n) by using a skip-list
Here is a false dichotomy dictated by the STL. With intrusive lists, you can have both your vector and lists of various orders. There is no contradiction.
When you need indexing, use a vector. When you need quick add/remove, use lists. When you need both, use both.
> 4: you do not iterate through the list often - if you do, you are likely going to blow the cache and mess up prefetching due to poor cache locality. Iterating through an array-based data structure can be much faster in this case.
Yes, if you want quick (repeated) enumeration, put it in a vector. Again, this doesn't contradict also putting it in lists.
> The biggest issue they seemed to mention, though, was cache locality - I fail to see how intrusive linked lists solve this.
Cache locality is based on your use pattern. When enumeration isn't your common operation, vectors don't have better locality than lists. For example, a sorted list of requests where you may want to time out the oldest one will only refer to the head and tail of the whole list. Cache locality is great.
> then you end up jumping around the preallocated nodes anyway and after a while will lose any cache-friendliness you may have had.
You don't need to jump around. Say you have a request object you found via a hash table. After handling it you decide to destroy it. You now want to delete it from various lists it is in. You can do it on all lists in O(1) for each list. This is relatively cache-local (only need 2 extra random accesses, potential misses, for each list).
> C programmers are trained to use linked lists. They are the first variable-length containers most programmers come into contact with and so C programmers tend to be imprinted on them like ducklings. Linked lists are a poor general purpose container.
I think the false premise here, espoused by the STL, is that a data structure is a "container" at all. Your data can be "contained" by anything (the stack, the heap, a vector, ...). The data structures involved (linked lists, hash tables, search trees) all do not contain the data. They organize the data.
When using vectors as containers STL-style -- you cannot really have your data organized by multiple data structures efficiently.
I regularly have my data objects within a hash table, a list, and a search tree simultaneously and efficiently, with no dynamic allocation to sustain any of that.
STL cannot do this because of the data-structure as "container" philosophy.
That answer is attributing the badness to linked lists, whereas it is fully std::list's. Linked lists are very useful, but it's hard to see that when their canonical (and almost only) implementation is std::list, which is indeed almost entirely useless.
* With an std::list::iterator in each of your data nodes that represents its own position in the list (this is called the "intrusive style")
* Without an std::list::iterator in each of your data nodes
If you use the (more common) latter form: whenever you have a reference to your own object, you cannot do any of the linked list operations without an O(N) penalty to go and re-find your element in the list!
i.e: Say you have a list of requests, and a timeout callback pops up with a pointer to your request, you cannot use a request pointer to do an O(1) deletion from the list. This kind of operation is what lists are for, and std::list canonically cannot do it.
Any other operation you might want to do relating to the list, given such a pointer is impossible (e.g: create a new request that is immediately between the old request and its next).
All this assumes your object is within just one list. If it is within 2 lists, you have to use an extra indirection: std::list <request * >, which makes the problem worse. Even if you do find your request via one of the lists, there is no way to get an iterator for the other list. That means you cannot do any of the list operations on the other list. Again: This is the canonical thing lists were designed for, and std::list cannot do it.
Say that to solve this, you use the intrusive style with std::list. i.e: for every list this object is a member of, you hold an iterator inside your object.
Now the onus is on your to maintain these iterators, in addition to the common list operations. i.e: If you add an element, you need to both call std::list::add, and update the iterator.
Additionally, instead of paying with 2 pointers for each node, as an ordinary doubly-linked list should cost, you have to pay with an extra pointer or two (depending on how the iterator is implemented)!
If you use multiple lists with std::list<request * >, you pay with an extra pointer yet!
So if your data structure is within 2 doubly-linked lists, instead of paying the ideal 4 pointers per item, you pay those 4 + 1(indirection) + (2 for two iterators). 75% memory overhead, ruining your cache lines.
The code will be a mess too, due to the duplicate maintenance.
Use the intrusive approach exclusively. So that you don't need to use a list::add in addition to maintaining the two iterators. You only maintain the "iterators" (now called "list heads").
EDIT: almost completely forgot that std::list also uses new to dynamically allocate elements. This effectively means the cost of adding to lists is many times greater than the simple list_add function. Even if you supply your own allocator, this is unnecessarily expensive and by default means that list::add, like remove, are both not O(1) like they ought to be.
> whenever you have a reference to your own object, you cannot do any of the linked list operations without an O(N) penalty to go and re-find your element in the list!
Can you show an example of when you actually need to do this? Because when I need to do something like this it usually means that some container other than list is more fitting for the problem.
I gave an example: a request that is in multiple linked lists. e.g one by chronological order for quick timing out of oldest requests, and one of active requests waiting on the physical wire.
Now the timeout elapsed, so you have a pointer to a request that needs to be destroyed.
In that case, you typically use the very cheap O(1) list_del on each of the lists it's in. In STL style you pay O(N) for each list it is in. Or you conclude lists are worthless and another structure should be used. But no other structure would give you the incredibly cheap O(1) add and delete you get from lists.
Indeed, coming from Prolog/Erlang-style "most everything can be represented as a tail-call with a linked-list accumulator" programming, I'm very confused about what operations the GP is talking about. Adding/removing nodes at a position other than the head? Lookup by value? If you need these, you should be using a different data structure.
I used to put C/C++ in my curriculum until I realized that my C++ code was getting a lot less "C-like" each year. Now I put it separatedly, which also allows for a nice explanation whenever I'm asked about it in a job interview.
You are correct. Usually I see this in CVs. It usually means that the person is proficient in C++ but knows a little C as well.
But C/C++ can be a language. It could be a code written in C++ but looks more like C than C++. Some people e.g. use character arrays instead of string class, avoid STL as much as possible, don't use object-oriented aspects of the language, etc. The end result uses some C++ libraries, compiles on a C++ compiler but resembles C more than anything. I think it would be appropriate to call it C/C++.
I mostly agree with your observation regarding CVs and what seeing this in one tells about the person who wrote it. Although I like to think that most proficient C++ programmers respect the differences between the two languages and prefer to call C++ what it is, and mention C separately if it needs to be mentioned.
I don't think the style of your code should change the name of the language you claim to use. If it doesn't pass through a C compiler, it's not C or "C/anything".
Now it is quite possible to use the subsets of C and C++ to write code that conforms to both language specifications and compiles with a compiler for either language. If some project deliberately does this, I don't have an issue with calling it C/C++ to hilight the fact. This however really is quite rare from what I've seen, even if I can name some projects that do make such code (the Opus audio codec is one example).
C/C++ still isn't a language though, so I would very much prefer to call it C in a case like this.
A different scenario entirely would be a project with parts clearly written in two (or more) different languages, which could happen to be C and C++...
I was looking at an analysis of id Tech's Doom3:BFG engine yesterday, and it's best described as this. The author described it as "C, with classes". Quite interesting, so I went and asked some game devs I know, apparently this is pretty standard!