This looks incredibly cool, I love the speed and minimalism. It doesn't seem as configurable and feature-packed as vifm[1], so I'll probably stick with that, but it certainly fills a good niche.
> nnn vs. ncdu memory usage in disk usage analyzer mode (400K files on disk):
I assume that's because nnn isn't keeping the entire directory structure in memory, which means that browsing to a subdirectory involves rescanning the entire directory. That's a fair trade-off, but an unfair comparison.
1. https://vifm.info/ - I'm surprised this hasn't been mentioned yet in this thread.
Author of `nnn` here. Thanks for the appreciation.
Please let us know what you like in vifm which isn't available in `nnn` and we will consider the features. However, I must say the last thing we want to see in `nnn` is feature bloat. You can extend it as you wish through scripts which `nnn` supports in any number.
No, `nnn` scans the complete dir tree otherwise du won't work. It rescans because data on disks keep changing e.g. on our server where 17 people run VMs. It can be easily changed to static but the rescan is very fast and users specifically asked for it. The memory usage is less because `nnn` uses much less memory for all operations in general.
I appreciate your work, but you're not being very honest with your claims.
nnn is not keeping information about 400K files in memory in that benchmark. As a result, the rescan is necessary when changing directory. The rescan may be fast in many cases and in some cases it may even be what you'd want, but I can also name many cases where you certainly won't want it (large NFS mounts being one example).
Sorry for the pedantry. I spent a fair amount of time optimizing ncdu's memory usage, so I tend to have an opinion on this topic. :)
I think we are saying the same thing in different lingo. I am trying to say, you do not need to store it if you can have fast rescans.
Coming to memory usage, if you store the sizes of every file you need 400K * 8 bytes = ~3 MB.
Now `ncdu` uses ~60 MB and `nnn` uses ~3.5 MB. How do you justify that huge gap?
> but you're not being very honest with your claims
No, I am completely honest within the limits of my technical understanding. Your tool uses 57 MB extra which would be considerable on a Raspberry Pi model B. To an end user, it's not important how a tool shows the du of `/`, what's important is - is the tool reasonable or not? I don't know how `ncdu` manages the memory within, I took a snapshot of the memory usage at `/`.
In fact, now I have questions about your very first line beginning with `This looks incredibly cool` and then the comparisons of it with different utilities in negative light. (I must be a fool realizing it now, I should have seen it coming.)
And I'm saying you can't have fast rescans in all cases - it very much depends on the filesystem and directory structure.
I'm not trying to downplay nnn - I meant it when I said it's a cool project! I'm saying each project has its strengths and weaknesses, but your marketing doesn't reflect that (or I missed it).
ncdu's memory usage is definitely its weak point - that's not news to me - but it's because I chose the other side of the trade-off: No rescans. If you're curious where that 60MB goes to, it's 400K of 'struct dir's: https://g.blicky.net/ncdu.git/tree/src/global.h?id=d95c65b0#...
You’re being very snarky considering how quick you to start the debate, where you boasted about how much better optimised your tool was in your GitHub README.
I grow rather tired of comparisons where one tool tries make itself look better than another based purely on a solitary arbitrary metric like memory usage. It’s not a fair benchmark and really it’s just an excuse to make yourself look better by bad mouthing someone else’s code.
What’s to say the other tools haven’t employed the same algorithms you vaguely stipulated you had (I say “vaguely because you don’t even state which highly optimised algorithms you’ve use)? Have you read the source code of the other projects you’ve commented on to check they’re not doing the same - nor even better? Because your README is written like you’re arguing that other projects are not optimised.
What’s to say that the larger memory usage (which is still peanuts compared to most file managers) isn’t because the developer optimised performance over literally just a few extra KB of system memory? Their tool might run circles around yours for FUSE file systems, network mounted volumes, external storage devices with lower IOPS, etc.
But no, instead you get snarky when the author of one of the tools you were so ready to dismiss as being worse starts making a more detailed points about practical, real world usage beyond indexing files on an SSD.
It wasn't a debate. You asked, I answered. And if you read carefully, there is _not_a_single_comment_ on the quality of a single other utility in the README. We recorded what we saw and I have shared the reason why.
I am not going to respond any further and would appreciate it if you refrain from getting personal with "not being completely honest", "being very snarky" etc. Please don't judge me by the project page of a utility which is a work of several contributors. That's all.
Your readme has a performance section, that section focuses on nnn vs two other tools. You only benchmark against Memory usage under normal circumstances (ie no other performance metric, no other file system nor device types, etc). Then you have a whole other page dedicated to “why is nnn so much smaller” which is directly linked to from the performance comparisons. There’s no other way to take that other than you’re directly comparing nnn to other tools and objectively saying it’s better.
So with that in mind, I think the developers of the other tooare totally with in their right to challenge you on your claims.
Edit: the “multiple contributors” point you made is also rather dishonest too. It’s your personal GitHub account for a project you chiefly contribute too and the documents in question were created and edited by yourself (according to git blame). Yes nnn has other contributors too but it was yourself who wrote and published the claims being questioned.
> totally with in their right to challenge you on your claims
Yes, and within the limits of common courtesy.
The other utility does only one thing - reports disk usage so there's not much to compare. The dev did mention that `ncdu's memory usage is definitely its weak point`.
> no other performance metric, no other file system nor device types
because lstat64() is at the core of the performance metric of the feature we are comparing here and with the same number of files on the same storage device the number of accesses are exactly the same. The only metric that differentiates the utilities is memory usage.
> Edit: the “multiple contributors” point you made is also rather dishonest too.
Not really, I prefer to edit the readme myself because I want to keep the documentation clean. You will see features contributed by other devs for which I have written the docs from readme to manpage. Regarding the metrics, sometimes I have taken the data and sometimes I have requested someone else to collect it. Or doesn't that count as contribution?
What I actually care most about a file manager is how they perform on mounts with low IOPS and how gracefully they handle time outs and other exceptions.
RAM is cheap and any file manager will be snappy on an SSD. But edge cases are where most file managers fall apart yet are situations where you might need to depend on your tools the most.
However now I understand the point of this project was purely to optimise against memory usage, I can better understand the arguments you were making.
> or doesn't that count as contribution?
Not in this case, no. You published it, so you’re still ultimately accountable for it.
You cannot request figures then play the “nothing to do with me guv’” card when someone queries the benchmarks that you subsequently published. At best it comes across as an unlikely story; at worst you’re still complicit.
This is the wrong mindset. RAM is only cheap if you don't use it. As soon as you go just 1 byte over the maximum RAM it turns into the most precious resource of the entire computer.
If an app uses more memory than another then it is not better because RAM is cheap. It is better because it provides more or higher quality features at a reasonable cost of increased memory usage. But at the same time it is also worse for people who do not need those features.
Here is an example: When I launch the graphical file manager nautlius it consumes roughly 26 MB of RAM showing my home folder but when I go to my "Pictures" folder it suddenly shoots up to 300MB. There is a reason for that and it is not "RAM is cheap", if that were the case it would always use 300MB regardless of what I do with it (electron apps are a major offender of this). Nautlis consumes that much RAM because it has more features like displaying 1000 thumbnails of all those pictures.
Now this feature would get in my way if I set up something like a Raspberry Pi Zero to take a photo every hour. Nautilus will crash because it needs too much memory to display the thumbnails.
I agree from an idealistic point of view (I've often make the same argument myself with regards to non-native applications and self-indulgent GUIs) but you're missing the context of the argument here.
We're not talking about burning through hundreds of megabytes (nor even gigs) on a pretty UI that adds no extra function; we are talking about only a few megabytes to save stressing low bandwidth endpoints.
It isn't 1990 any more, sacrificing a few megabytes of RAM in favour of greater stability is very much a worthwhile gain in my opinion. Hence why I say RAM is cheap - we don't need to cut every corner just to save a few bytes here and there.
In the example we were discussing, the idle RAM usage was higher because it doesn't hammer low bandwidth endpoints with frequent rescans. Caching is actually a pretty usage for spare RAM - your kernel does it too. So we are not talking about anything out of the ordinary here and we're certainly not talking about wasting RAM for the sake of wasting RAM. We're talking about trading some cheap RAM for the sake of potentially expensive (computationally speaking) access times.
However I do feel I need to reiterate a point I've said throughout this discussion: there is no right or wrong way; it's just a matter of compromise. eg the goals for an embedded system would be different to the goals for an CLI tool on a developers laptop.
> then play the “nothing to do with me guv’” card when someone queries the benchmarks that you subsequently published
No, you are cooking things up. I did respond as per my understanding. My statement was very clear - "_Please don't judge me_ by the project page of a utility which is a work of several contributors."
I had problems with the _personal remarks_. And I am not surprised you chose to ignore that and describe it in the light that I have problem with someone challenging the benchmarks.
I am yet to come across figures that can challenge the current one.
I have no idea how people convince themselves to contribute to open source and/or participate regularly in online discourse. It’s basically working really hard for the easiest-to-offend and least-likely-to-appreciate-anything-you-do people on earth, so they can throw shade at everything you do, and then get mad because you did something different than they would have done.
FWIW, I appreciate everything you all are doing, even the stuff I’m not using right now. You all don’t hear it enough, and you’re certainly not getting paid enough for what you’re doing.
It’s actually not that hard to do so without talking trash about other projects. In fact I find those kind of comparisons are often the laziest and least researched ways of promoting a particular project as anyone who’s spent any time dissecting other peoples work in detail (not just running ‘ps’ in another TTY) will usually gain an appreciation for the hard work and design decisions that have gone into competing solutions.
But that’s just the opinion of one guy who has been contributing to the open source community for the last two decades. ;)
Cleverly fabricated and blown out of proportions. Yes, 2 decades of rich experience sure teaches that!
To clear the context for others, you are talking about a list of performance numbers here, and as I said, I am yet to come across figures that can challenge the current one.
Actually what he said was he knew memory was a problem but it is designed for a different use case. The key part being the bit you’ve conveniently left off.
Your comments come across as very arrogant (particularly considering you’ve not reviewed the other projects source code) and intentionally deceptive too. While I’m normally the first to defend open source developers - because I know first hand how thankless it can be at times - you’re doing exactly to others as you ask not to be done to yourself. Which is not fair at all.
Anyhow, this is all a sorry distraction away from what should have been the real point of this HN submission; and I’m angry at myself for allowing myself to get dragged into this whole sorry affair. so I’m going to log off HN for the rest of the day to avoid any further temptation.
I do sincerely wish you the best with this and your other projects. It should be the code that speaks for itself rather than any other politics and that’s not been the case here for either your project nor the others it has been compared against. But let’s move on and get back to writing code :)
> Actually what he said was he knew memory was a problem but it is designed for a different use case. The key part being the bit you’ve conveniently left off.
No, read the other thread where I said his use case can be satisfied at much less memory usage. There's nothing arrogant in suggesting there's a more efficient way to achieve something.
Yes, let's get back to work. Loads of feature requests to cater to, thanks to this share. :)
It’s about how and when they get used. Eg if you’re running on a lower performing mount then you might want to rely on caches more than frequent rescans. What I would often do on performance critical routines running against network mounts was store a tree of inodes and time stamps and only rescan a directory if the parents time stamp changed. It meant I’d miss any subsequent metadata changes to individual files (mtime, size, etc) but I would capture any new files being created. Which is the main thing I cared about so that was the trade off I was willing to make.
There’s no right or wrong answer here though. Just different designs for different problems. Which was also the point the other developer was making when he was talking about his memory usage.
The processor cache plays an important role which you are ignoring.
External storage devices: most of the time they are write-back and even equipped with read-ahead. Yes, I know there are some exceptions but if you are write-through non-read-ahead you _chose_ to be slow in your feedback already and this discussion doesn't even apply.
Network mounts: cache coherency rules apply to CIFS as well. And again, if you _choose_ to ignore/disable, you are OK to be slow and this discussion does not apply.
If `nnn` take n secs the first time, another utility will take around the same time on the first startup (from a cold boot).
Now the next scans where you go into subdirs would be much faster even in `nnn` due to locality of caching of the information about the files (try it out). The CPU cache already does an excellent job here. And if you go up, both `nnn` and the other utility would rescan.
> point the other developer was making
Yes, he was saying - my memory usage may be 15 times higher because of storing all filenames (in a static snapshot!!!) but you are dishonest if you show the numbers from `top` output without reading my code first for an education of my utility.
I’m not sure what you mean by processor cache here. The processor wouldn’t cache file system meta data. Kernels will, but that is largely dependant on the respective file system driver (eg none of the hobby projects I’ve written in FUSE had any caching).
Different write modes on external hardware also confuses the issue because you still have the slower bus speeds (eg a USB2 for an older memory stick) to and from the external device than you might have with dedicated internal hardware.
> The processor wouldn’t cache file system meta data
What _file system meta data_? The processor doesn't care what data it is! I am talking of the hw control plane and you are still lurching at pure software constructs like the kernel and its drivers.
All the CPU cares is the latest data fetched by a program. The CPU cache exists to store program instructions and _data_ (no matter where it comes from) used repeatedly in the operation of programs or information that the CPU is likely to need next. If the data is still available in the cacheline and isn't invalidated, CPU won't fetch it from an external unit (so bus request is not even made). _Any_ data coming to CPU sits in any Ln cache, source notwithstanding. The external memory is accessed in case of cache misses. However, the metadata these utilities fetch is very very less and the probability is greatly reduced. Moreover, your hypothetical utility also banks on the assumption that this data won't change and it wouldn't have to issue too many rescans to remain performant.
It's the same thing you see when you copy a 2 GB file from a slow storage to SSD and the first time it's slow but the next time it's way faster.
You can see it for yourself. Run `nnn` on any external disk you are having (with a lot of data preferably), navigate to the mounpoint, press `^J` (1. notice the time taken), move to a subdir, come back to mountpoint again (2. notice the time taken). You would see what I mean.
> none of the hobby projects I’ve written in FUSE had any caching
On a side note (and though not much relevant here), all serious drivers (e.g. those from Tuxera, btrfs) maintain buffer cache (https://www.tldp.org/LDP/sag/html/buffer-cache.html). They always boost performance. If our Ln misses, this is where we would get the metadata from and _hopefully_ not from the disk which is the worst case.
Yeah, that’s the kernel caching that (as I described) not some hardware specific thing the CPU is doing. Not disagreeing with you that L1 and L2 cache does exist on the CPU (and L3 in some instances) but it is much too small to hold the kind of data you’re suggesting. It’s really the kernel freeable memory (or file system driver - eg in the case if ZFS) where file system data - inc file contents too - will be cached. The CPU cache is much more valuable for application data to fill with file system meta data (in my personal opinion, you might override that behaviour in nnn but I’d hope not)
However regardless of where that cache is, it’s volatile, it’s freeable. And thus you cannot guarantee it will be there when you need it. Particularly on systems with less system memory (remember that’s one of your target demographics).
If you wanted though, you could easily check if a particular file is cached and if not, perform the rescan. I can’t recall the APIs to do that off hand but it would be platform specific and not all encompassing (eg ZFS cache is separate to the kernels cache on Linux and FreeBSD) so it wouldn’t be particularly reliable nor portable. Plus depending on the syscall used, you might find it’s more overhead than an actual refresh on all but a rare subset of edge cases. As an aside, this is why I would build my own cache. Sure it cost more RAM but it was cheaper in the long run - fewer syscalls, less kernel/user space memory swapping, easier to track what is in cache and what is not, etc. But obviously the cost is I lose precision with regards to when a cache goes stale.
While on the topic of stale caches, I’ve actually ran into issues on SunOS (or was it Linux? I used a lot of platforms back then) where contents on an NFS volume could be updated on one host but various other connected machines wouldn’t see the updated file without doing a ‘touch’ on that file from those machines. So stale caches is something that can affect the host as well.
> but it is much too small to hold the kind of data you’re suggesting
My toy has a 3 MB L1 cache, 8 MB L2. You'll notice from the same figures that the resident memory usage of `nnn` was 3616 KB. And in the default start (not du) right now the figure is:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23034 vaio 20 0 14692 3256 2472 S 0.0 0.0 0:00.00 nnn -c 1 -i
So it required around 350 KB for `du`. Oh yes, it can be cached at will.
> check if a particular file is cached and if not
I am a userland utility. I don't add deps on fs drivers (and maintain redundant code... I discourage code bloat as much as feature bloat) when standard APIs are available to read, write, stat.
> However regardless of where that cache is, it’s volatile, it’s freeable.
Yes!!! The worst case! And the worst case for you is where all dirs have changed. There you issue rescan, `nnn` issues rescan. `nnn` is accurate because it doesn't omit independent files for _personal_ optimization preferences. You do. You show _stale_, _wrong_ info in the worst case.
3MB is not enough to store the kind of data you’re talking about. In your example you were coping files from an SSD to a removable device; what happens if that file is 4MB big? Or 20MB?
Some program executables are multiple megabytes in size these days, your CPU cache is much more valuable storing that than it is storing a random 20MB XLSX file you happened to copy half an hour earlier. ;)
> You do. You show wrong info in the worst case.
Yes, I’d repeatedly said that myself. However I was making a case study rather than giving a lecture on how your tool should be written. Remember when I said “there is no right or wrong answer”? (You have ignored my point that you cannot guarantee the kernels / fs drivers cache isn’t also stale though. That does happen in some rare edge cases).
My point is you shouldn’t be so certain that you’re method is the best because there are plenty of scenarios when accuracy and low memory usage is more costly (eg the fastest file copying tools are those that buffer data - costing more RAM during usage). Plus let’s not forget that syscalls are expensive too. But that’s the trade off you take and that is what is advantageous for the perpose you’re targeting nnn for.
So there really isn’t a right or wrong answer here at all and memory usage is in fact only a very small part of the equation (which is what the other projects author what trying to say).
> Some program executables are multiple megabytes in size these days
I have shared the figures from my system right now in my earlier comment. The memory required for the meta info is extremely cache-able.
And yes, let's call it a day now. In fact, I am happy we had a very informative discussion towards the end. This is what I expected from the other dev and you from the very beginning. I am always ready for a productive non-abusive technical discussion. And if I find someday `ncdu` takes comparable/lesser memory than `nnn` I will update that data verbatim as well. But I wouldn't take uncalled-for shitstorm in public forums from strangers lying low.
But that’s what you’re storing in the system memory, not CPU cache. Ok let’s say the kernel does cache recently accessed memeory in the L2 cache, you’re still competing with everything else - particularly in a multi-user or multi-threaded system. So that is not going to stay in L2 for long.
You’re making so many assumptions about the host platform and they simply don’t stack up for the majority of cases - it’s just that most systems are fast enough to hide your incorrect assumptions.
Also your understanding of how file system caching works is completely off. Granted that’s not something most normal engineers need to worry about, but given the claims you’re making about nnn, I would suggest you spend a little time researching the fundamentals here.
Edit:
> I am happy we had a very informative discussion towards the end. This is what I expected from the other dev
That was exactly what the other dev was doing. You just weren’t ready to listen to him - which is why I started making comments about your “narky” replies. ;)
but You’re not proving that the data you are expecting to be in L2 actually ends up in L2. Let alone persists long enough to be recalled from L2 when you do your rescans. Your test is crude and thus doesn’t prove anything. Which is what we have all been telling you right from the start!!!!
You cannot just throw numbers up and say “look I’m right” if the test of flawed from the outset.
But yes, it’s probably better we do call it a day.
Maybe you shouldn't store the file path but just the name, and a parent pointer. That brings you down to 8 bytes size + Parent pointer + a short string. Regarding the string you can go for offsets into a string pool (memory chunk containing zero terminated strings).
So I think 50 bytes per file is easy to accomplish if (name/parent/path) + size is all you want to cache. For speed-up, I would add another 4 or 8 bytes index to map each directory to its first child.
I can think of at least 3 possible algorithms to use much much less memory even with a static snapshot of the complete subtree. And all of them are broken because the filesystem is supposed to stay online and change. It's realistically useful to scan an external disk to find the largest file etc., but not accurate on a live server, a desktop with several ongoing downloads, video multiplexing etc.
Earlier in the thread you suggested that it's hard to justify ncdu using 60MB while it takes only 3.5MB to store 400K * 8 bytes numbers. The number you came up with is just silly and overlooks actual complexity of the problem.
Given that you are making an implicit judgement about the other program, don't be sloppy with your estimates.
I'm not. You can, and I'm sure eventually you will arrive very close to the approximation.
I'd been a big time fan of `ncdu` for years and even wrote in an e-journal about it once. Maybe that's why the sharp adjectives became more difficult to digest. Anyway, good luck!
There was a similar talk[1] at FOSDEM, where the speaker describes how, as an experiment, he replaces a full ELK stack plus other monitoring tools with PostgreSQL. He even goes as far as implementing a minimal logstash equivalent (i.e. log parsing) into the database itself.
It wasn't an "we do this at scale" talk, but I'd love to see more experiments like it.
For the impatient: Skip to 17 minutes into the video, where he describes the previous architecture and what parts are replaced with Postgres.
> It wasn't an "we do this at scale" talk, but I'd love to see more experiments like it.
Well, I will be conducting such thing in near future. From ELK stack, I never
used Logstash in the first place and used Fluentd instead (and now I'm using
a mixture of my own data forwarder and Fluentd as a hub). I'm planning mainly
to replace Elasticsearch, and probably will settle with a command line client
for reading, searching, and analyzing (I dislike writing web UIs).
All this because I'm tired of elastic.co. I can't upgrade my Elasticsearch
1.7.5 to the newest version, because then I would need to upgrade this
small(ish) 4MB Kibana 3.x to a monstrosity that weihgs more than whole
Elasticsearch engine itself for no good reason at all. And now that I'm stuck
with ES 1.x, it's only somewhat stable; it can hang up for no apparent
reasonat unpredictable intervals, sometimes three times per week, and
sometimes working with no problem for two months. And to add an insult to an
injury, processing logs with grep and awk (because I store the logs in flat
files as well as in ES) is often faster than letting ES do the job. I only
keep ES around because Kibana gives nice search interface and ES provides
a declarative query language, which is easier to use than building awk
program.
> He even goes as far as implementing a minimal logstash equivalent (i.e. log parsing) into the database itself.
As for parsing logs, I would stay away from database. Logs should be parsed
earlier and available for machine processing as a stream of structured
messages. I have implemented such thing using Rainer Gerhards' liblognorm and
I'm very happy with the results, to the point that I derive some monitoring
metrics and was collecting inventory from logs.
> I can't upgrade my Elasticsearch 1.7.5 to the newest version, because then I would need to upgrade this small(ish) 4MB Kibana 3.x to a monstrosity that weihgs more than whole Elasticsearch engine
...is that really a good reason to reinvent this whole solution, though? You're basically saying you're going to spend the time to replace your entire log storage/analysis system because you object to the disk size of Kibana. (Which, without knowing your platform specifically, looks like it safely sits under 100 megs).
The rest of your complaints seem to stem from not having upgraded elasticsearch, aside from possibly hitting query scenarios that continue to be slower-than-grep after the upgrade.
Maybe I'm misunderstanding your explanation, but if I'm not this sounds like a lot of effort to save yourself tens of megs of disk space.
> ...is that really a good reason to reinvent this whole solution, though?
The system being dependency-heavy and pulling an operationally awful stack
(Node)? Yes, this alone is enough of a reason for me. And I haven't mentioned
yet other important reasons, like memory requirements and processing speed
(less than satisfactory), elasticity of processing (ES is mostly query-based
tool, and whatever pre-defined aggregations it has, it's too constrained
paradigm for processing streams of logs), and me wanting to take a shot
at log storage, because our industry actually doesn't have any open source
alternative to Elasticsearch.
> Kibana. (Which, without knowing your platform specifically, looks like it safely sits under 100 megs).
Close, but missed. It's 130MB unpacked.
> Maybe I'm misunderstanding your explanation, but if I'm not this sounds like a lot of effort to save yourself tens of megs of disk space.
I'm fed up with the outlook of the whole thing. Here ridiculous disk space for
what the thing does, there slower-than-grep search speed, another place that
barely keeps up with the rate I'm throwing data at it (single ES instance
should not loose its breath under just hundreds of megabytes per day),
upgrade that didn't make things faster or less memory-consuming, but failed to
accept my data stream (I was ready to patch Kibana 3.x for ES 5.x, but then
I got bitten twice in surprising, undocumented ways and gave up, because
I lost my trust that it won't bite me again).
Sorry, but no, I don't see Elasticsearch as a state-of-the-art product.
I would gladly see some competition for log storage, but all our industry has
now is SaaS or paid software. I'm unhappy with this setting and that's why
I want to write my own tool.
The problem you run into is "we need some more information that is in the logs but we didn't thin to parse before." Here PL/Perl is awesome because you can write a function, index the output, and then query against the function output.
One reason I always store full source data in the db.
> The problem you run into is "we need some more information that is in the logs but we didn't thin to parse before."
Agreed, though with liblognorm rules you just shove every single variable
field into JSON field and that mostly does the job. And in the case you were
talking about logs with no matching rules, liblognorm reports all unparsed
logs, and my logdevourer sends them along the properly parsed logs, so no data
is actually omitted.
Oh yes it is. The rules syntax is nice and is a big improvement over regexps
that are popular with almost every other log parser out there, but the best
thing is that if your rules fail, liblognorm reports precisely what part of
the log could not be consumed, not just the fact that none of the rules
matched.
Liblognorm has only one major user: rsyslog, for which it was written, but at
some point I thought that it would be nice to have a separate daemon that only
parses logs, so I wrote logdevourer (https://github.com/korbank/logdevourer).
I'm surprised that even the fastest implementation needs 20+ms to search a 1000-element array. I would expect even a linear search to finish within 1 or 2ms with such a small data set. How large are the array elements? How were the times measured?
EDIT: Oh the time measured is the total of 1,000,000 random lookups? Nevermind my confusion then, that would certainly explain it.
Technologies: C, Perl, PostgreSQL, Apache, lots of Linux stuff
Résumé/CV: Mail me for full CV, check my OSS projects in the mean time: http://dev.yorhel.nl/
Email: contact@yorhel.nl
Recently finished my Master's degree in Embedded Systems, and still looking into the exact direction I'm heading for. Experienced (non-professionally) as (embedded) software dev, back-end web dev, and Linux sysadmin (websites/email). I have a broad interest in technology and am a quick learner, so I'm open to any offer that is relevant to the above listed technologies. :)
The bit that invokes youtube-dl is a Lua plugin. It checks if the URL starts with http:// or https://, and if so, invokes youtube-dl -J on the command line to dump JSON information about the video.
SEEKING WORK - Remote or near Enschede, Netherlands
Available for any C (not C++, sorry) development where performance and efficiency matter. Love working with low level stuff, embedded systems, network protocols, security, and making it all work together.
It's new, yes. I wrote it about two months ago because I needed a parser and found the existing solutions too bloated.
> It doesn't allocate or buffer, so when it says that it verifies proper nesting, I assume that doesn't mean verifying that start tags and end tags have matching names. I think that would be impossible.
It does match start tags and end tags, and in order to do so it does, indeed, need a small buffer. However, the only thing that needs to be stored in this buffer is a stack of element names, and that rarely needs much space - a simple fixed-size buffer is sufficient. It can be safely allocated on the stack if the application wants to. See the documentation for the yxml_init() for more details. The "mostly-bufferless" part of the homepage refers to the lack of a complicated input buffer and the avoidance of complex buffer management for output where possible.
> It looks like you pass the parse function a character at a time. To be fast, you'd want to somehow encourage that to get inlined.
The function is quite large. You can probably get it inlined when put inside a header file and when it's only called in a single place, and you may indeed see a performance boost. I haven't measured how significant such an improvement may be.
However, if performance is the primary goal (it wasn't for yxml), a completely different design would probably be better. Yxml needs to manage an explicit state object for each input character. For each new character it has to look at the state to know how it's handled, handle it, and then update the state for the next character. I was honestly surprised to see that yxml performed quite well in my benchmarks.
A more efficient design is probably the goto-based state machine approach that, e.g. ragel[1] can output. The reason I didn't go for that approach is that it complicates buffering. You either need to have the full file mapped into memory (not possible when reading from a socket or from a compressed stream), or you need to implement your own buffering and provide hooks so that the application can fill the buffer when it is depleted. The latter approach is used by most other XML parsers.
I was interested in yours specifically because it's the only one I've seen that doesn't buffer (almost). You might as well drop the last little buffer and be completely bufferless. Then it will never cause an error or block, and you'll have no dependencies. If you want to support validation, make a higher-level interface to this and do it there. It will be just as easy.
Hmm? The stack buffer in yxml will never cause parsing to block, and there's no added dependencies on... anything? I don't think that buffer causes any problems even on a size-restricted microcontroller that doesn't have malloc(). As long as you can find ~512 bytes or so of free memory you can parse a lot of files.
The only situation in which that buffer would cause an error is when the application used a too small buffer, or when the document is far too deeply nested or has extremely long element/property names. Both the maximum nesting level name lengths should, IMO, be limited in the parser in order to protect against malicious documents. Most parsers have separate settings for that, yxml simplifies that by letting the application control the size of a buffer.
The stack buffer in yxml is also used to make the API a bit easier to use. With the buffer I can pass element/property names as a single zero-terminated C string to the application, without it I would have to use the same mechanism as used for attribute values and element contents, and that mechanism isn't all that easy to use. (This is the one case where I chose convenience over simplicity, but I kinda wanted the validation anyway so that wasn't really a problem)
Ah. I have not yet looked too far into how yxml works. I never came up with a perfectly satisfactory solution to the zero-buffer problem myself, but you've hit on a lot of the things that make it a hassle either way. It's almost impossible to do a usable xml parser under the assumption that it will not buffer and it will not be guaranteed access to more than one character at a time. I started developing one like that based on a goto-driven state machine, but I stopped working on it, because the interface was going to be too inconvenient.
What I meant about blocking was blocking on malloc. It sounds like you're expecting the caller to take care of allocation?
Agree with the article; Linda is great, both the name and the project. It was one of the more interesting topics of a Distributed Systems course that I followed, but I never ended up using it (or its concepts) in practice due to the performance overhead.
Please do use autotools! You only need two files: configure.ac and Makefile.am, both at the top-level of your project. The autogenerated stuff can be ignored, you don't have to learn M4 to use autoconf. And (if you are a bit careful, but that's not too hard) you'll have many nice features such as out-of-source builds, proper feature checking, amazing portability, 'make distcheck', and acceptable cross-compiling (still hard to get right, but the alternatives tend to be even harder). Don't switch to another build system purely based on the idea that it's "more elegant".
Spend some hours figuring this out... Oh, I need to install an old version of auto*!
How do I get the old one but keep the new one around? Spend another 30 minutes to figure that out.
$ ./autogen.sh
checking for build system type...
^C
No damn, I wanted to generate configure, not run it! How do I clean up the mess it made just now?
$ make clean
$ make distclean
$ ./configure --prefix=$HOME/my_app ...
$ make -j9 install
...
install: no such file or directory blabla.la
WTF!?!?!
Spend an hour or so googling this mess. Ah, it's a parallel-make bug.
$ make install
HOLY SHIT, IT INSTALLED!!!
Let's submit this fix upstream. No problem, use diff.
WTF IS ALL THIS MESS IN THE DIFF I NEVER TOUCHED?!!??!
I know you're going to say I should be using the VCS checkout in the first place, which would hopefully be configured to ignore the autogenerated files. But as a user, or distribution maintainer, most of the time the bug you find is with a specific, packaged version of the software, and it may be quite an effort to figure out how to get the exact same version from the VCS server.
> $ ./autogen.sh
> checking for build system type...
> ^C
>
> No damn, I wanted to generate configure, not run it! How do I clean up the mess it made just now?
I always run "./autogen.sh --help" for exactly that reason; then if I see --help output from configure, I know that autogen.sh "helpfully" ran configure for me.
You can also usually just run "autoreconf -v -f -i" directly, unless the package has done something unusual that it needs to take extra steps in autogen for.
Practically, you need an autogen.sh script too because noone knows the command switches and order you need to call aclocal, autoconf and automake in to generate the configure script and Makefile.in files. That's enough for a minimal project. For something larger with subdirectories for src/ docs/ tests/ etc, you need to use one Makefile.am file in each directory and tie it together with the SUBDIRS variable. When you need to do something slightly out of the extra-ordinary (in autotools' world, "ordinary" means compile c, generate man pages and install) like generating doxygen or javadoc documentation, offering an uninstall option or building binaries that are not supposed to be installed, then you have to learn M4 and it's stupid syntax.
Also, autotools amazing features aren't all that. Builds out of the source dir? Eh.. This is 2012 and it doesn't impress anymore. Both waf and SCons can do it no problem. Autotools-projects, on the other hand, seem to always put the object files and linked library in the source directory. Sometimes the built library is put in a hidden directory for some reason I don't understand. Possibly it has something to do with libtool, another kludge on top of autotools one rather would do without. Since modern build tools does not pollute the source directories you basically get make distcheck for free. Waf has it builtin for example and it can easily be extended to also upload your tarballs to your distribution server, sending annoncement emails and what have you.
I don't want to toot my own horn, but I only check in Makefile.am and configure.ac. Here's a simple example I made a while back: https://github.com/ggreer/fsevents-tools. To build it, just run autogen.sh.
For a "real" project, https://github.com/ggreer/the_silver_searcher, I have the same set-up: Makefile.am and configure.ac with no generated code in the repo. It works fine as long as you have pkg-config, and most people do. It builds and runs on Solaris, OS X, FreeBSD, and any Linux distro you like.
Although I use autotools the "right way", I'm not a fan of it at all. There are multiple levels of generated files. Configure is generated from configure.in which is generated from configure.ac. Makefile is generated from configure and Makefile.in. Makefile.in is generated from Makefile.am. There are other files in the mix as well. Config.h, config.h.in, aclocal.m4, and various symlinks to autotools (compile, depcomp, install-sh, and missing) get generated. It's insane.
There are other problems. Minor versions of autotools can behave completely differently. AM_SILENT_MAKE was removed between automake 1.10 and 1.11. Instead of printing out minimal text during a build, scripts using that macro crash. Another example: 1.9 requires AM_PROG_CC_C_O to compile anything useful but 1.10 doesn't. What's crazy is that 1.9 actually spits out an error message telling you to add AM_PROG_CC_C_O to your configure.ac. It makes no sense.
A system this complicated can't be pruned without breaking backwards compatibility. For autotools, that's not feasible. There are too many projects that would need to be fixed. The next-best solution is to use something else for new projects and let autotools fade gently into history.
> nnn vs. ncdu memory usage in disk usage analyzer mode (400K files on disk):
I assume that's because nnn isn't keeping the entire directory structure in memory, which means that browsing to a subdirectory involves rescanning the entire directory. That's a fair trade-off, but an unfair comparison.
1. https://vifm.info/ - I'm surprised this hasn't been mentioned yet in this thread.