Engineers later confessed that system resets had occurred during pre-flight tests. They put these down to a hardware glitch
This doesn't really ring true to me. Sounds like the sort of programmers who are inclined to blame "cosmic rays" for odd behavior that they don't understand.
I would think on a mission as expensive as the pathfinder, and knowing that once the hardware is launched you will never have physical access to it again, ANY anomalous behavior would have been tracked down to root causes during testing.
Pathfinder was not all that expensive by space mission standards. The total budget was only about $200M or so, which was an order of magnitude less than typical prior missions.
Also, while I was not personally involved in Pathfinder, I was in the research program that led up to it. We were doing things that had never been done before, including the first use of an off-the-shelf commercial operating system on a space vehicle. We were under tremendous time and budget pressure, and so was Pathfinder. The idea that an intermittent system reset could be written off as a hardware problem is entirely plausible.
A system reset might be easier to overlook than other errors. When we test our gear, something like a 2 dB loss (out of 30 dB initial transmit power) anywhere within our operating range (-40 degrees to +85 degrees) would likely result in another board spin (and a 15-20 day delay in the that particular product), but, a NIC reset during a 72 hour soak (where we cycle the temperature up and down multiple times), isn't something we would get stressed over. That's the entire idea behind having a watchdog - if something looks amiss, start clean.
It may have also been the case in the Pathfinder era that a "Reset" was acceptable behavior.
Also - depending on the nature of the watchdog reset - they may not have been able to capture a core dump - in which case, if there only a few resets - really hard to track down on a tight schedule.
>\Sounds like the sort of programmers who are inclined to blame "cosmic rays" for odd behavior that they don't understand.
In fairness, they go out of their way to harden they computers against cosmic rays. I am sure that they make a decision about what an acceptable amount of cosmic ray induced error is, and design the system knowing that it will happen. Having said that, I agree that having unexplained problems on the ground should get an explanation before being launched. Even if it was a hardware glitch, they should either have told the hardware people that there was a glitch, or the level of error was within the designed for range and they should not have been suprised when it happened in space.
I didn't mean they were blaming cosmic rays. I've worked with completely terrestrial programmers who are prone to blaming "cosmic rays" or "compiler bugs" for their own mistakes... and the claim that the Pathfinder engineers chalked up the reset to a "hardware glitch" reminded me of those sorts of rationalizations.
In further fairness, Pathfinder's name was also its job description. It was built on a relative shoestring and in a way, fulfilled its most important function the moment Sojourner rolled onto Martian soil. It was basically a combination prototype/advance scout.
I remember reading it in a article some time back. Software for this sort of systems isn't written like we do. Come to office, fire up IDE and code. Rather by and large software is first written on paper and the possible side effects and correctness is first verified thoroughly.
I know of a embedded programmer from 80's/90's era. He has done embedded programming both in assembly and C heavily. I recently happen to run into his journals/notes at his home. Turns out even for things as simple as a telephone answering machine or a ECG machine. They would heavily work things out on the paper first. Prove its correctness, once they've got it all figured out then they go out and do it on an editor. I mean it was so well organized on paper, he had all the tests figured out to there most extreme edge cases and he had everything on paper.
I did a small project with him and it was fun to watch him work. It felt like bulk of the work happened on paper first and then on the real tool. There was hardly any room for error, bugs were very minimal. And beyond all, working on paper has its biggest benefit. There is no dangerous distraction called the 'web browser'.
The problem is that it is too easy to turn back off. Editing the hosts file on xp is sort of a pain, so it works for overcoming the impulsive browsing.
well, sure it is right now because we are not able to do so yet. But there is a lot of research in computer science trying to do made that statement less over! :)
Though correct me if I'm wrong, but I find the article rather biased against iterative/agile way of writing software. Don't get me wrong, I believe there's a time and place for over-2500-pages-of-specs-for-6000-LoC kinda process, but the article seems to be saying that quick iterations are of the stone age/how little children do it and that the way the on-board shuttle group software team wrote their software is the "perfect" way.
none of us are trawling /new and upvoting anything, sadly. Great posts regularly don't get a single click the first time through the HN mill. To be on the front page an article needs to be 1) clickbait title, 2) good content, 3) lucky, 4) posted when the Americans are awake.
I've been using OCC to avoid this problem with my apps, and I couldn't be more happy. It mostly voids issues with priority inversion. (Not exactly, if using pre-emtive scheduling and long running high priority tasks.) It doesn't matter if something is stuck, looping, crashes or so, it still won't bring down the whole system. + Offers great parallel performance as long as resources aren't too widely shared. I got sick'n'tired administering (and creating) systems as DevOp, when I were using traditional locks and all the problems which those caused, even on top of poor performance.
Thank you for this. This is almost a meta-story, how "politicians"/"managers" are always ruining everything that is good and true, meaning of which they genetically incapable to grasp.
Yeah, this has been a common topic in any embedded job interview I've had since the 90s, and not just on VxWorks jobs. The Pathfinder case study is pretty well known.
This doesn't really ring true to me. Sounds like the sort of programmers who are inclined to blame "cosmic rays" for odd behavior that they don't understand.
I would think on a mission as expensive as the pathfinder, and knowing that once the hardware is launched you will never have physical access to it again, ANY anomalous behavior would have been tracked down to root causes during testing.