What really happened to the software on the Mars Pathfinder spacecraft?

ams6110 · on July 5, 2013

Engineers later confessed that system resets had occurred during pre-flight tests. They put these down to a hardware glitch

This doesn't really ring true to me. Sounds like the sort of programmers who are inclined to blame "cosmic rays" for odd behavior that they don't understand.

I would think on a mission as expensive as the pathfinder, and knowing that once the hardware is launched you will never have physical access to it again, ANY anomalous behavior would have been tracked down to root causes during testing.

lisper · on July 5, 2013

Pathfinder was not all that expensive by space mission standards. The total budget was only about $200M or so, which was an order of magnitude less than typical prior missions.

Also, while I was not personally involved in Pathfinder, I was in the research program that led up to it. We were doing things that had never been done before, including the first use of an off-the-shelf commercial operating system on a space vehicle. We were under tremendous time and budget pressure, and so was Pathfinder. The idea that an intermittent system reset could be written off as a hardware problem is entirely plausible.

ErsatzVerkehr · on July 5, 2013

A hardware problem is still a problem, though... Why would it be "written off"?

lisper · on July 5, 2013

Because it would have appeared to be a problem with the test rig, not the flight hardware.

ghshephard · on July 5, 2013

A system reset might be easier to overlook than other errors. When we test our gear, something like a 2 dB loss (out of 30 dB initial transmit power) anywhere within our operating range (-40 degrees to +85 degrees) would likely result in another board spin (and a 15-20 day delay in the that particular product), but, a NIC reset during a 72 hour soak (where we cycle the temperature up and down multiple times), isn't something we would get stressed over. That's the entire idea behind having a watchdog - if something looks amiss, start clean.

It may have also been the case in the Pathfinder era that a "Reset" was acceptable behavior.

Also - depending on the nature of the watchdog reset - they may not have been able to capture a core dump - in which case, if there only a few resets - really hard to track down on a tight schedule.

gizmo686 · on July 5, 2013

>\Sounds like the sort of programmers who are inclined to blame "cosmic rays" for odd behavior that they don't understand.

In fairness, they go out of their way to harden they computers against cosmic rays. I am sure that they make a decision about what an acceptable amount of cosmic ray induced error is, and design the system knowing that it will happen. Having said that, I agree that having unexplained problems on the ground should get an explanation before being launched. Even if it was a hardware glitch, they should either have told the hardware people that there was a glitch, or the level of error was within the designed for range and they should not have been suprised when it happened in space.

ams6110 · on July 5, 2013

I didn't mean they were blaming cosmic rays. I've worked with completely terrestrial programmers who are prone to blaming "cosmic rays" or "compiler bugs" for their own mistakes... and the claim that the Pathfinder engineers chalked up the reset to a "hardware glitch" reminded me of those sorts of rationalizations.

nknighthb · on July 5, 2013

In further fairness, Pathfinder's name was also its job description. It was built on a relative shoestring and in a way, fulfilled its most important function the moment Sojourner rolled onto Martian soil. It was basically a combination prototype/advance scout.

mturmon · on July 6, 2013

Absolutely correct. More context: Pathfinder was done during the Goldin era of better, faster, cheaper.

kamaal · on July 5, 2013

It was the same thought I had too.

I remember reading it in a article some time back. Software for this sort of systems isn't written like we do. Come to office, fire up IDE and code. Rather by and large software is first written on paper and the possible side effects and correctness is first verified thoroughly.

I know of a embedded programmer from 80's/90's era. He has done embedded programming both in assembly and C heavily. I recently happen to run into his journals/notes at his home. Turns out even for things as simple as a telephone answering machine or a ECG machine. They would heavily work things out on the paper first. Prove its correctness, once they've got it all figured out then they go out and do it on an editor. I mean it was so well organized on paper, he had all the tests figured out to there most extreme edge cases and he had everything on paper.

I did a small project with him and it was fun to watch him work. It felt like bulk of the work happened on paper first and then on the real tool. There was hardly any room for error, bugs were very minimal. And beyond all, working on paper has its biggest benefit. There is no dangerous distraction called the 'web browser'.

DigitalJack · on July 5, 2013

That fricken web browser. I had to resort to sticking hacker news and a few others in my hosts file pointing back to 127.0.0.1. (On xp)

outworlder · on July 5, 2013

Turn on the noprocrast flag.

DigitalJack · on July 6, 2013

The problem is that it is too easy to turn back off. Editing the hosts file on xp is sort of a pain, so it works for overcoming the impulsive browsing.

javert · on July 5, 2013

> Prove its correctness

That's probably an overstatement :P

alemhnan · on July 5, 2013

well, sure it is right now because we are not able to do so yet. But there is a lot of research in computer science trying to do made that statement less over! :)

javert · on July 5, 2013

Yeah, but AFAIK they have been researching that topic for like 30 years with very little progress overall.

taspeotis · on July 5, 2013

Seeing this article reminds me of http://www.fastcompany.com/28121/they-write-right-stuff

jon2512chua · on July 5, 2013

Interesting read.

Though correct me if I'm wrong, but I find the article rather biased against iterative/agile way of writing software. Don't get me wrong, I believe there's a time and place for over-2500-pages-of-specs-for-6000-LoC kinda process, but the article seems to be saying that quick iterations are of the stone age/how little children do it and that the way the on-board shuttle group software team wrote their software is the "perfect" way.

taspeotis · on July 5, 2013

It's been a while since I've read the article, but I skimmed over it and didn't really get that vibe from it.

I felt it was more "these guys are extreme outliers in terms of software quality, but also extreme outliers in terms of process."

kotnik · on July 5, 2013

For a (much) more detailed story, go here: https://www.cs.duke.edu/~carla/mars.html

mturmon · on July 6, 2013

Thanks for this. It gets in to details about how e fix was deployed and says more about why it was not caught during testing.

mdturnerphys · on July 5, 2013

Did anyone else notice the "?reposting" in the URL? It looks like someone else submitted this 8 hours earlier (http://news.ycombinator.com/item?id=5991503).

willvarfar · on July 5, 2013

none of us are trawling /new and upvoting anything, sadly. Great posts regularly don't get a single click the first time through the HN mill. To be on the front page an article needs to be 1) clickbait title, 2) good content, 3) lucky, 4) posted when the Americans are awake.

Oh well.

lucaspiller · on July 5, 2013

The last bit is the main thing I think. The original has exactly the same title and original poster has a lot higher karma.

Sami_Lehtinen · on July 5, 2013

I've been using OCC to avoid this problem with my apps, and I couldn't be more happy. It mostly voids issues with priority inversion. (Not exactly, if using pre-emtive scheduling and long running high priority tasks.) It doesn't matter if something is stuck, looping, crashes or so, it still won't bring down the whole system. + Offers great parallel performance as long as resources aren't too widely shared. I got sick'n'tired administering (and creating) systems as DevOp, when I were using traditional locks and all the problems which those caused, even on top of poor performance.

raverbashing · on July 5, 2013

What's OCC?

taspeotis · on July 5, 2013

Took quite a bit of Googling, but it looks like it's optimistic concurrency control.

dschiptsov · on July 5, 2013

There were rumors that older generations of spacecrafts fly on lisp systems. Nevertheless, you see, they had C-interpreter (REPL) there.))

dandrews · on July 5, 2013

You're likely thinking of this essay: http://www.flownet.com/gat/jpl-lisp.html

lisper · on July 5, 2013

Or this: http://www.youtube.com/watch?v=_gZK0tW8EhQ

dschiptsov · on July 6, 2013

Thank you for this. This is almost a meta-story, how "politicians"/"managers" are always ruining everything that is good and true, meaning of which they genetically incapable to grasp.

pjmlp · on July 5, 2013

REPLs are very good for interactive development, regardless of the language.

comatose_kid · on July 5, 2013

I did some embedded sw on vx works based systems back in the day. Any embedded engineer we interviewed had to know what priority inversion was.

joezydeco · on July 5, 2013

Yeah, this has been a common topic in any embedded job interview I've had since the 90s, and not just on VxWorks jobs. The Pathfinder case study is pretty well known.

damian2000 · on July 5, 2013

TL;DR concurrency is hard