Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Floating point error is the least of my worries (2011) (johndcook.com)
77 points by rbanffy on Feb 21, 2018 | hide | past | favorite | 44 comments


In a funny way, reminds me of football: they'll bring out the chains to precisely measure the yardage... to a spot that was arbitrarily chosen by the ref. The refs do the best they can, but there's often no way they've spotted the ball exactly where the play died. The whole ceremony with the chains is ridiculous.

See also: economics.


Yes, I agree, and have thought about this a bit.

It seems like it effectively reduces to a coin flip choice.


I think that is the intent. If its really close then the system turns it into a random outcome with no apparent bias.


Having been someone that has programmed alternatives to IEEE FP, I would say that this is not a correct perspective in this sense: Floating point error can crop up in places you will never expect, and when it does the magnitude can be very difficult to predict (and can be far worse than any approximation or modeling error).

Here's a good example. Try solving this set of linear equations by hand (it's nearly trivial):

0.25510582x + 0.52746197y = 0.79981812

0.80143857x + 1.65707065y = 2.51270273

The answer you should get is: {-1, 2}, btw, which should be fairly easy to verify by hand.

Now try doing it with a computer.


This example shows a hazard of floating-point, but it does not show that the FP error is worse than modeling error.

Here, the problem arises because the equational system is ill-conditioned. Yet you present the system as if the coefficients are known exactly.

In reality, this ill-conditioned model can arise for two reasons:

+ The equations represent a physical system that is ill-conditioned -- a slight change in the system can produce a huge change in its output. So where did those coefficients, with their 8 digits, come from? A minor error measuring the coefficients will throw off your "correct" solution. In this case, modeling error is every bit as important as floating-point error.

+ The physical system is not poorly conditioned, but the model is. In this case, by definition, modeling error overwhelms floating-point error.

So the essay still advances the correct perspective -- if you are modeling an ill-conditioned system, you had better spend a lot of time worrying about the effects of modeling errors, before you get hung up worrying about FP errors. And if you know the system is ill-conditioned, then poor conditioning in the model is one of the things that an experienced analyst will be prepared for.


> In reality, this ill-conditioned model can arise for two reasons

> + The equations represent a physical system that is ill-conditioned

> + The physical system is not poorly conditioned, but the model is

In my experience, there are other ways this kind of problem can come up where it would have nothing to do with the model or the system being ill-conditioned. I'll give one example below, but I expect there are more I'm not thinking of.

One that I recall encountering is, say you have a model, and you compute the time 't' at which some event occurs (say, the time at which a condition x < h is violated) in some part of your model. Once you find this 't', you plug it back into your full model and evaluate the model at that time. Now you try to compute the time of the next event and repeat. But there's no guarantee that x = h (or even x >= h) at this time 't' here, even though that's the condition you solved for... and if x < h again, you can get stuck here re-evaluating the same event forever. And notice this had nothing to do with the system being ill-conditioned; it was purely due to FP round-off making x just a tiny bit smaller than h.

Now, there are two ways to fix this example. The usual hack (once you notice the implicit x = h assumption, which only obvious because I just told it to you, not when you're actually coding) is to just set a tolerance parameter for |x - h| being small... but introducing tolerances doesn't always solve the problem and in addition can introduce more messes of its own, since tolerances become quite difficult to reason about and deal with in the presence of feedback (they can cancel or compound themselves, for example) and you don't want to have to keep figuring out reasonable tolerances for every condition. On the other hand, there's another (nicer) solution: simply force x = h here, because that's literally the condition you solved for. The trouble is, this is a very unnatural approach, and hence very, very far from obvious when you're coding it for the first time. You have to introduce a mechanism for forcing the state of one part of your model externally, while the rest is computed internally. And you have to explicitly invoke this mechanism. It's something that makes sense after the fact, but it's very difficult to foresee because it makes no sense when you're thinking about the problem purely mathematically (i.e. in \mathbb{R}), and yet it can bite you just when you think your code that switches to finite-precision is fine (as it has bitten me).

My main point is, I wouldn't assume you can predict when these sorts of problems come up. They're not always due to poor conditioning or due to poor modeling, or poor coding for that matter. They can be products of reasonable, competent work and still crop up when you really don't anticipate them, no matter how "nicely-behaved" the inputs to your solver are.


> And notice this had nothing to do with the system being ill-conditioned; it was purely due to FP round-off making x just a tiny bit smaller than h.

This doesn't sound right: the comparison function x(t) >= h is extremely ill-conditioned at the point t where x(t)=h, having condition number Inf.


That is not the system being ill-conditioned. The system would be well-conditioned. That's just a comparison that is assumed to hold true inside the simulator because it's trying to advance its time to that of the next event. It's internal to the simulation algorithm.


I think it is: your code ends up taking different paths based on the result of a comparison, which is ill-conditioned. The underlying system might be well-conditioned, but the question you are asking about it is not.


I'm not disputing whether there is an ill-conditioning present here; I am saying that it is not the system that is ill-conditioned or otherwise somehow particularly pathological, and that this is obviously not due to a modeling error. Please consider re-reading my comments since my point is something other than what you are disputing -- you seem to be disputing a point that I was never making.


It is not the system that is ill-conditioned or pathological, nor is it modelling error, I agree. It is the comparison x(t) >= h that is ill-conditioned, and you'd be wrong to rely on its result (much like catastrophic cancellation, it is a bad idea to compare approximately equal numbers with errors in them, no matter where the errors come from—the errors could come from measurements, and the same logic would apply). We might have different ideas about what in your example is specified as given (with full precision) and what is specified approximately? But I do not think I misunderstood anything you wrote, please consider re-reading my comments too.


If you try to solve a linear system with a condition number of 3.73e16, it's not really correct to say that the problem lies with floating-point errors, because the real problem is that you tried to solve an extremely ill-conditioned problem. The wrong answer you get is backward stable, in fact—it is the exact solution of a nearby perturbed system. One way to think of it is that the phrase "the answer you should get" is somewhat ambiguous. I don't think your example goes against the post's main points.


> the real problem is that you tried to solve an extremely ill-conditioned problem

This makes it sound like you're saying just not to solve problems when they're too annoying, but I somehow don't think that's what you're trying to say.


In a way that's technically true, because solving ill-conditioned problems with noisy data is both difficult and the results are often meaningless. But you almost always have some choice about which of a multitude of subproblems you try and solve to reach your actual goal. It's like the difference between executing an algorithm and designing an algorithm. So instead of solving ill-conditioned subproblems, you spend your effort getting to well-conditioned subproblems instead.

I think a common example of an ill-conditioned problem that's too annoying to solve is weather prediction past 2 weeks in advance, because it's a chaotic system, so I don't think what you're saying is wrong.


That's only a place you would "never expect" because you've carefully hidden the gotcha. Stated less obtusely, you're finding the intersection between two lines which are parallel to JUST above the precision of an IEEE double. This is a cute example, not a real world situation. In the real world, precision loss in this circumstance is acceptable and any real system that accepted a "parallel" condition like this would have a check for it.

For a similar situation: atan2() exists because of a coverage issue in the problem area and not a bug in the idea of a tangent or a float.


I would think that atan2() exists because atan() doesn't invert tan() correctly for some purposes -- if you're trying to find the angle associated with a point then atan(y/x) is just the wrong answer.

The fact that it avoids the division by 0 is a convenient side effect of also being more correct.

(I'm just guessing; I don't know the actual history of it.)


The point was more that problems that need to use atan() in a context where the line may point near the Y axis don't flip out over "floating point precision", they just use atan2().

Problems like the one in the grandparent that need to intersect nearly-parallel lines (or subspaces) use special techniques better suited to the problem and don't just feed it to a linear solver.


How does that contradict what the parent was saying? Parent said problems like this can come up where you least expect it, and you rebutted that "this is not a real world situation" because "any real system that accepted a 'parallel' condition like this would have to check for it"? If anything you're confirming the point, not refuting it.


This is a carefully constructed (malformed) problem presented in an obfuscated way. It didn't "just happen" that those ~20 bit numbers happen to produce slopes that differ only after ~57 bits, come on. If you're going to do that kind of trickery, you can come up with an equivalent that blows up any fixed precision limit and/or explodes the representation size for any arbitrary precision computation system.

It's a parlor trick, not a serious problem.


in MATLAB:

>> A = [0.25510582 0.52746197; 0.80143857 1.65707065];

>> B = [0.79981812; 2.51270273];

>> X = A\B

X =

   0.272534072955269

   1.384541698505404
>> A * X

ans =

   0.799818120000000

   2.512702730000000
>> A * [-1;2]

ans =

   0.799818120000000

   2.512702730000000


Yeah, floating point error is the least of your worries... until it's not. But usually, it is.


Wolfram has no problems with it.


Other commentators have already shown that this is easily solvable with systems like MATLAB, Octave, or Wolfram. I will add, though, that I believe that the potential pitfalls of the above are still more due to modeling error than floating point.

Even if we don't relate the above system to a physical problem, the issue is that asking for a solution to `Ax=b` is generally an ill-defined problem. We always have to ask three questions:

1. Does a solution exist?

2. Does a unique solution exist?

3. If a solution exists, but it's not unique, then which solution do we want?

For a system `Ax=b`, I contend the first one is generally the hardest to answer. One way to view (1) is to ask whether there exists an `x` such that `|| Ax-b || < eps`. Here, we have a floating point question, but I don't think it's a floating point error issue. Simply, since we're working with floating point numbers, we have limited accuracy and have to define what we believe zero to be. Depending on the scaling, `eps` could be really big or really small and knowing the correct scaling is a modeling issue. Now, if we don't believe that there exists an `x` such that `|| Ax -b ||`, then we can solve something like a minimum norm problem and let `x=argmin || Ax - b ||` or we could do something else. Again, this is a modeling issue and it depends what we want.

Now, if we believe a solution exists and is unique, then we could just factorize the system with something like an LU. However, even here, we have a modeling issue because we have to decide what we mean by unique in floating point arithmetic. Again, this is not a floating point error issue, it's really an issue of what we want to define zero to be in floating point arithmetic. One way to define uniqueness is to look at the Hessian of the least-squares objective, `J(x) = 0.5 || Ax -b ||^2`, which is just `A'A`. If the smallest eigenvalue is smaller than `eps`, then we're rank deficient and the optimization problem `argmin J(x)` does not have a unique solution. Now, this relates to the condition number. Since `A'A` is symmetric, `cond(A)^2=cond(A'A)=|lambda_max(A'A)|/|lambda_min(A'A)|`. Basically, we have the scaling issue again. If the largest eigenvalue of `A'A` is much larger than the smallest, then we should probably just call the smallest eigenvalue effectively zero because we're limited by floating point arithmetic in our ability to correctly model differences.

These leads us to (3). If we have a solution, but it's not unique, then we need to pick what solution we want. Generally, this becomes the minimum norm solution or `x=argmin{ ||x|| : Ax = b }`. There's a lot of ways to calculate this, but using a QR factorization is my favorite. For the parent problem, this solution is `[0.594352357167031;1.228894994185025]`, which is precisely what MATLAB/Octave gives for `A\b`. The solution `[-1;2]` satisfies `Ax=b`, but it's not minimal norm. I guess we could call it the aesthetically pleasing solution.

Anyway, again, all of this relates to floating point, but I contend it's not a floating point error issue, but more of an issue of how do we model zero in floating point arithmetic. This relates to the scaling of the problem, but once we settle on that, then we can run through questions 1-3 to find the solution we want.


As he says, if you understand the nature of floating point, it should be fine. But having seen this been misunderstood over and over and over and over again on stackoverflow, many don't.

While we are on floating point, I really liked this article that was on HN a while back https://lemire.me/blog/2017/02/28/how-many-floating-point-nu...


He is right, but misses the big issue with floating point which is that, as far as I recall, it is neither commutative or associative. Hence, you can't use algebra then check equalities.

Most capable programmers just don't think to use a == b for floating points, because it isn't going to work a lot of the time, but for someone who isn't aware of that minefield they can wander in and get some very stupid bugs.


Floating point is commutative but it is not associative. Unfortunately, associativity is the more useful feature (you can do parallel reductions on noncommutative but associative operations, for example).


> Most capable programmers just don't think to use a == b for floating points, because it isn't going to work a lot of the time, but for someone who isn't aware of that minefield they can wander in and get some very stupid bugs.

Well, there are two separate issues here:

1) The default for programming should be decimal floating point instead of binary floating point. Suddenly, your strange "a == b" for floating point works just fine because 0.1 is exactly representable. Thus, addition, subtraction, and multiplication behave just like integer. Computers are now fast enough that any language in which you don't actually specify a floating point, should default to decimal. This covers 99% of the silly cases for beginners, and benefits people doing actual monetary calculations.

2) Programmers need to be able to specify when they want binary floating point for speed along with all the weirdness.


Decimal floating point is really no better than binary floating point, and suggesting that people should use it instead suggests that you don't actually understand what floating point is really about.

If you're dealing with money, for example, you'll want to use fixed-point, not any kind of floating point. Floating point expands its range by effectively representing numbers uniformly across the log space. If you need to deal with calculations that contain both 1e-12 and 1e12, you really want that logarithmic property. But money has a fixed scale: you don't need to represent any numbers between 0 and 0.01¢ (fractional cents are useful, but I think 1/100 is the smallest you'd need). A 64-bit binary floating number at that scale can only precisely store every number up to a tad over $900 billion, and even decimal floating point can only scooch that up to a tad under $1 trillion. A 64-bit integer in fixed point could handle $1.8 quadrillion precisely.

Decimal floating point doesn't fix rounding errors. It fixes superficial rounding error in that 0.1 is a terminating decimal in base 10 but not in base 2, but that's not really a problem that happens. It doesn't fix the fact that subtraction and division have the potential to destroy significant digits, and it doesn't fix the inability to accurately mix large and small numbers--both of which cause many more problems in practice, particularly when you try to do things like invert a matrix.


The default for programming should be arbitrary precision rational numbers. Think the rational numbers in Scheme or Haskell.


> The default for programming should be arbitrary precision rational numbers.

And all is well – until you learn about e and pi and realize, that your arbitrary precision rational numbers are no better than floating-point.


I find that the typical developer’s typical work doesn’t require irrational numbers at all. Of courses those specifically doing numeric computing would be best served with traditional floating point, but I truly think most don’t. Maybe my experience is biased then.


The typical developer's typical work mostly requires integers, and when they need reals, 64-bit floats are nearly always sufficient. I would even say that arbitrary-size integers are needed more often than true rationals.


I am struggling to see how this argument applies to things requring serious numerical analysis, such as finding eigenvalues and eigenvectors of a 500th order matrix, or some other complex control system with feedback and many parts.

Great pains need to be taken for calculations of this sort. One test: why would you sort floating point numbers in an array before taking their sum?


OP has never played KSP - had the kraken rip whole colony ships apart because of oscillatory wobbles coming in and out of physics modes :(


As far as I can tell, that supports, rather than contradicts, the post's point about modelling, fp, and approximation errors. It's a little too easy to ascribe KSP's bugs to floating-point errors, its physics engine just isn't trustworthy enough. How do you really know that the oscillations are not just due to it solving a stiff system of ODEs with explicit Euler, or something perfectly predictable like that?


Thus the simulation resembles reality, where such oscillations arise out of natural imperfections ;)


I just ran into the problem for an HR collegue that uses Excel.

Real financial data manage to produce three floats (with only 2 decimal places as usual with finance) that when used together a+b-c produced -0 (actually -0.1336e-20 or something like that).

The result ended up producing a cascade of mismatching results in further calculations seeking for 0 balance.

It’s a known and identified "feature" with a correct workaround which is rounding (for finance) and an horrible evil twin "formatting away and praying" that can lead to futur unfigurable mess for candid users.

So for me I agree that floating point error is not so bad from a mathematical standpoint, but from a technical stand point let us all remind that for most humankind 3/2 = 1.5 (and not 0).


”with a correct workaround which is rounding (for finance) and an horrible evil twin "formatting away and praying" ”

Are you suggesting the “precision as displayed” setting in Excel is that evil twin? Why?


Because financial data (in our use case) have a fixed 2 decimal format. In almost every other context you could impose this constraint by somehow defining a data type and be done with it.


Floating point is a trade-off which you just have to consciously choose. Sometimes you need more precision. Sometimes - even less: in my current project, I encode horizontal object position (coordinates x and z) in 2 bytes, while y coordinate (vertical) and all 3 euler angles fit into 1 byte each. It's far from precise, but looks OK from the point of view of our top-down camera, and all movement updates of players and NPCs fit into half of a pessimistic IP packet (about 400 bytes) per frame.


Floats are used in applications other than simulations and modelling. Probably they're mostly used in applications other that simulations and modelling. Modelling errors are nowhere to be found, but floats can still be troublesome. I once spent a day trying to determine why a filter criteria wasn't being met that stated {numberA} + {numberB} >= 0.3 The answer as you can guess was that the system wanted to be doing decimal arithmetic, but was implemented (by me) using floats.


And I take exception to your lack of worry about floating point errors (FPE).

Trefethen's statement was made only in response to the outrageous mischaracterization of numerical-analysis by non-numerical analysts as "people who are mainly concerned about errors in their calculations". Trefethen wrote a paper debunking the claim that numerical analysis is all or mostly about error analysis.

Now back to FPE. Let me give you a few arguments for why FPE is a major headache:

- Consider a random collinear triplet of points on a 2D plane. What I mean by that is three (x, y) pairs (x1, y1), (x2, y2), (x3, y3) such that they make a straight line. Do you know that if you generate such triplets with random angles, 99.999% of your triplets would not be collinear if you're using FP? Why? because FP representation is not exact except for a hopefully small number of scenarios. E.g., if your lines are perfectly vertical (x1 == x2 == x3) or perfectly horizontal (y1 == y2 == y3), or a perfect diagonal (x1 == y1, x2 == y2, x3 == y) you'd have collinearity, otherwise it's a toss up. For any other lines, you could try to be very careful in picking x and y values, but then you probably compromised some other property of your points depending on the application (and btw, when you decide to carefully pick x and y values, welcome to the world of 'being concerned about FP' already). This is the well known problem of 'robustness' in computer graphics, and something CG researchers have to spend a lot of time on, "beyond" their work of actually having created a correct algorithm, and "beyond" actually having implemented a performant version of it.

- You compare floating-point error magnitudes with modeling error magnitudes. I'm afraid that's a silly comparison. When you're working as a numerical analyst, the modeling domain is the primary part of your research activity. You're expected to spend time on looking at modeling and associated errors. FPE could have much smaller magnitudes, but when you have them, they're a serious headache and a distraction, and force you to drop down to system-programming and dealing with the IEEE standard and how you can circumvent the negative affects. That's like saying "I'm not concerned about leaks in the fluid systems of my car because they're insignificant compared to my worries in driving the car in dangerous conditions like snowy weather, steep hills, rough terrain".

- Also by the way, good luck creating a reproducible system primarily based on FP if it has to run on different architectures. Various microprocessors like Intel x86, x86_64, arm, power, etc, etc have implemented their own slightly differing ways of doing single and double precision operations like addition, multiplication, trig functions, etc. Your FP numbers are pretty much guaranteed to differ in their 10 or 11th decimal points. You might think this is not a big deal. But then make sure your system is not running for hours or days, (and definitely make sure your system is not for critical applications like medical or space, etc), or you will encounter these errors accumulating and results diverging away from each other in more and more significant decimal places.

So to summarize, if you're not worried about FPE, I'm afraid it's gives a strong impression that you're only getting started in FP based work (hands-on numerical analysis, computational science, stats/ML, etc) and have not acquired enough experience to realize its importance.

edit: I see you have a PhD in applied math based on non-linear PDEs, and a postdoc. So my question to you is, did your PhD involve implementing and running numerical algorithms (meaning writing/compiling/testing actual numerical code)? If the answer is yes then I'm extremely surprised that you don't worry about FPE.


He's not saying that he's not worried about FPE. What he's saying is that misunderstanding the physical system being modeled is far more likely to cause problems, and that it's much harder to spot than an FPE. So your mathematical model requires much more scrutiny for that reason. Basically if you don't spend much time questioning your assumptions about the world your system exists in, then don't even bother accounting for FPE because it's going to be overwhelmed by errors that you actually chose to introduce by your modeling assumptions.


In view of some of what you wrote above, I'm compelled to note that Cook is generally well-informed and his posts are on target - as is this one. I think if you read through some of his posts you'd agree. Some of the ad hominem comments you made above are regrettable and baseless.

(About my own experience with floating point: I think I wrote my first Cholesky solver in K&R C in about 1986, while taking a course based on Golub and van Loan. This was before IEEE was sensibly implemented and you could observe significant FP differences across various Unix platforms.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: