Conservation of Intent: why A/B tests aren’t as effective as they look

snovv_crash · on July 3, 2018

A/B tests tell you about short term gains, but don't tell you about long term issues you may be accumulating due to things like dark patterns, clickbait headlines, shoddy article topics and more. A/B tests don't take into account the loss of prestige or reputation that the options give.

I've seen this repeatedly with ArsTechnica, which has devolved into so much political and clickbait material that I don't even really visit anymore. Yes, I'm guity myself of clicking on those articles when I do visit, but at a certain point I've found that Ars doesn't have the news I'm after, so I turn elsewhere and now Ars has one less viewer.

smueller1234 · on July 3, 2018

I think what you say is practically spot on.

Yet I'd like to add that I don't think that testing frameworks (at this point it would be misleading to call them strictly A/B testing) HAVE to only reflect short term gains. It's hard to come up with (proxy) metrics that hold that kind of short term optimization in check.

Just to pick one fairly obvious example for e-commerce: you can track returns/customer service contacts as a health metric or even explicitly assign a value to them to fold them into the primary metric you're optimizing.

These kinds of safeguards typically need longer recording periods, so it's potentially a lot of data science and engineering effort to build tools that can handle the long term data collection and analysis. But it's not impossible. It's rather something people love to pretend is not a problem.

vintagedave · on July 3, 2018

> I've seen this repeatedly with ArsTechnica

Can you expand? I am an Ars reader, and I too find it frustrating compared to what it used to be. I wish there were more in-depth technical articles; I find it too light on details, written for a non-tech audience. I really, really wish it had solid technical content, since it's what got me reading it.

But I don't characterise it as click-bait (it seems clear) nor as especially political, except in so far as politics intersects technology and climate. In both those it seems balanced or erring towards freedom.

kolpa · on July 3, 2018

Politics and a Climate and Freedom are interesting, but not what makes Ars Technica interesting.

swied · on July 3, 2018

So true about the danger of focusing on short term gains. I think that it is a problem of picking the wrong things to optimize and then sampling on the wrong entities. Conversion rate is often the most commonly used target, but, it is short sighted. In the big picture a company is really just trying to maximize profit. Note that you can't use traditional A/B testing tools on a value like total profit. You need an average if you are going to run an A/B test. For conversion rates your average is conversions per user. In this case you sample on the denominator -- the users. What do you sample on if you are testing total profit per quarter? This is the ultimate conundrum for websites who use A/B testing. They don't understand how they might actually get more new/returning users by simply improving the user experience. Instead, they run marketing campaigns to drive growth and then squeeze every dollar out of the customers when they arrive. This is not a good long term strategy. I have reviewed hundreds of A/B tests over the years. The only ones that I think succeeded were the ones that focused on improving the users experience.

s-shellfish · on July 3, 2018

User experience analysis is always going to require a human mind to synthesize the material. The problem is that automation eventually gets noticed by users, thus automation implies to a user 'the programmers/whoever believes my mind and preferences can be automated'.

That's always going to throw a wrench into automation. So yes, it's hard work that requires real thinking with a clear focus on the users and how the users feel about the software. Software needs to care about users if it is dependent on users for it's business model!

citrablue · on July 3, 2018

What's the alternative to A/B testing? Everywhere I've worked, design teams hate A/B testing, but suggest "trust us we know what we're doing" as the alternative.

A/B/n testing let's you explore a search space. There are issues that can come up from that, but IMO it's a better option than this Dilbert comic[0].

[0] http://dilbert.com/strip/2014-10-27

swied · on July 3, 2018

A/B testing is a great tool for incrementally improving upon a product design given the correct business strategy. Just don't let the limitations of A/B testing drive how you choose your business strategy. The strategy is ultimately about maximizing total profit, and not necessarily short term profit per user.

citrablue · on July 4, 2018

How do you create a discipline around making good decisions for business strategy? I've seen "highest paid person" win, I've seen "Most vocal" win, I've seen "5 versus 1" win.

The testing advocates I've worked with have an attitude of "I don't know what will work - that's why we're testing". I have not seem much of that attitude from most other individuals/teams involved in decision making -- they have preferred to say "this will work", and get angry when I say, "compared to what? how do you know?"

edit - fwiw, I totally agree with understanding the limitations of testing. You have to know what it can be good for. My argument is that it's actually better for long term strategy than most other decision making processes I have experienced, which usually boils down to "gut based".

s-shellfish · on July 3, 2018

Listen to people about their complaints and preferences as intellectual equals.

citrablue · on July 4, 2018

Sure, but not everyone's opinion is equally valid. I've worked with a number of design teams who confuse their opinion (an educated opinion, which I respect) as a fact. Facts beat opinions -- if a short-term optimization will be a long term loser, then prove it.

Unfortunately, it's not really possible to prove/disprove an opinion on what a design element will do.

Additionally, optimization testing allows you to try out different options. Going with someone's (educated) gut feel is a worse decision than exploring the option space.

Ideally, you'd test everything. In the real world, resource constraints apply, so at issue is how to establish an internal discipline where facts trump opinions, and opinions can be tested then either discarded or confirmed.

s-shellfish · on July 4, 2018

I mean the users of software. Instead of trying to predict what is the best design, create interfaces that make it easy to aggregate that data, and take that information at face value.

When it comes to design, there often aren't facts (one may even suggest facts don't exist when it comes to design). Study the history of art to contest that. Otherwise, listen to users like they are humans, and see whether it matches with the designers advice. Simple.

Franciscouzo · on July 4, 2018

I think a better approach might be using bayesian optimization, each variation would be a dimension, and the algorithm would find the (almost) optimal possibility instead of a/b testing, which finds the local maximus.

mygo · on July 3, 2018

This has less to do with the use of A/B testing than it does the experimental design.

What are they testing for in their metrics? Is it just click rates? Well that’s a patently losing strategy in the long term.

What if whether or not you came back over time is what was being measured? Would that break some cardinal rule that all A/B tests must be done within a short time period?

Researchers has been doing A/B testing for centuries. Till now they’ve just called it having a control variable and an independent variable. Experimental design is a big deal, and something that is taught to every graduate student researcher, but I don’t see the tech blogs giving experimental design enough focus — at least not in their click-bait headlines. Sometimes people don’t design good experiments and then blame the methods.

RA_Fisher · on July 4, 2018

Bingo! "A/B testing" aka completely randomized design: https://www.itl.nist.gov/div898/handbook/pri/section3/pri331...

Mountains can be moved with randomized control trials.

tehlike · on July 3, 2018

Not entirely true. You can setup your experiment to be user keyed (as it should be) and can run it for a long time, or can run the hold back for a long time

Something as basic as mod(hash(experiment_group_id or layer id, userid), 1000) would give you a user stable experiment distribution that you can analzy long term impact on.

kolpa · on July 3, 2018

Long-term A/B testing has major problems in the modern world where users can communicate with each other (both onsite and out of band) and can use multiple accounts / devices / browsers

tehlike · on July 4, 2018

communicating user is less of a concern, cross device _is_ a real concern. In some cases it's minimized with user logins, though.

mockingbirdy · on July 3, 2018

Winning the game but losing the meta-game.

Jommi · on July 4, 2018

That's so true!

Too many companies just optimize for short term instead of long term.

Yet, on the other hand it's completely natural: Not every founder wants to stay at their company for ever. Not all companies have the objective of staying alive. It's much more exciting and thrilling to chase that hockey stick growth and IPO/buyout.

Mirioron · on July 3, 2018

The last point you mentioned is why I stopped visiting Ars. I can even live with the articles themselves being slanted, but the comments section seemed even more heavily slanted.

birken · on July 3, 2018

I could not disagree with this more. I remember vividly having this "low-intent" vs "high-intent" debate at Thumbtack, when we rolled out changes that A/B tests showed increased conversion (by a lot), but some people in the company thought the changes were ugly and "off-brand" and argued they brought in the wrong type of customers. So we ran the test again that we knew raised conversion by a lot, and then followed the 2 cohorts of customers and watched their behavior. The control group vs the 10% more from whatever the test was that increased conversion. They behaved exactly the same. They came back again at the same rates. They made the same amount of profit (per customer). Their response rates to emails were the same. They closed jobs at the same rates. As far as we could tell they were identical.

I have to admit I was a little surprised too, but for our business it didn't seem this "high-intent" vs "low-intent" distinction existed. And with that out of the way we continued to optimize conversion rates, and our revenue continued to go up.

Every company is different so I don't want to generalize too much, but if somebody tells me they ran an A/B test that said some key flow went up 10%, but then afterwards the traffic/revenue/whatever didn't go up 10%, I think the most likely candidate is bad test design. Humans are really good at rigging A/B tests to produce wrong results in their favor. I guarantee every single company who isn't maniacal about A/B testing does at least one of the following:

- Uses a tool to grade A/B tests that isn't statistically sound

- Let's people check tests too often and allows them to stop the test when it hits a good result

- Running a test with a lot of similar variations and cherry picking the best one

- Doesn't plan for enough traffic to detect the percentage of change their test is likely to produce

All of these create the potential for the perceived gains of the A/B test not matching up with real world result.

I'm not saying the distinction between "low-intent" and "high-intent" customers doesn't exist, but it is fairly easy to test for. Do that test for your business and see if that distinction exists. But don't use it as some magical explanation for why your A/B tests aren't producing the results you want as this article suggests.

sokoloff · on July 3, 2018

> Let's people check test too often and allows them to stop the test when it hits a good result

I admit to attempting to be guilty of this in the past and being stopped by our analytics team (in the sense that they took the time to patiently explain to me why what I was doing was statistically unsound). It's not obvious, IMO.

birken · on July 3, 2018

I completely agree. We went through all of this stuff as our A/B testing was evolving at Thumbtack. They seem so simple, but the deeper you get you realize they are only simple if you don't "cheat", which everybody does. And you'll only realize you are cheating in the first place if you have somebody who is maniacal about testing.

If you create a culture where positive A/B tests are lauded (which is good!), then you create a lot of people who want A/B tests to finish in positive ways. For those people, it doesn't really matter if their A/B test actually improves things, only if it looks like it does. This isn't nefarious, this is just human nature, but it creates a lot of creativity and energy at finding ways to making winning A/B tests. We'd have people run a test, where their new variation got off to a bad start, then say "oh, it was a bug", then they'd fix some irrelevant thing and start over just to reset the counters. I was guilty of it sometimes. You get excited about your tests and want them to win. That is why it is critical either to have gatekeepers like an analytics team to keep you honest or have a really specific protocol on how your company runs tests and only consider results of tests that followed the protocol.

btilly · on July 3, 2018

Meh.

There is statistics for the purpose of uncovering Truth, and statistics for the purpose of making a business decision. The difference is that when we talk about Truth, a small error is still an error. When we make business decisions, it is fine to make a decision that is probably right, and we know isn't far wrong.

Here is a perfectly valid test procedure that illustrates the difference. Decide the most time you would be willing to spend to get a test result. Multiply that by current conversion rates to get N, the number of conversions that you expect to see by the end of the test.

Start running the test with two variations. Stop at any point if one variation is at least sqrt(N) conversions ahead of the other. Stop at N if there is no clear winner and go with whoever is ahead, even by a hair.

Here are features of this test procedure.

o You always make a decision.

o Running a test has a known fixed cost. You know how long it takes. And a bad idea will cost you no more than sqrt(N) conversions to test.

o The results are very simple and easy to understand.

o Your answers are usually right.

o Your bad decisions are not very bad. If the true conversion rate for one version is better by 1/sqrt(N), you've got a 95% chance of making the right choice. You will probably never make a mistake as big as 2/sqrt(N).

The result is a test procedure that is a horrible approach for doing science, but an excellent tool for improving a business. You'll never find it in a statistics class. And I'm sure it would horrify your analytics team.

joshuamorton · on July 3, 2018

I mean I think the only reason it would horrify your analytics team is that if you want to do something that sophisticated, you may as well just use a better multi-armed bandit function like Thompson Sampling or UCB-1 (which is very, very similar to what you've described, although more formalized).

So I think its wrong to say that you'd never find it in a stats class.

srean · on July 4, 2018

Bandits and A/B are meant for solving very different problem.

In bandit there is a clear explore/exploit trade off. There is no such trade off in the A/B formulation, although it does get used in scenarios that have such trade offs.

If I can pull the lever a finite and small number of times there is a strong incentive for using a bandit. In this case I don't want to pull the wrong lever as few times as possible. On the other hand if I am given an unlimited number of pulls, I can afford to pull the wrong one many times more (still finite) for the sake of 'knowledge' knowing well that I would have infinitely many opportunities to exploit that knowledge.

joshuamorton · on July 4, 2018

And for a business there always is. In an a/b test that is to improve customer conversion, in a perfect world, you use the superior method on everyone, converting the maximum number of customers. That saves you money.

In other words, the opportunity cost of putting someone in the wrong group creates such a trade-off. You can pull the lever as many times as you want, but each one potentially costs you money. It's textbook bandits.

srean · on July 5, 2018

To a large extent I agree with you.

Differences creep in when there is ambiguity and judgement involved on what is that metric that the org wants to optimize. This is fairly common. Typically, in these situations its the PMs who make the final call. There the goal of the experiment protocol is glean as much knowledge as possible, and present it to the PM. The thinking there is -- if that comes at the cost of exposing some customers to bad choices, so be it.

kolpa · on July 3, 2018

Truth isn't that different from business. Both are statistical. Making a decision that is not statistically valid is bad business, as you could be making things worth as often as better.

srean · on July 4, 2018

I think you might be missing btilly's point. He is saying that the testing protocol should factor in the cost of being wrong. These costs might be drastically different for a business and for a scientific pursuit.

If A is ahead of B by a hair and the my flawed protocol chooses B the cost to business might not be that high. But the same protocol might not be a good one if the cost of making that mistake is very high. The probabilities of the errors remain the same for the two scenarios, the expected costs are different.

Xcelerate · on July 4, 2018

Yeah, classical A/B testing only works if you specify the duration of time that the experiment will run in advance and stick to it. There are ways that you can continuously monitor how an experiment is performing (Google "sequential Bayesian analysis") but the math is quite a bit more complex and nuanced.

Honestly, for most startups a simple multi-armed bandit approach is probably the way to go. Don't worry about statistical significance; just throw some "lite" reinforcement learning on top of your product's aesthetics and enjoy the incremental profit. (Caveat: do not apply MAB to major product changes.)

taejo · on July 4, 2018

> Honestly, for most startups a simple multi-armed bandit approach is probably the way to go.

Doesn't this mean maintaining all the variants forever?

alexbeloi · on July 3, 2018

That is actually one of the biggest contributing factors to the replication crisis in science, lots of scientists have been making this error for decades. Very not obvious.

milesvp · on July 4, 2018

Bless you for listening. I find statistics to be one of those subjects that can be very counter intuitive and hard to grok, which leads to people just shutting down when the topic comes up. I have a hard enough time convincing people that just the use of statistical average is inferior to median in nearly every context they care about in business, let alone why stopping a test early invalidates it.

marcosdumay · on July 3, 2018

Well, you can arrange the test so checking often is not a problem. It may even be the optimal way.

But yes, statistics is not obvious at all.

samd · on July 3, 2018

So damn true, it's so difficult to get statistically significant A/B test results, and the popular tools actively lead you astray. It's rare enough for a startup to have the traffic to meaningfully get results on their homepage before the heat death of the universe, let alone random landing or in-app pages. I'd recommend anyone reading this who does or wants to do A/B tests read: https://www.evanmiller.org/how-not-to-run-an-ab-test.html It was one of the trickiest lessons I had to learn. Once you realize you need to set your sample size in advance, you actually have to do the math to figure out the traffic you'll need. That's when reality hits you.

birken · on July 3, 2018

Yup. The best thing for running an A/B test you can do, by far, is setting up the rules in advance. Our protocol was something like this:

This test is going to have 2 variations, looking for a 5% increase of the conversion rate that is currently X%, and to do so the test will run for Y iterations (based on the company standards for significance and power).

If the test shows a >= 5% increase, it wins and we use it. Yay! If not, we assume it is no different than the baseline, record the results and discard it. You are welcome to peek at the results all you want, but no tests are stopped early and no decisions are made until it reaches the set amount of iterations. This isn't the only good or valid A/B testing protocol, but it does force people to consider the costs of their tests in advance (in terms of time and iterations required), which I think had a positive effect on the type of tests people ran.

Just having the calculator discourages people from running tests with tons of variations looking for tiny increases (like the famous try different colors of your submit button), because just looking at the iterations required by the calculator it becomes obvious to everybody that those types of tests just cannot show any meaningful results for your average website.

Karrot_Kream · on July 3, 2018

Well said. The problem is, most companies don't really build up a model for conversions, ignore sample sizes, and do not fit their A/B test results along the model, they just blindly look for the higher percentage. I guess this is the problem if you don't have any statistics-aware engineers.

ssharp · on July 3, 2018

I think it's always good to attach A/B tests and other web analytics to individuals so you can go back and do this type of segmentation and analysis. A/B test software like VWO is great, but when it comes to analytics and measurement, looking at real people (assuming you have an identifier for them) is much better than looking at pixel'd representations of a person.

Attaching A/B test variations, sourcing data (UTMs), and other data to actual customer/user records in your database is a great start to understanding your results better. That allows you to do things like suss out "intent" (essentially conversion rate), "quality" (essentially LTV), etc. down the road and really understand the value of your tests and acquisition channels.

I'm always surprised when companies with more than enough resources to do this don't do this.

gwbas1c · on July 3, 2018

Did you read the article? The real point has less to do with intent that you realize.

Most of the discussion has to do with having healthy skepticism for a vendor making exaggerated claims; and ensuring that you retain customers.

gfodor · on July 3, 2018

The title is misleading relative to the article's content. Surely, as the author points out, sometimes A/B tests can be misleading especially if you ignore longer term cohort analysis, etc.

But often times, if you fix an obviously broken part of your funnel, particularly in the early acquisition stages, you're fixing things that are universally lifting the amount of people who ultimately are able to engage with your brand and product to the point where they can even form intent. The reality is most people are only willing to give you a tiny bit of their time during their first one or two engagements with your brand, so at that stage you're trying to sell them on your product, and build intent. A/B testing helps reduce the friction needed to get them through the core of your sales pitch.

It's easy to come up with a thought experiment that shows A/B testing can sometimes be as simple as you'd imagine: just break the site. Your conversion drops to 0%, now split test the fix. Like magic, your control stays at 0% and your variant returns to normal. Nothing about "intent" in this scenario, this is pure friction resolution. Just a thought experiment, but shows that surely there are plenty of places where pure A/B testing and removing friction is a net positive without any fretting over this "conservation of intent" issue.

foobaw · on July 3, 2018

Slightly relevant but useful: use meditation modeling (https://eng.uber.com/mediation-modeling/)

mwexler · on July 3, 2018

This is a nice read. BTW, in case folks get confused, moderators are different from mediators. http://psych.wisc.edu/henriques/mediator.html

smueller1234 · on July 3, 2018

What the article largely discusses seems to be a problem with the metrics one chooses as a proxy.

Let's say you're actually trying to optimize total transaction value on the site or total number of transactions or something like the overall fraction of users with at least one transaction within a certain window of time. Then - as the article rightly observes - getting users not to bounce on a particular page is a TERRIBLE proxy to what you're optimizing for. If that's not clear to you, you have no business running A/B tests without supervision.

Source: co-designed one iteration of the experimentation framework for Booking.com many years ago. Indirectly managed the team of much more qualified people that took it a world further.

User23 · on July 3, 2018

One of my coworkers is a trained particle physicist and he informs me he almost never sees properly designed experiments used by our A/B testers. The result is that the testers almost always find what they are looking to find.

ben509 · on July 3, 2018

I think the idea of "high intent" is the same fallacy as the notion of "affordable." We say something is "affordable" because we have "enough" money to buy it, but that's not how people make decisions in aggregate.

The reason economists talk about opportunity cost is because people are constantly optimizing decisions based on new information. (Humans may not deal with prices and numbers very well, but they're pretty well evolved to break time into chunks and work out plans to solve problems.)

If you talk to an individual, they might say "I can't afford it," or you may talk to someone who didn't click through and they might say, "I was just browsing." The fallacy behind both is you're creating archetypes and assuming they represent the modes of the population.

And even if you talk to the individuals you based those archetypes on, there is a whole history behind how they arrived at "I can't afford it." Those changing circumstances are why the aggregate behavior doesn't show some arbitrary level of "affordability," and instead you see a smooth curve of consumer demand.

And the opportunity cost of continuing to view a web page will not have neatly quantized levels of intent, but rather individuals have a broad array of competing interests.

MaxBarraclough · on July 3, 2018

> You ship an experiment that’s +10% in your conversion funnel. Then your revenue/installs/whatever goes up by +10% right? Wrong :( Turns out usually it goes up a little bit, or maybe not at all.

Never mind "The difference between high- and low-intent users", this could be explained in terms of regression-toward-the-mean, a phenomenon mention in neither the article, nor the discussion here.

Have 1000 students do an IQ test. Pick the top 20 students. Have them do another IQ test next week. Their mean score second time round will almost certainly be lower than their mean score first time round. The reason they made the top 20 the first time round was a combination of having a high true IQ, and being lucky on the day. Second time round, they aren't 'defined to be lucky', as it were.

It's the reason movie sequels tend to be worse than the original. The reason the sequel was made was that the original movie was far more successful than the average movie, on account of both unusually skillful creators, and unusually good luck. Second time round, you can't count on the luck component again.

baybal2 · on July 4, 2018

Totally true, I list count of people coming from the web/startupey scene A/B testing their companies/business units into insolvency

a-dub · on July 3, 2018

Frustration is not linear. Film at 11.

_1qd4 · on July 3, 2018

Wait, so you’re telling me the laziest form of scientific analysis, the A/B test, doesn’t produce accurate results? Colour me shocked.

A/B tests routinely leave out important observations, have way too small a scope, uncontrolled populations, I could go on... they run the gamut of anti-patterns.

s17n · on July 3, 2018

What the software industry calls an "A/B test" is what scientists call a "randomized controlled trial",* and it's generally considered to be the best type of experiment you can do.

The fact that people routinely screw up their experiments in the ways you list and others isn't an indictment of the A/B test methodology.

* Well actually the term gets used to mean a lot of things but insofar as you can give it a meaningful definition, it means randomized controlled trial.

glup · on July 3, 2018

This. It has always seemed pretty nuts to me the slippage between the sort of idealized multi-armed bandit mechanics of A/B testing (the theoretically sound basis) and the actual real-world situations with enormous hypothesis spaces and gnarly sampling problems. But I guess even finding local minima / maxima is useful?

matt4077 · on July 3, 2018

....and criticising statistics is the laziest kind of scientific criticism....

A/B tests are fine. They work. They allow inference of causality. They are easy to understand, and can be fun to run. They get you 90% of wherever you want to go, and such over-the-top criticism just seems like badly executed pretentiousness.

_1qd4 · on July 3, 2018

You’re partly right - a scientific endeavor to figure out the color of a button would be over-the-top, because it’s not that important.

But to the article’s point, if you’re running banking software or something, your users don’t give a shit what the button colours are; they will slog through whatever you develop because they need to get stuff done.

A/B tests are a small tool that sometimes get taken too far or used in the wrong context.

lostcolony · on July 3, 2018

'Taken too far or used in the wrong context' - sure.

In the language of this article, banking app users are mostly all 'high intent'. But that doesn't mean you can't evaluate criteria other than users who completed the workflow to determine what is a design improvement. You can still measure time to completion, how long it took the user from entering the workflow to completing a given task as a measure, and play with the design. Optimize the things users are doing the most, that sort of thing. A/B testing can help you there. It's not the be all end all; you still need sound UX design to figure out what designs to test out, but it can give you measurable data as to what works, rather than just UX gut feeling, or purely lab based results which don't reflect reality.

ddebernardy · on July 3, 2018

> a scientific endeavor to figure out the color of a button would be over-the-top, because it’s not that important.

Not saying A/B testing shades of blue like Google reportedly does is anything useful, but picking a different button color altogether reportedly can make a difference.

Even if it's 1% more sign-ups each month, that can translate to a few more potential clients per month - and more word of mouth. If you think that is too trivial a difference to matter, think about what 1% more interest would mean over your career as you compound interest for your retirement.