It apparently requires a browser extension. Is that actually better? I'd rather direct my customers to a third-party site I have a relationship with than ask them to install a plugin.
(Disclaimer: I have no product or customers to worry about, just musing)
I have a relative who loves to tell the story of a school building's clocks that would show the wrong time every Monday morning. Turned out their janitor's new vacuum cleaner was putting interference on the power line and disrupting the master clock system's signal enough to require a manual reset. Problem went away after replacing the vacuum.
The problem with bountysource is that the bounties are quite low for the amount of work involved. Working a week consulting on the customization of open source software for a business will usually net far more than the largest bounty on there. (And those large bounties are significant, highly-specialized tasks.)
I think the bounty model could work well, but on all the projects I've seen it implemented on, it acted more like a tip mechanism for the core team than a way to attract professional freelancers to contribute, because of the low value of the bounties.
A simple hack is to run an A-A-B-B test instead of an A-B test. Rather than splitting 50-50, use 25-25-25-25 splits. When A1==A2 and B1==B2, then you know that you have statistically relevant data and you can compare A to B. Depending on the dataset, this could happen in minutes or weeks.
To explain this in a different way, let's use a simplified example:
Suppose I have a website with a "Click Me" button that's green in color. I want to increase clicks and think to myself, "perhaps if it was a red button instead of a green button, more people would click!" To test this, I would run an A-B test along the lines of:
if random(2) == 0 then color='red' else color='green';
In theory, I just push this code and track the number of clicks on the red button versus the green button and then pick the best. But in practice, when I push the code, there might be 5 clicks on green and none on red in the first hour. Maybe green is better? Maybe I didn't wait long enough? Okay, let's wait longer. A few hours later, there's now 10 clicks on red and only 6 clicks on green. Okay, so red is better? Let's wait even longer. A week later, there's 5000 clicks on red and 4500 clicks on green. That seems like enough data that I can make a conclusion about red vs. green. But is there a better way?
This is where A-A-B-B testing can help. Let's start by looking at just the A-A part of the test. If I split my audience into two groups (green1 and green2) and show them both green buttons, the results should be identical because both buttons are green. If I check back in an hour and the "green1" and the "green2" groups are off by 20%, then I have a large margin of error and need to wait longer. If I check back in 6 hours and they're off by 10%, then I need to wait longer. If I check back in a day and green1 and green2 are only off by 1% then that means we've probably waited long enough and my margin of error is around 1%. I can now add green1+green2 and compare it to red1+red2 groups and see if there's a clear winner (e.g. red is 5% better). And this only took a day instead of a week!
Using four buckets instead of two like that will improve your confidence in the results, but will also double the required sample / testing duration. You could just as easily use two buckets and wait twice as long to achieve the same effect.
Can you explain why? I'm struggling with the math behind the whole thing as it is, but intuitively this sounds like a very clever hack. I wonder why it would double the experiment time if effectively people are seeing either A or B variants.
That comment is brilliant, thanks for contributing it.
You'll probably have to ensure it applies sequentially too, at least to be sure As and Bs are stable in their matching, but it seems to me an elegant solution for the problem (not that I'm statistician, though).
This is better than stopping when you get a statistically significant finding which is nearly always the wrong thing to do. Do you have any math behind this?
I believe it lets you compensate for the possibility that, say, all of your conversions might be coming from the bottom 1% of your users. Segmenting A into A1/A2 therefore insulates your interpretation of the results for A from being as heavily skewed.
Yes but in your A/B test you shouldn't be picking first half vs second half. Each visitor should be randomly assigned, so it should mitigate the problem you mentioned.
https://salt.bountysource.com/teams/crystal-lang