When I see analytics in industry I normally see a whole lotta graphs, big reports for the monthly board meeting, and so on. The way the industry is moving is basically replicating what people used to do with "small data" on the systems for handling "big data". Witness the surge of interest in SQL-on-Hadoop type systems.
I believe a large amount of this is crap.
Only actions lead to improvements. The tighter your observe/act loop the more chances to win you have in any given timeframe. Computers can act a heck of a lot faster than humans. Computers can also do a better job for balancing risk than humans in many situations.
If I was a consultant I would definitely be selling my report generating ability. If I was a company that really wanted to win big I would investing in automated decision making systems.
(Ob plug: My startup, Myna http://mynaweb.com, is one small realisation of this idea.)
Yeah within report generators is a lot of business I think. The classical report generators take ages and may even put strain on the DB servers. Writing this stuff completely from scratch on the other hand is a lot of work, so it takes time to see novel results. (I speak from experience ;))
Sorry to take it off-topic, but I've never understood how you're supposed to customize your pages based on a synchronous call to a third party API though.
From what I can see it add quite a lot of latency and introduce a single point of failure, which I'm sure isn't acceptable for most cases.
Big data is all the rage, but as an enterprise biz dev guy I hear a lot of talk about it and not a lot of real doing.
The article states accurately that people can store huge amounts of data. What I don't think it explored sufficiently is how companies can make sense of this.
Big data gives companies the ability to find new insights into relationships between data points they had no idea existed. From what I am seeing, people are however approaching the issue the same (wrong) way - they are running the same reports on larger data sets, or running new reports to confirm/disprove areas where they suspect relationships will exist.
What big data needs is a generally available product (read: $$ thousands per year, not hundreds of thousands or millions) that allows companies to send in their data sets, with it proactively mining to find new trends across is entirety.
I don't think this product exists yet - and if it does please tell me and I'll rapidly call them for a job - and is exactly where the big money will be.
The closest thing I can think of is OpenRefine. It doesn't directly find trends, but it does do machine learning to help you clean up datasets. You can also create a scatterplot matrix and do faceted search, which lets you quickly explore data and relationships between dimensions. It's an open source program (previously Google Refine).
For health care, finance, enterprises and government, data quality is a huge issue. Many machine learning and statistical methods just don't work or produce misleading results with messy data.
You might also be interested in recline.js (http://okfnlabs.org/recline/), a fork of Google refine, initially created by @maxogden, now maintained by the Open Knowledge Foundation (okfn).
How their data processing features differ from openRefine is not clear to me though...
You have quite a few players that do this, challenging traditional "corporate business intelligence" solutions with more affordable, innovative and powerful alternatives.
>>"The problem of big data has been solved. We know how to gather data and store it."
Nope. Far from it. We are still learning to gather data and store it well. This is a complex problem. The author is underestimating the difficulty in a large number of disparate people collecting data and the variety of formats it produces.
Exactly this. At SnowPlow (https://github.com/snowplow/snowplow) we would love to spend more time downstream at the analysis phase (doing ML etc), but we still have to spend a ton of time working upstream on collection, storage, enrichment etc.
A lot of this work is defining, testing and documenting standard protocols, data models etc (see https://github.com/snowplow/snowplow/wiki/SnowPlow-technical... if you're interested). And this is just for eventstream analytics, working with our own data formats - ingesting and mapping third-party formats (e.g. Omniture, MailChimp, MixPanel etc) is another lot of work that needs doing... So a solved problem? Not so much.
It depends on how you define "analytics." The problem continues to be the same: There's too much data, and we can't make sense of it. It seems obvious, but this simple concept has huge ramifications.
Different ways of approaching making use of that data are huge swings: look at how other search engines tried to make sense of the web vs. how Google did it. Look at how the dozens of analytics companies tried to make use of your web analytics vs. how Omniture did it.
In a more modern example, look at how much content is generated every day on social networks, or how much data healthcare facilities have resting in different silos that is completely unusable to them. There's a lot more that needs to be done.
Although the headline is stating the obvious, this article touches on some good points that companies so far have completely missed.
"But now that companies can receive and store this data, everything from logs, to usage tracking, location coordinates, patterns and so on, the next step is to make sense of all of this."
Shameless plug, but I'm working on those problems now and it is an extremely rewarding area to be in. If you're interested in working in that area or learning more about it feel free to ping me.
>He says that data can be gathered showing how many people see a particular painting or share it online, and thus reach conclusions on how successful an artist is.
Wouldn't this just lead to art becoming a new form of spam when people try to game this system? Not to mention that not all art is painting or can be easily shared.
Thanks for this article! As someone who is just getting into the field out of University, I love to absorb as much as possible on this topic and read different points of view.
Before you can sell me a wrench, have to let me know what nuts it will turn, and I have to have some such nuts to turn.
For big data, what nuts does it turn? I'm still waiting to hear just what nuts people want turned, especially those for which 'big data' is essential.
It's easy enough to find cases where analysis has been stopped due to far too little data or far to little ability to handle more data. A classic example is R. Bellman's "curse of dimensionality" especially for his work in dynamic programming -- i.e., best decision making over time under uncertainty (with flavor quite different from the uses of dynamic programming in some computer science algorithms).
Broadly, for the curse of dimensionality, we can start with the set of real numbers R, a positive integer n, and the real n-dimensional space R^n, that is, just the set of all n-tuples of real numbers. Then as n starts to grow, it takes 'big data' to start to 'fill', say, the n-dimensional 'cube' with each side 100 units long, [0, 100]^n. So, if want to describe something in such a cube and want a lot of accuracy, then can start with 1 MB of data and start multiplying by factors of 1000 over and over. We can zip past a warehouse full of 4 TB disk drives in a big hurry.
Here is a general situation in 'analysis': We are looking for the value of some variable Y. Since we don't know Y, we can say we are looking for the value of a random variable. Or we can look for its distribution. For input, maybe we have many pairs (x, y) where x is in R^n and y is in R. Then, maybe we are told that in our case we have some X in R^n and want the corresponding Y or its distribution.
Well, essentially there is one, just one, one to rule all the rest, way to answer this. May I have the envelope, please? Yes, here it is (drum roll): Simple, plain old cross-tabulation. Why? Because cross tabulation is just the discrete version of the joint distribution from which we use the classic Radon-Nikodym theorem (Rudin, 'Real and Complex Analysis') to say that the best answer we can get (non-linear least squares) is just the conditional distribution or its expectation the conditional expectation, that is, E[Y|X] which is the best non-linear least squares approximation of Y as a function of X, and taking an average from a cross tabulation is the discrete approximation of this. For a good approximation for a lot of values of X, we can suck up 'big data' and ask for many factors of thousands of times more. The conditional distribution of Y given X is P(Y <= y|X) and is addressed similarly. So, net, I agree that there can be some good uses for big data.
Still, before proposing an answer and picking the tools, let's hear the real questions. Okay?
Why? For one, as general as cross tabulation is, it commonly requires so much data that even realistic versions of 'big data' are way too small. So, typically we use methods other than just cross tabulation to make better use of our limited TBs of data, and to select such methods we really need to hear the question first.
Before I select a wrench, I want to look at the nut. Is this point too much to ask in the discussion of 'big data'?
I will end with one more: Suppose we want to estimate E[Y] by taking an average of n 'samples'. Under the usual assumptions, the standard deviation of our estimate goes down like 1 over the square root of n. So, to get the standard deviation 10 times smaller, we need n to be 100 times bigger. So, roughly for each additional significant digit we want in the estimate, we need another 100 times as much data. Once we start asking for more than, say, five more significant digits, we are way up on a parabola in the amount of data we need. Net, if we want really accurate estimates, then even big data has to struggle. So, really, we accept the law of diminishing returns and just use medium or small data.
I'm not really sure about the main point of this comment, and I don't do 'big data' even though I dropped the buzzword into a recent annual report, but I thought the main focus was on "discovery" of reliable relationships in the data.
The statistically interesting aspects come from a large number of variables, not observations.
Edit ... And so I think the comment's focus on the curse* of dimensionality and "small data" is misplaced.
The main "point" is that, (1) we can do some things with 'small' or 'medium' data, (2) with 'big' data we can do more, (3) but due to the curse of dimensionality, the 1 over the square root of n problem, etc., even big data cannot be big enough to be, net, much different than small or medium data.
Your point about what is 'big' about 'big' data is not really many more observations but many more variables. Okay. Of course there are problems trolling for causality or even just reliable relationships; we may find something in our net that is not real. But, if from big data we select and work with just a few variables, then with just those variables we are back to small or medium data.
So, the shortest 'point' is that it is so far not so clear just what is new and good that really needs big data. So I asked, for the big data wrench, what nut does it turn that needs turning? I'm not saying that there is no such nut; instead, so far I'm not hearing what the nut is and not seeing it on my own. So, I'm asking the big data people, what are the problems that want big data to solve. This goes to the OP claim of the importance of analysis of big data. I agree that the analysis is crucial, but without more idea of just what the problems, nuts, are we still need more to get excited about the opportunity for valuable analysis.
The basic approach to empirical modeling is pretty labor intensive and ad hoc. It is feasible for small to medium datasets but can't scale to massive datasets -- in particular where there are lots of potential covariates. For those datasets to be analyzed, the modeling process needs a lot more automation. So I think[1] that's what big-data research is trying to solve most of all, and that seems to be what various companies are trying to sell.
Obviously, none of this stuff is useful if it finds meaningless relationships; that's always true when people are looking at empirical data and is not unique to big-data. There is a lot of research on dealing with that exact issue in this setting. An old paper that looks at this stuff from an economics/finance perspective is here:
I suspect that the popularity and faddish nature of "big data" right now comes from hope that those automation procedures will make it conceptually easier to do data analysis, but I don't really know if it will (and I have some doubts).
[1] I am not a big-data person, so anyone more knowledgeable should jump in and correct me.
I get concerned about usages of big data that concentrate on "'discovery' of reliable relationships in the data."
The problem being that as we add more data, the chances of spurious relationships increase dramatically, and the human brain is incredibly good at finding a causal explanation for those relationships, even if none exists. This can quickly turn Big Data into a noise-generating rabbit-hole, leading us down blind alleys, and wasting our time.
I love that it keeps getting easier to test our hypotheses, but a search that begins without a logical and reasoned hypothesis is a dangerous beast.
If have enough data and keep testing hypotheses long enough and keep fitting long enough, then have a good chance of finding a hypothesis can't reject and a fit that looks good, even though are looking at junk.
So, divide the data in half, fit to the first half and test the fit on the second half. And if the fit fails on the second half, then what? Return to the first half, fit again, and then test on the second half again? Now want some more data to test the most recent fit.
That's one thing that's interesting about this stuff from a statistics perspective, how you can draw conclusions that are reliable even after some sort of search process. See, for example, the research by Joe Romano and Michael Wolf (and their coauthors) on stuff like "family-wise error rate".
Some people (like you) understand that reliability and validity aren't just "p < 0.05", but that's far from universal understanding. I've seen intelligent people accept and reject hypotheses with woefully inadequate evidence, and I've also seen wild hypotheses built on the backs of strong but meaningless correlations.
Dangerous beasts can be useful, but they must be treated with due care.
The books by Howard Wainer, Edward Tufte, and Bill Cleveland are good starting points; it also depends a lot on your particular interests. Andrew Gelman's blog [1] is very good if you're at all interested in non-experimental data and/or poli sci applications.
> Any stories of inadequate examples or meaningless correlations?
A customer's marketing group was tying visitor data to geodemographic data. They put together a database with tons of variables, went searching, and found a multiple regression with a Pearson coefficient of 0.8+, a low p, decided to rewrite personas, and started devising new tactics based on the discovery.
Fortunately, they briefed the CEO and the CEO said that the dimensions in question (I honestly don't remember what they were) didn't make intuitive sense, and demanded more details before supporting such a major shift in tactics. More research was done, and this time somebody remembered that this was a product where the customers aren't the users, so they need to be treated separately. And it turned out the original analysis (done without fancy analytics) was very close to correct.
If the CEO hadn't been engaged during that meeting, they would've thrown away good tactics on a simple mistake. The regression was "reliable" by most statistical measures, but it was noise.
A similar example holds for validity, where I saw a team make wonderfully accurate promotion response models, but they only measured to the first "conversion" instead of measuring LTV. And after several months of the new campaign, it turned out that the new customers had much higher churn, so they weren't nearly as valuable as the original customers.
> Care to elaborate how how to be more sure of reliability and validity?
I'm not a statistician or an actuary. I'm a guy who took four stat classes during undergrad. I know just enough to know that I don't know that much.
Disclaimer aside: my biggest rules of thumb are to make sure that you're measuring the thing you want to measure (not a substitute), to make sure the statistical methods you're using are appropriate for the data you're collecting, and to make sure you understand the segmentation of your market.
So those are some pretty bad decisions coming from statistical analysis; I wonder if you think that those people (the marketing group in particular) would make good decisions generally? It seems like some people are hell-bent on making bad decisions regardless of the tools available to them.
But, yeah, you hand some people a spreadsheet with numbers in it and their critical thinking ability just evaporates.
As an aside, that's not what I meant by "reliable" earlier (and, to be really specific, I agree that low p-values do not ensure reliability even w/out the other problems introduced by that particular model search).
I think I got there is some correlation between x and y that makes us money, and to find Y there is an approximation of cross-tabulation which is really simple, and the exponential cost of accuracy is so great it will defeat us then we should just live with fast simple analysis on smaller sets
An alternative is to use multidimensional visualization techniques like parallel coordinates. This is a viable way to answer many questions and get a sense of the data, even if you're only working with sample.
Questions like: Is this dimension ever relevant? Where are there strange outliers? How densely populated is the data? Do these dimensions have a relationship worth investigating? Of course you can answer most of these questions directly with code that computes the answer, but the advantage of high-dimensional visualization is it lets your visual system notice patterns which lead to questions you didn't think to ask.
"some correlation between x and y that makes us money"
Well, there is a relationship between X and Y but it might not be an actual 'correlation'. Why? Because the relationship might be non-linear.
For an easier explanation, consider just
real random X and Y. That is, they are just
real numbers (but 'random variables').
Okay, time out, what is a 'random variable'?
It's general: Go in a lab, measure a number. That's
a random variable. The specific number we measured
is one of others that we 'might have' measured (observed).
So, we did one 'trial' among many other trials
we might have done. With the modern view (modern since
Kolmogorov's paper in 1933), all we ever observe in all the
universe is just some one trial; still we imagine that there are other trials we might have observed.
Useless? Not really: It may be that we have other measurements we can make, that is, other random variables. Maybe we have reason to believe that these other random variables have the same distribution (from all those other trials we never see) but are independent of Y. So, we can use these other random variables to say things about Y. That's the 'modern' setup. Can any such mathematical objects actually exist? Yes. Classic measure theory provides the foundation. That's much of what Kolmogorov wrote in 1933.
Back to X and Y. Okay, go into Excel and get a 3D graph. Get lots of pairs (x,y) and plot them on the X-Y plane. Then up the Z axis, build a histogram. Have a lot of data so that we get a nice, accurate histogram. So, the histogram will look like a mountain range. That's the joint probability density of X and Y if we scale the mountain range to have total volume 1.
Okay, we are told that we just measured X and want to know about Y. So, we go to our Excel 3D graph, find X on its axis (too many X's in this explanation -- there is better notation but it would take more explanation), and then see what the histogram says about Y. So, at X we get the distribution of Y 'given' X (after we scale to 1). That's the 'conditional density distribution' of Y given X. The cumulative version is written P{Y <= y|X) and read the 'conditional probability that Y is <= y given the value of X'.
Well, with that conditional distribution, either the density version or the cumulative version, we can find the expectation of Y given X, and that is written E[Y|X]. Then there is a real valued function of a real variable f such that E[Y|X] = f(X). With all the assumptions clear and also all the details, that is the classic Radon-Nikodym theorem. There is a famous proof by von Neumann -- i.e., he took it seriously -- in Rudin's book, and there the theorem is a big thing; in modern probability and stochastic processes, it's a huge thing. E.g., Markov means that the past and future are conditionally independent given the present. A martingale is the present is the same as the conditional expectation of the future given the past. Every stochastic process is the sum of a martingale and a predictable process. Every martingale that doesn't run off to infinity has to converge to a point. Get very deep into modern mathematical finance and get a lot on conditional expectations, Markov processes, and martingales. As in the classic Halmos-Savage paper, sufficient statistics is just an application of the Radon-Nikodym theorem. E.g., for Gaussian data, sample mean and sample standard deviation are sufficient statistics which means that anything that can be done with all the data can also be done just as well starting with just the sample mean and sample standard deviation -- so long a lot of 'big data'! Order statistics are always sufficient. Net, conditional expectation from the Radon-Nikodym theorem is a big thing.
So, we can use f(X) = E[Y|X] to approximate Y. Maybe we'd like the least squares approximation. Coming right up: Let our approximating function be g. Then our error is Y - g(X). Then our squared error is (Y - g(X))^2. Then on average our squared error is E[(Y -g(X))^2], and we want to pick g, non-linear, to minimize that. Presto: The g we want is just f. That is g(X) = f(X) = E[Y}X] minimizes E[(Y - g(X))^2], and that's the very best we can do with any function, including non-linear, of X. So, if we want to approximate Y, the very best (least squares) use we can make of X is just f(X). Moreover, our estimate of Y is 'unbiased', that is, E{Y] = E[f(X)] = E[E[Y|X}}. So, we have the best non-linear, minimum variance, unbiased estimator of Y. We should ask for more?
Then for f(X), for the discrete approximation, we can get that just from that 3D graph in Excel that we developed from just cross tabulation.
So, net, if we have enough data, then cross-tabulation is the discrete approximation of the best estimation we can do, that is, cross tabulation is the one statistical method to rule them all, if we have enough data. So, this is a reason for 'big data'.
"exponential cost"? Well, for the average I mentioned, the 100 times n, etc., the cost is just quadratic, but it still gets too large too fast.
If all we really want to consider is just correlation, then we can use linear statistical methods that are basically a perpendicular projection where we get to use the Pythagorean theorem.
I believe a large amount of this is crap.
Only actions lead to improvements. The tighter your observe/act loop the more chances to win you have in any given timeframe. Computers can act a heck of a lot faster than humans. Computers can also do a better job for balancing risk than humans in many situations.
If I was a consultant I would definitely be selling my report generating ability. If I was a company that really wanted to win big I would investing in automated decision making systems.
(Ob plug: My startup, Myna http://mynaweb.com, is one small realisation of this idea.)