Hi. I'm an employee of Microsoft, which is why I have to be so vague about them. I was an employee of Powerset.com, which implemented a natural language search engine of non-trivial size. We were recently purchased by Microsoft. I've worked closely on Powerset's search engine infrastructure and with their linguistic packages for just about 2 years now, and I've started to understand some of what Bing is doing in the last 6 months.
The basics of search engine ranking are well understood, but compare those basics to the results that Google and Bing pull up and you start to see some serious discrepancies from naive rankers. This is because query type strongly influences ranking decisions, and there is implicit knowledge that STRONGLY affects ranking. A trivial example is the geoip data of a querier, which can be used to aid ranking for queries like "presidential scandals". But trending news headlines might also be an input to your ranking algorithm. A very good ranker like what Google and Bing employ is a complex beast with special inputs heuristics, and secret tricks which make them as good as they are.
I confess, I am predisposed to be very irritated at your post. You made a toy engine (and no, holding 100's of millions of pages in a keyword index really isn't a huge deal these days), a toy ranker, and now you're an expert on the state of the art? But before I can get too angry I have to admit that when I left mog.com to go to powerset.com, totally ignorant about all but the basics of search, I too had a similar opinion. I figured we'd just use some variant of LSA for relevance and be done with it. Boy, was I wrong. So I can't get too angry at you about it.
Ok, so I'll take your word for it then, you are obviously the expert in the field.
I did not mean to step on your toes, I hope that I'm as well informed as you can be about the subject as an 'outsider', if there is anything that you can reveal about the real reasons for these discrepancies then consider me all ears.
I'm not above wanting to learn about this stuff (that's why I'm on HN in the first place), my perspective to date (based on my own effort and whatever I could read up on that is publicly accessible) was that the results are not the hard thing, the spam is where the real problems lie.
Fudge factors does not sound encouraging by the way, possibly you are proving the original posters point here in some unintended way ;)
EDIT: it would be nice if you could state clearly that you are not aware of any direct effort on the part of microsoft that influences the search results in a way that either promotes microsoft and their products and/or changes the results when they are critical of microsoft, including the blacklisting of critical pages. I think that would go a very long way to laying these rumours to rest. It's microsofts trust image that comes to the surface here, and it seems that that is not very high. Chinese walls between search and the rest of the company would have been the way to go here.
I am not aware of any direct effort on the part of Microsoft to influence the search results in a way that deliberately attempts to obscure negative press about Microsoft's products. I would not be surprised if some security-related things were in fact concealed, but I can see how that might be considered unreasonable in some circles. In any event, I'm certainly not aware of any generic policy in this regard outside of age-related filtering.
Now, I am a low-man on the totem pole working in a satellite office. So I wouldn't necessarily know. If I did know of such a scenario, it'd be a firing-level violation of my NDA to talk about it here, but it'd also be a violation of the ethics guidelines to lie about it publicly, so I probably wouldn't be posting here at all about this subject if I knew anything like that.
If I did discover Microsoft doing this, I would probably resign. But I don't believe they're doing it. Microsoft is very serious about this Bing project. I've met and talked with a lot of the people who manage the product, and they're serious, talented people who clearly understand (and have directly said to us) that a search engine is about results and people trusting those results. It would be incredibly risky to deliberately filter things in Bing and risk discovery at the formative stage of Bing-as-a-brand's reputation.
PS. "Fudge factors" are just some things like saying, "Wikipedia and c2.com are awesome, give them a nice boost." It's pretty clear that Google loves Wikipedia even more than what we know about its ranking algorithm's major features would suggest. Once upon a time all wikis scored very highly in google because of the way page rank and link text worked, but they've since reduced that effect, it seems. Wikipedia's ranking never really went down.
PPS. I really can't talk about specific features of Bing or Powerset's ranking algorithm. One reason for this is my NDA. The other reason is that they're not my primary domain of expertise, so I'd feel uncomfortable lecturing about them instead of their actual architects, who are often unsung heros of a search team.
Thanks, that is really appreciated, nice to see you being such an upstanding guy!
It's funny how absolutely crucial search is and how we depend on it but how little we actually know about what goes on inside. I think that is part of what drives these wild goose chases based on limited querying, if there were more transparency then this would not take hold. At the same time the spammers would waste no time or effort trying to exploit such knowledge so out of necessity it needs to be under wraps.
Or maybe a 'many eyeballs' approach here would help too.
The basics of search engine ranking are well understood, but compare those basics to the results that Google and Bing pull up and you start to see some serious discrepancies from naive rankers. This is because query type strongly influences ranking decisions, and there is implicit knowledge that STRONGLY affects ranking. A trivial example is the geoip data of a querier, which can be used to aid ranking for queries like "presidential scandals". But trending news headlines might also be an input to your ranking algorithm. A very good ranker like what Google and Bing employ is a complex beast with special inputs heuristics, and secret tricks which make them as good as they are.
I confess, I am predisposed to be very irritated at your post. You made a toy engine (and no, holding 100's of millions of pages in a keyword index really isn't a huge deal these days), a toy ranker, and now you're an expert on the state of the art? But before I can get too angry I have to admit that when I left mog.com to go to powerset.com, totally ignorant about all but the basics of search, I too had a similar opinion. I figured we'd just use some variant of LSA for relevance and be done with it. Boy, was I wrong. So I can't get too angry at you about it.