Here's my (completely unsubstantiated) theory. It happend literally the day after crossing $500 & 50K views in adsense. I'm guessing one of those was a trigger for manual review by some contractor, perhaps overseas. They looked at my sites for 3 sec, found them to be cookie-cutter and decided to blacklist the account. I get the impression they shoot first, ask questions later. I didn't feel like dealing with it all or starting over so I just moved on to other things.
I talked to someone from Google at I/O who should know and he claimed they don't play "Whack a mole" with websites. They will tweak their ranking algorithm to punish the behavior they see in a web site they don't want to be ranked.
That was probably me. We have two sides to the webspam team at Google: engineering and manual. We definitely prefer to write algorithms so that we avoid dealing with individual websites--the idea is that you strive to fix the root cause of an issue, not to tackle specific sites. However, if we see a website that violates our guidelines and that gets past the algorithms, we are willing to take manual action. Where possible, we use the output of the manual team not only to reduce spam itself, but to train the next iteration of algorithms.
For example, one of the big issues in blackhat spam this past year was illegally hacked sites. Our algorithms weren't doing the best job on hacked sites, so the manual team kept an eye out for hacked sites to remove them (and often to alert the website owners that they'd been hacked). The data generated by the manual team helped us build and deploy multiple new algorithms to detect hacked sites, leading to a 90% reduction in the number of hacked sites showing up in Google's search results in the past few months. That decrease in hacked spam in turn frees up the manual team to tackle the next bleeding-edge technique the spammers use.
I suspect every major search engine uses similar approaches: try to stop the majority of spam with algorithms, but be willing to take action in the mean time while engineers work to improve the algorithms.
Great to know. Out of curiosity, in this particular case, did you save supposed violations for each site, or did you blacklist all of them based on a few?
It varies for different cases depending on a lot of factors like severity, impact on users, etc. In the particular case from above, to find out the history of what might have happened, I just picked a domain at random and dug into its history to find the autogenerated pages with tons of typos for each domain.
I could post more examples from the other domains, but my point is that this is the sort of thing that users dislike and complain about. If you were a blogger and saw pages like this ranking for your name or your site's name, you probably wouldn't be happy either. From looking at a few domains, I don't think that we overgeneralized from a few pages in this case.
I know that you've moved on and the domains are shut down now. And I'm not trying to be cantankerous. I'm just trying to say that from our point of view there's good reasons to take action on sites like this so that users don't complain to us.
So, basically what you're saying is I went wrong with the typos? I got really excited by my algo and was overzealous with adding it. I believe I did take it off of the sites I issued re-inclusion requests for, but they never got re-included and I never got any messages back (to my knowledge). Also, they were not on every one of those domains.
Each site took a long time to make actually. They either involved generating a data set from scratch or piecing together and parsing other large data sets. This one in particular, I was crawling the Web for feed discovery and was planning on adding stuff like grouping the best posts by category, etc.
Yeah, would love to know about some others, e.g. japanese2englishdictionary.com, idnscan.com, serverslist.com. Also, did you actually get any complaints about this or was it triggered by some other threshold/thing? On a side note, I still get requests about exposing some of this data, i.e. sites behind ip addresses or lists of domains matching some criteria. In any case, thx for the info!
I can understand the need to take action. I just think it could have been handled better. If typos were the problem, I would have removed them immediately if someone told me, and that could have been automated. In retrospect, it seems pretty obvious, but it wasn't at the time.
The typos were definitely going overboard. I can understand the appeal of "I've got this great tool--what can I do with it?" But we get a lot of complaints about typo spam, so that's a sensitive issue. I definitely would have done less of that.
There's also a class of folks we call navigation spammers who try to show up for tons of domain name queries. I can give you some history to provide context. In the old days, when you searched for [myspace.com] we'd show a single result as if someone had done the query [info:myspace.com]. The problem is that people would misspell it and do the query [mypsace.com], and then we'd end up either show no result or (usually) a low-quality typo-squatting url. So we made url queries be a string search, so [myspace.com] would return 10 results. That way if someone misspelled the query, they might get the exact-match bad url at #1, but they'd probably get the right answer somewhere else in the top 10. Overall, the change was a big win, because 10% of our queries are misspelled. But if you're showing 10 results for url queries, now there's an opportunity for spammers to SEO for url queries and get dregs of traffic from the #2 to #10 positions. Now we're getting closer to present-day, so I'll just say we've made algorithmic changes to reduce the impact of that.
But you were hitting a bunch of different factors: tons of typos, specifically for misspelled url queries, autogenerated content, lots of different domain names that looked to have a fair amount of overlap (expireddomainscan.com, registereddomainscan.com, refundeddomainscan.com, etc.). If you were doing this again, I'd recommend fewer domain names and putting more UI/value-add work on the individual domains.
Matt Cutts often tells us this - but he talks specifically from a search web spam perspective, I suspect the Adsense team have different rules and probably can (and do) "play Whack a mole" when appropriate.
Interesting, but in this case it wasn't that it was ranked lower, but that one day I had 25 domains indexed fine and the next day they were no where to be found, i.e. not in the index at all. And I had plenty of other domains (not in that adsense account) that were still ranked fine.
I don't think that was the issue. The fact is that if you've got dozens of websites, each of which has lists of domains/IPs like http://www.mattcutts.com/images/verypopularwebsites-com.png , that is the sort of thing that users complain about and don't want showing up when they do a search. Especially if sites have autogenerated boilerplate content for each one of those links.
I mean, if you're auto-generating a page that has this text: "Elcorillord.com
Thanks Matt. I was building a business off of these domains and realized that Google rankings were the biggest wildcard, and really didn't want any trouble. So I read the Webmaster guidelines closely and often and didn't think I was violating them.
However, I realize some were closer to the line and I should have focused on being less cookie-cutter and more useful in the domains that were really better (more farther along). I had always intended on coming back and working more on each, but wanted to get placeholders up quickly because it takes a while to get backlinks and indexed.
I guess I'm saying I had hoped I would have at least been contacted with a warning and what was found objectionable before just being totally blacklisted with no reason given. I would have also hoped that each site would have been addressed individually. If I had been contacted and you had said, hey, you need to remove these misspellings off of these sites, I would have done it immediately.
Here are some comments on the above though. Again, from my perspective these weren't violating the guidelines because the pages were useful from the user's perspective and there were no hidden tricks going on.
First off, there were actually many categories of sites, domains was just one of them. Others were sports stats, definitions, language, medical, and addresses. For each site I made, I was modelling it off of other sites that had gotten great Google rankings for years. I had hoped to eventually improve the UX on those sites and get similar rankings. For domains, I'm talking mainly about who.is and domaintools.com.
Each domain had a static site index, and that's what you linked to above in the screen shot. The extensive ones weren't really meant to be browsed, but just so search engines could find the pages (pre my knowledge of sitemaps). It's no different than any of the other static sitemaps, e.g. http://who.is/whois_index/index.php, and most of them looked better than the screenshot.
That one in particular came from the code for the streetsandzips site that was a big tag cloud. I was trying to find ways to make the static site better, and that was one of them. It looks better when the fonts are of different sizes :). I had intended for that site to make them different sizes based on the traffic numbers, so Google, Facebook, would be really big, etc. On the streetsandzips site the bigger cities are bigger.
In fact, I believe I evolved the sites so that those (site index) pages had noindex,follow on them such that they wouldn't come up on search results. I also added a search engine (Google custom search) on each page as well. I don't remember if I got to the tag cloud sizes for this particular domain at the time of blacklisting.
As for the misspellings, I did mess around with those, but not on all sites, and I believe at the time they were blacklisted that had been removed from most of the domains, if not all.
Common misspelling and typos as you know is a tool that people provide to those who buy domains. I built it for that purpose, and wanted to see how many people were actually searching for this stuff, so added it to some of the domains. Turns out, a lot of people do. I didn't just tack it on to the footer or cloak it or whatever; I put it in with a purpose that people ask for, e.g. common misspellings and typos.
Additionally from the users perspective, if they got to this page by typing in one of those misspellings, they were getting a big link to the official site at top and then more information about that site, e.g. siteadvisor rating, traffic, etc. So it was essentially functioning as one-click Did you mean x.
I'm happy to answer more questions about it. But it is pretty clear that it was still shoot first and ask questions later. No one ever contacted me about anything. I wasn't trying to hide anything from Google. It was all in my personal adsense account.
I can understand from a search engine perspective, banning sites. But given I already had a relationship with Google, I expected to be contacted. In fact, at one point I had a call with an Adsense guy from Google trying to help me better optimize my sites for Google! He looked at them and had no issues with them, so I thought I was fine.
Also, IIRC I submitted at least one re-inclusion request after being banned, and never heard a response back from that either. Before submitting that request I did a top to bottom review and tried to remove anything even close to the line, including misspellings I believe.
From what I've seen Google doesn't contact people :) My guess is they also have a policy of not sharing reasons for getting blacklisted, to ensure they're not giving spammers an easy way to fix their website.
They claim they respond to all "Site reconsideration" requests. I had to file one once, they did respond, but with a very non-informative and unhelpful response.
Yeah, in retrospect I should have taken it slower and not gotten as close to the line in the first place. It's totally my fault, and I'm not bitter. As you can tell from the OP, I've had a lot of failure, and I similarly learned from this one.
The tricky part is that the math works out something along the lines of there being ~200,000,000 domains and there being ~20,000 Google employees. At a simplistic level that works out to 10,000 domains per Google employee. Which means that even if Google stopped doing everything else and everyone at Google spent all their time talking to webmasters, they'd each have to answer 10,000 peoples' questions about rankings, how to make their site, whether they have ranking issues, etc. That's oversimplifying somewhat because there's lots of parked domains, but not too much--you'd be surprised how many people want to talk about their parked domains and why they aren't ranked the way they want. My team is vastly smaller than the number of Google employees, of course. And our first order of business has to be worrying about what users see when they search; talking to webmasters is the secondary priority.
The net effect is that we haven't found a way to talk 1:1 with every webmaster, and I'm not sure whether that's possible. The story of webmaster communication for the last few years at Google has been trying to improve scalability of the info. The earliest Google webmaster communicator ("GoogleGuy") answered questions on a webmaster forum. In 2005 I started a blog, which has the advantage of permalinks for posts like http://www.mattcutts.com/blog/seo-mistakes-autogenerated-doo... . We tried doing live webmaster chats, but that would only reach 400-500 webmasters at a time.
The most scalable thing I've found so far is making videos. Here's a video that came out last month about the dangers of autogenerating pages for example: http://www.youtube.com/watch?v=A8bgpWtVHo4 . We're at almost 300 videos now, and we're getting closer to 3M total views on our webmaster video channel. The hope is that this additional guidance helps people self-identify what can cause issues to avoid or to correct them without needing to talk to Google.
The other big tool that has been helpful is http://google.com/webmasters/ . That provides tools to identify the common errors/mistakes that webmasters make (crawl errors, 404 pages, canonicalization, robots.txt issues, identifying hacked sites using the "Fetch as Googlebot" feature, etc.). That helps with many of the straightforward issues, but of course it doesn't solve the issue with "sheer number of webmasters who have ranking questions vs. number of Googlers." If anyone has suggestions on how to tackle communication with webmasters in a more scalable way, I'd appreciate feedback on how to do better on that.
I think the best scalable thing that you could do would be to generate a lot more useful automated warnings via all registered channels. And then have a process you outline somewhere on timelines and how to correct. I think the biggest hassle from the user perspective is it all feels like a black hole and black box.
I understand the argument behind keeping it a black box, but it doesn't need to be as much of a blackhole. For example, in this case the following could have happened:
1) Site triggers some alarm for violating something.
2) Just those site(s) get strongly penalized.
3) Automatic emails go out in the message centers of Google Webmaster tools, analytics, adsense, and Gmail -- wherever the sites show up registered. In my case, it would have been all of the above.
4) The messages indicate the nature of the violation, that there is a penalty in effect.
5) There is a link to click on if you think you've corrected the errors.
6) If you click it, it auto-checks your site in y days and sends you another message that it passed or not.
7) If not corrected, it stays penalized or there are a series of penalties until full blacklisting.
That's all automated, i.e. scalable. I understand there are some tricky bits about how much to reveal about why things were penalized and what not, but I think those could be worked around usefully.
Google attempts to determine whether you deserve a warning; the goal is to notify honest folks, without notifying real "bad guy" spammers that they've been caught. Naturally, the algorithm gets it wrong sometimes... detecting wrongdoing is easier than detecting intent.
Were you blacklisted from Google Search or Google AdSense or both? Google AdSense's blacklist policy is totally separate from Google Search; Google AdSense's policy is to blacklist people on suspicion of wrong-doing (guilty unless proven innocent).
First, Matt Cutts is awesome. I don't think this can be disputed.
Second, we're doing our part to spread what we're learning about running a top 500 website (Stack Overflow) with the community, in the form of http://webmasters.stackexchange.com
Do we make mistakes? You bet we do. Just the other day I accidentally disallowed all questions on Stack Overflow from being spidered in robots.txt. That.. was .. not a good day.
The videos are great. Also, webmaster central has a ton of great info.
However, I interact with a lot of customers who seem put off by webmaster central. It seems to be a very outdated interface. I understand it's important to be clear and concise when explaining these issues. But if you look around the web 2.0 world at people providing similar information there's a harsh contrast.
Put simply, webmaster central is small text with a dark appearance and little or no graphics. In my experience and testing this harbors a mentality of "This is too complex". Users seem to encounter long wordy pages with no graphics and convinced themselves it's beyond them, before they begin to read.
Making a page lighter and throwing in a few visual aids goes a long way in curbing this issue, as well as making the information easier to understand and more fun to read.
It seems like a small thing, but it scales to become overwhelming when you consider that most people who encounter a page like this and dismiss it at a glance start looking for an email us link or a contact phone number.
This is my experience anyway. Perhaps your results may vary.
Morr, thanks for the feedback--I'm talking to that team in 45 minutes, and I'll pass on the advice. The webmaster console has evolved a lot through the years, but I'd be the first to admit that it's a lot of info on a relatively small amount of pages. Lately the philosophy has moved more toward "Let's try to set up the tools to solve the most common questions or problems that come up." That could work better than a passive panel of information that doesn't tell you what to do about all the info you see.
The idea is still percolating, but I think it's got a lot of potential.