We recently had several sites that had spam code injected into them due to a mis-configured file permission. It was a very targeted attack and only revealed itself to the Google search bot when it visited the site. Normal site visitors simply saw the regular web page, but when the Google bot came, the script gave it a version of the page that included text about on-line pharmaceuticals. So, the only place you could see the spam was in the Google cache. Since the cache is what is used to show the descriptive text under a link in the search results, we had spam text under our links, which tend to be at the top of the search results.
Once we were notified, we quickly addressed the problem, removed the offending script and secured the file. Then the fun began. I went looking for ways to have the Google cache flushed to remove the offending text. I registered the sites with Google’s web master tools. This process supposedly connects me to the sites and established me as a legitimate owner of them. I then went to the remove URL option (buried under site configuration, crawler access). No luck there, that only lets you remove a particular path from the cache, you can’t use it for your base URL. If you try to put the base URL in there it wants to remove the entire site from Google’s search index, not just the cache. Eventually I found the remove option for the top level URL, put in my requests and waited. About 2 hours later almost all the requests were denied because according to Google it was a live site and could not be removed:
The content you submitted for cache removal appears on a live page.
As you may know, information in our search results is actually located on publicly available webpages. Even if we removed this page from our index, the content in question would still be available on the web.
To remove this information from our search results and from the web, you’ll need to contact the webmaster of the site in question. Once the webmaster makes the change, you can submit a request to remove the cached copy or simply wait for our search results to reflect this change the next time we crawl the page.
Yeah, that’s what I did, Google.
So, I e-mailed email@example.com and firstname.lastname@example.org and got an automated response from security which basically told me that unless I had identified a breach in a Google product I could forget about ever hearing from them. The abuse team didn’t even bother with an auto-reply.
So we were now in this really interesting position. The sites themselves were fine, in fact, they had always appeared fine to the site visitors. Google had used cached versions of our content to drive traffic and ads on its site, and was the only place where our on-line identity was compromised in any way, yet there seemed to be no way to get them to stop using that. In essence, we no longer owned our site content, the Google cache did and the cache was now misrepresenting our sites with no way to have it changed.
24 hours later, one of the sites, which was submitted exactly like the sites that were denied, was actually updated and the cache removed. So much for consistency in the Google tools, apparently. Thinking I could capitalize on that small victory, I resubmitted all the sites again using the same format as the successful one (which was the same thing I did the 1st time). No luck, within an hour all the requests were denied. Why the one succeeded is still beyond me.
In the interim, I added:
in the header of the sites. That’s Googlebot instructions to not cache our sites. But with no way to force it, we have to wait for the bot to decide when it’s time to update the cache. That’s now going into all our sites going forward (and probably should have been in the past).
I know Google’s motto is supposedly “do no evil” but they sure can bring out some evil feelings when you try to control your own content that has been co-opted by their tools.
After some waiting it does appear that many of the sites now have cleared from the Google cache and the meta tag has prevented the caching. It took a visit from the Googlebot in order for this to happen. However, I have one site that Google refuses to remove from the cache when I submit a request, because according to Google it’s already been removed from the cache:
However, a Google search reveals that it, in fact, is still being cached:
Google gets credit for providing a suite of tools for webmasters and site owners to interact with the giant faceless entity that is a search index. However, I still remain unimpressed with the consistency in the tools and the ability for site owners to actually effect change in cached site info on Google. I think that is Google is going to cache our content, it should do a much better job at respecting our ability to have that content modified or removed. The current process has the appearance of transparency, but the effectiveness is rather murky.