We recently had several sites that had spam code injected into them due to a mis-configured file permission. It was a very targeted attack and only revealed itself to the Google search bot when it visited the site. Normal site visitors simply saw the regular web page, but when the Google bot came, the script gave it a version of the page that included text about on-line pharmaceuticals. So, the only place you could see the spam was in the Google cache. Since the cache is what is used to show the descriptive text under a link in the search results, we had spam text under our links, which tend to be at the top of the search results.

Once we were notified, we quickly addressed the problem, removed the offending script and secured the file. Then the fun began. I went looking for ways to have the Google cache flushed to remove the offending text. I registered the sites with Google’s web master tools. This process supposedly connects me to the sites and established me as a legitimate owner of them. I then went to the remove URL option (buried under site configuration, crawler access). No luck there, that only lets you remove a particular path from the cache, you can’t use it for your base URL. If you try to put the base URL in there it wants to remove the entire site from Google’s search index, not just the cache. Eventually I found the remove option for the top level URL, put in my requests and waited. About 2 hours later almost all the requests were denied because according to Google it was a live site and could not be removed:

The content you submitted for cache removal appears on a live page.

As you may know, information in our search results is actually located on publicly available webpages. Even if we removed this page from our index, the content in question would still be available on the web.

To remove this information from our search results and from the web, you’ll need to contact the webmaster of the site in question. Once the webmaster makes the change, you can submit a request to remove the cached copy or simply wait for our search results to reflect this change the next time we crawl the page.

Yeah, that’s what I did, Google.

So, I e-mailed security@google.com and abuse@google.com and got an automated response from security which basically told me that unless I had identified a breach in a Google product I could forget about ever hearing from them. The abuse team didn’t even bother with an auto-reply.

So we were now in this really interesting position. The sites themselves were fine, in fact, they had always appeared fine to the site visitors. Google had used cached versions of our content to drive traffic and ads on its site, and was the only place where our on-line identity was compromised in any way, yet there seemed to be no way to get them to stop using that. In essence, we no longer owned our site content, the Google cache did and the cache was now misrepresenting our sites with no way to have it changed.

24 hours later,  one of the sites, which was submitted exactly like the sites that were denied, was actually updated and the cache removed. So much for consistency in the Google tools, apparently. Thinking I could capitalize on that small victory, I resubmitted all the sites again using the same format as the successful one (which was the same thing I did the 1st time). No luck, within an hour all the requests were denied. Why the one succeeded is still beyond me.

In the interim, I added:

<meta name=”robots” content=”nosnippet,noarchive”>

in the header of the sites. That’s Googlebot instructions to not cache our sites. But with no way to force it, we have to wait for the bot to decide when it’s time to update the cache. That’s now going into all our sites going forward (and probably should have been in the past).

I know Google’s motto is supposedly “do no evil” but they sure can bring out some evil feelings when you try to control your own content that has been co-opted by their tools.

Update:

After some waiting it does appear that many of the sites now have cleared from the Google cache and the meta tag has prevented the caching. It took a visit from the Googlebot in order for this to happen. However, I have one site that Google refuses to remove from the cache when I submit a request, because according to Google it’s already been removed from the cache:

google webmaster tools screenshot

Screenshot from the Google webmaster tools

However, a Google search reveals that it, in fact, is still being cached:

Google search results

Google gets credit for providing a suite of tools for webmasters and site owners to interact with the giant faceless entity that is a search index. However, I still remain unimpressed with the consistency in the tools and the ability for site owners to actually effect change in cached site info on Google. I think that is Google is going to cache our content, it should do a much better job at respecting our ability to have that content modified or removed. The current process has the appearance of transparency, but the effectiveness is rather murky.

9 Responses to “When your content is no longer your own”

  1. Michael says:

    Unfortunately once this happens the only thing you can really do is wait. Google bot will eventually craw the site again and will either cache the new data without the highjacked code or, if requested, remove the cache.

  2. Świetlik says:

    Totally agree with Michael. In this kind of situation we can only wait. We can ofcourse request remove the cache but usually it doesnt works – we need to wait. I had the same problem with my sites, for example artoza or Aero7. Btw. great article!

  3. This is the perfect website for anybody who hopes to understand this topic.
    You know a whole lot its almost tough to argue with you (not that I actually would want to…HaHa).
    You certainly put a fresh spin on a subject that’s been discussed for a long time. Excellent stuff, just great!

  4. hey there and thank you for your info – I’ve certainly picked up anything new from right here. I did however expertise several technical issues using this site, as I experienced to reload the web site a lot of times previous to I could get it to load properly. I had been wondering if your hosting is OK? Not that I am complaining, but sluggish loading instances times will often affect your placement in google and can damage your high-quality score if advertising and marketing with Adwords. Anyway I’m adding this RSS
    to my e-mail and can look out for much more of your respective
    intriguing content. Make sure you update this again very soon.

  5. Hi, just wanted to mention, I enjoyed this post.
    It was funny. Keep on posting!

  6. When I originally commented I clicked the “Notify me when new comments are added” checkbox
    and now each time a comment is added I get several emails with the same comment.
    Is there any way you can remove people from that service?
    Thank you!

  7. Hey there! I’ve been reading your web site for a while now and finally got the courage to go ahead and give you a shout out from Porter Tx! Just wanted to mention keep up the great work!

  8. Great ¡V I should definitely pronounce, impressed with your website. I had no trouble navigating through all the tabs and related info ended up being truly simple to do to access. I recently found what I hoped for before you know it in the least. Reasonably unusual. Is likely to appreciate it for those who add forums or anything, site theme . a tones way for your client to communicate. Nice task..

Leave a Reply

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>