Webmaster randomly reporting a massive increase in 404s (apparently from old sitemaps)

Well, I’m stumped. Several months back, we launched a totally new website, replacing a legacy system that was pretty messy. Part of the mess was many, many pages created that really didn’t need to be there or be crawled by Google. There was a lot of duplicate and shell data that resulted in extra URLs crawled and indexed by Google. With the site transition, we of course broke some of these URLs, but it didn’t seem to be too much concern. I blocked ones I knew should be blocked in robots.txt, 301 redirected as much duplicate data as I could (this is still an ongoing process), and simply returned 404 for any others that should have never really been there.

For the last 3 months, I’ve been monitoring the 404s Google reports in Webmaster, and while we had a few thousand due to the gradual removal of shell and duplicate data, I wasn’t too concerned. I’ve been generating updated sitemaps for Google multiple times a week with any updated URLs. Then, about a week ago, Webmaster started to report a massive increase in 404s, somewhere around 30,000 new 404s a day (making it impossible for me to keep up). My updated sitemaps don’t even have 30,000 URLs in them. The 404s are indeed for incorrect URLs, and for URLs that haven’t existed for months and haven’t been in a sitemap for as long. It’s like Google decided to randomly use a sitemap from many months ago, as I have no other idea why it’d all of a sudden crawl a URL for data that hasn’t existed for many months and is definitely not linked anywhere (although Webmaster claims it’s linked in the sitemap….which it is not).

Does anyone have an explanation for this? I even got an auto message from Webmaster Tools this morning reporting that it has seen a significant increase in 404s from my site. I’m not quite sure how concerned I should really be about this…

Answer

Are the 404 errors all from Google Bot or are they real users? If the former, you might be right that they’ve used an old sitemap or that they’re re-crawling old URLs to check that they are indeed invalid. Who knows how the Bot works, but it generally does the right thing – your 404 pages won’t appear in the search results, so who cares?

If they are real users, you should look into where they came from using the referrer header. Hopefully you can then find the source of the problem. The referrer header is sometimes blank but for a sample this big, I’d expect quite a lot of data to be available.

Attribution
Source : Link , Question Author : jhdavids8 , Answer Author : mjaggard

Leave a Comment