Google Indexing Scrapers First?
Yesterday I published a guest post from Abhijeet Mukherjee titled Do You Know Your Visitors? 5 Points to Consider. A couple of hours later Abhijeet messaged me on Gtalk to let me know that Google was not indexing my backlinks to his blog, but rather the link from a scraper site that had copied part of the post.
This made me curious and went to check for myself. The first thing I wanted to know was if my post was indexed already by Google or not. I copied one sentence from the post and search it in Google, with quotation marks to find only exact matches. The result was pretty surprising: Google had already indexed 2 scraper sites, but my original post was not on their index yet, as the image below illustrates:
I repeated the search query today, and my post is now showing on the first position. Regardless, I find it pretty weird that Google would index first scraping material and only afterwards the original source. The same thing was happening to the indexation of the backlinks.
Anyone know what could be the cause for this flaw?
Browse all articles on the SEO category
42 Responses to “Google Indexing Scrapers First?”
Which sounds suspiciously like what someone said above
SEO and WordPress Design
The reason is simple. Scraper sites scrape several sites, which means they update several times a day, compared to a regular blog which updates usually once a day.
The more frequently a site is updated the more frequently it is crawled by the search engines.
And also the fact that your site is ranked above scraper sites doesn’t mean that Google considers you the original source. For example Technorati blog pages rank usually higher than the original posts.
It happens everyday. and it has a logical explanation. The scrapers “produce” much more content than a normal blog. For the Google spider, it is updated more frequently, that’s why the spiders send the information to the index at first.
But, when the 2 (or more) copies are to be evaluated by the algorithm, then Google determines which is the original one. So being, the copy has a short-time success in SERPs.
Not John Chow
I am sure it is not worth your time to chase after the scrapers. Google should be dealing with them by lowering their pagerank.
Logic would suggest if there are more incoming links to the scraper sites, then google’s bots will find these sites before yours. Then when they finally do get to yours they acknowledge yours as the original, and list it as so.
Looks like you might have some influence on Google because I just checked and DBT is listed in the number 1 spot where it should be. Maybe it just takes some time for things to work its way through.
Still I’m sure it ticks you off that others are copying your work for their gain but in the end I think it all works out.
What Michael Clark says is correct. The search bot crawling the scrapper site is (most likely) not the same bot that crawls dailyblogtips. Upon the first crawl, the scrapper site had original content not seen elsewhere, which ranked it high. And then, as soon as the dailyblogtips bot came by, it saw the same content and sent you to the top as being the authority and more trusted site.
It happens all the time with press releases. Scrappers get high rankings for the first few minutes to hours and by the end of the day we are up top and they can’t even be found anymore.
It is something the scrapper sites take advantage of, otherwise they wouldn’t be around, but does anyone have a better method for Google to use?
I’ve been noticing this for a while. Another issue along the same lines: republishing posts on places like Zimbio will bump your post out of it’s position in the SERP’s for a few days and replace it with the Zimbio post. And this is after Google has already indexed your post.
Daniel, this has been a major problem with Google for at least a year, maybe longer. Scrappers routinely out-rank original content when doing a text snippet search, and it is not just an “initial” ranking problem –it can and will happen when your content has been live on the net for literally years.
I’ve had articles that were published years ago be out-ranked by scrappers. Luckily, you have a PR7 site, so this effect will be much less on you, however PR5 and below sites that produce original content are constantly pummeled by this. AND their listing in the serps is often hidden below the “for more results…” link at the very bottom. It is very disheartening.
It appears that Google has simply given up fighting scrappers and decided to just index everything and let their algo decide who goes on top with no regard to who originally produced the text.
Unfortunately, Google has lost the battle with scrappers.
Daniel, read comment #13 from Abhijeet…that’s why I had mentioned Digg. He was talking about a scraper blog doing well even though the MakeUseOf post was the one that was Dugg…
wow scapers work fast
If this is the case the their system definitely is flawed, I can’t see why indexing a scraper site first would be beneficial at all to the person who originally wrote the post. I say Google needs to fix this, being that they are top search engine this shouldn’t be happening.
@Life is Colourful, could be the case.
@Michael Clark, yes I would also assume they have hundreds of bots going around. I will check the server logs to see if I can identify when the came yesterday.
@Aseem, true your first point to some extent, but what does it have to do with this post 🙂 ?
I have it footer plugin as well, right now I am using it to display sponsor and partner messages.
@banji, if they are scraping only 20% or less it would be hard to take them down, because this could be considered fair use.
If they are scraping 100% of the content though just send a DMCA.
Maybe a bit off topic, but what do you want to do about that scraper sites? I’m asking because there’s one scraping mine exactly like the same way.
Firstly, I’ve heard getting tons of backlinks from Digg is not necessarily a great thing anymore…I remember reading somewhere that Google puts less weight on that kind of super build up of links in such a short period.
Secondly, everyone should be using the RSS footer plugin, which was actually recommended by someone at Google I think. It puts a link in your RSS feed after each post stating the original source of that post. Scrapers usually grab the entire feed content, so the link will more than likely be in their post also.
Tea Party Girl
Well that explains why my latest back-link is from a “AP” story about Myanmar…
Definitely sounds very backwards to me. Sorry I cannot shed any light on it for you, but I’m glad it didn’t take longer to get fixed and get you in the #1 spot.
Doesn’t google have hundreds (thousands?) of GoogleBots running around the web? It’s very likely that even though your site had been crawled by a GoogleBot, that Googlebot’s data hadn’t been absorbed by the larger Google search engine. This is a disadvantage of a distributed system like Google.
I’d be curious: check your web server log to see what time Google accessed the article you’re writing about.
I’m pretty sure having a sitemap and utilizing update notifications would increase the speed at which Google indexes your site. With the notification, Google would know about the post prior to the scrapers hitting it, putting you at the top quicker…
Daniel, you are so right. Same thing’s happened to me too. Sites that are copying my content are being indexed before mine. This is really weird. Google should do something about it.
Life is Colourful
That might be because of some weird mistake in algorithm. Most of the times, Google would take little time to find out the reliable resource and the scraper site must have been on tops in terms of updation frequency and all. So it can be easy to guess that sometimes Google gives imp to sites that are updated with high posting frequency [which is possible with scraping] though they are scrapers.
Daniel, yes, I understand your point. Now it could be that G-algo is slightly flawed and even small aggregator site can at least temporarily beat their source.
Scraper sites are still relatively new phenominon. Like with blogs, it took Google quite a while to get them into control. In time this sort of attention arbitrages will certainly play out.
…Although, I still get many Google Alerts including links to pure spam blogs. Now that I come to think of it the share of spam might have been growing slightly…
I’d thought of domain authority (DBT is), but then the previous comment was the only answer I could find 😉
I’d be curious to know if anyone else has a better explanation. I’m subscribing to the comment feed.
@Sumesh, crawling rate should also be related to overall domain trust and backlinks right?
And how would a scraper site have ‘popular sites’ linking to them? Its more likely the other way around 😉
The first reason I can think of is that scrapers post several articles everyday, and on each visit, if Googlebot sees more content, it’ll gradually increase crawling rate. Since DBT (1 post/day) probably has less than that scraper, Google may have indexed it later.
Jarkko, if that is the case I think their system is flawed.
I assume it would be OK for that to happen if Techmeme got indexed before a small site, but having a PR4 scraper site being indexed before a PR7 which is also updated daily is over the edge.
Google probably labels scraper site as a mid-ranking directory/newssite. Frequently updating information streams require faster indexing than company or personal sites. When initial trust has been established – Blogstrings shows PR4 for me – new info is just immediately included without analyzing the relationship between source and distributor.
This shouldn’t be anything surprising, many companies, which publish a press release on their own site and distribute it via relevant services, are often out-ranked by bigger aggregator sites.
This can’t be sustainable trend though, after some time web would be flooded with RSS circulating sites and nobody would create new content. This can’t happen, can it?
Omition, that should not be the case. Daily Blog Tips has around 300,000 backlinks from as counted by Yahoo.
I believe google lists sites by the order of the links coming into that site and by the popularity of those sites that link to the site in question. Therefore, if the scraper site has more “popular sites” linking to it, then it will probably be listed first.
That Google Daniel…! As usual or maybe not…
For example I search the name of a quite famous person and the first results I got are some stupid nicknames on Hi5 and then after the 5 or 6 listing I get what I am looking for!
Comments are closed.