My client has an ASP.NET MVC web application that also has a WordPress blog in a subfolder.
https://www.example.com/
https://www.example.com/wordpress
The WordPress site is loaded with some social sharing links that I do not want crawlers to index. For example:
https://www.example.com/wordpress/some-post/?share=pinterest
First thing, should there be a robots.txt
in the /
folder and also one in the /wordpress
folder? Or just a single one in the /
folder? I've tried both without any success.
In my robots.txt
file I've included the following:
User-agent: Googlebot
Disallow: ?share=pinterest$
I've also tried several variations like:
Disallow: /wordpress/*/?share=pinterest
No matter what rule I have in robots.txt
, I'm not able to get crawlers to stop trying to index these social sharing links. The plugin that creates these sharing links is also making them "nofollow noindex noreferer", but since they are all internal links it causes issues due to blocking internal "link juice".
How do I form a rule to Disallow crawlers to index any link inside this site that ends with ?share=pinterest
?
Should both sites have a robots.txt
or only one in the main/root folder?
robots.txt
should only be at the root of the domain. https://example.com/robots.txt
is the correct URL for your robots.txt
file. Any robots.txt
file in a subdirectory will be ignored.
By default, robots.txt
rules are all "starts with" rules. Only a few major bots such as Googlebot support wildcards in Disallow:
rules. If you use wildcards, the rules will be obeyed by the major search engines but ignored by most less sophisticated bots.
Using nofollow
on those links isn't really going to effect your internal link juice. Those links are all going to be external redirects that will either pass PageRank out of your site, or if you block that PageRank somehow, it will evaporate. Neither external linking, nor PageRank evaporation hurt the SEO of the rest of your site, so it doesn't really matter from an SEO perspective what you do. You can allow those links to be crawled, use nofollow
on those links, or disallow those links in robots.txt
. It won't change how the rest of your site is ranked.
robots.txt
also has the disadvantage that search engines occasionally index disallowed pages. robots.txt
blocks crawling, but it doesn't always prevent indexing. If any of those URLs get external links, Google may index the URL with the anchor text of the links it finds to them.
If you really want to hide the social sharing from search engine bots, you should have the functionality handled with onclick
events. Something like:
<a onclick="pintrestShare()">Share on Pinterest</a>
Where pintrestShare
is a JavaScript function that uses location.href
set the URL of the page to the Pinterest share URL for the current URL.
To directly answer your question about robots.txt
, this rule is correct:
User-agent: *
Disallow: /wordpress/*/?share=pinterest
You can use Google's robots.txt
testing tool to verify that it blocks your URL:
You have to wait 24 hours after making robots.txt
changes before bots start obeying the new rules. Bots often cache your old robots.txt
for a day.
You may have to wait weeks for new results to show in your webmaster tools and search console accounts. Search engines won't report new results until they get around to re-crawling pages, realize the requests are blocked, and that information makes it back to their webmaster information portals.