seocontent-management-systemsitemapgoogle-search-console

Multilingual site with some pages marked as 'Duplicate without user-selected canonical', probably issue related to domain with www and no-www


I am working on a CMS and in our latest test for a new multilingual site, I can see that some of the pages in GOOGLE SEARCH CONSOLE, are maked as "Duplicate without user-selected canonical".

Example, the following is maked as "Duplicate without user-selected canonical":

https://www.gotomdz.com/en/place/detail/c5ac334a-6d5e-473b-a46f-70e81c559cf6/cerro-de-la-gloria

Now in my sitemap.xml for this page I have:

<url>
    <loc>https://www.gotomdz.com/en/place/detail/c5ac334a-6d5e-473b-a46f-70e81c559cf6/cerro-de-la-gloria</loc>
    <lastmod>2023-10-03</lastmod>
    <xhtml:link rel="alternate" hreflang="en" href="https://www.gotomdz.com/en/place/detail/c5ac334a-6d5e-473b-a46f-70e81c559cf6/cerro-de-la-gloria"/>
    <xhtml:link rel="alternate" hreflang="es" href="https://www.gotomdz.com/es/place/detail/c5ac334a-6d5e-473b-a46f-70e81c559cf6/cerro-de-la-gloria"/>
    <xhtml:link rel="alternate" hreflang="pt" href="https://www.gotomdz.com/pt/place/detail/c5ac334a-6d5e-473b-a46f-70e81c559cf6/cerro-de-la-gloria"/>
</url>

As you can see, I am letting google knows that there are multiple language versions of the same page. The content is almost the same, but is fully translated, so my goal is that this pages are all indexed.

Looking at google documentation I can read the following:

There are three ways to indicate multiple language/locale versions of a page to Google:

  • HTML
  • HTTP Headers
  • Sitemap

The three methods are equivalent from Google's perspective and you can choose the method that's the most convenient for your site. While you can use all three methods at the same time, there's no benefit in Search

So, I think my sitemap.xml is enough, right?

Now, about the "rel=canonical", I dont think this is right in my case, since they are different pages. I do not use "rel=canonical" in any part of the site.

I am afraid that not all the content is indexed.

Now, looking at the indexed content:

The last url, is wrong. That is an old link, I have to see the way to remove this in search console.

Besides this, I can see that ALL pages are indexed.

BUT, as you see

https://gotomdz.com/en/place/detail/c5ac334a-6d5e-473b-a46f-70e81c559cf6/cerro-de-la-gloria

is indexed, but the following is not indexed:

https://www.gotomdz.com/en/place/detail/c5ac334a-6d5e-473b-a46f-70e81c559cf6/cerro-de-la-gloria

Not sure how to handle this issue

In the sitemap.xml I am using the full site url (https://www.gotomdz.com) instead of the domain wihtout www (https://gotomdz.com).

Do I have to also add the "https://gotomdz.com" urls into the sitemap.xml and mark the "https://www.gotomdz.com" with "rel=canonical"? What do you think?

Thank you


Solution

  • 1. www / non-www duplication

    It seems that your entire website is duplicated on both www and "non-www" hosts. You do not want both versions to be indexed, as this would be a waste of time for Google. Instead, you need to pick one, and 301-redirect the other to it using a global redirect rule (see Redirect non-www to www in .htaccess for guidance on how to achieve this on Apache).

    Canonical tags are not the solution in this case, because they should only be used in cases where duplicate pages cannot be redirected. This is because:

    2. XML sitemap

    Regarding your sitemap.xml file, you should only specify the URLs belonging to the version you chose (www or non-www). The other version should never appear anywhere on your website to avoid feeding duplicate URLs to Google.

    It is not possible to tag URLs in an XML sitemap with rel="canonical", because by definition an XML sitemap should only list canonical URLs that you want indexed. You should never list duplicate URLs in an XML sitemap.

    3. Handling old URL patterns

    It seems you have updated the URL pattern of the "place" route from /{language}/places/detail/{uuid}/{place-slug} to /{language}/place/detail/{uuid}/{place-slug}, e.g. https://gotomdz.com/en/places/detail/c5ac334a-6d5e-473b-a46f-70e81c559cf6/cerro-de-la-gloria is now https://gotomdz.com/en/place/detail/c5ac334a-6d5e-473b-a46f-70e81c559cf6/cerro-de-la-gloria

    This wouldn't be a problem if Google hadn't already crawled and indexed the old URLs, in which case you could just let them end up as 404 errors. But if Google has already indexed your old URLs, you should make sure to set up a 301-redirect rule from the old URL pattern to the new one so Google can update its index and forget about the old URLs (and remove them from Google Search Console error reports).

    Make sure you also update all your internal links to point to the new URLs, and never again link to the old ones.

    4. Handling 404 errors

    Finally, the old URL discussed above is currently displaying a "404 error" message, but the HTTP status code for this page is 200 "OK".

    You should fix this by making sure that your web server returns an HTTP 404 status code whenever it displays the "404 error" message. This will ensure that invalid URLs are properly identified and removed from Google's index.

    Hope this helps!