I'm struggling to find a solution to how we should structure our canonical URLs in a marketplace built with Next.js.
We have a sitemap structure for a marketplace where users can filter collections like so:
/collections/{category}/{subCategory}/{type} + query params for additional filtering
/brand/{brand} + query params for additional filtering
/room/{room} + query params for additional filtering
/style/{style} + query params for additional filtering
/location/{location} + query params for additional filtering
The query params can be a combination of any of the other filters so that:
/collections/seating/chairs?brand=pottery-barn
**has the same content as**
/brand/pottery-barn?category=seating&subcategory=chairs
or
/style/mcm?brand=knoll&category=seating
**has the same content as**
/brand/knoll&category=seating&style=mcm
**has the same content as**
/collections/seating?brand=knoll&style=mcm
I'd love to know what the best practices here are. Should the pages still have separate canonicals for example, even though they have similar content, or should I consolidate the pages with canonical URLs to potentially improve SEO ?
Pages that have the same content are considered duplicates by Google, which can be harmful as it will waste Googlebot's time by forcing it to crawl multiple versions of the same content.
Canonical tags will help Google understand which of these duplicate URLs you prefer to have indexed, but it will not prevent it from wasting time crawling the duplicate pages.
That is why it is much preferable to avoid exposing the same content across multiple URLs at all costs, and use Canonical tags to mitigate duplication when it is impossible to avoid it in the first place.
In your case, I understand that these URLs are necessary to allow users to filter Product List Pages, regardless of which part of the catalog they have navigated to, therefore these duplicate URLs are a functional requirement. However, it is possible to successfully prevent Googlebot from discovering and crawling your duplicate URLs by following these recommendations:
<a>
tags to prevent Googlebot from discovering these URLs. Only use client-side JavaScript interactions to send the user to these URLs./robots.txt
file to prevent Googlebot from crawling these URLs, using the following rules:Disallow: *?brand=*
Disallow: *&brand=*
Disallow: *?category=*
Disallow: *&category=*
Disallow: *?location=*
Disallow: *&location=*
Disallow: *?room=*
Disallow: *&room=*
Disallow: *?style=*
Disallow: *&style=*
Disallow: *?subcategory=*
Disallow: *&subcategory=*
/collections/{category}
/collections/{category}/{subCategory}
/collections/{category}/{subCategory}/{type}
/brand/{brand}
/room/{room}
/style/{style}
/location/{location}
/collections/seating/chairs
should canonical to itself, i.e. /collections/seating/chairs
/collections/seating/chairs?foo=bar
should canonical to /collections/seating/chairs
⚠️ It is worth noting that the above recommendations are tailored to the specific case described in the question. In no way is it a general recommendation to block discovery and crawling of URLs containing query string parameters.