What should the canonical URL be for large e-commerce applications that have many filters?

I'm struggling to find a solution to how we should structure our canonical URLs in a marketplace built with Next.js.

We have a sitemap structure for a marketplace where users can filter collections like so:

/collections/{category}/{subCategory}/{type} + query params for additional filtering
/brand/{brand} + query params for additional filtering
/room/{room} + query params for additional filtering
/style/{style} + query params for additional filtering
/location/{location} + query params for additional filtering

The query params can be a combination of any of the other filters so that:

/collections/seating/chairs?brand=pottery-barn 
**has the same content as** 
/brand/pottery-barn?category=seating&subcategory=chairs

/style/mcm?brand=knoll&category=seating 
**has the same content as** 
/brand/knoll&category=seating&style=mcm
**has the same content as** 
/collections/seating?brand=knoll&style=mcm

I'd love to know what the best practices here are. Should the pages still have separate canonicals for example, even though they have similar content, or should I consolidate the pages with canonical URLs to potentially improve SEO ?

Solution

Pages that have the same content are considered duplicates by Google, which can be harmful as it will waste Googlebot's time by forcing it to crawl multiple versions of the same content.

Canonical tags will help Google understand which of these duplicate URLs you prefer to have indexed, but it will not prevent it from wasting time crawling the duplicate pages.

That is why it is much preferable to avoid exposing the same content across multiple URLs at all costs, and use Canonical tags to mitigate duplication when it is impossible to avoid it in the first place.

In your case, I understand that these URLs are necessary to allow users to filter Product List Pages, regardless of which part of the catalog they have navigated to, therefore these duplicate URLs are a functional requirement. However, it is possible to successfully prevent Googlebot from discovering and crawling your duplicate URLs by following these recommendations:

Avoid linking to URLs containing query string parameters using HTML <a> tags to prevent Googlebot from discovering these URLs. Only use client-side JavaScript interactions to send the user to these URLs.
Disallow all query string parameters in your /robots.txt file to prevent Googlebot from crawling these URLs, using the following rules:

Disallow: *?brand=*
Disallow: *&brand=*
Disallow: *?category=*
Disallow: *&category=*
Disallow: *?location=*
Disallow: *&location=*
Disallow: *?room=*
Disallow: *&room=*
Disallow: *?style=*
Disallow: *&style=*
Disallow: *?subcategory=*
Disallow: *&subcategory=*

Do not include URLs containing query string parameters in your XML sitemap. Only list the following URL patterns for PLPs:

/collections/{category}
/collections/{category}/{subCategory}
/collections/{category}/{subCategory}/{type}
/brand/{brand}
/room/{room}
/style/{style}
/location/{location}

Use Canonical tags as usual: every page should have a Canonical tag pointing to its own path, without query string parameters, i.e.
- /collections/seating/chairs should canonical to itself, i.e. /collections/seating/chairs
- /collections/seating/chairs?foo=bar should canonical to /collections/seating/chairs

⚠️ It is worth noting that the above recommendations are tailored to the specific case described in the question. In no way is it a general recommendation to block discovery and crawling of URLs containing query string parameters.