This is a very fundamental and silly doubt. I have read that in order to prevent large relevance assessments in TREC competitions (reference), the top-ranked documents returned by participating systems are pooled to create the set of documents for relevance assessment. However, my doubt is this:
If majority of the systems use a common model or a similar model with somewhat same parameters. For example if several systems use LSA with rank reduced to 100,120,150,105, etc. Then there are two problems. One, merging such results might not really give the documents relevant to each query as the documents returned might severely overlap. Two, the documents which are to be assessed are actually biased as per the models used by the participating systems. So the relevance judgements will not really be method agnostic.
I know I am missing something here and if anyone could guide me in finding the missing link it would be really helpful!
You are correct. Pooling has got its own problems and we have to live with it.
There're, however, ways of making the pooling process less biased towards a set of specific retrieval models.
Using a set of diverse retrieval models and different retrieval settings (e.g. using the title or the title and description as queries) often helps in reducing the overlap in the retrieved set of documents. The overlap isn't always a bad thing either because ending up retrieving a document in multiple lists (corresponding to different settings or retrieval models) may actually reinforce the belief of including this document in the pool.
Another approach that was followed in TREC was to encourage participating systems to include manually post-processed runs, in order to ensure that the documents being shown to the assessors involve some kind of a manual filtering instead of them being outputs of purely automated algorithms.
While it is true that the top-retrieved set is a function of a specific retrieval model, the idea that pooling uses is that with sufficient depth (say depth-100), it is highly unlikely that a document that's truly relevant would not be retrieved within the top-100 of any retrieval model. So, the higher number of settings (models and query formulation strategies) one uses and the higher the depth is, the lower becomes this probability of missing a truly relevant document.
However, it's certainly possible to extend the assessment pool for a retrieval model with characteristics completely different from the existing ones using which the initial pool was constructed.