pythonmatlabinformation-retrievalprecision-recallaverage-precision

MAP@k computation


Mean average precision computed at k (for top-k elements in the answer), according to wiki, ml metrics at kaggle, and this answer: Confusion about (Mean) Average Precision should be computed as mean of average precisions at k, where average precision at k is computed as:

enter image description here

Where: P(i) is the precision at cut-off i in the list; rel(i) is an indicator function equaling 1 if the item at rank i is a relevant document, zero otherwise.

The divider min(k, number of relevant documents) has the meaning of maximum possible number of relevant entries in the answer.

Is this understanding correct?

Is MAP@k always less than MAP computed for all ranked list?

My concern is that, this is not how MAP@k is computed in many works.

It is typical, that the divider is not min(k, number of relevant documents), but the number of relative documents in the top-k. This approach will give higher value of MAP@k.


HashNet: Deep Learning to Hash by Continuation" (ICCV 2017)

Code: https://github.com/thuml/HashNet/blob/master/pytorch/src/test.py#L42-L51

    for i in range(query_num):
        label = validation_labels[i, :]
        label[label == 0] = -1
        idx = ids[:, i]
        imatch = np.sum(database_labels[idx[0:R], :] == label, axis=1) > 0
        relevant_num = np.sum(imatch)
        Lx = np.cumsum(imatch)
        Px = Lx.astype(float) / np.arange(1, R+1, 1)
        if relevant_num != 0:
            APx.append(np.sum(Px * imatch) / relevant_num)

Where relevant_num is not the min(k, number of relevant documents), but number of relevant documents in the result, which is not the same as total number of relative documents or k.

Am I reading wrong the code?


Deep Visual-Semantic Quantization of Efficient Image Retrieval CVPR 2017

Code: https://github.com/caoyue10/cvpr17-dvsq/blob/master/util.py#L155-L178

def get_mAPs_by_feature(self, database, query):
    ips = np.dot(query.output, database.output.T)
    #norms = np.sqrt(np.dot(np.reshape(np.sum(query.output ** 2, 1), [query.n_samples, 1]), np.reshape(np.sum(database.output ** 2, 1), [1, database.n_samples])))
    #self.all_rel = ips / norms
    self.all_rel = ips
    ids = np.argsort(-self.all_rel, 1)
    APx = []
    query_labels = query.label
    database_labels = database.label
    print "#calc mAPs# calculating mAPs"
    bar = ProgressBar(total=self.all_rel.shape[0])
    for i in xrange(self.all_rel.shape[0]):
        label = query_labels[i, :]
        label[label == 0] = -1
        idx = ids[i, :]
        imatch = np.sum(database_labels[idx[0: self.R], :] == label, 1) > 0
        rel = np.sum(imatch)
        Lx = np.cumsum(imatch)
        Px = Lx.astype(float) / np.arange(1, self.R+1, 1)
        if rel != 0:
            APx.append(np.sum(Px * imatch) / rel)
        bar.move()
    print "mAPs: ", np.mean(np.array(APx))
    return np.mean(np.array(APx))

Where divider is rel, which is computed as np.sum(imatch), where imatch is a binary vector that indicates if the entry is relevant or not. The problem is that it takes only first R: imatch = np.sum(database_labels[idx[0: self.R], :] == label, 1) > 0. So np.sum(imatch) will give number of relevant entries in the returned list of size R, but not min(R, number of relevant entries). And note that values of R used in the paper are less than number of entries in DB.


Deep Learning of Binary Hash Codes for Fast Image Retrieval (CVPR 2015)

Code: https://github.com/kevinlin311tw/caffe-cvprw15/blob/master/analysis/precision.m#L30-L55

    buffer_yes = zeros(K,1);
    buffer_total = zeros(K,1);
    total_relevant = 0;
    
    for j = 1:K
        retrieval_label = trn_label(y2(j));
        
        if (query_label==retrieval_label)
            buffer_yes(j,1) = 1;
            total_relevant = total_relevant + 1;
        end
        buffer_total(j,1) = 1;
    end
    
    % compute precision
    P = cumsum(buffer_yes) ./ Ns';
    
    if (sum(buffer_yes) == 0)
        AP(i) = 0;
    else
        AP(i) = sum(P.*buffer_yes) / sum(buffer_yes);
    end

Here the divider is sum(buffer_yes) which is number of the relative documents in the returned list of size k, not min(k, number of relevant documents).


"Supervised Learning of Semantics-Preserving Deep Hashing" (TPAMI 2017)

Code: https://github.com/kevinlin311tw/Caffe-DeepBinaryCode/blob/master/analysis/precision.m

Code is the same as in the previouse paper.


Learning Compact Binary Descriptors with Unsupervised Deep Neural Networks (CVPR 2016)

Same code: https://github.com/kevinlin311tw/cvpr16-deepbit/blob/master/analysis/precision.m#L32-L55



Am I missing something? Is the code in the papers above correct? Why it does not coincide with https://github.com/benhamner/Metrics/blob/master/Python/ml_metrics/average_precision.py#L25-L39 ?


Update

I found this closed issue, referring the same problem: https://github.com/thuml/HashNet/issues/2

Is claim the following claim correct?

AP is a ranking metric. If the top 2 retrievals in the ranked list are relevant (and only the top 2), AP is 100%. You're talking about Recall, which in this case is indeed 0.2%.

From my understanding, if we treat AP as area under PR curve, the claim above is not correct.


P.S. I was in doubt if this should go to Cross Validated or to StackOverflow. If you think that it is better to place it to Cross Validated I don't mind. My reasoning was that it is not a theoretical question, but implementation one with reference to actual code.


Solution

  • You are completely right and well done for finding this. Given the similarity of code, my guess is there is one source bug, and then papers after papers copied the bad implementation without examining it closely.

    The "akturtle" issue raiser is completely right too, I was going to give the same example. I'm not sure if "kunhe" understood the argument, of course recall matters when computing average precision.

    Yes, the bug should inflate the numbers. I just hope that the ranking lists are long enough and that the methods are reasonable enough such that they achieve 100% recall in the ranked list, in which case the bug would not affect the results.

    Unfortunately it's hard for reviewers to catch this as typically one doesn't review code of papers.. It's worth contacting authors to try to make them update the code, update their papers with correct numbers, or at least don't continue making the mistake in their future works. If you are planning to write a paper comparing different methods, you could point out the problem and report the correct numbers (as well as potentially the ones with the bug just to make apples for apples comparisons).

    To answer your side-question:

    Is MAP@k always less than MAP computed for all ranked list?

    Not necessarily, MAP@k is essentially computing the MAP while normalizing for the potential case where you can't do any better given just k retrievals. E.g. consider returned ranked list with relevances: 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 and assume there are in total 6 relevant documents. MAP should be slightly higher than 50% here, while MAP@3 = 100% because you can't do any better than retrieving 1 1 1. But this is unrelated to the bug you discovered as with their bug the MAP@k is guaranteed to be at least as large as the true MAP@k.