pythonrdkitcheminformatics

Why Morgan Fingerprints for the molecules in my data does not plateaus with increasing number of bits?


So I am new in the field of Cheminformatics and programming in general and my first project involves using Morgan fingerprints as descriptors/features for my ML model and I cannot seem to find a suitable size of bit vector. My analysis involves finding out the max number of bits occupied for a given length.

So first I create a function that takes a list of SMILES and bit vector length, and genrates the fingerprint for all the SMILES and return maximum number of bits occupied within all the generated fingerprints.

def max_on_bits(smiles_list, nbits):
    on_bits = []

    for smiles in smiles_list:
        mol = Chem.MolFromSmiles(smiles)
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, 
                                                   3, 
                                                   nBits=nbits, 
                                                   useChirality=True, 
                                                   useFeatures=True)

        num_on_bits = sum(np.array(fp))
        on_bits.append(num_on_bits)
        
    return max(on_bits)

Then I use the function recursively to find the maximum occupancy

nbits_range = range(1024, 20483, 256)
occupancy = []
for nbits in nbits_range:
    occupancy.append(max_on_bits(smiles_list, nbits))

Finally I plot the results

import matplotlib.pyplot as plt

plt.plot(nbits_range, occupancy)
plt.xlabel("Bit Vector Size")
plt.ylabel("Maximum Number of bits occupied")
plt.grid()
plt.show()

The resulting curve does not plateaus and instead creates an oscillating line. Is there something wrong in my code or is this because of the inherent nature of the fingerprints? Plot


Solution

  • During the generation of the morgan fingerprint, first integer identifiers are generated for all non redundant environments and then these integers are folded (usually using a modulo operation) to populate the fingerprints. in case two integers fold to the same index, you have a bit collision and you will have an on-bit count which is lower than the unique environments count. The chance of this occuring is pretty high, which is reflected in your plot. If you want to have the count of unique identifiers you can skip the folding altogether and just use the environment identifiers directly instead. The chance there will be a collision of 32-bit integers is far lower. In general, morgan FPs are pretty sparse so depending on your purpose the fingerprint doesn't have to be that long. Often 1024,2048 or 4096 will perform quite similarly in tasks such as retrieving highly similar compounds from a set.

    Most likely, you curve will actually converge, but only at a much higher fingerprint length. PS, your code will run much faster if you use fp.GetNumOnBits() instead of sum(np.array(fp)) because a lot of time is wasted on unnecessarily converting the rdkit fp into a np array. See below for an example, using the SMILES for rapamycin:

    enter image description here