Choosing strings that are most different from each other in Python

Alright so this question might be a little weird so first let me give you a short background.

I am using spintax in order to generate large blocks of text given a set of optional phrases. I insert the spin inside a loop with the range from 0 to 10, so it creates multiple strings, each one of them being different.

for i in range(0, 10):
   L.append(spintax.spin(            
  " ----<h1>{" +Title+ " - {køb online|sammenlign {priser|online supermarkederne}} via x.dk|Få din "+y+ "\
  leveret til døren og spar penge via x.dk|Køb din "+y+ " online og spar penge  via x.dk }\
  \n  \
  ----<h2>{{Få adgang til|vælg fra} {et stort|Danmarks største} {udvalg} af} " +y+ "<h2>\
  \n  \
  {Når|Hvis} du {besøger|handler ind gennem|benytter|køber ind via|køber dine varer via}\
  x.dk, {er det {vigtigt|væsentligt} at forstå|skal du huske|skal du vide}"))

  L2.append(df['ID'][index])
df2 = pd.DataFrame(np.column_stack([L, L2]), columns=['Text' ,'ID'])

Right, so this is an example of how my code looks like. L is a list that takes the generated text and L2 is a list of IDs (not going to explain what's up with that list too as it's off-topic). My df2 DataFrame will therefore look like this:

Index            Text                                 Id
0            <h1>Få din Mælk & Fløde leveret til      4169
             døren og spar penge via...
1            <h1>Mælk & Fløde - køb online via x.dk   4169
              ....
12           <h1>Få din Yoghurt leveret til døren     4178
             og spar penge via 
              ....

So at this point, there are 10 text strings for every Id. I need to bring these down to 1, and here my issues are starting. I want to make sure that these text strings all differ from one another to some extent. From these 10 strings per Id I will need to choose 1 that will be differ from the strings of other Ids.

Hopefully, that kinda makes sense...

As a summary, if you got lost on the way: is there any way to compare the similarity between strings? A way to compare text strings and choose the string which is the most different out of all of them?

Solution

In the below data, Text in Index 0 & 2 and Text in Index 4 & 5 are the most similar among each unique Id since they contain text from each other. So the least similar are Index 1 & 3 among each Id

To find the least similar Text we can use TF-IDF to encode each Text into a numeric vector. We then find the euclidean distance between each pair of rows within each group and sum the distances for each row and assume the max mean is the least similar. Finally, we grab the index with the largest mean for each group of Id's.

Data:

| Index | Text                                                       | Id   |
|-------|------------------------------------------------------------|------|
| 0     | Få din Mælk & Fløde leveret til døren og spar penge via... | 4169 |
| 1     | Mælk & Fløde - køb online via x.dk                         | 4169 |
| 2     | Fløde leveret til døren og spar penge via...               | 4169 |
| 3     | Få din Mælk & Fløde leveret til døren og spar penge via... | 4170 |
| 4     | Mælk & Fløde - køb online via x.dk                         | 4170 |
| 5     | køb online via x.dk                                        | 4170 |

In:

from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial.distance import cdist

df = pd.read_clipboard()
df.columns = df.columns.str.strip()

v = TfidfVectorizer()
X = v.fit_transform(df['Text'])

df = df.join(pd.DataFrame(X.toarray()))

group = df.groupby('Id', as_index=False)

df = group.apply(lambda x : x.iloc[cdist(x.iloc[:,3:].values, x.iloc[:,3:].values).mean(axis=0).argmax()])

df[['Index', 'Text', 'Id']]

Out:

|   | Index | Text                                                       | Id   |
|---|-------|------------------------------------------------------------|------|
| 0 | 1     | Mælk & Fløde - køb online via x.dk                         | 4169 |
| 1 | 3     | Få din Mælk & Fløde leveret til døren og spar penge via... | 4170 |