Alright so this question might be a little weird so first let me give you a short background.
I am using spintax
in order to generate large blocks of text given a set of optional phrases. I insert the spin
inside a loop with the range from 0 to 10, so it creates multiple strings, each one of them being different.
for i in range(0, 10):
L.append(spintax.spin(
" ----<h1>{" +Title+ " - {køb online|sammenlign {priser|online supermarkederne}} via x.dk|Få din "+y+ "\
leveret til døren og spar penge via x.dk|Køb din "+y+ " online og spar penge via x.dk }\
\n \
----<h2>{{Få adgang til|vælg fra} {et stort|Danmarks største} {udvalg} af} " +y+ "<h2>\
\n \
{Når|Hvis} du {besøger|handler ind gennem|benytter|køber ind via|køber dine varer via}\
x.dk, {er det {vigtigt|væsentligt} at forstå|skal du huske|skal du vide}"))
L2.append(df['ID'][index])
df2 = pd.DataFrame(np.column_stack([L, L2]), columns=['Text' ,'ID'])
Right, so this is an example of how my code looks like. L
is a list that takes the generated text and L2
is a list of IDs (not going to explain what's up with that list too as it's off-topic). My df2
DataFrame will therefore look like this:
Index Text Id
0 <h1>Få din Mælk & Fløde leveret til 4169
døren og spar penge via...
1 <h1>Mælk & Fløde - køb online via x.dk 4169
....
12 <h1>Få din Yoghurt leveret til døren 4178
og spar penge via
....
So at this point, there are 10 text strings for every Id. I need to bring these down to 1, and here my issues are starting. I want to make sure that these text strings all differ from one another to some extent. From these 10 strings per Id I will need to choose 1 that will be differ from the strings of other Ids.
Hopefully, that kinda makes sense...
As a summary, if you got lost on the way: is there any way to compare the similarity between strings? A way to compare text strings and choose the string which is the most different out of all of them?
In the below data, Text
in Index
0 & 2
and Text
in Index
4 & 5
are the most similar among each unique Id
since they contain text from each other. So the least similar are Index
1 & 3
among each Id
To find the least similar Text
we can use TF-IDF
to encode each Text
into a numeric vector. We then find the euclidean distance
between each pair of rows within each group and sum the distances for each row and assume the max mean is the least similar. Finally, we grab the index with the largest mean for each group of Id's
.
Data:
| Index | Text | Id |
|-------|------------------------------------------------------------|------|
| 0 | Få din Mælk & Fløde leveret til døren og spar penge via... | 4169 |
| 1 | Mælk & Fløde - køb online via x.dk | 4169 |
| 2 | Fløde leveret til døren og spar penge via... | 4169 |
| 3 | Få din Mælk & Fløde leveret til døren og spar penge via... | 4170 |
| 4 | Mælk & Fløde - køb online via x.dk | 4170 |
| 5 | køb online via x.dk | 4170 |
In:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.spatial.distance import cdist
df = pd.read_clipboard()
df.columns = df.columns.str.strip()
v = TfidfVectorizer()
X = v.fit_transform(df['Text'])
df = df.join(pd.DataFrame(X.toarray()))
group = df.groupby('Id', as_index=False)
df = group.apply(lambda x : x.iloc[cdist(x.iloc[:,3:].values, x.iloc[:,3:].values).mean(axis=0).argmax()])
df[['Index', 'Text', 'Id']]
Out:
| | Index | Text | Id |
|---|-------|------------------------------------------------------------|------|
| 0 | 1 | Mælk & Fløde - køb online via x.dk | 4169 |
| 1 | 3 | Få din Mælk & Fløde leveret til døren og spar penge via... | 4170 |