I have a collection of texts from some authors. Each author has a unique signature or link that occurs in all of their texts.
Example for Author1:
$texts=['sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd @jhsad.sadas.com sdsdADSA sada', 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf', 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl @jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];
Expected output for Author1 is:
@jhsad.sadas.com
Example for Author2:
$texts=['This is some random string representative of non-signature text. This is the *author\'s* signature.', 'Different message body text. This is the *author\'s* signature. This is an afterthought that expresses that a signature is not always at the end.', 'Finally, this is unwanted stuff. This is the *author\'s* signature.'];
Expected output for Author2 is:
This is the *author's* signature.
Pay particular notice to the fact there there are no reliable identifying characters (or positions) that signify the start or end of the signature. It could be a url, a Twitter mention, any kind of plain text, etc. of any length containing any sequence of characters that occurs at the start, end, or middle of the string.
I am seeking a method that will extract the longest substring that exists in all $text
elements for a single author.
It is expected, for the sake of this task, that all authors WILL have a signature substring that exists in every post/text.
IDEA: I'm thinking of converting words to vectors and finding similarity between each texts. We can use cosine similarity to find the signatures. I think the solution must be some thing like this idea.
mickmackusa's commented code captures the essence of what is desired, but I would like to see if there are other ways to achieve the desired result.
Here is my thinking:
overlaps
).0
index.Code: (Demo)
$posts['Author1'] = ['sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd
@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];
$posts['Author2'] = ['This is some random string representative of non-signature text.
This is the
*author\'s* signature.',
'Different message body text. This is the
*author\'s* signature.
This is an afterthought that expresses that a signature is not always at the end.',
'Finally, this is unwanted stuff. This is the
*author\'s* signature.'];
foreach ($posts as $author => $texts) {
echo "Author: $author\n";
usort($texts, function($a, $b) {
return strlen($a) <=> strlen($b); // sort ASC by strlen; mb_strlen probably isn't advantageous
});
var_export($texts);
echo "\n";
foreach ($texts as $index => $string) {
if (!$index) {
$overlaps = preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY); // declare with all non-white-space substrings from first text
} else {
$overlaps = array_intersect($overlaps, preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY)); // filter word bank using narrowing number of words
}
}
var_export($overlaps);
echo "\n";
// batch consecutive substrings
$group = null;
$consecutives = []; // clear previous iteration's data
foreach ($overlaps as $i => $word) {
if ($group === null || $i - $last > 1) {
$group = $i;
}
$last = $i;
$consecutives[$group][] = $word;
}
var_export($consecutives);
echo "\n";
foreach($consecutives as $words){
// match potential signatures in first text for measurement:
if (preg_match_all('/\Q' . implode('\E\s+\Q', $words) . '\E/', $texts[0], $out)) { // make alternatives characters literal using \Q & \E
$potential_signatures = $out[0];
}
}
usort($potential_signatures, function($a,$b){
return strlen($b) <=> strlen($a); // sort DESC by strlen; mb_strlen probably isn't advantageous
});
echo "Assumed Signature: {$potential_signatures[0]}\n\n";
}
Output:
Author: Author1
array (
0 => 'sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd
@jhsad.sadas.com sdsdADSA sada',
1 => 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl',
2 => 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
)
array (
11 => '@jhsad.sadas.com',
)
array (
11 =>
array (
0 => '@jhsad.sadas.com',
),
)
Assumed Signature: @jhsad.sadas.com
Author: Author2
array (
0 => 'Finally, this is unwanted stuff. This is the
*author\'s* signature.',
1 => 'This is some random string representative of non-signature text.
This is the
*author\'s* signature.',
2 => 'Different message body text. This is the
*author\'s* signature.
This is an afterthought that expresses that a signature is not always at the end.',
)
array (
2 => 'is',
5 => 'This',
6 => 'is',
7 => 'the',
8 => '*author\'s*',
9 => 'signature.',
)
array (
2 =>
array (
0 => 'is',
),
5 =>
array (
0 => 'This',
1 => 'is',
2 => 'the',
3 => '*author\'s*',
4 => 'signature.',
),
)
Assumed Signature: This is the
*author's* signature.