I'm making Chrome extension to find lyrics for song on YouTube. I have lyrics written in files in extension folder. Files are named Name of song.txt
. What I'm trying to do is to match search term which is title of video on YouTube and file name. Essentially what I have is this for example:
searchTerm = "Artist name - Song name - Audio"
searchTerm = "Song name - Audio"
searchTerm = "Song name - Artist name - Audio"
searchTerm = "Song name" but with letters `š`, `ć`, `č`, `ž` instead of letters `s`, `c` and `z`
searchTerm = "Baby, I love you" (but my file is named Baby I love you (without comma))
etc.
What algorithm should I use to get best matches? Example of bad match with Levensthein dinstance and Jaccard similarity is song name Kancone with search term Miki Jevremović - Kancone. When I try to match that, as result I get song Ako jednom vidis Mariju.
Basically, song name (and also file name) will often be substring of title of YouTube video (search term). Not necessarily complete substring but close. I can't go on supposition that file name is going to appear in title as-is or with some small differences like removing commas, apostrophes, hyphens etc. There is no standard delimiter or something like that. Also, there might be spelling mistake in title and such things.
Since this is a kill-my-time little project, I don't need some mega complicated algorithm or hardcore stuff, but everything is appreciated.
PS
If it's needed, I can provide code for algorithms I used.
Assuming that JavaScript is used and the file names are in an array. Details are commented in example.
// For demonstration purposes.
const log = (data, unformatted = false) => {
if (unformatted) return console.log(data);
console.log(JSON.stringify(data));
};
// An array of 10 text file names
const fileNames = ["Café au Lait.txt", "Rêverie d'Amour.txt", "Fleur de Lis.txt", "Cliché d'Amour.txt", "L'Étoile Brillante.txt", "Déjà Vu.txt", "Pâtisserie Délicieuse.txt", "L'amour Infini.txt", "Château de Rêves.txt", "Ballet Romantique.txt", "Baby, I Love You.txt"];
/**
* Removes accents, and anything that's not a letter,
* a number, or a dot (case insensitive).
* @param {string} string - A string
* @return {string} - A clean string
*/
const clean = string => {
return string
.normalize("NFD")
.replace(/[\u0300-\u036f]/g, "")
.replace(/[^a-z0-9.]/gi, "");
};
/**
* Compares a given array of file names (fileNames)
* vs a given string delimited by " - " to find
* any matches.
* @param {array} fileNames - An array of file names
* @param {string} findTerms - A string to search for
* @return {array} - An array of matches
*/
const searchFiles = (fileNames, findTerms) => {
// Clean arrays of file names and search terms
let files = fileNames.map(name => clean(name));
let terms = findTerms.split(" - ").map(term => clean(term));
/**
* If a file name matches any of the search
* terms, return the index of the file name
* otherwise return an empty array which gets
* flattened into nothing.
*/
const indices = files.flatMap((file, index) => {
return terms.some(term => {
let rgx = new RegExp(`${term}`, "i");
return rgx.test(file);
}) ? index : [];
});
/**
* Return an array of file names from fileNames
* array consisting of the file names that
* correspond to the index numbers of indices
* array.
*/
return fileNames.filter((name, index) => {
return indices.includes(index);
});
};
log("Artist - de - Audio", true);
log(searchFiles(fileNames, "Artist - de - Audio"));
log(" ", true);
log("ballet - Artist - Audio", true);
log(searchFiles(fileNames, "ballet - Artist - Audio"));
log(" ", true);
log("Dé - Audio", true);
log(searchFiles(fileNames, "Dé - Audio"));
log(" ", true);
log("d'Amour", true);
log(searchFiles(fileNames, "d'Amour"));
log(" ", true);
log(".txt", true);
log(searchFiles(fileNames, ".txt"));
log(" ", true);
log("baby i love you", true);
log(searchFiles(fileNames, "baby i love you"));
log(" ", true);