pythonstringnlpnltke-commerce

Shorten product title to a specific length using python nlp libraries


I have a collection of products for which I need a specific product name shorter than 40 characters. My input product name is a string column longer than 40 characters per item, so I need to make this shorter. I could use some string methods, but in that case some product names could end up in senseless names. As an example, an input name could be 'Cut Resistant Gloves, Size 8, Grey/Black - 12 per DZ' (52). How could I get from this to, as an example, 'Resistant Size 8 Grey/Black Gloves' (34)? Thanks in advance

I would like to end up with a new column in my data frame containing this new product names shorter than 40 characters.


Solution

  • You can modify the logic implemented below as per your requirement:

    import pandas as pd
    import spacy
    
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(product_name)
    
    shortened_tokens = []
    noun_tokens = []
    adjective_tokens = []
    size_tokens = []
    
    # Iterate over tokens and identify nouns, adjectives, and size/volume information
    for token in doc:
        if token.pos_ == "NOUN":
            noun_tokens.append(token.text)
        elif token.pos_ == "ADJ":
            adjective_tokens.append(token.text)
        elif token.pos_ == "NUM" and token.head.text.lower() in ["size", "vol", "volume"]:
            size_tokens.append(token.text)
        elif token.lower_ in ["size", "vol", "volume"]:
            size_tokens.append(token.text)
    
    # Determine the number of adjectives and nouns to include
    num_adjectives = min(len(adjective_tokens), Max_Adj_count)  # Initialise Max_Adj_count as the max number of adjectives permissible
    num_nouns = min(len(noun_tokens), Max_noun_count)           # Initialise Max_Noun_count as the max number of nouns permissible
    
    # Construct the shortened name using specific rules
    size_info = " ".join(size_tokens[:1])  
    shortened_tokens.extend(adjective_tokens[:num_adjectives])
    shortened_tokens.extend(size_info.split())  
    shortened_tokens.extend(noun_tokens[:num_nouns])
    
    
    shortened_name = " ".join(shortened_tokens)
    
    # If the shortened name is longer than 40 characters, truncate at the nearest word boundary
    if len(shortened_name) > 40:
        shortened_name = " ".join(shortened_name.split()[:7])