How to derive attributes/labels from short plain text descriptions? (NER, LLM, ?)
I have short product descriptions that I’d like to transform into structured attributes.
Example:
Input:
“La Lecciaia Cabernet Sauvignon 2017 – Red – 750ml”
Output:
Year = 2017
Color = Red
Weight = 750
Weight Unit = ml
If everything was in this format it would be trivial to write a regular expression and be done with it, but there are many different formats and nuances. It is increasingly cumbersome to hard-code logic for each format. Trying to create a generic solution I immediately run into issues with a “basic” approach:
There are several different data providers, and each has its own format. For the example above, another provider might use “(Red) 2017 La Lecciaia Cabernet Sauvignon 750 ML”. Even for a given provider, there may be multiple formats and they may change over time. Formats are not always strictly followed.
There are many ways of expressing particular components. As an example, Weight might be expressed as any one of these: “1.5L”, “1 1/2 Liters”, “1500ml”, etc.
Parts of the description may be confused for target components. There may be a white wine from a brand called “Red Head Vineyard”. A weight of “2000 ml” may be confused for a year, etc. I’m only using these wine examples here for the sake of simplicity to general audience but my product domain has the same conceptual issues.
I’d consider this more of a “nice to have” but would be useful to be able to parse out even more detail like the algo would be smart enough to know that “La Lecciaia” is the brand and “Cabernet Sauvignon” is the grape variety. Assuming this would take more up front work and harder to get right but if there’s a straightforward method of doing this would be good to know about.
I’d like to develop a general-purpose function that can accept a description from any format. I have little experience with NLP/Artificial Intelligence but suspect there are useful tools/algos I can leverage. I have 1,000+ example records that I could potentially use to train a model. Something that can run locally would be preferred but not absolutely necessary.
I’m not looking for a specific implementation but for guidance from anyone who’s worked on a similar problem. Open to hybrid approaches where some additional logic or manual oversight could account for initial inaccuracies.
Appreciate any insight into approaches or suggested learning resources.
I've looked online for information but many approaches involve significant amount of up front work and unclear if they'll work in a practical sense.
LLM would work nicely for this. I'v done similar tasks before and it worked nicely with minimal training. Just keep in mind that any of the statistical methods NLP / LLM / NER will never be 100% accurate, but for practical purposes I find LLMs to be more accurate then a custom soup of regular expressions.
For you task I would use a framework like Langchain, and the following prompt (note you might need to work on your prompt a bit this just an example). When run with a model it will create an XML output which would be trivial to parse. You can modify the prompt to create different type of outputs. But, personally I find XML working very well for me.
You are an AI language model designed to parse wine bottle descriptions into structured data. You will be given a wine bottle description, and your task is to extract the following components:
- **Year**: The vintage year of the wine.
- **Color**: The color of the wine (e.g., Red, White, Rosé).
- **Weight**: The volume of the wine bottle expressed as a number (e.g., 750, 1500).
- **Weight Unit**: The unit of measurement for the weight (e.g., ml, mL, L, Liters).
- **Brand**: The brand or producer of the wine.
- **Grape Variety**: The variety of grape used (e.g., Cabernet Sauvignon, Merlot).
**Instructions:**
- Wine descriptions may come in various formats and may include additional or confusing information. Carefully analyze the description to accurately extract the components.
- Be cautious of potential ambiguities. For example:
- A brand name may include words like "Red" or "White" (e.g., "Red Head Vineyard") which should not be confused with the wine color.
- Large numbers may represent weight (e.g., "1500 ml") rather than a year.
- **Do not assume information not present in the description.** If a component is missing, you may leave the corresponding tag empty or omit it.
**Output Format:**
Provide the extracted information in XML format, using the following structure:
<Wine>
<Year>{{Year}}</Year>
<Color>{{Color}}</Color>
<Weight>{{Weight}}</Weight>
<WeightUnit>{{WeightUnit}}</WeightUnit>
<Brand>{{Brand}}</Brand>
<GrapeVariety>{{GrapeVariety}}</GrapeVariety>
</Wine>
**Examples:**
1. **Input:**
`La Lecciaia Cabernet Sauvignon 2017 – Red – 750ml`
**Output:**
```xml
<Wine>
<Year>2017</Year>
<Color>Red</Color>
<Weight>750</Weight>
<WeightUnit>ml</WeightUnit>
<Brand>La Lecciaia</Brand>
<GrapeVariety>Cabernet Sauvignon</GrapeVariety>
</Wine>
```
`Red Head Vineyard Chardonnay 2020 1.5L`
**Output:**
<Wine>
<Year>2020</Year>
<Color></Color>
<Weight>1.5</Weight>
<WeightUnit>L</WeightUnit>
<Brand>Red Head Vineyard</Brand>
<GrapeVariety>Chardonnay</GrapeVariety>
</Wine>
**Task:**
Given the following wine description, extract the components and provide the output in XML format as specified.
{win_description}
Keep in mind that LLMs are not cheap to run. But for this tasks given ambiguousness of the domain it is most likely the best choice. For this particular task it would be 1/1000 of a penny per label using OpenAI service. You might find a cheaper model / provider. However when working with LLM it is very important to ensure accuracy first, then optimize for costs.
The whole thing will probably take 1-2 hours to build for the intermediate LLM developer. If you are learning it may vary. But this is a perfect project to learn about LLMs