I started my master thesis for a food company. They start with a few ingredients, mix them, heat them, and so on until they finally get candy. But there is a problem. For the production of the same candy, the PLC controlled machines do not always run smoothly, and do not give the same result. They think it is fruit as an ingredient, which is not always 100% the same (viscosity, etc.). They measure the features of the ingredients before they are used for production. They also measure all process parameters (pressure, temperature, brix, etc.). These are all stored. Now my thesis is to examine this data using machine learning models to obtain more information. Now I come across some problems. The first problem is that I do not actually have a classification. There is no such thing as 'good candy' and 'bad candy'. The second problem is that I do not really have output parameters. I have the brix value, but that's it. The last question is: the ingredients are input features for my model, but the process featues, are these inputs also? Or should I just leave it behind?
Thank you very much for the help!
The first problem is that I do not actually have a classification. There is no such thing as 'good candy' and 'bad candy'.
How does the company decide what is sufficient or not? You need to determine the criteria they use for labeling the candies as 'bad' or 'good'. If you do not have any labels you might have to look for unsupervised learning techniques like cluster analysis or factor analysis.
The second problem is that I do not really have output parameters. I have the brix value, but that's it.
Depending on your task you will have to think about what your target values are. For classification it would be the label of the candy. Hence, 'bad' or 'good' candy. For regression problems you would need something continous (e.g. brix value if this is relevant to your goal). For unsupervised learning you do not need an output variable.
The last question is: the ingredients are input features for my model, but the process featues, are these inputs also? Or should I just leave it behind?
You have to look at all the variables that you have and decide which hold valuable information if the candy is 'good' or 'bad'. That is specific domain knowledge that you need to gather. You can ask the people at the company. They should be able to tell you what is important or not. You can also look at the statistics of all parameters. Parameters that correlate with the quality of the candy should be identified. Parameters that don't show a lot of variation (e.g. temperature is always constant) can be neglected.