I created training data of R&R from ground truth and noticed that each question of ground truth made 10 records of training data without depending on the number of candidate answers of ground truth.
Only the number of questions of ground truth affects the size of R&R training data? I would like to know it because there is size limitation of training data.
noticed that each question of ground truth made 10 records of training data without depending on the number of candidate answers of ground truth
If you are using the python train.py utility to prepare the training data for R&R, the number of candidate answers per question is controlled by the optional -r
(--rows
) argument which specifies the number of answer results that the query returns. The default value is 10, which is what you are seeing.
Similarly, if you are directly using the /fcselect
API call to generate the training data, then you can similarly use the optional rows
parameter to specify the number of candidate answers for which features are generated. Again, the default is 10.
If you can afford to do so, it is generally better to override this default and experiment with higher values as that provides the ranker with more room to learn and re-rank answers. The RnR web tooling uses a default of 30.
Only the number of questions of ground truth affects the size of R&R training data?
No, the size of the training data is proportional to all aspects: (1) the number of queries, (2) the number of candidate answers per query, and (3) the number of features (columns). The number of features is, in itself, proportional to the number of fields in the schema that are marked for feature generation (i.e. in the default schema, they are marked with type watson_text_en
).