I'm working with a dataset where each sample contains both numeric and text data. Therefore multiple methods are employed to build the training feature matrix from the dataset. For each sample in the dataset, I construct a vector representation from 3 parts.
Doc2Vec vector representation for paragraph text: I use the gensim
implemetation of paragraph vector to encode the text into a 100-D vetors of floats between [-5, 5]
One-hot encoded vector for text label: Each sample in the dataset has zero or more text label, I aggregate out all of the unique labels used in the dataset and encode it into a binary array containing only 0 and 1. For example, if the complete set of labels is [Python, Java, JavaScript, C++]
and a sample contains labels Python
and Java
, the resulted vector will be [1, 1, 0, 0]
.
Numeric data & categorical data:
The resulted feature matrix looks something like below
[
[-1.02, 1.33, 2.35, -0.48, ... -4.11, 1, 0, 1, 1, 0, 0, ..., 1, 0, 235, 11.5, 333],
[-0.22, 3.03, 1.95, -0.48, ... -4.11, 0, 1, 1, 1, 0, 0, ..., 0, 0, 233, 22, 333],
[-2.07, -1.33, -2.35, -0.48, ... -4.11, 1, 1, 0, 1, 1, 0, ..., 1, 1, 102, 13, 333],
[-4.32, 4.33, 1.75, -0.48, ... -4.11, 0, 0, 0, 1, 0, 1, ..., 1, 0, 98, 8, 333],
]
Should I apply any standardization or normalization on the dataset? If so, should I do it before or after concatenating different parts of feature?
I'm using scikit-learn and the major algorithm I using will be Gradient Boosting.
Yes, you need to process features separately: you should apply standardization or normalization only on the original numerical features, you shouldn't do it for doc2vec, OHE or encoded categorical features.