kerassentiment-analysisword-embeddingtensorflow-hublanguage-model

Tensorflow hub-NNLM word embedding using sentiment140 data gives input shape error


I am using tensorflow hub "https://tfhub.dev/google/nnlm-en-dim128/2" word embedding for the sentiment analysis of Kaggle "sentiment140" dataset.

Data set : Kaggle("sentiment140") https://www.kaggle.com/kazanova/sentiment140 Tensorflow-Hub : https://tfhub.dev/google/nnlm-en-dim128/2

Here i am using keras sequential layer when i fit the model it gives value error

ValueError: Python inputs incompatible with input_signature:
      inputs: (
        Tensor("IteratorGetNext:0", shape=(None, 128), dtype=float32))
      input_signature: (
        TensorSpec(shape=(None,), dtype=tf.string, name=None))

My code:

    import pandas as pd
import tensorflow as tf
from sklearn.model_selection import  train_test_split
import seaborn as sns
import tensorflow_hub as hub
from tensorflow.keras import Sequential
import keras

tweet_df = pd.read_csv("training.1600000.processed.noemoticon.csv", names=['polarity', 'id', 'date', 'query', 'user', 'text'],encoding='latin-1')

tweet_df.info()

tweet_df.head()

"""#### 2.) Data Visualization"""

tweet_df['polarity'] = tweet_df['polarity'].replace(to_replace=4,value=1)

### Print two movies reviews from each class

print("Movie Review Polarity Negative class 0 :\n", tweet_df[tweet_df['polarity']==0]['text'].head(2) )

print("\n\nMovie Review Polarity Positive class 1 :\n", tweet_df['text'][tweet_df['polarity']==1].head(2) )

class_dist = tweet_df['polarity'].value_counts().rename_axis('Class Label').reset_index(name='Tweets')
#class_dist = class_dist['Class Label'].replace({0:'Negative',1:'Positve'})
class_dist

## Bar graph of Distribution of Classes
class_dist['class'] = ['Positive','Negative']
sns.set_theme(style='whitegrid')
sns.barplot(x='Class Label', y='Tweets', hue='class', data= class_dist)

### Train and test split 
X = tweet_df.iloc[:,5]
y = tweet_df.iloc[:,0]
X_train, X_test,y_train, y_test = train_test_split(X,y,random_state=5, test_size=0.2)

print("Training shape of X and y : ", X_train.shape ,y_train.shape)
print("Testing shape of X and y : ", X_test.shape ,y_test.shape)

"""#### 3.) Data Pre-processing"""

embed = hub.load("https://tfhub.dev/google/nnlm-en-dim128/2")
X_train_embed = embed(X_train)

y_train = tf.keras.utils.to_categorical(y_train,2)

X_train_embed.shape


X_sample = X_train_embed[:1000]
y_sample = y_train[:1000]
y_sample = tf.keras.utils.to_categorical(y_sample,2)


"""#### 4.) Model Building"""

hub_layer = hub.KerasLayer('https://tfhub.dev/google/nnlm-en-dim128/2',input_shape=[],dtype=tf.string,trainable=False)

model = Sequential()
model.add(hub_layer)
model.add(keras.layers.Dense(128, 'relu', name ='layer_1'))
model.add(keras.layers.Dense(64, 'relu', name = 'layer_2'))
model.add(keras.layers.Dense(2, activation='sigmoid', name='output'))

model.compile(optimizer='adam',loss= 'BinaryCrossentropy',  #'categorical_crossentropy' ,
              metrics=['accuracy'] )

NN_model = model.fit(X_sample, y_sample, epochs=20, validation_split=0.1, verbose=1)

Input shape:

X_sample.shape

TensorShape([1000, 128])

y_sample.shape

(1000, 2, 2)

X_sample

<tf.Tensor: shape=(1000, 128), dtype=float32, numpy=
array([[ 0.10381411,  0.07044576, -0.0282673 , ...,  0.08205549,
0.15822364, -0.10019408],
[-0.03332436, -0.00529242,  0.20348714, ..., -0.14174528,
0.05178985, -0.12599435],
[ 0.2461916 , -0.03084931,  0.05861813, ...,  0.07956063,
-0.03579932,  0.07493019],
[ 0.4102695 ,  0.15445013,  0.19045362, ...,  0.12681636,
0.12362286, -0.03969387],
[-0.0144283 , -0.05236297,  0.04851832, ...,  0.05562773,
0.01529189,  0.12605236],
[ 0.29280087,  0.05795274, -0.11779188, ..., -0.01890504,
0.02824693, -0.13629636]], dtype=float32)>

Solution

  • As described on https://tfhub.dev/google/nnlm-en-dim128/2, the model expects a vector of strings as input. You're basically calling the model twice since you're executing

    embed = hub.load("https://tfhub.dev/google/nnlm-en-dim128/2")
    X_train_embed = embed(X_train)  # (n, 128) float matrix
    

    and then passing that embedding to model, which actually takes strings as input since it starts with the NNLM KerasLayer.

    I'd propose to remove embed and X_train_embed and just call model.fit with X_train:

    model.fit(np.array(["Lyx is cool", "Lyx is not cool"]), np.array([1, 0]), epochs=20, validation_split=0.1, verbose=1)