I am using tensorflow hub "https://tfhub.dev/google/nnlm-en-dim128/2" word embedding for the sentiment analysis of Kaggle "sentiment140" dataset.
Data set : Kaggle("sentiment140") https://www.kaggle.com/kazanova/sentiment140 Tensorflow-Hub : https://tfhub.dev/google/nnlm-en-dim128/2
Here i am using keras sequential layer when i fit the model it gives value error
ValueError: Python inputs incompatible with input_signature:
inputs: (
Tensor("IteratorGetNext:0", shape=(None, 128), dtype=float32))
input_signature: (
TensorSpec(shape=(None,), dtype=tf.string, name=None))
My code:
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
import seaborn as sns
import tensorflow_hub as hub
from tensorflow.keras import Sequential
import keras
tweet_df = pd.read_csv("training.1600000.processed.noemoticon.csv", names=['polarity', 'id', 'date', 'query', 'user', 'text'],encoding='latin-1')
tweet_df.info()
tweet_df.head()
"""#### 2.) Data Visualization"""
tweet_df['polarity'] = tweet_df['polarity'].replace(to_replace=4,value=1)
### Print two movies reviews from each class
print("Movie Review Polarity Negative class 0 :\n", tweet_df[tweet_df['polarity']==0]['text'].head(2) )
print("\n\nMovie Review Polarity Positive class 1 :\n", tweet_df['text'][tweet_df['polarity']==1].head(2) )
class_dist = tweet_df['polarity'].value_counts().rename_axis('Class Label').reset_index(name='Tweets')
#class_dist = class_dist['Class Label'].replace({0:'Negative',1:'Positve'})
class_dist
## Bar graph of Distribution of Classes
class_dist['class'] = ['Positive','Negative']
sns.set_theme(style='whitegrid')
sns.barplot(x='Class Label', y='Tweets', hue='class', data= class_dist)
### Train and test split
X = tweet_df.iloc[:,5]
y = tweet_df.iloc[:,0]
X_train, X_test,y_train, y_test = train_test_split(X,y,random_state=5, test_size=0.2)
print("Training shape of X and y : ", X_train.shape ,y_train.shape)
print("Testing shape of X and y : ", X_test.shape ,y_test.shape)
"""#### 3.) Data Pre-processing"""
embed = hub.load("https://tfhub.dev/google/nnlm-en-dim128/2")
X_train_embed = embed(X_train)
y_train = tf.keras.utils.to_categorical(y_train,2)
X_train_embed.shape
X_sample = X_train_embed[:1000]
y_sample = y_train[:1000]
y_sample = tf.keras.utils.to_categorical(y_sample,2)
"""#### 4.) Model Building"""
hub_layer = hub.KerasLayer('https://tfhub.dev/google/nnlm-en-dim128/2',input_shape=[],dtype=tf.string,trainable=False)
model = Sequential()
model.add(hub_layer)
model.add(keras.layers.Dense(128, 'relu', name ='layer_1'))
model.add(keras.layers.Dense(64, 'relu', name = 'layer_2'))
model.add(keras.layers.Dense(2, activation='sigmoid', name='output'))
model.compile(optimizer='adam',loss= 'BinaryCrossentropy', #'categorical_crossentropy' ,
metrics=['accuracy'] )
NN_model = model.fit(X_sample, y_sample, epochs=20, validation_split=0.1, verbose=1)
Input shape:
X_sample.shape
TensorShape([1000, 128])
y_sample.shape
(1000, 2, 2)
X_sample
<tf.Tensor: shape=(1000, 128), dtype=float32, numpy=
array([[ 0.10381411, 0.07044576, -0.0282673 , ..., 0.08205549,
0.15822364, -0.10019408],
[-0.03332436, -0.00529242, 0.20348714, ..., -0.14174528,
0.05178985, -0.12599435],
[ 0.2461916 , -0.03084931, 0.05861813, ..., 0.07956063,
-0.03579932, 0.07493019],
[ 0.4102695 , 0.15445013, 0.19045362, ..., 0.12681636,
0.12362286, -0.03969387],
[-0.0144283 , -0.05236297, 0.04851832, ..., 0.05562773,
0.01529189, 0.12605236],
[ 0.29280087, 0.05795274, -0.11779188, ..., -0.01890504,
0.02824693, -0.13629636]], dtype=float32)>
As described on https://tfhub.dev/google/nnlm-en-dim128/2, the model expects a vector of strings as input. You're basically calling the model twice since you're executing
embed = hub.load("https://tfhub.dev/google/nnlm-en-dim128/2")
X_train_embed = embed(X_train) # (n, 128) float matrix
and then passing that embedding to model
, which actually takes strings as input since it starts with the NNLM KerasLayer.
I'd propose to remove embed
and X_train_embed
and just call model.fit
with X_train
:
model.fit(np.array(["Lyx is cool", "Lyx is not cool"]), np.array([1, 0]), epochs=20, validation_split=0.1, verbose=1)