In this notebook, we will use autoencoders to do stock sentiment analysis. Autoencoder consists of encoder and decoder models. Encoders compress the data and decoders decompress it. Once you train an autoencoder neural network, the encoder can be used to train a different machine learning model.
For stock sentiment analysis, we will first use encoder for the feature extraction and then use these features to train a machine learning model to classify the stock tweets. To learn more about Autoencoders check out the following link...
https://www.nbshare.io/notebook/86916405/Understanding-Autoencoders-With-Examples/
Let us import the necessary packages.
# importing necessary lib importpandasaspdimportnumpyasnpimportmatplotlib.pyplotaspltimportseabornassns
# reading tweets datadf=pd.read_csv('/content/stocktwits (2).csv')
df.head()
Let us remove the unnecessary features - ticker, followers and created_at from our dataset.
df=df.drop(['ticker','followers','created_at'],axis=1)
df.head()
# class countsdf['sentiment'].value_counts()
If you observe the above results.Our data set is imabalanced. The number of Bullish tweets are way more than the Bearish tweets. We need to balance the data.
# Sentiment encoding # Encoding Bullish with 0 and Bearish with 1 dict={'Bullish':0,'Bearish':1}# Mapping dictionary to Is_Response featuredf['Class']=df['sentiment'].map(dict)df.head()
Let us remove the 'sentiment' feature since we have already encoded it in the 'class' column.
df=df.drop(['sentiment'],axis=1)
To make our dataset balanced, in the next few lines of code, I am taking same number of samples from Bullish class as we have in Bearish class.
Bearish=df[df['Class']==1]Bullish=df[df['Class']==0].sample(4887)
# appending sample records of majority class to minority classdf=Bullish.append(Bearish).reset_index(drop=True)
Let us check how our dataframe looks now.
df.head()
Let us do count of both the classes to make sure count of each class is same.
# balanced class df['Class'].value_counts()
df.message
Now we need to convert the tweets(text) into vector form.
To convert text into vector form, first we need to clean the text, Cleaning means removing special characters, lowercasing , remvoing numericals,stemming etc
For text preprocessing I am using NLTK lib.
importnltknltk.download('stopwords')
importre
# I am using porterstemmer for stemming fromnltk.corpusimportstopwordsfromnltk.stem.porterimportPorterStemmerps=PorterStemmer()corpus=[]foriinrange(0,len(df)):review=re.sub('[^a-zA-Z]',' ',df['message'][i])review=review.lower()review=review.split()review=[ps.stem(word)forwordinreviewifnotwordinstopwords.words('english')]review=' '.join(review)corpus.append(review)
To convert words into vector I am using TF-IDF.
fromsklearn.feature_extraction.textimportTfidfVectorizerfromsklearn.feature_extraction.textimportTfidfVectorizer
# I am using 1 to 3 ngram combinationstfidf=TfidfVectorizer(max_features=10000,ngram_range=(1,3))tfidf_word=tfidf.fit_transform(corpus).toarray()tfidf_class=df['Class']
tfidf_word
# importing necessary lib importpandasaspdimportnumpyasnpfromsklearn.model_selectionimporttrain_test_splitfromsklearn.linear_modelimportLogisticRegressionfromsklearn.svmimportSVCfromsklearn.metricsimportaccuracy_scorefromsklearn.preprocessingimportMinMaxScalerfromsklearn.manifoldimportTSNEimportmatplotlib.pyplotaspltimportseabornassnsfromkeras.layersimportInput,Densefromkeras.modelsimportModel,Sequentialfromkerasimportregularizers
tfidf_class
To make the data suitable for the auto-encoder, I am using MinMaxScaler.
X_scaled=MinMaxScaler().fit_transform(tfidf_word)X_bulli_scaled=X_scaled[tfidf_class==0]X_bearish_scaled=X_scaled[tfidf_class==1]
tfidf_word.shape
I am using standard auto-encoder network.
For encoder and decoder I am using 'tanh' activation function.
For bottle neck and output layers I am using 'relu' activation.
I am using L1 regularizer in Encoder. To learn more about regularlization check here.
# Building the Input Layerinput_layer=Input(shape=(tfidf_word.shape[1],))# Building the Encoder networkencoded=Dense(100,activation='tanh',activity_regularizer=regularizers.l1(10e-5))(input_layer)encoded=Dense(50,activation='tanh',activity_regularizer=regularizers.l1(10e-5))(encoded)encoded=Dense(25,activation='tanh',activity_regularizer=regularizers.l1(10e-5))(encoded)encoded=Dense(12,activation='tanh',activity_regularizer=regularizers.l1(10e-5))(encoded)encoded=Dense(6,activation='relu')(encoded)# Building the Decoder networkdecoded=Dense(12,activation='tanh')(encoded)decoded=Dense(25,activation='tanh')(decoded)decoded=Dense(50,activation='tanh')(decoded)decoded=Dense(100,activation='tanh')(decoded)# Building the Output Layeroutput_layer=Dense(tfidf_word.shape[1],activation='relu')(decoded)
importtensorflowastf
For training I am using 'Adam' Optimizer and 'BinaryCrossentropy' Loss.
# Defining the parameters of the Auto-encoder networkautoencoder=Model(input_layer,output_layer)autoencoder.compile(optimizer="Adam",loss=tf.keras.losses.BinaryCrossentropy())# Training the Auto-encoder networkautoencoder.fit(X_bulli_scaled,X_bearish_scaled,batch_size=16,epochs=100,shuffle=True,validation_split=0.20)
After training the neural network, we discard the decoder since we are only interested in Encoder and bottle neck layers.
In the below code, autoencoder.layers[0] means first layer which is encoder layer. Similarly autoencoder.layers[4] means bottle neck layer. Now we will create our model with encoder and bottle neck layers.
hidden_representation=Sequential()hidden_representation.add(autoencoder.layers[0])hidden_representation.add(autoencoder.layers[1])hidden_representation.add(autoencoder.layers[2])hidden_representation.add(autoencoder.layers[3])hidden_representation.add(autoencoder.layers[4])
# Separating the points encoded by the Auto-encoder as bulli_hidden_scaled and bearish_hidden_scaledbulli_hidden_scaled=hidden_representation.predict(X_bulli_scaled)bearish_hidden_scaled=hidden_representation.predict(X_bearish_scaled)
Let us combine the encoded data in to a single table.
encoded_X=np.append(bulli_hidden_scaled,bearish_hidden_scaled,axis=0)y_bulli=np.zeros(bulli_hidden_scaled.shape[0])# class 0y_bearish=np.ones(bearish_hidden_scaled.shape[0])# class 1encoded_y=np.append(y_bulli,y_bearish)
Now we have encoded data from auto encoder. This is nothing but feature extraction from input data using auto encoder.
We can use these extracted features to train machine learning models.
# splitting the encoded data into train and test X_train_encoded,X_test_encoded,y_train_encoded,y_test_encoded=train_test_split(encoded_X,encoded_y,test_size=0.2)
lrclf=LogisticRegression()lrclf.fit(X_train_encoded,y_train_encoded)# Storing the predictions of the linear modely_pred_lrclf=lrclf.predict(X_test_encoded)# Evaluating the performance of the linear modelprint('Accuracy : '+str(accuracy_score(y_test_encoded,y_pred_lrclf)))
# Building the SVM modelsvmclf=SVC()svmclf.fit(X_train_encoded,y_train_encoded)# Storing the predictions of the non-linear modely_pred_svmclf=svmclf.predict(X_test_encoded)# Evaluating the performance of the non-linear modelprint('Accuracy : '+str(accuracy_score(y_test_encoded,y_pred_svmclf)))
fromsklearn.ensembleimportRandomForestClassifier
# Building the rf modelrfclf=RandomForestClassifier()rfclf.fit(X_train_encoded,y_train_encoded)# Storing the predictions of the non-linear modely_pred_rfclf=rfclf.predict(X_test_encoded)# Evaluating the performance of the non-linear modelprint('Accuracy : '+str(accuracy_score(y_test_encoded,y_pred_rfclf)))
importxgboostasxgb
#xgbosst classifier xgb_clf=xgb.XGBClassifier()xgb_clf.fit(X_train_encoded,y_train_encoded)y_pred_xgclf=xgb_clf.predict(X_test_encoded)print('Accuracy : '+str(accuracy_score(y_test_encoded,y_pred_xgclf)))
If you observe the above accuracy's by model. Randomforest is giving good accuracy on test data. So we can tune the RFclassifier to get better accuracy.
fromsklearn.model_selectionimportRandomizedSearchCV# Number of trees in random forestn_estimators=[int(x)forxinnp.linspace(start=200,stop=2000,num=10)]# Number of features to consider at every splitmax_features=['auto','sqrt']# Maximum number of levels in treemax_depth=[int(x)forxinnp.linspace(10,110,num=11)]max_depth.append(None)# Minimum number of samples required to split a nodemin_samples_split=[2,5,10]# Minimum number of samples required at each leaf nodemin_samples_leaf=[1,2,4]# Method of selecting samples for training each treebootstrap=[True,False]# Create the random gridrandom_grid={'n_estimators':n_estimators,'max_features':max_features,'max_depth':max_depth,'min_samples_split':min_samples_split,'min_samples_leaf':min_samples_leaf,'bootstrap':bootstrap}
# Use the random grid to search for best hyperparameters# First create the base model to tunerf=RandomForestClassifier()# Random search of parameters, using 3 fold cross validation, # search across 100 different combinations, and use all available coresrf_random=RandomizedSearchCV(estimator=rf,param_distributions=random_grid,n_iter=25,cv=3,verbose=2,random_state=42)# Fit the random search modelrf_random.fit(X_train_encoded,y_train_encoded)
rf_random.best_params_
But these are probably not the best hyperparameters, I used only 25 iterations. We can increase the iterations further to find the best hyperparameters.