Stock Sentiment Analysis Using Autoencoders

In this notebook, we will use autoencoders to do stock sentiment analysis. Autoencoder consists of encoder and decoder models. Encoders compress the data and decoders decompress it. Once you train an autoencoder neural network, the encoder can be used to train a different machine learning model.

For stock sentiment analysis, we will first use encoder for the feature extraction and then use these features to train a machine learning model to classify the stock tweets. To learn more about Autoencoders check out the following link...

https://www.nbshare.io/notebook/86916405/Understanding-Autoencoders-With-Examples/

Stock Tweets Data

Let us import the necessary packages.

# importing necessary lib importpandasaspdimportnumpyasnpimportmatplotlib.pyplotaspltimportseabornassns

# reading tweets datadf=pd.read_csv('/content/stocktwits (2).csv')

df.head()

Let us remove the unnecessary features - ticker, followers and created_at from our dataset.

df=df.drop(['ticker','followers','created_at'],axis=1)

df.head()

# class countsdf['sentiment'].value_counts()

Bullish    26485
Bearish     4887
Name: sentiment, dtype: int64

If you observe the above results.Our data set is imabalanced. The number of Bullish tweets are way more than the Bearish tweets. We need to balance the data.

# Sentiment encoding # Encoding Bullish with 0 and Bearish with 1 dict={'Bullish':0,'Bearish':1}# Mapping dictionary to Is_Response featuredf['Class']=df['sentiment'].map(dict)df.head()

Let us remove the 'sentiment' feature since we have already encoded it in the 'class' column.

df=df.drop(['sentiment'],axis=1)

To make our dataset balanced, in the next few lines of code, I am taking same number of samples from Bullish class as we have in Bearish class.

Bearish=df[df['Class']==1]Bullish=df[df['Class']==0].sample(4887)

# appending sample records of majority class to minority classdf=Bullish.append(Bearish).reset_index(drop=True)

Let us check how our dataframe looks now.

df.head()

Let us do count of both the classes to make sure count of each class is same.

# balanced class df['Class'].value_counts()

1    4887
0    4887
Name: Class, dtype: int64

df.message

0       Options  Live Trading with a small Ass account...
1                     $UPS your crazy if you sold at open
2       If $EQIX is at $680, this stock with the bigge...
3       $WMT just getting hit on the no stimulus deal....
4       $AMZN I&#39;m playing the catalyst stocks with...
                              ...                        
9769    SmartOptions® Unusual Activity Alert\n(Delayed...
9770                                            $VNO ouch
9771                                             $VNO dog
9772    $ZION I wanted to buy into this but I had an u...
9773    $ZOM Point of Care, rapid tests from $IDXX and...
Name: message, Length: 9774, dtype: object

Stock Tweets Text to Vector Form

Now we need to convert the tweets(text) into vector form.

To convert text into vector form, first we need to clean the text, Cleaning means removing special characters, lowercasing , remvoing numericals,stemming etc

For text preprocessing I am using NLTK lib.

importnltknltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.

True

importre

# I am using porterstemmer for stemming fromnltk.corpusimportstopwordsfromnltk.stem.porterimportPorterStemmerps=PorterStemmer()corpus=[]foriinrange(0,len(df)):review=re.sub('[^a-zA-Z]',' ',df['message'][i])review=review.lower()review=review.split()review=[ps.stem(word)forwordinreviewifnotwordinstopwords.words('english')]review=' '.join(review)corpus.append(review)

To convert words into vector I am using TF-IDF.

fromsklearn.feature_extraction.textimportTfidfVectorizerfromsklearn.feature_extraction.textimportTfidfVectorizer

# I am using 1 to 3 ngram combinationstfidf=TfidfVectorizer(max_features=10000,ngram_range=(1,3))tfidf_word=tfidf.fit_transform(corpus).toarray()tfidf_class=df['Class']

tfidf_word

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.20443663,
        0.        ]])

# importing necessary lib importpandasaspdimportnumpyasnpfromsklearn.model_selectionimporttrain_test_splitfromsklearn.linear_modelimportLogisticRegressionfromsklearn.svmimportSVCfromsklearn.metricsimportaccuracy_scorefromsklearn.preprocessingimportMinMaxScalerfromsklearn.manifoldimportTSNEimportmatplotlib.pyplotaspltimportseabornassnsfromkeras.layersimportInput,Densefromkeras.modelsimportModel,Sequentialfromkerasimportregularizers

tfidf_class

0       0
1       0
2       0
3       0
4       0
       ..
9769    1
9770    1
9771    1
9772    1
9773    1
Name: Class, Length: 9774, dtype: int64

Scaling the data

To make the data suitable for the auto-encoder, I am using MinMaxScaler.

X_scaled=MinMaxScaler().fit_transform(tfidf_word)X_bulli_scaled=X_scaled[tfidf_class==0]X_bearish_scaled=X_scaled[tfidf_class==1]

tfidf_word.shape

(9774, 10000)

Building the Autoencoder neural network

I am using standard auto-encoder network.

For encoder and decoder I am using 'tanh' activation function.

For bottle neck and output layers I am using 'relu' activation.

I am using L1 regularizer in Encoder. To learn more about regularlization check here.

# Building the Input Layerinput_layer=Input(shape=(tfidf_word.shape[1],))# Building the Encoder networkencoded=Dense(100,activation='tanh',activity_regularizer=regularizers.l1(10e-5))(input_layer)encoded=Dense(50,activation='tanh',activity_regularizer=regularizers.l1(10e-5))(encoded)encoded=Dense(25,activation='tanh',activity_regularizer=regularizers.l1(10e-5))(encoded)encoded=Dense(12,activation='tanh',activity_regularizer=regularizers.l1(10e-5))(encoded)encoded=Dense(6,activation='relu')(encoded)# Building the Decoder networkdecoded=Dense(12,activation='tanh')(encoded)decoded=Dense(25,activation='tanh')(decoded)decoded=Dense(50,activation='tanh')(decoded)decoded=Dense(100,activation='tanh')(decoded)# Building the Output Layeroutput_layer=Dense(tfidf_word.shape[1],activation='relu')(decoded)

Training Autoencoder

importtensorflowastf

For training I am using 'Adam' Optimizer and 'BinaryCrossentropy' Loss.

# Defining the parameters of the Auto-encoder networkautoencoder=Model(input_layer,output_layer)autoencoder.compile(optimizer="Adam",loss=tf.keras.losses.BinaryCrossentropy())# Training the Auto-encoder networkautoencoder.fit(X_bulli_scaled,X_bearish_scaled,batch_size=16,epochs=100,shuffle=True,validation_split=0.20)

After training the neural network, we discard the decoder since we are only interested in Encoder and bottle neck layers.

In the below code, autoencoder.layers[0] means first layer which is encoder layer. Similarly autoencoder.layers[4] means bottle neck layer. Now we will create our model with encoder and bottle neck layers.

hidden_representation=Sequential()hidden_representation.add(autoencoder.layers[0])hidden_representation.add(autoencoder.layers[1])hidden_representation.add(autoencoder.layers[2])hidden_representation.add(autoencoder.layers[3])hidden_representation.add(autoencoder.layers[4])

Encoding Data

# Separating the points encoded by the Auto-encoder as bulli_hidden_scaled and bearish_hidden_scaledbulli_hidden_scaled=hidden_representation.predict(X_bulli_scaled)bearish_hidden_scaled=hidden_representation.predict(X_bearish_scaled)

Let us combine the encoded data in to a single table.

encoded_X=np.append(bulli_hidden_scaled,bearish_hidden_scaled,axis=0)y_bulli=np.zeros(bulli_hidden_scaled.shape[0])# class 0y_bearish=np.ones(bearish_hidden_scaled.shape[0])# class 1encoded_y=np.append(y_bulli,y_bearish)

Now we have encoded data from auto encoder. This is nothing but feature extraction from input data using auto encoder.

Train Machine Learning Model

We can use these extracted features to train machine learning models.

# splitting the encoded data into train and test X_train_encoded,X_test_encoded,y_train_encoded,y_test_encoded=train_test_split(encoded_X,encoded_y,test_size=0.2)

Logistic Regreession

lrclf=LogisticRegression()lrclf.fit(X_train_encoded,y_train_encoded)# Storing the predictions of the linear modely_pred_lrclf=lrclf.predict(X_test_encoded)# Evaluating the performance of the linear modelprint('Accuracy : '+str(accuracy_score(y_test_encoded,y_pred_lrclf)))

Accuracy : 0.620460358056266

SVM

# Building the SVM modelsvmclf=SVC()svmclf.fit(X_train_encoded,y_train_encoded)# Storing the predictions of the non-linear modely_pred_svmclf=svmclf.predict(X_test_encoded)# Evaluating the performance of the non-linear modelprint('Accuracy : '+str(accuracy_score(y_test_encoded,y_pred_svmclf)))

Accuracy : 0.6649616368286445

RandomForest

fromsklearn.ensembleimportRandomForestClassifier

# Building the rf modelrfclf=RandomForestClassifier()rfclf.fit(X_train_encoded,y_train_encoded)# Storing the predictions of the non-linear modely_pred_rfclf=rfclf.predict(X_test_encoded)# Evaluating the performance of the non-linear modelprint('Accuracy : '+str(accuracy_score(y_test_encoded,y_pred_rfclf)))

Accuracy : 0.7631713554987213

Xgbosst Classifier

importxgboostasxgb

#xgbosst classifier xgb_clf=xgb.XGBClassifier()xgb_clf.fit(X_train_encoded,y_train_encoded)y_pred_xgclf=xgb_clf.predict(X_test_encoded)print('Accuracy : '+str(accuracy_score(y_test_encoded,y_pred_xgclf)))

Accuracy : 0.7089514066496164

If you observe the above accuracy's by model. Randomforest is giving good accuracy on test data. So we can tune the RFclassifier to get better accuracy.

Hyperparamter Optimization

fromsklearn.model_selectionimportRandomizedSearchCV# Number of trees in random forestn_estimators=[int(x)forxinnp.linspace(start=200,stop=2000,num=10)]# Number of features to consider at every splitmax_features=['auto','sqrt']# Maximum number of levels in treemax_depth=[int(x)forxinnp.linspace(10,110,num=11)]max_depth.append(None)# Minimum number of samples required to split a nodemin_samples_split=[2,5,10]# Minimum number of samples required at each leaf nodemin_samples_leaf=[1,2,4]# Method of selecting samples for training each treebootstrap=[True,False]# Create the random gridrandom_grid={'n_estimators':n_estimators,'max_features':max_features,'max_depth':max_depth,'min_samples_split':min_samples_split,'min_samples_leaf':min_samples_leaf,'bootstrap':bootstrap}

# Use the random grid to search for best hyperparameters# First create the base model to tunerf=RandomForestClassifier()# Random search of parameters, using 3 fold cross validation, # search across 100 different combinations, and use all available coresrf_random=RandomizedSearchCV(estimator=rf,param_distributions=random_grid,n_iter=25,cv=3,verbose=2,random_state=42)# Fit the random search modelrf_random.fit(X_train_encoded,y_train_encoded)

rf_random.best_params_

{'bootstrap': True,
 'max_depth': 30,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 10,
 'n_estimators': 1000}

But these are probably not the best hyperparameters, I used only 25 iterations. We can increase the iterations further to find the best hyperparameters.

	ticker	message	sentiment	followers	created_at
0	atvi	$ATVI brutal selloff here today... really dumb...	Bullish	14	2020-10-02T22:19:36.000Z
1	atvi	$ATVI $80 around next week!	Bullish	31	2020-10-02T21:50:19.000Z
2	atvi	$ATVI Jefferies says that the delay is a &quot...	Bullish	83	2020-10-02T21:19:06.000Z
3	atvi	$ATVI I’ve seen this twice before, and both ti...	Bullish	5	2020-10-02T20:48:42.000Z
4	atvi	$ATVI acting like a game has never been pushed...	Bullish	1	2020-10-02T19:14:56.000Z

John Ludhi/nbshare.io: Stock Sentiment Analysis Using Autoencoders

Stock Sentiment Analysis Using Autoencoders

Stock Tweets Data

Stock Tweets Text to Vector Form

Scaling the data

Building the Autoencoder neural network

Training Autoencoder

Encoding Data

Train Machine Learning Model

Logistic Regreession

SVM

RandomForest

Xgbosst Classifier

Hyperparamter Optimization

Trending Articles

Avril Lavigne – Let Go (20th Anniversary Edition) [iTunes Plus M4A]

Practice Sheet of Right form of verbs for HSC Students

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

A/L Technology Stream – Subject combinations, Syllabuses and Teacher guides

[NSW] UnMetal [RUS/ENG][NSP] [562MB]

School playground abuse and assault convictions against solicitor...

Re: Error UA_400_EB000U0410

3 Extremely pleasurable sex positions for slim women

Black Angus Grilled Artichokes

23-11-2015 – Priyamana Thozhi

Stalker hid in bushes leaving his ex 'terrified'

Download New Album: Wizkid – Morayo (Full Album)

Thread: Unmatched: The Witcher – Steel and Silver:: Rules:: Ciri Ongoin...

Throw Back: Samini — Where My Baby Dey (Prod by Kaywa)

Nalgonda District Police Office Mobile Numbers List in Telangana State

An Outfit Farewell: Chicago Mobster, Cicero Crew Button Man Paul Spano Passes...

Moondru Mudichu 20-07-2016 – Polimer tv Serial

Roland VS SOUND Canvas VA v1.1.1 READ NFO-R2R

Rema & Selena Gomez – Calm Down – Single [iTunes Plus M4A]

Shatta Wale – Allo ft. Kwaw Kese (Prod. by Willis Beatz)