Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 23381

John Ludhi/nbshare.io: Stock Sentiment Analysis Using Autoencoders

$
0
0

Stock Sentiment Analysis Using Autoencoders

In this notebook, we will use autoencoders to do stock sentiment analysis. Autoencoder consists of encoder and decoder models. Encoders compress the data and decoders decompress it. Once you train an autoencoder neural network, the encoder can be used to train a different machine learning model.

For stock sentiment analysis, we will first use encoder for the feature extraction and then use these features to train a machine learning model to classify the stock tweets. To learn more about Autoencoders check out the following link...

https://www.nbshare.io/notebook/86916405/Understanding-Autoencoders-With-Examples/

Stock Tweets Data

Let us import the necessary packages.

In [1]:
# importing necessary lib importpandasaspdimportnumpyasnpimportmatplotlib.pyplotaspltimportseabornassns
In [2]:
# reading tweets datadf=pd.read_csv('/content/stocktwits (2).csv')
In [3]:
df.head()
Out[3]:
tickermessagesentimentfollowerscreated_at
0atvi$ATVI brutal selloff here today... really dumb...Bullish142020-10-02T22:19:36.000Z
1atvi$ATVI $80 around next week!Bullish312020-10-02T21:50:19.000Z
2atvi$ATVI Jefferies says that the delay is a &quot...Bullish832020-10-02T21:19:06.000Z
3atvi$ATVI I’ve seen this twice before, and both ti...Bullish52020-10-02T20:48:42.000Z
4atvi$ATVI acting like a game has never been pushed...Bullish12020-10-02T19:14:56.000Z

Let us remove the unnecessary features - ticker, followers and created_at from our dataset.

In [4]:
df=df.drop(['ticker','followers','created_at'],axis=1)
In [5]:
df.head()
Out[5]:
messagesentiment
0$ATVI brutal selloff here today... really dumb...Bullish
1$ATVI $80 around next week!Bullish
2$ATVI Jefferies says that the delay is a &quot...Bullish
3$ATVI I’ve seen this twice before, and both ti...Bullish
4$ATVI acting like a game has never been pushed...Bullish
In [6]:
# class countsdf['sentiment'].value_counts()
Out[6]:
Bullish    26485
Bearish     4887
Name: sentiment, dtype: int64

If you observe the above results.Our data set is imabalanced. The number of Bullish tweets are way more than the Bearish tweets. We need to balance the data.

In [7]:
# Sentiment encoding # Encoding Bullish with 0 and Bearish with 1 dict={'Bullish':0,'Bearish':1}# Mapping dictionary to Is_Response featuredf['Class']=df['sentiment'].map(dict)df.head()
Out[7]:
messagesentimentClass
0$ATVI brutal selloff here today... really dumb...Bullish0
1$ATVI $80 around next week!Bullish0
2$ATVI Jefferies says that the delay is a &quot...Bullish0
3$ATVI I’ve seen this twice before, and both ti...Bullish0
4$ATVI acting like a game has never been pushed...Bullish0

Let us remove the 'sentiment' feature since we have already encoded it in the 'class' column.

In [8]:
df=df.drop(['sentiment'],axis=1)

To make our dataset balanced, in the next few lines of code, I am taking same number of samples from Bullish class as we have in Bearish class.

In [9]:
Bearish=df[df['Class']==1]Bullish=df[df['Class']==0].sample(4887)
In [10]:
# appending sample records of majority class to minority classdf=Bullish.append(Bearish).reset_index(drop=True)

Let us check how our dataframe looks now.

In [11]:
df.head()
Out[11]:
messageClass
0Options Live Trading with a small Ass account...0
1$UPS your crazy if you sold at open0
2If $EQIX is at $680, this stock with the bigge...0
3$WMT just getting hit on the no stimulus deal....0
4$AMZN I'm playing the catalyst stocks with...0

Let us do count of both the classes to make sure count of each class is same.

In [12]:
# balanced class df['Class'].value_counts()
Out[12]:
1    4887
0    4887
Name: Class, dtype: int64
In [13]:
df.message
Out[13]:
0       Options  Live Trading with a small Ass account...
1                     $UPS your crazy if you sold at open
2       If $EQIX is at $680, this stock with the bigge...
3       $WMT just getting hit on the no stimulus deal....
4       $AMZN I'm playing the catalyst stocks with...
                              ...                        
9769    SmartOptions® Unusual Activity Alert\n(Delayed...
9770                                            $VNO ouch
9771                                             $VNO dog
9772    $ZION I wanted to buy into this but I had an u...
9773    $ZOM Point of Care, rapid tests from $IDXX and...
Name: message, Length: 9774, dtype: object

Stock Tweets Text to Vector Form

Now we need to convert the tweets(text) into vector form.

To convert text into vector form, first we need to clean the text, Cleaning means removing special characters, lowercasing , remvoing numericals,stemming etc

For text preprocessing I am using NLTK lib.

In [14]:
importnltknltk.download('stopwords')
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
Out[14]:
True
In [15]:
importre
In [16]:
# I am using porterstemmer for stemming fromnltk.corpusimportstopwordsfromnltk.stem.porterimportPorterStemmerps=PorterStemmer()corpus=[]foriinrange(0,len(df)):review=re.sub('[^a-zA-Z]',' ',df['message'][i])review=review.lower()review=review.split()review=[ps.stem(word)forwordinreviewifnotwordinstopwords.words('english')]review=' '.join(review)corpus.append(review)

To convert words into vector I am using TF-IDF.

In [18]:
fromsklearn.feature_extraction.textimportTfidfVectorizerfromsklearn.feature_extraction.textimportTfidfVectorizer
In [19]:
# I am using 1 to 3 ngram combinationstfidf=TfidfVectorizer(max_features=10000,ngram_range=(1,3))tfidf_word=tfidf.fit_transform(corpus).toarray()tfidf_class=df['Class']
In [20]:
tfidf_word
Out[20]:
array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.20443663,
        0.        ]])
In [21]:
# importing necessary lib importpandasaspdimportnumpyasnpfromsklearn.model_selectionimporttrain_test_splitfromsklearn.linear_modelimportLogisticRegressionfromsklearn.svmimportSVCfromsklearn.metricsimportaccuracy_scorefromsklearn.preprocessingimportMinMaxScalerfromsklearn.manifoldimportTSNEimportmatplotlib.pyplotaspltimportseabornassnsfromkeras.layersimportInput,Densefromkeras.modelsimportModel,Sequentialfromkerasimportregularizers
In [22]:
tfidf_class
Out[22]:
0       0
1       0
2       0
3       0
4       0
       ..
9769    1
9770    1
9771    1
9772    1
9773    1
Name: Class, Length: 9774, dtype: int64

Scaling the data

To make the data suitable for the auto-encoder, I am using MinMaxScaler.

In [23]:
X_scaled=MinMaxScaler().fit_transform(tfidf_word)X_bulli_scaled=X_scaled[tfidf_class==0]X_bearish_scaled=X_scaled[tfidf_class==1]
In [25]:
tfidf_word.shape
Out[25]:
(9774, 10000)

Building the Autoencoder neural network

I am using standard auto-encoder network.

For encoder and decoder I am using 'tanh' activation function.

For bottle neck and output layers I am using 'relu' activation.

I am using L1 regularizer in Encoder. To learn more about regularlization check here.

In [26]:
# Building the Input Layerinput_layer=Input(shape=(tfidf_word.shape[1],))# Building the Encoder networkencoded=Dense(100,activation='tanh',activity_regularizer=regularizers.l1(10e-5))(input_layer)encoded=Dense(50,activation='tanh',activity_regularizer=regularizers.l1(10e-5))(encoded)encoded=Dense(25,activation='tanh',activity_regularizer=regularizers.l1(10e-5))(encoded)encoded=Dense(12,activation='tanh',activity_regularizer=regularizers.l1(10e-5))(encoded)encoded=Dense(6,activation='relu')(encoded)# Building the Decoder networkdecoded=Dense(12,activation='tanh')(encoded)decoded=Dense(25,activation='tanh')(decoded)decoded=Dense(50,activation='tanh')(decoded)decoded=Dense(100,activation='tanh')(decoded)# Building the Output Layeroutput_layer=Dense(tfidf_word.shape[1],activation='relu')(decoded)

Training Autoencoder

In [27]:
importtensorflowastf

For training I am using 'Adam' Optimizer and 'BinaryCrossentropy' Loss.

In [ ]:
# Defining the parameters of the Auto-encoder networkautoencoder=Model(input_layer,output_layer)autoencoder.compile(optimizer="Adam",loss=tf.keras.losses.BinaryCrossentropy())# Training the Auto-encoder networkautoencoder.fit(X_bulli_scaled,X_bearish_scaled,batch_size=16,epochs=100,shuffle=True,validation_split=0.20)

After training the neural network, we discard the decoder since we are only interested in Encoder and bottle neck layers.

In the below code, autoencoder.layers[0] means first layer which is encoder layer. Similarly autoencoder.layers[4] means bottle neck layer. Now we will create our model with encoder and bottle neck layers.

In [29]:
hidden_representation=Sequential()hidden_representation.add(autoencoder.layers[0])hidden_representation.add(autoencoder.layers[1])hidden_representation.add(autoencoder.layers[2])hidden_representation.add(autoencoder.layers[3])hidden_representation.add(autoencoder.layers[4])

Encoding Data

In [30]:
# Separating the points encoded by the Auto-encoder as bulli_hidden_scaled and bearish_hidden_scaledbulli_hidden_scaled=hidden_representation.predict(X_bulli_scaled)bearish_hidden_scaled=hidden_representation.predict(X_bearish_scaled)

Let us combine the encoded data in to a single table.

In [31]:
encoded_X=np.append(bulli_hidden_scaled,bearish_hidden_scaled,axis=0)y_bulli=np.zeros(bulli_hidden_scaled.shape[0])# class 0y_bearish=np.ones(bearish_hidden_scaled.shape[0])# class 1encoded_y=np.append(y_bulli,y_bearish)

Now we have encoded data from auto encoder. This is nothing but feature extraction from input data using auto encoder.

Train Machine Learning Model

We can use these extracted features to train machine learning models.

In [32]:
# splitting the encoded data into train and test X_train_encoded,X_test_encoded,y_train_encoded,y_test_encoded=train_test_split(encoded_X,encoded_y,test_size=0.2)

Logistic Regreession

In [33]:
lrclf=LogisticRegression()lrclf.fit(X_train_encoded,y_train_encoded)# Storing the predictions of the linear modely_pred_lrclf=lrclf.predict(X_test_encoded)# Evaluating the performance of the linear modelprint('Accuracy : '+str(accuracy_score(y_test_encoded,y_pred_lrclf)))
Accuracy : 0.620460358056266

SVM

In [34]:
# Building the SVM modelsvmclf=SVC()svmclf.fit(X_train_encoded,y_train_encoded)# Storing the predictions of the non-linear modely_pred_svmclf=svmclf.predict(X_test_encoded)# Evaluating the performance of the non-linear modelprint('Accuracy : '+str(accuracy_score(y_test_encoded,y_pred_svmclf)))
Accuracy : 0.6649616368286445

RandomForest

In [35]:
fromsklearn.ensembleimportRandomForestClassifier
In [36]:
# Building the rf modelrfclf=RandomForestClassifier()rfclf.fit(X_train_encoded,y_train_encoded)# Storing the predictions of the non-linear modely_pred_rfclf=rfclf.predict(X_test_encoded)# Evaluating the performance of the non-linear modelprint('Accuracy : '+str(accuracy_score(y_test_encoded,y_pred_rfclf)))
Accuracy : 0.7631713554987213

Xgbosst Classifier

In [37]:
importxgboostasxgb
In [38]:
#xgbosst classifier xgb_clf=xgb.XGBClassifier()xgb_clf.fit(X_train_encoded,y_train_encoded)y_pred_xgclf=xgb_clf.predict(X_test_encoded)print('Accuracy : '+str(accuracy_score(y_test_encoded,y_pred_xgclf)))
Accuracy : 0.7089514066496164

If you observe the above accuracy's by model. Randomforest is giving good accuracy on test data. So we can tune the RFclassifier to get better accuracy.

Hyperparamter Optimization

In [39]:
fromsklearn.model_selectionimportRandomizedSearchCV# Number of trees in random forestn_estimators=[int(x)forxinnp.linspace(start=200,stop=2000,num=10)]# Number of features to consider at every splitmax_features=['auto','sqrt']# Maximum number of levels in treemax_depth=[int(x)forxinnp.linspace(10,110,num=11)]max_depth.append(None)# Minimum number of samples required to split a nodemin_samples_split=[2,5,10]# Minimum number of samples required at each leaf nodemin_samples_leaf=[1,2,4]# Method of selecting samples for training each treebootstrap=[True,False]# Create the random gridrandom_grid={'n_estimators':n_estimators,'max_features':max_features,'max_depth':max_depth,'min_samples_split':min_samples_split,'min_samples_leaf':min_samples_leaf,'bootstrap':bootstrap}
In [ ]:
# Use the random grid to search for best hyperparameters# First create the base model to tunerf=RandomForestClassifier()# Random search of parameters, using 3 fold cross validation, # search across 100 different combinations, and use all available coresrf_random=RandomizedSearchCV(estimator=rf,param_distributions=random_grid,n_iter=25,cv=3,verbose=2,random_state=42)# Fit the random search modelrf_random.fit(X_train_encoded,y_train_encoded)
In [46]:
rf_random.best_params_
Out[46]:
{'bootstrap': True,
 'max_depth': 30,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 10,
 'n_estimators': 1000}

But these are probably not the best hyperparameters, I used only 25 iterations. We can increase the iterations further to find the best hyperparameters.


Viewing all articles
Browse latest Browse all 23381

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>