Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 23144

John Ludhi/nbshare.io: Tweet Sentiment Analysis Using LSTM With PyTorch

$
0
0

Tweet Sentiment Analysis Using LSTM With PyTorch

We will go through a common case study (sentiment analysis) to explore many techniques and patterns in Natural Language Processing.

Overview:

  • Imports and Data Loading
  • Data Preprocessing
    • Null Value Removal
    • Class Balance
  • Tokenization
  • Embeddings
  • LSTM Model Building
  • Setup and Training
  • Evaluation

Imports and Data Loading

In [81]:
importtorchimporttorch.nnasnnimporttorch.nn.functionalasFfromtorch.utils.dataimportDataLoader,TensorDatasetimportnumpyasnpimportpandasaspdimportrefromsklearn.model_selectionimporttrain_test_splitfromsklearn.metricsimportaccuracy_scoreimportnltkfromnltk.tokenizeimportword_tokenizeimportmatplotlib.pyplotasplt
In [4]:
nltk.download('punkt')
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
Out[4]:
True

This dataset can be found on Github in this repo: https://github.com/ajayshewale/Sentiment-Analysis-of-Text-Data-Tweets-

It is a sentiment analysis dataset comprised of 2 files:

  • train.csv, 5971 tweets
  • test.csv, 4000 tweets

The tweets are labeled as:

  • Positive
  • Neutral
  • Negative

Other datasets have different or more labels, but the same concepts apply to preprocessing and training. Download the files and store them locally.

In [7]:
train_path="train.csv"test_path="test.csv"

Before working with PyTorch, make sure to set the device. This line of code selects a GPU if available.

In [8]:
device=torch.device('cuda'iftorch.cuda.is_available()else'cpu')device
Out[8]:
device(type='cuda')

Since the data is stored in csv files, we can use the pandas function .read_csv() to parse both train and test files:

In [9]:
train_df=pd.read_csv(train_path)
In [10]:
test_df=pd.read_csv(test_path)

Data Preprocessing

Null Value Removal

After parsing the files, it is important to analyze the text to understand the preprocessing steps you will take.

In [11]:
train_df
Out[11]:
IdCategoryTweet
0635769805279248384negativeNot Available
1635930169241374720neutralIOS 9 App Transport Security. Mm need to check...
2635950258682523648neutralMar if you have an iOS device, you should down...
3636030803433009153negative@jimmie_vanagon my phone does not run on lates...
4636100906224848896positiveNot sure how to start your publication on iOS?...
............
5965639016598477651968neutral@YouAreMyArsenal Wouldn't surprise me if we en...
5966640276909633486849neutralRib injury for Zlatan against Russia is a big ...
5967640296841725235200neutralNoooooo! I was hoping to see Zlatan being Zlat...
5968641017384908779520neutralNot Available
5969641395811474128896neutralNot Available

5970 rows × 3 columns

Preprocessing is about cleaning the files from inconsistent, useless, or noisy information. So, we first look for things to remove.

  • We can see a few tweets that are "Not Available", and they will not help train our model.
  • Also, the column "Id" is not useful in machine learning, since the ID of a tweet does not affect its sentiment.
  • We may not see any in the sample displayed, but there may be null values (NaN) in the columns. Pandas has a function .dropna() that drops null values.
In [12]:
train_df=train_df.drop(columns=["Id"])train_df=train_df.dropna()train_df=train_df[train_df['Tweet']!="Not Available"]train_df
Out[12]:
CategoryTweet
1neutralIOS 9 App Transport Security. Mm need to check...
2neutralMar if you have an iOS device, you should down...
3negative@jimmie_vanagon my phone does not run on lates...
4positiveNot sure how to start your publication on iOS?...
5neutralTwo Dollar Tuesday is here with Forklift 2, Qu...
.........
5963positiveOk ed let's do this, Zlatan, greizmann and Lap...
5964neutralGoal level: Zlatan 90k by Friday? = Posting e...
5965neutral@YouAreMyArsenal Wouldn't surprise me if we en...
5966neutralRib injury for Zlatan against Russia is a big ...
5967neutralNoooooo! I was hoping to see Zlatan being Zlat...

5422 rows × 2 columns

So far so good, let us take a look at the test set:

In [13]:
test_df
Out[13]:
IdCategory
06.289494e+17dear @Microsoft the newOoffice for Mac is grea...
16.289766e+17@Microsoft how about you make a system that do...
26.290232e+17Not Available
36.291792e+17Not Available
46.291863e+17If I make a game as a #windows10 Universal App...
.........
9963NaNNaN
9964NaNNaN
9965NaNNaN
9966NaNNaN
9967NaNNaN

9968 rows × 2 columns

It turns out that the test set unfortunately has no Category column. Thus, it will not be very useful for us. However, we can do some preprocessing for pratice:

  • The tweets column is wrongly named "Category", we can rename it:
In [14]:
test_df=test_df.rename(columns={"Category":"Tweet"})

Then, we apply the same steps as we did on the train set.

In [15]:
test_df=test_df.drop(columns=["Id"])test_df=test_df.dropna()test_df=test_df[test_df['Tweet']!="Not Available"]test_df
Out[15]:
Tweet
0dear @Microsoft the newOoffice for Mac is grea...
1@Microsoft how about you make a system that do...
4If I make a game as a #windows10 Universal App...
5Microsoft, I may not prefer your gaming branch...
6@MikeWolf1980 @Microsoft I will be downgrading...
......
3994Anybody with a Steak & Shake or IHOP move ...
3995I am assembling an epic Pancake Posse for an I...
3996do you work at Ihop tomorrow @carlysunshine_
399723 Aug 00;30 #771NAS Rescue193 returned from T...
3999IOS 9 App Transport Security. Mm need to check...

3640 rows × 1 columns

Class Imbalance

Next, since this is a classification task, we must make sure that the classes are balanced in terms of number of instances. Otherwise, any model we train will be skewed and less accurate.

First, we find the counts of each class:

In [16]:
train_df['Category'].value_counts()
Out[16]:
positive    2599
neutral     1953
negative     869
Tweet          1
Name: Category, dtype: int64

Supervised datasets typically have balanced classes. However, as seen in this dataset, the number of positive and neutral tweets are a lot more than the negative tweets. There are several solutions to fix imbalance problem:

  • Oversampling
  • Undersampling
  • Hybrid approaches
  • Augmentation

Oversampling

To re-adjust the class imbalance, in oversampling, you duplicate some tweets in the minority classes until you have similar number of tweets for each class. So for example, we would duplicate the negative set ~3 times to acquire 2600 negative tweets. We can also do the same for neutral tweets. By doing so, you end up with all classes having 2600 tweets.

Undersampling

In undersampling, instead of increasing the number of tweets in the minority classes, you decrease the number of tweets in the majority classes. You do so simply by deleting tweets in the majority classes randomly until you have 869 tweets in all classes.

Hybrid Approaches

Both oversampling and undersampling can be a bit extreme. One can do a mixture of both by determining a final number of tweets that is between the minimum and the maximum. For instance, we can select 2000 as the final tweet count. Then, we delete ~600 positive tweets, keep neutral tweets the same, and duplicate the negative tweets by a factor of ~2.3. This way we end up with ~2000 tweets in each class.

Augmentation

Augmentation is more complex than the other approaches. In augmentation, you use the existing negative tweets to create new negative tweets. By doing so, you can increase the number of negative and neutral tweets until they are all 2600.

It is a relatively new concept, but you can find more about it in the papers listed here: https://paperswithcode.com/task/text-augmentation/codeless

For our purpose, we undersample positive and neutral classes till we have 869 tweets in each class. We are doing undersampling manually in this excercise, but there is a python library called imblearn that can perform under/oversampling.

In [17]:
remove_pos=2599-869remove_neut=1953-869neg_df=train_df[train_df["Category"]=="negative"]pos_df=train_df[train_df["Category"]=="positive"]neut_df=train_df[train_df["Category"]=="neutral"]pos_drop_indices=np.random.choice(pos_df.index,remove_pos,replace=False)neut_drop_indices=np.random.choice(neut_df.index,remove_neut,replace=False)pos_undersampled=pos_df.drop(pos_drop_indices)neut_undersampled=neut_df.drop(neut_drop_indices)
In [18]:
pos_undersampled
Out[18]:
CategoryTweet
10positiveParkrun app for iOS downloaded Where have you ...
16positiveFive Great Free Apps and Games for iOS - Augus...
18positiveSee news through the eyes of real people &...
19positiveSiri knows all about #Apple's iOS event on the...
22positive@Yurt try beat mp3 it may be on android i have...
.........
5924positiveZlatan Ibrahimovich. @zlatan_ibra9 Gracious Le...
5932positiveScenes when Benzema walks out of tunnel tomorr...
5939positive7 more days till we start the campaign that wi...
5940positiveThe VP of France's refereeing union Laurent Ug...
5947positive@DaveEllis11 @klavierstuk but if Zlatan is ava...

869 rows × 2 columns

After undersampling both neutral and positive classes, we join them all together again:

In [19]:
balanced_train_df=pd.concat([neg_df,pos_undersampled,neut_undersampled])
In [20]:
balanced_train_df["Category"].value_counts()
Out[20]:
neutral     869
negative    869
positive    869
Name: Category, dtype: int64

As shown, the value counts have been adjusted.

Moving forward, since we do not have a labeled test set, we split the train set into train and test sets with ratios of 85:15

In [21]:
train_clean_df,test_clean_df=train_test_split(balanced_train_df,test_size=0.15)
In [22]:
train_clean_df
Out[22]:
CategoryTweet
2818positiveEarly release bc Obama will be at the College ...
1505neutralApril 17, 1986 Madonna at the At Close Range p...
620negative"Joe Biden may join Bernie Sanders in the Demo...
3367positive@LaurenceWHolmes What do you mean, Laurence? T...
19positiveSiri knows all about #Apple's iOS event on the...
.........
1738positiveWarm up those vocals, Castro! @KAMELLE is lead...
990positiveBest Jerseys this season (not in order, can't ...
4391neutral"I've never been shy or secretive about the fa...
4753neutralNot for nothing is their motto TGIF - 'Thank G...
1838positiveMAGICAL MARCH - With 48 goals in 42 official m...

2215 rows × 2 columns

In [23]:
test_clean_df
Out[23]:
CategoryTweet
705positiveKhakis and Jurassic Park shirt for tomorrow. ...
1482neutralMay our old mini van and Lexus rest in peace. ...
5307negativeThere's a simple solution, just deport all the...
3377negativeRick Perry was going to go on Are You Smarter ...
3932positiveSnoop Dogg was one of the stars to support Ma...
.........
4972neutralTristram 'more Tory than the Tories' Hunt seem...
2859negativeMark Levin Market Crash: It's Not China-It's B...
3536negativeSomeone may want to let Sarah Palin know that ...
2367negativeThe LAST thing we need is more corn and more M...
5099neutralHahaha dead. Trump talks about the real issues...

392 rows × 2 columns

Since the data is small, we can transfer them into python lists for further manipulation. If the data is large, it's preferred to keep using pandas until you create the batch iterator (DataLoader in PyTorch).

In [24]:
train_set=list(train_clean_df.to_records(index=False))test_set=list(test_clean_df.to_records(index=False))
In [25]:
train_set[:10]
Out[25]:
[('positive', 'Early release bc Obama will be at the College across the street from my high school tomorrow. Nice.'),
 ('neutral', 'April 17, 1986 Madonna at the At Close Range premiere http://t.co/Lw4T3AplZF'),
 ('negative', '"Joe Biden may join Bernie Sanders in the Democrat primary... I thought the Democrats were opposed to fossil fools!" ~ Emily Zanotti,'),
 ('positive', '@LaurenceWHolmes What do you mean, Laurence? The Dudleys, Ric Flair, and Sting were on Raw Monday. Taker wrestled Sunday. It IS the 90s.'),
 ('positive', "Siri knows all about #Apple's iOS event on the 9th. #GiveUsAHint http://t.co/sHmTw46ELR"),
 ('negative', ".@SenTedCruz @realDonaldTrump @SenTomCotton   We don't want Obama dumping them in the USA!   https://t.co/obxcmVydfh"),
 ('neutral', 'YouTube Gaming Launches Tomorrow with iOS and Android Apps to Go Head-to-Head with Twitch http://t.co/yZOATToeJC #ios #game'),
 ('neutral', "@Omsondafivenine @Footy_Jokes this is the truth my friend while messi might win the 5th ballon d or people would say Ronaldo didn't win it"),
 ('neutral', "Michelle Obama's waiting in the Master Bedroom Chelsea Clinton's waiting in the Lincoln Bedroom WHICH ROOM 1st @Sadieisonfire @REALFITFINLAY"),
 ('positive', 'The very best thing about Monday Night Raw was the Nintendo #MarioMaker commericial. We still want the games @WWE @2K @WWENetwork. #WiiU')]

We can observe that some tweets end with links. Moreover, we can see that many tweets have twitter mentions (@someone). These are not useful in determining the sentiment of the tweet, and it is better to remove them before proceeding:

In [26]:
defremove_links_mentions(tweet):link_re_pattern="https?:\/\/t.co/[\w]+"mention_re_pattern="@\w+"tweet=re.sub(link_re_pattern,"",tweet)tweet=re.sub(mention_re_pattern,"",tweet)returntweet.lower()
In [27]:
remove_links_mentions('...and Jeb Bush is third in the polls and losing donors. Be fair and balance...@karlrove @FoxNews. https://t.co/Ka2km3bua6')
Out[27]:
'...and jeb bush is third in the polls and losing donors. be fair and balance... . '

As showm, regex can remove such strings easily. Finally, notice that we lowercased all tweets in the function. The simple reason is that for a computer, case differences are important. For example, the word "word" and "Word" are as different as any other 2 pairs of words, although for us they are the same. To improve training, it is better to lowercase all words.

Tokenization

Finally, using word_tokenize() from the NLTK library, we can split the sentence into tokens, or words, puncatation points, and other language blocks that are "divisbile".

In [28]:
train_set=[(label,word_tokenize(remove_links_mentions(tweet)))forlabel,tweetintrain_set]train_set[:3]
Out[28]:
[('positive',
  ['early',
   'release',
   'bc',
   'obama',
   'will',
   'be',
   'at',
   'the',
   'college',
   'across',
   'the',
   'street',
   'from',
   'my',
   'high',
   'school',
   'tomorrow',
   '.',
   'nice',
   '.']),
 ('neutral',
  ['april',
   '17',
   ',',
   '1986',
   'madonna',
   'at',
   'the',
   'at',
   'close',
   'range',
   'premiere']),
 ('negative',
  ['``',
   'joe',
   'biden',
   'may',
   'join',
   'bernie',
   'sanders',
   'in',
   'the',
   'democrat',
   'primary',
   '...',
   'i',
   'thought',
   'the',
   'democrats',
   'were',
   'opposed',
   'to',
   'fossil',
   'fools',
   '!',
   "''",
   '~',
   'emily',
   'zanotti',
   ','])]
In [29]:
test_set=[(label,word_tokenize(remove_links_mentions(tweet)))forlabel,tweetintest_set]test_set[:3]
Out[29]:
[('positive',
  ['khakis',
   'and',
   'jurassic',
   'park',
   'shirt',
   'for',
   'tomorrow',
   '.',
   'i',
   "'m",
   'gon',
   'na',
   'look',
   'hot',
   'on',
   'the',
   'first',
   'day',
   'of',
   'school',
   '.',
   'literally',
   '...',
   'we',
   "'re",
   'experiencing',
   'a',
   'heat',
   'wave',
   '.']),
 ('neutral',
  ['may',
   'our',
   'old',
   'mini',
   'van',
   'and',
   'lexus',
   'rest',
   'in',
   'peace',
   '.',
   'and',
   'hello',
   'brand',
   'new',
   'cars',
   ':',
   'd',
   'still',
   'miss',
   'the',
   'lexus',
   'a',
   'lot',
   'though',
   ':',
   "'",
   '(']),
 ('negative',
  ['there',
   "'s",
   'a',
   'simple',
   'solution',
   ',',
   'just',
   'deport',
   'all',
   'the',
   'far',
   'right',
   'wing',
   'tory',
   '&',
   'amp',
   ';',
   'ukip',
   'voting',
   'cocksuckers',
   '!'])]

Next, we create the "vocabulary" of the corpus. In NLP projects, the vocabulary is just a mapping of each word to a unique ID. Since models cannot process text as we do, we must convert them into numerical form.

By creating this mapping, one can write a sentence with numbers. For instance, if the vocab is as follows:

{"i": 0,
 "the: 1,
 "ate": 2,
 "pizza": 3
}

We can say "I ate the pizza" by saynig [0, 2, 1, 3].

This is an oversimplified explanation of encoding, but the general idea is the same.

In this exercise, we create a list of unique words (set-like) and use that list and its indices to create a dictionary of mapping.

As shown, the list starts with the 3 tokens "<PAD>", "<SOS>", "<EOS>".

Since we will input fixed-size text to the model, we will have to pad some tweets to increase their length. The token for padding is <PAD>.

<SOS> and <EOS> are short for "start of sentence" and "end of sentence" respectively. They are tokens used to identify the beginning and ending of each sentence in order to train the model. As will be showm, they will be inserted at the beginning and end of every tweet

In [30]:
index2word=["<PAD>","<SOS>","<EOS>"]fordsin[train_set,test_set]:forlabel,tweetinds:fortokenintweet:iftokennotinindex2word:index2word.append(token)
In [31]:
index2word[10]
Out[31]:
'the'
In [32]:
word2index={token:idxforidx,tokeninenumerate(index2word)}
In [33]:
word2index["the"]
Out[33]:
10

As shown, index2word and word2index act as our vocabulary which can be used to encode all tweets.

In [34]:
deflabel_map(label):iflabel=="negative":return0eliflabel=="neutral":return1else:#positivereturn2

ALso, we cannot leave the labels in text form. So, we encode them using 0, 1, and 2 for negative, neutral, and positive respectively.

To pad, we must select a sequence length. This length should cover the majority of tweets. Typically, length measurements are performed to find the ideal sequence length, but since our data is tweet data im 2012, we know that they cannot be too long and therefore we can set the length to 32 tokens.

In [35]:
seq_length=32

Then, we perform padding and truncating. Padding is performed when a tweet is shorter than 32 tokens, and truncating is used when a tweet is longer than 32 tokens. In the same encoding method, we also insert the PAD, SOS, and EOS tokens.

In [36]:
defencode_and_pad(tweet,length):sos=[word2index["<SOS>"]]eos=[word2index["<EOS>"]]pad=[word2index["<PAD>"]]iflen(tweet)<length-2:# -2 for SOS and EOSn_pads=length-2-len(tweet)encoded=[word2index[w]forwintweet]returnsos+encoded+eos+pad*n_padselse:# tweet is longer than possible; truncatingencoded=[word2index[w]forwintweet]truncated=encoded[:length-2]returnsos+truncated+eos

Encoding both train and test sets:

In [37]:
train_encoded=[(encode_and_pad(tweet,seq_length),label_map(label))forlabel,tweetintrain_set]
In [38]:
test_encoded=[(encode_and_pad(tweet,seq_length),label_map(label))forlabel,tweetintest_set]

This is what 3 tweets look like after encoding:

In [39]:
foriintrain_encoded[:3]:print(i)
([1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 10, 13, 14, 15, 16, 17, 18, 19, 20, 19, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 2)
([1, 21, 22, 23, 24, 25, 9, 10, 9, 26, 27, 28, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 1)
([1, 29, 30, 31, 32, 33, 34, 35, 36, 10, 37, 38, 39, 40, 41, 10, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 23, 2, 0, 0, 0], 0)

Notice that they always begin with 1, which stands for SOS, and end with 2, which is EOS. If the tweet is shorter than 32 tokens, it is then padded with 0's, which is the padding. Also, notice that the labels are numerical as well.

Now, the data is preprocessed and encoded. It is time to create our PyTorch Datasets and DataLoaders:

In [40]:
batch_size=50train_x=np.array([tweetfortweet,labelintrain_encoded])train_y=np.array([labelfortweet,labelintrain_encoded])test_x=np.array([tweetfortweet,labelintest_encoded])test_y=np.array([labelfortweet,labelintest_encoded])train_ds=TensorDataset(torch.from_numpy(train_x),torch.from_numpy(train_y))test_ds=TensorDataset(torch.from_numpy(test_x),torch.from_numpy(test_y))train_dl=DataLoader(train_ds,shuffle=True,batch_size=batch_size,drop_last=True)test_dl=DataLoader(test_ds,shuffle=True,batch_size=batch_size,drop_last=True)

Notice the parameter drop_last=True. This is used for when the final batch does not have 50 elements. The batch is then incomplete and will cause dimension errors if we feed it into the model. By setting this parameter to True, we avoid this final batch.

PyTorch LSTM Model Buidling

Building LSTMs is very simple in PyTorch. Similar to how you create simple feed-forward neural networks, we extend nn.Module, create the layers in the initialization, and create a forward() method.

In the initialization, we create an embeddings layer first.

Embeddings are used for improving the representation of the text. This Wikipedia article explains embeddings well: https://en.wikipedia.org/wiki/Word_embedding#:~:text=In%20natural%20language%20processing%20.

In short, instead of feeding sentences as simple encoded sequences (for example [0, 1, 2], etc. as seen in the pizza example), we can improve the representation of every token.

Word embeddings are vectors that represent each word, instead of a single number in the pizza example.

Why does a vector help? Vectors allow you to highlight the similarities between words. For instance, we can give the words "food" and "pizza" similar vectors since the 2 words are related. This makes it easier for the model to "understand" the text.

As seen, in PyTorch it is a simple layer, and we only need to feed the data into it. Vectors are initially initialized randomly for every word, and then adjusted during training. That means that the embeddings are trainable parameters in this network.

Another alternative to using random initialization is to use pre-trained vectors. Big AI labs at Google, Facebook, and Stanford have created pre-trained embeddings that you can just download and use. They are called word2vec, fastText, and GloVe respectively.

This is a good example of how to use pre-trained embeddings such as word2vec in the Embedding layer of PyTorch: https://medium.com/@martinpella/how-to-use-pre-trained-word-embeddings-in-pytorch-71ca59249f76

In [41]:
classBiLSTM_SentimentAnalysis(torch.nn.Module):def__init__(self,vocab_size,embedding_dim,hidden_dim,dropout):super().__init__()# The embedding layer takes the vocab size and the embeddings size as input# The embeddings size is up to you to decide, but common sizes are between 50 and 100.self.embedding=nn.Embedding(vocab_size,embedding_dim,padding_idx=0)# The LSTM layer takes in the the embedding size and the hidden vector size.# The hidden dimension is up to you to decide, but common values are 32, 64, 128self.lstm=nn.LSTM(embedding_dim,hidden_dim,batch_first=True)# We use dropout before the final layer to improve with regularizationself.dropout=nn.Dropout(dropout)# The fully-connected layer takes in the hidden dim of the LSTM and#  outputs a a 3x1 vector of the class scores.self.fc=nn.Linear(hidden_dim,3)defforward(self,x,hidden):"""        The forward method takes in the input and the previous hidden state """# The input is transformed to embeddings by passing it to the embedding layerembs=self.embedding(x)# The embedded inputs are fed to the LSTM alongside the previous hidden stateout,hidden=self.lstm(embs,hidden)# Dropout is applied to the output and fed to the FC layerout=self.dropout(out)out=self.fc(out)# We extract the scores for the final hidden state since it is the one that matters.out=out[:,-1]returnout,hiddendefinit_hidden(self):return(torch.zeros(1,batch_size,32),torch.zeros(1,batch_size,32))

Finally, as seen, we have an init_hidden() method. The reason we need this method is that at the beginning of the sequence, there are no hidden states.

The LSTM takes in initial hidden states of zeros at the first time-step. So, we initalize them using this method.

Now, we initialize the model and move it to device as follows:

Setup and Training

In [113]:
model=BiLSTM_SentimentAnalysis(len(word2index),64,32,0.2)model=model.to(device)

Next, we create the criterion and optimizer used for training:

In [114]:
criterion=nn.CrossEntropyLoss()optimizer=torch.optim.Adam(model.parameters(),lr=3e-4)

Then we train the model for 50 epochs:

In [115]:
epochs=50losses=[]foreinrange(epochs):h0,c0=model.init_hidden()h0=h0.to(device)c0=c0.to(device)forbatch_idx,batchinenumerate(train_dl):input=batch[0].to(device)target=batch[1].to(device)optimizer.zero_grad()withtorch.set_grad_enabled(True):out,hidden=model(input,(h0,c0))loss=criterion(out,target)loss.backward()optimizer.step()losses.append(loss.item())

We plot the loss at each batch to make sure that the mode is learning:

In [116]:
plt.plot(losses)
Out[116]:
[<matplotlib.lines.Line2D at 0x7f03a2c1bbd0>]

As shown, the losses are decreasing steadily and then they level off, which means that the model has successfully learnt what can be learned from the data.

To test the model, we run the same loop for the the test set and extract the accuracy:

Evaluation

In [117]:
batch_acc=[]forbatch_idx,batchinenumerate(test_dl):input=batch[0].to(device)target=batch[1].to(device)optimizer.zero_grad()withtorch.set_grad_enabled(False):out,hidden=model(input,(h0,c0))_,preds=torch.max(out,1)preds=preds.to("cpu").tolist()batch_acc.append(accuracy_score(preds,target.tolist()))sum(batch_acc)/len(batch_acc)
Out[117]:
0.4628571428571428

While this is generally a low accuracy, it is not insignificant. If the model did not learn, we would expect an accuracy of ~33%, which is random selection.

However, since the dataset is noisy and not robust, this is the best performance a simple LSTM could achieve on the dataset.

According to the Github repo, the author was able to achieve an accuracy of ~50% using XGBoost.

Conclusion

In this tutorial, we created a simple LSTM classifier for sentiment analysis. Along the way, we learned many NLP techniques used in real NLP projects. While the accuracy was not as high as accuracies for other datasets, we can conclude that the model learned what it could from the data, as shown by the loss.


Viewing all articles
Browse latest Browse all 23144

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>