Once you understand the basics of Python, familiarizing yourself with its most popular packages will not only boost your mastery over the language but also rapidly increase your versatility. In this tutorial, you’ll learn the amazing capabilities of the Natural Language Toolkit (NLTK) for processing and analyzing text, from basic functions to sentiment analysis powered by machine learning!
Sentiment analysis can help you determine the ratio of positive to negative engagements about a specific topic. You can analyze bodies of text, such as comments, tweets, and product reviews, to obtain insights from your audience. In this tutorial, you’ll learn the important features of NLTK for processing text data and the different approaches you can use to perform sentiment analysis on your data.
By the end of this tutorial, you’ll be ready to:
- Split and filter text data in preparation for analysis
- Analyze word frequency
- Find concordance and collocations using different methods
- Perform quick sentiment analysis with NLTK’s built-in classifier
- Define features for custom classification
- Use and compare classifiers for sentiment analysis with NLTK
Free Bonus:Click here to get our free Python Cheat Sheet that shows you the basics of Python 3, like working with data types, dictionaries, lists, and Python functions.
Getting Started With NLTK
The NLTK library contains various utilities that allow you to effectively manipulate and analyze linguistic data. Among its advanced features are text classifiers that you can use for many kinds of classification, including sentiment analysis.
Sentiment analysis is the practice of using algorithms to classify various samples of related text into overall positive and negative categories. With NLTK, you can employ these algorithms through powerful built-in machine learning operations to obtain insights from linguistic data.
Installing and Importing
You’ll begin by installing some prerequisites, including NLTK itself as well as specific resources you’ll need throughout this tutorial.
First, use pip
to install NLTK:
$ python3 -m pip install nltk
While this will install the NLTK module, you’ll still need to obtain a few additional resources. Some of them are text samples, and others are data models that certain NLTK functions require.
To get the resources you’ll need, use nltk.download()
:
importnltknltk.download()
NLTK will display a download manager showing all available and installed resources. Here are the ones you’ll need to download for this tutorial:
names
: A list of common English names compiled by Mark Kantrowitzstopwords
: A list of really common words, like articles, pronouns, prepositions, and conjunctionsstate_union
: A sample of transcribed State of the Union addresses by different US presidents, compiled by Kathleen Ahrenstwitter_samples
: A list of social media phrases posted to Twittermovie_reviews
:Two thousand movie reviews categorized by Bo Pang and Lillian Leeaveraged_perceptron_tagger
: A data model that NLTK uses to categorize words into their part of speechvader_lexicon
: A scored list of words and jargon that NLTK references when performing sentiment analysis, created by C.J. Hutto and Eric Gilbertpunkt
: A data model created by Jan Strunk that NLTK uses to split full texts into word lists
Note: Throughout this tutorial, you’ll find many references to the word corpus and its plural form, corpora. A corpus is a large collection of related text samples. In the context of NLTK, corpora are compiled with features for natural language processing (NLP), such as categories and numerical scores for particular features.
A quick way to download specific resources directly from the console is to pass a list to nltk.download()
:
>>> importnltk>>> nltk.download([... "names",... "stopwords",... "state_union",... "twitter_samples",... "movie_reviews",... "averaged_perceptron_tagger",... "vader_lexicon",... "punkt",... ])[nltk_data] Downloading package names to /home/user/nltk_data...[nltk_data] Unzipping corpora/names.zip.[nltk_data] Downloading package stopwords to /home/user/nltk_data...[nltk_data] Unzipping corpora/stopwords.zip.[nltk_data] Downloading package state_union to[nltk_data] /home/user/nltk_data...[nltk_data] Unzipping corpora/state_union.zip.[nltk_data] Downloading package twitter_samples to[nltk_data] /home/user/nltk_data...[nltk_data] Unzipping corpora/twitter_samples.zip.[nltk_data] Downloading package movie_reviews to[nltk_data] /home/user/nltk_data...[nltk_data] Unzipping corpora/movie_reviews.zip.[nltk_data] Downloading package averaged_perceptron_tagger to[nltk_data] /home/user/nltk_data...[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.[nltk_data] Downloading package vader_lexicon to[nltk_data] /home/user/nltk_data...[nltk_data] Downloading package punkt to /home/user/nltk_data...[nltk_data] Unzipping tokenizers/punkt.zip.True
This will tell NLTK to find and download each resource based on its identifier.
Should NLTK require additional resources that you haven’t installed, you’ll see a helpful LookupError
with details and instructions to download the resource:
>>> importnltk>>> w=nltk.corpus.shakespeare.words()...LookupError:********************************************************************** Resource shakespeare not found. Please use the NLTK Downloader to obtain the resource:>>> import nltk>>> nltk.download('shakespeare')...
The LookupError
specifies which resource is necessary for the requested operation along with instructions to download it using its identifier.
Compiling Data
Read the full article at https://realpython.com/pyhton-nltk-sentiment-analysis/ »
[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]