Stack Abuse: Text Classification with BERT Tokenizer and TF 2.0 in Python

January 28, 2020, 5:46 am

≫ Next: Real Python: Python Modules and Packages: An Introduction

≪ Previous: PyCharm: PyCharm 2020.1 EAP starts now

This is the 23rd article in my series of articles on Python for NLP. In the previous article of this series, I explained how to perform neural machine translation using seq2seq architecture with Python's Keras library for deep learning.

In this article we will study BERT, which stands for Bidirectional Encoder Representations from Transformers and its application to text classification. BERT is a text representation technique like Word Embeddings. If you have no idea of how word embeddings work, take a look at my article on word embeddings.

Like word embeddings, BERT is also a text representation technique which is a fusion of variety of state-of-the-art deep learning algorithms, such as bidirectional encoder LSTM and Transformers. BERT was developed by researchers at Google in 2018 and has been proven to be state-of-the-art for a variety of natural language processing tasks such text classification, text summarization, text generation, etc. Just recently, Google announced that BERT is being used as a core part of their search algorithm to better understand queries.

In this article we will not go into the mathematical details of how BERT is implemented, as there are plenty of resources already available online. Rather we will see how to perform text classification using the BERT Tokenizer. In this article you will see how the BERT Tokenizer can be used to create text classification model. In the next article I will explain how the BERT Tokenizer, along with BERT embedding layer, can be used to create even more efficient NLP models.

Note: All the scripts in this article have been tested using Google Colab environment, with Python runtime set to GPU.

The Dataset

The dataset used in this article can be downloaded from this Kaggle link.

If you download the dataset and extract the compressed file, you will see a CSV file. The file contains 50,000 records and two columns: review and sentiment. The review column contains text for the review and the sentiment column contains sentiment for the review. The sentiment column can have two values i.e. "positive" and "negative" which makes our problem a binary classification problem.

We have previously performed sentimental analysis of this dataset in a previous article where we achieved maximum accuracy of 92% on the training set via word a embedding technique and convolutional neural network. On the test set the maximum accuracy achieved was 85.40% using the word embedding and single LSTM with 128 nodes. Let's see if we can get better accuracy using BERT representation.

Installing and Importing Required Libraries

Before you can go and use the BERT text representation, you need to install BERT for TensorFlow 2.0. Execute the following pip commands on your terminal to install BERT for TensorFlow 2.0.

!pip install bert-for-tf2
!pip install sentencepiece

Next, you need to make sure that you are running TensorFlow 2.0. Google Colab, by default, doesn't run your script on TensorFlow 2.0. Therefore, to make sure that you are running your script via TensorFlow 2.0, execute the following script:

try:
    %tensorflow_version 2.x
except Exception:
    pass
import tensorflow as tf

import tensorflow_hub as hub

from tensorflow.keras import layers
import bert

In the above script, in addition to TensorFlow 2.0, we also import tensorflow_hub, which basically is a place where you can find all the prebuilt and pretrained models developed in TensorFlow. We will be importing and using a built-in BERT model from TF hub. Finally, if in the output you see the following output, you are good to go:

TensorFlow 2.x selected.

Importing and Preprocessing the Dataset

The following script imports the dataset using the read_csv() method of the Pandas dataframe. The script also prints the shape of the dataset.

movie_reviews = pd.read_csv("/content/drive/My Drive/Colab Datasets/IMDB Dataset.csv")

movie_reviews.isnull().values.any()

movie_reviews.shape

Output

(50000, 2)

The output shows that our dataset has 50,000 rows and 2 columns.

Next, we will preprocess our data to remove any punctuations and special characters. To do so, we will define a function that takes as input a raw text review and returns the corresponding cleaned text review.

def preprocess_text(sen):
    # Removing html tags
    sentence = remove_tags(sen)

    # Remove punctuations and numbers
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)

    # Single character removal
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)

    # Removing multiple spaces
    sentence = re.sub(r'\s+', ' ', sentence)

    return sentence

TAG_RE = re.compile(r'<[^>]+>')

def remove_tags(text):
    return TAG_RE.sub('', text)

The following script cleans all the text reviews:

reviews = []
sentences = list(movie_reviews['review'])
for sen in sentences:
    reviews.append(preprocess_text(sen))

Our dataset contains two columns, as can be verified from the following script:

print(movie_reviews.columns.values)

Output:

['review' 'sentiment']

The review column contains text while the sentiment column contains sentiments. The sentiments column contains values in the form of text. The following script displays unique values in the sentiment column:

movie_reviews.sentiment.unique()

Output:

array(['positive', 'negative'], dtype=object)

You can see that the sentiment column contains two unique values i.e. positive and negative. Deep learning algorithms work with numbers. Since we have only two unique values in the output, we can convert them into 1 and 0. The following script replaces positive sentiment by 1 and the negative sentiment by 0.

y = movie_reviews['sentiment']

y = np.array(list(map(lambda x: 1 if x=="positive" else 0, y)))

Now the reviews variable contain text reviews while the y variable contains the corresponding labels. Let's randomly print a review.

print(reviews[10])

Output:

Phil the Alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines At first it was very odd and pretty funny but as the movie progressed didn find the jokes or oddness funny anymore Its low budget film thats never problem in itself there were some pretty interesting characters but eventually just lost interest imagine this film would appeal to stoner who is currently partaking For something similar but better try Brother from another planet

It clearly looks like a negative review. Let's just confirm it by printing the corresponding label value:

print(y[10])

Output:

The output 0 confirms that it is a negative review. We have now preprocessed our data and we are now ready to create BERT representations from our text data.

Creating a BERT Tokenizer

In order to use BERT text embeddings as input to train text classification model, we need to tokenize our text reviews. Tokenization refers to dividing a sentence into individual words. To tokenize our text, we will be using the BERT tokenizer. Look at the following script:

BertTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1",
                            trainable=False)
vocabulary_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
to_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = BertTokenizer(vocabulary_file, to_lower_case)

In the script above we first create an object of the FullTokenizer class from the bert.bert_tokenization module. Next, we create a BERT embedding layer by importing the BERT model from hub.KerasLayer. The trainable parameter is set to False, which means that we will not be training the BERT embedding. In the next line, we create a BERT vocabulary file in the form a numpy array. We then set the text to lowercase and finally we pass our vocabulary_file and to_lower_case variables to the BertTokenizer object.

It is pertinent to mention that in this article, we will only be using BERT Tokenizer. In the next article we will use BERT Embeddings along with tokenizer.

Let's now see if our BERT tokenizer is actually working. To do so, we will tokenize a random sentence, as shown below:

tokenizer.tokenize("don't be so judgmental")

Output:

['don', "'", 't', 'be', 'so', 'judgment', '##al']

You can see that the text has been successfully tokenized. You can also get the ids of the tokens using the convert_tokens_to_ids() of the tokenizer object. Look at the following script:

tokenizer.convert_tokens_to_ids(tokenizer.tokenize("dont be so judgmental"))

Output:

[2123, 2102, 2022, 2061, 8689, 2389]

Now will define a function that accepts a single text review and returns the ids of the tokenized words in the review. Execute the following script:

def tokenize_reviews(text_reviews):
    return tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text_reviews))

And execute the following script to actually tokenize all the reviews in the input dataset:

tokenized_reviews = [tokenize_reviews(review) for review in reviews]

Prerparing Data For Training

The reviews in our dataset have varying lengths. Some reviews are very small while others are very long. To train the model, the input sentences should be of equal length. To create sentences of equal length, one way is to pad the shorter sentences by 0s. However, this can result in a sparse matrix contain large number of 0s. The other way is to pad sentences within each batch. Since we will be training the model in batches, we can pad the sentences within the training batch locally depending upon the length of the longest sentence. To do so, we first need to find the length of each sentence.

The following script creates a list of lists where each sublist contains tokenized review, the label of the review and the length of the review:

reviews_with_len = [[review, y[i], len(review)]
                 for i, review in enumerate(tokenized_reviews)]

In our dataset, the first half of the reviews are positive while the last half contains negative reviews. Therefore, in order to have both positive and negative reviews in the training batches we need to shuffle the reviews. The following script shuffles the data randomly:

random.shuffle(reviews_with_len)

Once the data is shuffled, we will sort the data by the length of the reviews. To do so, we will use the sort() function of the list and will tell it that we want to sort the list with respect to the third item in the sublist i.e. the length of the review.

reviews_with_len.sort(key=lambda x: x[2])

Once the reviews are sorted by length, we can remove the length attribute from all the reviews. Execute the following script to do so:

sorted_reviews_labels = [(review_lab[0], review_lab[1]) for review_lab in reviews_with_len]

Once the reviews are sorted we will convert thed dataset so that it can be used to train TensorFlow 2.0 models. Run the following code to convert the sorted dataset into a TensorFlow 2.0-compliant input dataset shape.

processed_dataset = tf.data.Dataset.from_generator(lambda: sorted_reviews_labels, output_types=(tf.int32, tf.int32))

Finally, we can now pad our dataset for each batch. The batch size we are going to use is 32 which means that after processing 32 reviews, the weights of the neural network will be updated. To pad the reviews locally with respect to batches, execute the following:

BATCH_SIZE = 32
batched_dataset = processed_dataset.padded_batch(BATCH_SIZE, padded_shapes=((None, ), ()))

Let's print the first batch and see how padding has been applied to it:

next(iter(batched_dataset))

Output:

(<tf.Tensor: shape=(32, 21), dtype=int32, numpy=
 array([[ 2054,  5896,  2054,  2466,  2054,  6752,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 3078,  5436,  3078,  3257,  3532,  7613,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 3191,  1996,  2338,  5293,  1996,  3185,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 2062, 23873,  3993,  2062, 11259,  2172,  2172,  2062, 14888,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [ 1045,  2876,  9278,  2023,  2028,  2130,  2006,  7922, 12635,
          2305,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
      ......
      
        [ 7244,  2092,  2856, 10828,  1997, 10904,  2402,  2472,  3135,
          2293,  2466,  2007, 10958,  8428, 10102,  1999,  1996,  4281,
          4276,  3773,     0],
        [ 2005,  5760,  7788,  4393,  8808,  2498,  2064, 12826,  2000,
          1996, 11056,  3152,  3811, 16755,  2169,  1998,  2296,  2028,
          1997,  2068,     0],
        [ 2307,  3185,  2926,  1996,  2189,  3802,  2696,  2508,  2012,
          2197,  2023,  8847,  6702,  2043,  2017,  2031,  2633,  2179,
          2008,  2569,  2619],
        [ 2028,  1997,  1996,  4569, 15580,  2102,  5691,  2081,  1999,
          3522,  2086,  2204, 23191,  5436,  1998, 11813,  6370,  2191,
          2023,  2028,  4438],
        [ 2023,  3185,  2097,  2467,  2022,  5934,  1998,  3185,  4438,
          2004,  2146,  2004,  2045,  2024,  2145,  2111,  2040,  6170,
          3153,  1998,  2552]], dtype=int32)>,
 <tf.Tensor: shape=(32,), dtype=int32, numpy=
 array([0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1,
        1, 1, 0, 1, 0, 1, 1, 0, 1, 1], dtype=int32)>)

The above output shows the first five and last five padded reviews. From the last five reviews, you can see that the total number of words in the largest sentence were 21. Therefore, in the first five reviews the 0s are added at the end of the sentences so that their total length is also 21. The padding for the next batch will be different depending upon the size of the largest sentence in the batch.

Once we have applied padding to our dataset, the next step is to divide the dataset into test and training sets. We can do that with the help of following code:

TOTAL_BATCHES = math.ceil(len(sorted_reviews_labels) / BATCH_SIZE)
TEST_BATCHES = TOTAL_BATCHES // 10
batched_dataset.shuffle(TOTAL_BATCHES)
test_data = batched_dataset.take(TEST_BATCHES)
train_data = batched_dataset.skip(TEST_BATCHES)

In the code above we first find the total number of batches by dividing the total records by 32. Next, 10% of the data is left aside for testing. To do so, we use the take() method of batched_dataset() object to store 10% of the data in the test_data variable. The remaining data is stored in the train_data object for training using the skip() method.

The dataset has been prepared and now we are ready to create our text classification model.

Creating the Model

Now we are all set to create our model. To do so, we will create a class named TEXT_MODEL that inherits from the tf.keras.Model class. Inside the class we will define our model layers. Our model will consist of three convolutional neural network layers. You can use LSTM layers instead and can also increase or decrease the number of layers. I have copied the number and types of layers from SuperDataScience's Google colab notebook and this architecture seems to work quite well for the IMDB Movie reviews dataset as well.

Let's now create out model class:

class TEXT_MODEL(tf.keras.Model):
    
    def __init__(self,
                 vocabulary_size,
                 embedding_dimensions=128,
                 cnn_filters=50,
                 dnn_units=512,
                 model_output_classes=2,
                 dropout_rate=0.1,
                 training=False,
                 name="text_model"):
        super(TEXT_MODEL, self).__init__(name=name)
        
        self.embedding = layers.Embedding(vocabulary_size,
                                          embedding_dimensions)
        self.cnn_layer1 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=2,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer2 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=3,
                                        padding="valid",
                                        activation="relu")
        self.cnn_layer3 = layers.Conv1D(filters=cnn_filters,
                                        kernel_size=4,
                                        padding="valid",
                                        activation="relu")
        self.pool = layers.GlobalMaxPool1D()
        
        self.dense_1 = layers.Dense(units=dnn_units, activation="relu")
        self.dropout = layers.Dropout(rate=dropout_rate)
        if model_output_classes == 2:
            self.last_dense = layers.Dense(units=1,
                                           activation="sigmoid")
        else:
            self.last_dense = layers.Dense(units=model_output_classes,
                                           activation="softmax")
    
    def call(self, inputs, training):
        l = self.embedding(inputs)
        l_1 = self.cnn_layer1(l) 
        l_1 = self.pool(l_1) 
        l_2 = self.cnn_layer2(l) 
        l_2 = self.pool(l_2)
        l_3 = self.cnn_layer3(l)
        l_3 = self.pool(l_3) 
        
        concatenated = tf.concat([l_1, l_2, l_3], axis=-1) # (batch_size, 3 * cnn_filters)
        concatenated = self.dense_1(concatenated)
        concatenated = self.dropout(concatenated, training)
        model_output = self.last_dense(concatenated)
        
        return model_output

The above script is pretty straightforward. In the constructor of the class, we initialze some attributes with default values. These values will be replaced later on by the values passed when the object of the TEXT_MODEL class is created.

Next, three convolutional neural network layers have been initialized with the kernel or filter values of 2, 3, and 4, respectively. Again, you can change the filter sizes if you want.

Next, inside the call() function, global max pooling is applied to the output of each of the convolutional neural network layer. Finally, the three convolutional neural network layers are concatenated together and their output is fed to the first densely connected neural network. The second densely connected neural network is used to predict the output sentiment since it only contains 2 classes. In case you have more classes in the output, you can updated the output_classes variable accordingly.

Let's now define the values for the hyper parameters of our model.

VOCAB_LENGTH = len(tokenizer.vocab)
EMB_DIM = 200
CNN_FILTERS = 100
DNN_UNITS = 256
OUTPUT_CLASSES = 2

DROPOUT_RATE = 0.2

NB_EPOCHS = 5

Next, we need to create an object of the TEXT_MODEL class and pass the hyper paramters values that we defined in the last step to the constructor of the TEXT_MODEL class.

text_model = TEXT_MODEL(vocabulary_size=VOCAB_LENGTH,
                        embedding_dimensions=EMB_DIM,
                        cnn_filters=CNN_FILTERS,
                        dnn_units=DNN_UNITS,
                        model_output_classes=OUTPUT_CLASSES,
                        dropout_rate=DROPOUT_RATE)

Before we can actually train the model we need to compile it. The following script compiles the model:

if OUTPUT_CLASSES == 2:
    text_model.compile(loss="binary_crossentropy",
                       optimizer="adam",
                       metrics=["accuracy"])
else:
    text_model.compile(loss="sparse_categorical_crossentropy",
                       optimizer="adam",
                       metrics=["sparse_categorical_accuracy"])

Finally to train our model, we can use the fit method of the model class.

text_model.fit(train_data, epochs=NB_EPOCHS)

Here is the result after 5 epochs:

Epoch 1/5
1407/1407 [==============================] - 381s 271ms/step - loss: 0.3037 - accuracy: 0.8661
Epoch 2/5
1407/1407 [==============================] - 381s 271ms/step - loss: 0.1341 - accuracy: 0.9521
Epoch 3/5
1407/1407 [==============================] - 383s 272ms/step - loss: 0.0732 - accuracy: 0.9742
Epoch 4/5
1407/1407 [==============================] - 381s 271ms/step - loss: 0.0376 - accuracy: 0.9865
Epoch 5/5
1407/1407 [==============================] - 383s 272ms/step - loss: 0.0193 - accuracy: 0.9931
<tensorflow.python.keras.callbacks.History at 0x7f5f65690048>

You can see that we got an accuracy of 99.31% on the training set.

Let's now evaluate our model's performance on the test set:

results = text_model.evaluate(test_dataset)
print(results)

Output:

156/Unknown - 4s 28ms/step - loss: 0.4428 - accuracy: 0.8926[0.442786190037926, 0.8926282]

From the output, we can see that we got an accuracy of 89.26% on the test set.

Conclusion

In this article you saw how we can use BERT Tokenizer to create word embeddings that can be used to perform text classification. We performed sentimental analysis of IMDB movie reviews and achieved an accuracy of 89.26% on the test set. In this article we did not use BERT embeddings, we only used BERT Tokenizer to tokenize the words. In the next article, you will see how BERT Tokenizer along with BERT Embeddings can be used to perform text classification.

↧

Real Python: Python Modules and Packages: An Introduction

January 28, 2020, 6:00 am

≫ Next: PyCharm: Webinar Recording: “Advanced Debugging in PyCharm”

≪ Previous: Stack Abuse: Text Classification with BERT Tokenizer and TF 2.0 in Python

In this course, you’ll learn about Python modules and Python packages, two mechanisms that facilitate modular programming.

Modular programming is the process of breaking a large, unwieldy programming task into separate, smaller, more manageable subtasks or modules. Individual modules can then be put together like building blocks to create a larger application.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

PyCharm: Webinar Recording: “Advanced Debugging in PyCharm”

January 28, 2020, 8:00 am

≫ Next: Artem Rys: Why you should use typing for all your Python projects

≪ Previous: Real Python: Python Modules and Packages: An Introduction

Last week we held a special webinar for “Advanced Debugging in PyCharm”. Special how? In person, in the St. Petersburg office, with the two PyCharm team members in charge of the debugger, and a huge webinar audience. The recording is now available.

In this webinar, Liza Shashkova covered a long list of intermediate debugger tips and features, done in the context of writing a Tetris game using Arcade. Quite a number of really useful features, including some that even the pros might not know about.

Andrey Lisin did one section on remote debugging in PyCharm Professional, followed by a series of slides on the architecture of debuggers. This came from an internal talk Liza and he gave to the team when we were planning upcoming features.

Liza has a repo for her part and Andrey’s material is also available.

We wound up with a big turnout of attendees with a bunch of good questions: just what we were hoping for.

↧

Artem Rys: Why you should use typing for all your Python projects

January 28, 2020, 10:00 am

≫ Next: Go Deh: Three guys on math

≪ Previous: PyCharm: Webinar Recording: “Advanced Debugging in PyCharm”

I am using typing at my work project and now trying to use it for my personal or freelance projects only if it is not a one-time script.

Continue reading on python4you »

↧

Go Deh: Three guys on math

January 28, 2020, 10:16 am

≫ Next: Ian Ozsvald: Another Successful Data Science Projects course completed

≪ Previous: Artem Rys: Why you should use typing for all your Python projects

Coder:

(In Jeans and T-shirt, next to a cup of coffee) I look down on him (Indicates Excel'r) because I write proper programs.

Excel'r:

(Trousers, shirt, no tie) I look up to him (Coder) because he writes proper programs; but I look down on him (Hand-Calculater) because he has no graphics. I have a GUI

Hand-Calculater:

(Student) I know my place. I look up to them both. But I don't look up to him (Excel'r) as much as I look up to him (Coder), because he writes apps and games.

Coder:

I do write apps and games, but I have a higher barrier to entry. So sometimes I look up (bends knees, does so) to him (Excel'r).

Excel'r:

I still look up to him (Coder) because although I have easy access, I am vulgar. But I am not as vulgar as him (Hand-Calculater) so I still look down on him (Hand-Calculater).

Hand-Calculater:

I know my place. I look up to them both; but while I am poor, I am honest, industrious and trustworthy. Had I the inclination, I could look down on them. But I don't.

Excel'r:

We all know our place, but what do we get out of it?

Coder:

I get a feeling of superiority over them.

Excel'r:

I get a feeling of inferiority from him, (Coder), but a feeling of superiority over him (Hand-Calculater).

Hand-Calculater:

I get RSI.

Original Sketch:

"The Frost Report - I know my place"

↧

Ian Ozsvald: Another Successful Data Science Projects course completed

January 28, 2020, 10:37 am

≫ Next: PyCoder’s Weekly: Issue #405 (Jan. 28, 2020)

≪ Previous: Go Deh: Three guys on math

A week back I ran the 4th iteration of my 1 day Successful Data Science Projects course. We covered:

How to write a Project Specification including a strong Definition of Done
How to derisk a new dataset quickly using Pandas Profiling, Seaborn and dabl
Building interactive data tools using Altair to identify trends and outliers (for quick discussion and diagnosis with business colleagues)
A long discussion on best practice for designing and running projects to increase their odds of a good outcome
Several diagnosis scenarios for prioritisation and valuation of potential projects

One of the lovely outcomes in the training slack is that new tools get shared by the attendees – I particularly liked Streamlit which was shared as an alternative to my Jupyter Widgets + sklearn estimator demo (which shows a way to hand-build an estimator under Widget control with GUI plots for interactive diagnosis). I’m going to look into integrating this in a future iteration of this course. Here’s my happy room of students:

If you’re interested in attending my future courses then make sure you’re on my low-volume training announce list (and/or you might want my more frequent Thoughts & Jobs email list). My upcoming Software Engineering for Data Scientists course has a seat left, if that’s sold out do contact me to be on the reserve list. If you’d like to get a discount code for future courses please complete my research survey for my 2020 courses.

Ian is a Chief Interim Data Scientist via his Mor Consulting. Sign-up for Data Science tutorials in London and to hear about his data science thoughts and jobs. He lives in London, is walked by his high energy Springer Spaniel and is a consumer of fine coffees.

The post Another Successful Data Science Projects course completed appeared first on Entrepreneurial Geekiness.

↧

PyCoder’s Weekly: Issue #405 (Jan. 28, 2020)

January 28, 2020, 11:30 am

≫ Next: Continuum Analytics Blog: Introducing Anaconda Team Edition: Secure Open-Source Data Science for the Enterprise

≪ Previous: Ian Ozsvald: Another Successful Data Science Projects course completed

#405 – JANUARY 28, 2020
View in Browser »

Python GUI Programming With Tkinter

In this article, you’ll learn the basics of GUI programming with Tkinter, the de-facto Python GUI framework. Master GUI programming concepts such as widgets, geometry managers, and event handlers. Then, put it all together by building two applications: a temperature converter and a text editor.
REAL PYTHON

Pythonic Code Review [2016]

“In this article I’ll focus on my personal experience in authoring and reviewing Python code from both psychological and technical perspectives. And I’ll do so keeping in mind the ultimate goal of striking a balance between code reviews being enjoyable and technically fruitful.”
ILYA ETINGOF

Profile and Optimize Python Apps Performance with Blackfire.io

You can’t improve what you can’t measure. Profile and understand Python code’s behaviour and performance. Build faster applications. Blackfire.io is now available as Public Beta. Signup, install and find optimizations in minutes →
BLACKFIREsponsor

pip 20.0 Released

Default to doing a user install (as if --user was passed) when the main site-packages directory is not writeable and user site-packages are enabled, cache wheels built from Git requirements, and more.
PYPA.IO

Python 3.9 Compatibility Changes

“With the EoL of Python 2 being in line with development of Python 3.9 there were changes made to Python 3.9 that broke a lot of packages since many deprecation warnings became errors.”
KARTHIKEYAN SINGARAVELAN

Quick-And-Dirty Guide on How to Install Packages for Python

“If you just want to start poking at Python and want to avoid the pitfalls to installing packages globally, it only takes 3 steps to do the right thing.”
BRETT CANNON

Python 3.9.0a3 Now Available for Testing

Changelog at the link.
PYTHONINSIDER.BLOGSPOT.COM

Python Jobs

Articles & Tutorials

Understand Django: URLs Lead the Way

How does a Django site know where to send requests? You have to tell it! In this article you’ll look at URLs and how to let your users get to the right place.
MATT LAYMAN• Shared by Matt Layman

RIP Pipenv: Tried Too Hard. Do What You Need With `pip-tools`

An opinionated look at Pipenv and its future as a Python packaging tool. More about pip-tools here.
NICK TIMKOVICH

Learn the Skills You Need to Land a Job in Data Science, Guaranteed

As a student in Springboard’s Data Science Career Track, you’ll work one-on-one with an expert data science mentor to complete real-world projects, build your portfolio, and gain the skills necessary to get hired. Springboard’s team will work with you from the start to help you land your dream data science role. Learn more →
SPRINGBOARDsponsor

Python Modules and Packages: An Introduction

In this course, you’ll explore Python modules and Python packages, two mechanisms that facilitate modular programming. See how to write and import modules so you can optimize the structure of your own programs and make them more maintainable.
REAL PYTHONcourse

Using Markdown to Create Responsive HTML Emails

This article describes how to use Python to transform a Markdown text file into a response HTML email and static page on a Pelican blog.
CHRIS MOFFITT

A Tiny Python Called Snek

Snek is a version of Python targeting embedded processors developed by Keith Packard.
JAKE EDGE

How to Port From Python 2 to Python 3

MACIEJ URBAŃSKI

Serverless Python Deployments With Github Actions

IAN WHITESTONE

Opinionated Coding Guidelines and Best Practices in Python

GITHUB.COM/REDNAFI

Create Animated Images Using Python

YONG CUI

Projects & Code

yakutils: Toolbox of Python 3 Helper Functions

GITHUB.COM/NFICANO

quick: Qt5 Based GUI Generator for Click

GITHUB.COM/SZSDK

rich: Rich Text and Beautiful Formatting in the Terminal

GITHUB.COM/WILLMCGUGAN

text_grapher: 3D Graphics Rendered as Text

GITHUB.COM/FLETCHGRAHAM

ScalaPy: Use Python Libraries From the Comfort of Scala

SCALAPY.DEV

gitfilesplit: Split One File Into Several, Preserving Git History

GITHUB.COM/IDLESIGN• Shared by pythonz

PythonMinefield: Game Written Using the `arcade` Library

GITHUB.COM/MYRMICA-HABILIS

Events

Pravega Hackathon 2020

February 1 to February 3, 2020
PRAVEGA.ORG

FOSDEM 2020: Python Dev Room

February 1 to February 2, 2020
FOSDEM.ORG

PyDelhi User Group Meetup

February 1, 2020
MEETUP.COM

Melbourne Python Users Group, Australia

February 3, 2020
J.MP

Dominican Republic Python User Group

February 4, 2020
PYTHON.DO

Heidelberg Python Meetup

February 5, 2020
MEETUP.COM

PyRana General Body Meeting

February 5, 2020
PYRANA.ORG

Happy Pythoning!
This was PyCoder’s Weekly Issue #405.
View in Browser »

[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]

↧

Continuum Analytics Blog: Introducing Anaconda Team Edition: Secure Open-Source Data Science for the Enterprise

January 28, 2020, 1:39 pm

≫ Next: PyPy Development: Leysin Winter sprint 2020: Feb 29 - March 8th

≪ Previous: PyCoder’s Weekly: Issue #405 (Jan. 28, 2020)

I’m very excited to announce a new addition to Anaconda’s product line — Anaconda Team Edition! For the last few years, Anaconda has offered two products: our free Anaconda Distribution, meant for individual practitioners, and…

The post Introducing Anaconda Team Edition: Secure Open-Source Data Science for the Enterprise appeared first on Anaconda.

↧

PyPy Development: Leysin Winter sprint 2020: Feb 29 - March 8th

January 29, 2020, 5:46 am

≫ Next: Real Python: Python '!=' Is Not 'is not': Comparing Objects in Python

≪ Previous: Continuum Analytics Blog: Introducing Anaconda Team Edition: Secure Open-Source Data Science for the Enterprise

The next PyPy sprint will be in Leysin, Switzerland, for the fourteenth time. This is a fully public sprint: newcomers and topics other than those proposed below are welcome.

Goals and topics of the sprint

The list of topics is open. For reference, we would like to work at least partially on the following topics:

HPy
Python 3.7 support (buildbot status)

As usual, the main side goal is to have fun in winter sports :-) We can take a day off (for ski or anything else).

Times and accomodation

The sprint will occur for one week starting on Saturday, the 29th of February, to Sunday, the 8th of March 2020 (dates were pushed back one day!) It will occur in Les Airelles, a different bed-and-breakfast place from the traditional one in Leysin. It is a nice old house at the top of the village.

We have a 4- or 5-people room as well as up to three double-rooms. Please register early! These rooms are not booked for the sprint in advance, and might be already taken if you end up announcing yourself late. (But it is of course always possible to book at a different place in Leysin.)

For more information, see our repository or write to me directly at armin.rigo@gmail.com.

↧

Real Python: Python '!=' Is Not 'is not': Comparing Objects in Python

January 29, 2020, 6:00 am

≫ Next: PyCharm: Webinar: “Security Checks for Python Code” with Anthony Shaw

≪ Previous: PyPy Development: Leysin Winter sprint 2020: Feb 29 - March 8th

There’s a subtle difference between the Python identity operator (is) and the equality operator (==). Your code can run fine when you use the Python is operator to compare numbers, until it suddenly doesn’t. You might have heard somewhere that the Python is operator is faster than the == operator, or you may feel that it looks more Pythonic. However, it’s crucial to keep in mind that these operators don’t behave quite the same.

The == operator compares the value or equality of two objects, whereas the Python is operator checks whether two variables point to the same object in memory. In the vast majority of cases, this means you should use the equality operators == and !=, except when you’re comparing to None.

In this tutorial, you’ll learn:

What the difference is between object equality and identity
When to use equality and identity operators to compare objects
What these Python operators do under the hood
Why using is and is not to compare values leads to unexpected behavior
How to write a custom __eq__() class method to define equality operator behavior

Python Pit Stop: This tutorial is a quick and practical way to find the info you need, so you’ll be back to your project in no time!

Free Bonus:Click here to get a Python Cheat Sheet and learn the basics of Python 3, like working with data types, dictionaries, lists, and Python functions.

Comparing Identity With the Python is and is not Operators

The Python is and is not operators compare the identity of two objects. In CPython, this is their memory address. Everything in Python is an object, and each object is stored at a specific memory location. The Python is and is not operators check whether two variables refer to the same object in memory.

Note: Keep in mind that objects with the same value are usually stored at separate memory addresses.

You can use id() to check the identity of an object:

>>>

>>> help(id)Help on built-in function id in module builtins:id(obj, /)    Return the identity of an object.    This is guaranteed to be unique among simultaneously existing objects.    (CPython uses the object's memory address.)>>> id(id)2570892442576

The last line shows the memory address where the built-in function id itself is stored.

There are some common cases where objects with the same value will have the same id by default. For example, the numbers -5 to 256 are interned in CPython. Each number is stored at a singular and fixed place in memory, which saves memory for commonly-used integers.

You can use sys.intern() to intern strings for performance. This function allows you to compare their memory addresses rather than comparing the strings character-by-character:

>>>

>>> fromsysimportintern>>> a='hello world'>>> b='hello world'>>> aisbFalse>>> id(a)1603648396784>>> id(b)1603648426160>>> a=intern(a)>>> b=intern(b)>>> aisbTrue>>> id(a)1603648396784>>> id(b)1603648396784

The variables a and b initially point to two different objects in memory, as shown by their different IDs. When you intern them, you ensure that a and b point to the same object in memory. Any new string with the value 'hello world' will now be created at a new memory location, but when you intern this new string, you make sure that it points to the same memory address as the first 'hello world' that you interned.

Note: Even though the memory address of an object is unique at any given time, it varies between runs of the same code, and depends on the version of CPython and the machine on which it runs.

Other objects that are interned by default are None, True, False, and simple strings. Keep in mind that most of the time, different objects with the same value will be stored at separate memory addresses. This means you should not use the Python is operator to compare values.

When Only Some Integers Are Interned

Behind the scenes, Python interns objects with commonly-used values (for example, the integers -5 to 256) to save memory. The following bit of code shows you how only some integers have a fixed memory address:

>>>

>>> a=256>>> b=256>>> aisbTrue>>> id(a)1638894624>>> id(b)1638894624>>> a=257>>> b=257>>> aisbFalse>>> id(a)2570926051952>>> id(b)2570926051984

Initially, a and b point to the same interned object in memory, but when their values are outside the range of common integers (ranging from -5 to 256), they’re stored at separate memory addresses.

When Multiple Variables Point to the Same Object

When you use the assignment operator (=) to make one variable equal to the other, you make these variables point to the same object in memory. This may lead to unexpected behavior for mutable objects:

>>>

>>> a=[1,2,3]>>> b=a>>> a[1, 2, 3]>>> b[1, 2, 3]>>> a.append(4)>>> a[1, 2, 3, 4]>>> b[1, 2, 3, 4]>>> id(a)2570926056520>>> id(b)2570926056520

What just happened? You add a new element to a, but now b contains this element too! Well, in the line where b = a, you set b to point to the same memory address as a, so that both variables now refer to the same object.

If you define these lists independently of each other, then they’re stored at different memory addresses and behave independently:

>>>

>>> a=[1,2,3]>>> b=[1,2,3]>>> aisbFalse>>> id(a)2356388925576>>> id(b)2356388952648

Because a and b now refer to different objects in memory, changing one doesn’t affect the other.

Comparing Equality With the Python == and != Operators

Recall that objects with the same value are often stored at separate memory addresses. Use the equality operators == and != if you want to check whether or not two objects have the same value, regardless of where they’re stored in memory. In the vast majority of cases, this is what you want to do.

When Object Copy Is Equal but Not Identical

In the example below, you set b to be a copy of a (which is a mutable object, such as a list or a dictionary). Both variables will have the same value, but each will be stored at a different memory address:

>>>

>>> a=[1,2,3]>>> b=a.copy()>>> a[1, 2, 3]>>> b[1, 2, 3]>>> a==bTrue>>> aisbFalse>>> id(a)2570926058312>>> id(b)2570926057736

a and b are now stored at different memory addresses, so a is b will no longer return True. However, a == b returns True because both objects have the same value.

How Comparing by Equality Works

The magic of the equality operator == happens in the __eq__() class method of the object to the left of the == sign.

Note: This is the case unless the object on the right is a subclass of the object on the left. For more information, check the official documentation.

This is a magic class method that’s called whenever an instance of this class is compared against another object. If this method is not implemented, then == compares the memory addresses of the two objects by default.

As an exercise, make a SillyString class that inherits from str and implement __eq__() to compare whether the length of this string is the same as the length of the other object:

classSillyString(str):# This method gets called when using == on the objectdef__eq__(self,other):print(f'comparing {self} to {other}')# Return True if self and other have the same lengthreturnlen(self)==len(other)

Now, a SillyString 'hello world' should be equal to the string 'world hello', and even to any other object with the same length:

>>>

>>> # Compare two strings>>> 'hello world'=='world hello'False>>> # Compare a string with a SillyString>>> 'hello world'==SillyString('world hello')comparing world hello to hello worldTrue>>> # Compare a SillyString with a list>>> SillyString('hello world')==[1,2,3,4,5,6,7,8,9,10,11]comparing hello world to [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]True

This is, of course, silly behavior for an object that otherwise behaves as a string, but it does illustrate what happens when you compare two objects using ==. The != operator gives the inverse response of this unless a specific __ne__() class method is implemented.

The example above also clearly shows you why it is good practice to use the Python is operator for comparing with None, instead of the == operator. Not only is it faster since it compares memory addresses, but it’s also safer because it doesn’t depend on the logic of any __eq__() class methods.

Comparing the Python Comparison Operators

As a rule of thumb, you should always use the equality operators == and !=, except when you’re comparing to None:

Use the Python == and != operators to compare object equality. Here, you’re generally comparing the value of two objects. This is what you need if you want to compare whether or not two objects have the same contents, and you don’t care about where they’re stored in memory.
Use the Python is and is not operators when you want to compare object identity. Here, you’re comparing whether or not two variables point to the same object in memory. The main use case for these operators is when you’re comparing to None. It’s faster and safer to compare to None by memory address than it is by using class methods.

Variables with the same value are often stored at separate memory addresses. This means that you should use == and != to compare their values and use the Python is and is not operators only when you want to check whether two variables point to the same memory address.

Conclusion

In this tutorial, you’ve learned that == and !=compare the value of two objects, whereas the Python is and is not operators compare whether two variables refer to the same object in memory. If you keep this distinction in mind, then you should be able to prevent unexpected behavior in your code.

If you want to read more about the wonderful world of object interning and the Python is operator, then check out Why you should almost never use “is” in Python. You could also have a look at how you can use sys.intern() to optimize memory usage and comparison times for strings, although the chances are that Python already automatically handles this for you behind-the-scenes.

Now that you’ve learned what the equality and identity operators do under the hood, you can try writing your own __eq__() class methods, which define how instances of this class are compared when using the == operator. Go and apply your newfound knowledge of these Python comparison operators!

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

PyCharm: Webinar: “Security Checks for Python Code” with Anthony Shaw

January 29, 2020, 7:43 am

≫ Next: Sumana Harihareswara - Cogito, Ergo Sumana: MOSS Video, BSSw Honorable Mention, and The Maintainership Book I Am Writing

≪ Previous: Real Python: Python '!=' Is Not 'is not': Comparing Objects in Python

Software has security issues, Python is software, so how do Python developers avoid common traps? In this webinar, Anthony Shaw discusses the topic of security vulnerabilities, how code quality tools can help, and demonstrates the PyCharm plugin he wrote to let the IDE help.

– Wednesday, February 19th
– 5:00 PM – 6:00 PM CET (11:00 AM – 12:00 PM EST)
– Register here
– Aimed at intermediate Python developers

Speaker

Anthony Shaw is a Python researcher from Australia. He publishes articles about Python, software, and automation to over 1 million readers annually. Anthony is an open-source software advocate, Fellow of the Python Software Foundation, and a member of the Apache Software Foundation.

↧

Sumana Harihareswara - Cogito, Ergo Sumana: MOSS Video, BSSw Honorable Mention, and The Maintainership Book I Am Writing

January 29, 2020, 9:12 am

≫ Next: testmon: Hidden test dependencies

≪ Previous: PyCharm: Webinar: “Security Checks for Python Code” with Anthony Shaw

Video

Mozilla interviewed me about the Python Package Index (PyPI), a USD$170,000 Mozilla Open Source Support award I helped the Python Software Foundation get in 2017, and how we used that money to revamp PyPI and drive it forward in 2017 and 2018.

From that interview, they condensed a video (2 minutes, 14 seconds) featuring, for instance, slo-mo footage of me making air quotes. Their tweet calls me "a driving force behind" PyPI, and given how many people were working on it way before I was, that's quite a compliment!

I will put a transcript in the comments of this blog post.

(Please note that they massively condensed this video from 30+ minutes of interview. In the video, I say, "the site got popular before the code got good". In the interview, I did not just say that without acknowledging the tremendous effort of past volunteers who worked on the previous iteration of PyPI and kept the site going through massive infrastructure challenges, but that's been edited (for brevity, I assume).)

This video is the first in a series meant to encourage people to apply for MOSS funding. I mentioned MOSS in my grants roundup last month. If you want to figure out whether to apply for MOSS funding for your open source software project, and you need help, ping me for a free 20-minute chat or phone call and I can give you some quick advice. (Offer limited in case literally a hundred people contact me, which is unlikely.)

BSSw

The Better Scientific Software (BSSw) Fellowship Program"gives recognition and funding to leaders and advocates of high-quality scientific software."I'm one of three Honorable Mentions for 2020.

The main goal of the BSSw Fellowship program is to foster and promote practices, processes, and tools to improve developer productivity and software sustainability of scientific code. We also anticipate accumulating a growing community of BSSw Fellowship alums who can serve as leaders, mentors, and consultants to increase the visibility of those involved in scientific software production and sustainability in the pursuit of scientific discovery.

That's why I'll be at the Exascale Computing Project Annual Meeting next week in Houston, so if you're there, I hope to meet you. In particular I'd like to meet the leaders of open source projects who want help streamlining contribution processes, growing more maintainers, managing communications with stakeholders, participating in internship projects like Google Summer of Code and Outreachy, expediting releases, and getting more out of hackathons. My consulting firm provides these services, and at ECPAM I can give you some free advice.

Book

And here's the project I'm working on -- why I received this honor.

In 2020, I am writing the first draft of a book teaching the skills open source software maintainers need, aimed at those working scientists and other contributors who have never managed public-facing projects before.

More than developer time, maintainership -- coordination, leadership, and management -- is a bottleneck in software sustainability. The lack of skilled managers is a huge blocker to the sustainability of Free/Libre Open Source Software (FLOSS) infrastructure.

Many FLOSS project maintainers lack management experience and skill. This textbook/self-help guide for new and current maintainers of existing projects ("brownfield projects") will focus on teaching specific project management skills in the context of FLOSS. This will provide scalable guidance, enabling existing FLOSS contributors to become more effective maintainers.

Existing "how to run a FLOSS project" documentation (such as Karl Fogel's Producing Open Source Software) addresses fresh-start "greenfield" projects rather than more common "brownfield", and doesn't teach specific project management skills (e.g., getting to know a team, creating roadmaps, running asynchronous meetings, managing budgets, and writing email memos). Existing educational pathways for scientists and developers (The Carpentries, internships and code schools) don't cover FLOSS-specific management skills.

So I'm writing a sequel to Karl's book -- with his blessing -- and I'm excited to see how I can more scalably share the lessons I've learned in more than a decade of leading open source projects.

I don't yet have a full outline, a publisher, or a length in mind. I'll be posting more here as I grow my plans. Thanks to BSSw and all my colleagues and friends who have encouraged me.

↧

testmon: Hidden test dependencies

January 29, 2020, 9:37 am

≫ Next: Wingware: Wing Python IDE 7.2.1 - January 28, 2020

≪ Previous: Sumana Harihareswara - Cogito, Ergo Sumana: MOSS Video, BSSw Honorable Mention, and The Maintainership Book I Am Writing

hidden test dependencies

Tests should be independent, isolated and repeatable. When they are, it's easy to run just one of them, run all of them in parallel or use pytest-testmon. But we don't live in an ideal world and many times we end up with a test suite with unwanted hidden test dependencies. In this article I am describing a couple of tips and tricks which allow us to find and fix the problems.

Continue reading: Hidden test dependencies

↧

Wingware: Wing Python IDE 7.2.1 - January 28, 2020

January 27, 2020, 5:00 pm

≫ Next: Erik Marsja: Random Forests (and Extremely) in Python with scikit-learn

≪ Previous: testmon: Hidden test dependencies

Wing 7.2.1 fixes debug process group termination, avoids failures seen when pasting some Python code, prevents crashing in vi browse mode when the first line of the file is blank, and fixes some other usability issues.

Download Wing 7.2.1 Now:Wing Pro | Wing Personal | Wing 101 | Compare Products

What's New in Wing 7.2

Auto-Reformatting with Black and YAPF (Wing Pro)

Wing 7.2 adds support for Black and YAPF for code reformatting, in addition to the previously available built-in autopep8 reformatting. To use Black or YAPF, they must first be installed into your Python with pip, conda, or other package manager. Reformatting options are available from the Source>Reformatting menu group, and automatic reformatting may be configured in the Editor>Auto-reformatting preferences group.

See Auto-Reformatting for details.

Improved Support for Virtualenv

Wing 7.2 improves support for virtualenv by allowing the command that activates the environment to be entered in the Python Executable in Project Properties, Launch Configurations, and when creating new projects. The New Project dialog now also includes the option to create a new virtualenv along with the new project, optionally specifying packages to install.

See Using Wing with Virtualenv for details.

Support for Anaconda Environments

Similarly, Wing 7.2 adds support for Anaconda environments, so the condaactivate command can be entered when configuring the Python Executable and the New Project dialog supports using an existing Anaconda environment or creating a new one along with the project.

See Using Wing with Anaconda for details.

And More

Wing 7.2 also makes it easier to debug modules with python-m, simplifies manual configuration of remote debugging, allows using a command line for the configured PythonExecutable, and fixes a number of usability issues.

For details see the change log.

For a complete list of new features in Wing 7, see What's New in Wing 7.

Try Wing 7.2 Now!

Wing 7.2 is an exciting new step for Wingware's Python IDE product line. Find out how Wing 7.2 can turbocharge your Python development by trying it today.

Downloads:Wing Pro | Wing Personal | Wing 101 | Compare Products

See Upgrading for details on upgrading from Wing 6 and earlier, and Migrating from Older Versions for a list of compatibility notes.

↧

Erik Marsja: Random Forests (and Extremely) in Python with scikit-learn

January 29, 2020, 11:48 pm

≫ Next: Wingware: Wing Python IDE 7.2.1 - January 29, 2020

≪ Previous: Wingware: Wing Python IDE 7.2.1 - January 28, 2020

The post Random Forests (and Extremely) in Python with scikit-learn appeared first on Erik Marsja.

In this guest post, you will learn by example how to do two popular machine learning techniques called random forest and extremely random forests. In fact, this post is an excerpt (adapted to the blog format) from the forthcoming Artificial Intelligence with Python – Second Edition: Your Complete Guide to Building Intelligent Apps using Python 3.x and TensorFlow 2. Now, before you will learn how to carry out random forests in Python with scikit-learn, you will find some brief information about the book.

Artificial Intelligence with Python – Second Edition

The new edition of this book, which will guide you to artificial intelligence with Python, is now updated to Python 3.x and TensorFlow 2. Furthermore, it has new chapters that, besides random forests, cover recurrent neural networks, artificial intelligence and Big Data, fundamental use cases, chatbots, and more. Finally, artificial Intelligence with Python – Second Edition is written by two experts in the field of artificial intelligence; Alberto Artasanches and Pratek Joshi (more information about the authors can be found towards the end of the post).

Now, in the next section of this post, you will learn what random forests and extremely random forests are. After that, there’s a code example on how to set up a script to do these types of classification with Python and scikit-learn learn.

What are Random Forests and Extremely Random Forests?

A random forest is an instance of ensemble learning where individual models are constructed using decision trees. This ensemble of decision trees is then used to predict the output value. We use a random subset of training data to construct each decision tree. This will ensure diversity among various decision trees. In the first section, we discussed that one of the most important attributes when building good ensemble learning models is that we ensure that there is diversity among individual models.

Advantages of Random Forests

One of the advantages of random forests is that they do not overfit. Overfitting is a frequent problem in machine learning. Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. By constructing a diverse set of decision trees using various random subsets, we ensure that the model does not overfit the training data. During the construction of the tree, the nodes are split successively, and the best thresholds are chosen to reduce the entropy at each level. This split doesn’t consider all the features in the input dataset. Instead, it chooses the best split among the random subset of the features that are under consideration. Adding this randomness tends to increase the bias of the random forest, but the variance decreases because of averaging. Hence, we end up with a robust model.

Extremely Random Forests

Extremely random forests take randomness to the next level. Along with taking a random subset of features, the thresholds are chosen randomly as well. These randomly generated thresholds are chosen as the splitting rules, which reduce the variance of the model even further. Hence, the decision boundaries obtained using extremely random forests tend to be smoother than the ones obtained using random forests. Some implementations of extremely random forest algorithms also enable better parallelization and can scale better.

Building Random Forest and Extremely Random Forest Classifiers

Let’s see how we can build a classifier based on random forests and extremely random forests. The way to construct both classifiers is very similar, so an input flag is used to specify which classifier needs to be built.

Create a new Python file and import the following packages:

import argparse 

import numpy as np 

import matplotlib.pyplot as plt 

from sklearn.metrics import classification_report 

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier  

from sklearn.model_selection import train_test_split 

from sklearn.metrics import classification_report 

 

from utilities import visualize_classifier

Constructing a Random Forest Classifier in Python

Define an argument parser for Python so that we can take the classifier type as an input parameter. Depending on this parameter, we can construct a random forest classifier or an extremely random forest classifier:

# Argument parser 

def build_arg_parser(): 

    parser = argparse.ArgumentParser(description='Classify data using \ 

            Ensemble Learning techniques') 

    parser.add_argument('--classifier-type', dest='classifier_type', 

            required=True, choices=['rf', 'erf'], help="Type of classifier \ 

                    to use; can be either 'rf' or 'erf'") 

    return parser

Define the main function and parse the input arguments:

code class="lang-py">if __name__=='__main__': 

    # Parse the input arguments 

    args = build_arg_parser().parse_args()  

    classifier_type = args.classifier_type

In this random forest in Python example, data is loaded from the data_random_forests.txt file. Each line in this file contains comma-separated values. The first two values correspond to the input data and the last value corresponds to the target label. We have three distinct classes in this dataset. Let’s load the data from that file:

    # Load input data 

    input_file = 'data_random_forests.txt' 

    data = np.loadtxt(input_file, delimiter=',') 

    X, y = data[:, :-1], data[:, -1]

A side note, on this blog there are many guides, and tutorials, on how to import data with Python. In some cases, the data may be stored in CSV or Excel files. Here are two posts by the author of this blog if you need to import data from other formats:

Separate the input data into three classes:

    # Separate input data into three classes based on labels  

    class_0 = np.array(X[y==0]) 

    class_1 = np.array(X[y==1]) 

    class_2 = np.array(X[y==2])

Let’s visualize the input data:

    # Visualize input data 

    plt.figure() 

    plt.scatter(class_0[:, 0], class_0[:, 1], s=75, facecolors='white', 

                    edgecolors='black', linewidth=1, marker='s') 

    plt.scatter(class_1[:, 0], class_1[:, 1], s=75, facecolors='white', 

                    edgecolors='black', linewidth=1, marker='o') 

    plt.scatter(class_2[:, 0], class_2[:, 1], s=75, facecolors='white', 

                    edgecolors='black', linewidth=1, marker='^') 

    plt.title('Input data')

Split the data into training and testing datasets:

    # Split data into training and testing datasets 

    X_train, X_test, y_train, y_test = train_test_split( 

            X, y, test_size=0.25, random_state=5)

Define the parameters to be used when we construct the classifier. The n_estimators parameter refers to the number of trees that will be constructed. The max_depth parameter refers to the maximum number of levels in each tree. The random_state parameter refers to the seed value of the random number generator needed to initialize the random forest classifier algorithm:

    # Ensemble Learning classifier 

    params = {'n_estimators': 100, 'max_depth': 4, 'random_state': 0}

Depending on the input parameter, we either construct a random forest classifier or an extremely random forest classifier:

    if classifier_type == 'rf': 

        classifier = RandomForestClassifier(**params) 

    else: 

        classifier = ExtraTreesClassifier(**params)

Visualize a Random Forest Classifier in Python

Train and visualize the classifier:

    classifier.fit(X_train, y_train) 

    visualize_classifier(classifier, X_train, y_train, 'Training dataset')

Compute the output based on the test dataset and visualize it:

    y_test_pred = classifier.predict(X_test) 

    visualize_classifier(classifier, X_test, y_test, 'Test dataset')

Evaluate the performance of the classifier by printing the classification report:

 # Evaluate classifier performance 

    class_names = ['Class-0', 'Class-1', 'Class-2'] 

    print("\n" + "#"*40) 

    print("\nClassifier performance on training dataset\n") 

    print(classification_report(y_train, classifier.predict(X_train), target_names=class_names)) 

    print("#"*40 + "\n") 

     

    print("#"*40) 

    print("\nClassifier performance on test dataset\n") 

    print(classification_report(y_test, y_test_pred, target_names=class_names)) 

    print("#"*40 + "\n")

Running a Random Forest Classifier in Python

If you were to save the code in the file random_forests.py file. Let’s run the code with the random forest classifier using the rf flag in the input argument. Run the following command:

$ python3 random_forests.py --classifier-type rf

You will see a few figures pop up. The first screenshot is the input data:

Visualization of input data

In the preceding screenshot, the three classes are being represented by squares, circles, and triangles. We see that there is a lot of overlap between classes, but that should be fine for now. The second screenshot shows the classifier boundaries:

Classifier boundaries on the test dataset

Extremely Random Forest in Python

Now let’s run the code with the extremely random forest classifier by using the erf flag in the input argument. Run the following command:

$ python3 random_forests.py --classifier-type erf

You will see a few figures pop up. We already know what the input data looks like. The second screenshot shows the classifier boundaries:

Classifier boundaries on the test dataset

If you compare the preceding screenshot with the boundaries obtained from the random forest classifier, you will see that these boundaries are smoother. The reason is that extremely random forests have more freedom during the training process to come up with good decision trees, hence they usually produce better boundaries.

Summary: Random Forests in Python

Now, to summarize, in this post, you have learned about the forthcoming new edition of the book Artificial Intelligence with Python. To fully get all you can from this book you need basic Python programming experience and awareness of machine learning concepts and techniques.

However, the most important part of this blog post, and maybe the take-home message, was to learn how to do random forests in Python using scikit-learn.

Authors Biography

Alberto Artasanchez is a data scientist with over 25 years of consulting experience with Fortune 500 companies as well as startups. He has an extensive background in artificial intelligence and advanced algorithms. Mr. Artasanchez holds 8 AWS certifications including the Big Data Specialty and the Machine Learning Specialty certifications. He is an AWS Ambassador and publishes frequently in a variety of data science blogs. He is often tapped as a speaker on topics ranging from Data Science, Big Data and Analytics, underwriting optimization and fraud detection. Alberto has a strong and extensive track record of designing and building end-to-end machine learning platforms at scale. He graduated with a Master of Science degree from Wayne State University and a Bachelor of Arts degree from Kalamazoo College. Alberto is particularly interested in using Artificial Intelligence to build Data Lakes at scale. Finally, he is married to his lovely wife Karen and is addicted to CrossFit.

Prateek Joshi is the founder of Plutoshift and a published author of 9 books on Artificial Intelligence. He has been featured on Forbes 30 Under 30, NBC, Bloomberg, CNBC, TechCrunch, and The Business Journals. He has been an invited speaker at conferences such as TEDx, Global Big Data Conference, Machine Learning Developers Conference, and Silicon Valley Deep Learning. His tech blog (www.prateekjoshi.com) has received more than 2M+ page views from 200+ countries and has 7,500+ followers. Apart from Artificial Intelligence, some of the topics that excite him are number theory, cryptography, and quantum computing. His greater goal is to make Artificial Intelligence accessible to everyone so that it can impact billions of people around the world.

The post Random Forests (and Extremely) in Python with scikit-learn appeared first on Erik Marsja.

↧

Wingware: Wing Python IDE 7.2.1 - January 29, 2020

January 28, 2020, 5:00 pm

≫ Next: Continuum Analytics Blog: We’ve Reached a Milestone: pandas 1.0 Is Here

≪ Previous: Erik Marsja: Random Forests (and Extremely) in Python with scikit-learn

Download Wing 7.2.1 Now:Wing Pro | Wing Personal | Wing 101 | Compare Products

What's New in Wing 7.2

Auto-Reformatting with Black and YAPF (Wing Pro)

See Auto-Reformatting for details.

Improved Support for Virtualenv

See Using Wing with Virtualenv for details.

Support for Anaconda Environments

See Using Wing with Anaconda for details.

And More

For details see the change log.

For a complete list of new features in Wing 7, see What's New in Wing 7.

Try Wing 7.2 Now!

Wing 7.2 is an exciting new step for Wingware's Python IDE product line. Find out how Wing 7.2 can turbocharge your Python development by trying it today.

Downloads:Wing Pro | Wing Personal | Wing 101 | Compare Products

See Upgrading for details on upgrading from Wing 6 and earlier, and Migrating from Older Versions for a list of compatibility notes.

↧

Continuum Analytics Blog: We’ve Reached a Milestone: pandas 1.0 Is Here

January 30, 2020, 7:07 am

≫ Next: Daniel Roy Greenfeld: Feed Generator

≪ Previous: Wingware: Wing Python IDE 7.2.1 - January 29, 2020

Today the pandas project announced the release of pandas 1.0.0. For more on what’s changed, read through the extensive release notes. We’re particularly excited about Numba-accelerated window operations and the new nullable boolean and string…

The post We’ve Reached a Milestone: pandas 1.0 Is Here appeared first on Anaconda.

↧

Daniel Roy Greenfeld: Feed Generator

January 30, 2020, 11:48 am

≫ Next: Talk Python to Me: #249 Capture the Staff of Pythonic Knowledge in TwilioQuest

≪ Previous: Continuum Analytics Blog: We’ve Reached a Milestone: pandas 1.0 Is Here

Late last year I changed my blog engine yet again. I've been happy with it so far, with the exception of XML feeds. The tooling I chose doesn't have good support for feeds, certainly not with the filtering I need. Specifically, I need to have a python feed, a family feed, and so on. As much as I love my wife and daughter, non-technical posts about them probably don't belong on places where this post will show up.

After trying to work within the framework of the blog engine (Vuepress), I got tired of fighting abstraction and gave up. My blog wouldn't have an XML feed.

Solution

Last night I decided to go around the problem. In 30 minutes I coded up a solution, a Python script that bypasses the Vuepress abstraction. You can see it below:

"""
generate_feed.py

Usage:

    python generate_feed.py TAGHERE

Note:

    Works with Python 3.8, untested otherwise.
"""
from glob import glob
import sys

try:
    from feedgen.feed import FeedGenerator
    from yaml import safe_load
    from markdown2 import Markdown
except ImportError:
    print("You need to install pyyaml, feedgen, and markdown2")
    sys.exit(1)


if __name__ == "__main__":

    try:
        tag = sys.argv[1]
    except IndexError:
        print('Add a tag argument such as "python"')
        sys.exit(1)

    # TODO - convert to argument
    YEARS = [
        "2020",
    ]

    markdowner = Markdown()

    fg = FeedGenerator()
    fg.id("https://daniel.roygreenfeld.com/")
    fg.title("pydanny")
    fg.author(
        {
            "name": "Daniel Roy Greenfeld",
            "email": "daniel.roy.greenfeld@roygreenfeld.com",
        }
    )
    fg.link(href="https://daniel.roygreenfeld.com", rel="alternate")
    fg.logo("https://daniel.roygreenfeld.com/images/personalPhoto.png")
    fg.subtitle("Inside the Head of Daniel Roy Greenfeld")
    fg.link(href=f"https://daniel.roygreenfeld.com/atom.{tag}.xml", rel="self")
    fg.language("en")

    years = [f"_posts/posts{x}/*.md" for x in YEARS]
    years.sort()
    years.reverse()

    def read_post(filename):
        with open(filename) as f:
            raw = f.read()[3:]

        config = safe_load(raw[: raw.index("---")])
        content = raw[raw.index("---") + 3 :]

        return config, content

    feed = []

    for year in years:
        posts = glob(year)
        posts.sort()
        posts.reverse()
        for post in posts:
            config, content = read_post(post)
            if tag not in config["tags"]:
                continue

            # add the metadata
            print(config["title"])
            entry = fg.add_entry()
            entry.id(f'https://daniel.roygreenfeld.com/{config["slug"]}.html')
            entry.title(config["title"])
            entry.description(config["description"])
            entry.pubDate(config["date"])

            # Add the content
            content = markdowner.convert(content)
            entry.content(content, type="html")

    print(fg.atom_str(pretty=True))
    fg.atom_file(f".vuepress/public/feeds/{tag}.atom.xml")

You call this on my blog for all python tagged content by running it thus:

python generate_feed.py python

The result validates per W3C and should work everywhere. Yeah!

Summary

This is what I've always enjoyed about Python. In a very short time I can throw together a script that makes my life better.

↧

Talk Python to Me: #249 Capture the Staff of Pythonic Knowledge in TwilioQuest

January 30, 2020, 12:00 am

≫ Next: Test and Code: 99: Software Maintenance and Chess

≪ Previous: Daniel Roy Greenfeld: Feed Generator

Are you learning or helping someone else learn Python, why not make a game out of it? TwilioQuest is a game that doesn't treat you with kid-gloves while teaching you Python. Using your editor of choice, write code on your machine, and still play the game to solve Python challenges.

↧

Test and Code: 99: Software Maintenance and Chess

January 30, 2020, 3:45 pm

≫ Next: Weekly Python StackOverflow Report: (ccxiii) stackoverflow python report

≪ Previous: Talk Python to Me: #249 Capture the Staff of Pythonic Knowledge in TwilioQuest

I play a form of group chess that has some interesting analogies to software development and maintenance of existing systems. This episode explains group chess and explores a few of those analogies.