Wingware: Wing Python IDE 7.2 - January 20, 2020

January 19, 2020, 5:00 pm

≫ Next: eGenix.com: Python Meeting Düsseldorf - 2020-01-22

≪ Previous: IslandT: Python class to create SQL database, table and submit values

Wing 7.2 adds auto-formatting with Black and YAPF, expanded support for virtualenv, support for Anaconda environments, easier debugging of modules launched with python-m, simplified manually configured remote debugging, and other improvements.

Download Wing 7.2 Now:Wing Pro | Wing Personal | Wing 101 | Compare Products

What's New in Wing 7.2

Auto-Reformatting with Black and YAPF (Wing Pro)

Wing 7.2 adds support for Black and YAPF for code reformatting, in addition to the previously available built-in autopep8 reformatting. To use Black or YAPF, they must first be installed into your Python with pip, conda, or other package manager. Reformatting options are available from the Source>Reformatting menu group, and automatic reformatting may be configured in the Editor>Auto-reformatting preferences group.

See Auto-Reformatting for details.

Improved Support for Virtualenv

Wing 7.2 improves support for virtualenv by allowing the command that activates the environment to be entered in the Python Executable in Project Properties, Launch Configurations, and when creating new projects. The New Project dialog now also includes the option to create a new virtualenv along with the new project, optionally specifying packages to install.

See Using Wing with Virtualenv for details.

Support for Anaconda Environments

Similarly, Wing 7.2 adds support for Anaconda environments, so the condaactivate command can be entered when configuring the Python Executable and the New Project dialog supports using an existing Anaconda environment or creating a new one along with the project.

See Using Wing with Anaconda for details.

And More

Wing 7.2 also makes it easier to debug modules with python-m, simplifies manual configuration of remote debugging, allows using a command line for the configured PythonExecutable, and fixes a number of usability issues.

For details see the change log.

For a complete list of new features in Wing 7, see What's New in Wing 7.

Try Wing 7.2 Now!

Wing 7.2 is an exciting new step for Wingware's Python IDE product line. Find out how Wing 7.2 can turbocharge your Python development by trying it today.

Downloads:Wing Pro | Wing Personal | Wing 101 | Compare Products

See Upgrading for details on upgrading from Wing 6 and earlier, and Migrating from Older Versions for a list of compatibility notes.

↧

eGenix.com: Python Meeting Düsseldorf - 2020-01-22

January 19, 2020, 11:00 pm

≫ Next: James Bennett: Having some fun with Python

≪ Previous: Wingware: Wing Python IDE 7.2 - January 20, 2020

The following text is in German, since we're announcing a regional user group meeting in Düsseldorf, Germany.

Ankündigung

Das nächste Python Meeting Düsseldorf findet an folgendem Termin statt:

22.01.2020, 18:00 Uhr
Raum 1, 2.OG im Bürgerhaus Stadtteilzentrum Bilk
Düsseldorfer Arcaden, Bachstr. 145, 40217 Düsseldorf

Programm

Bereits angemeldete Vorträge

Christian Hetmann
        "pipenv"

Jens Diemer
         "Micropython Sonoff Switch"

Klaus Bremer
        "FritzConnection"

Klaus Bremer
        "PyCon DE"

Weitere Vorträge können gerne noch angemeldet werden. Bei Interesse, bitte unter info@pyddf.de melden.

Startzeit und Ort

Wir treffen uns um 18:00 Uhr im Bürgerhaus in den Düsseldorfer Arcaden.

Das Bürgerhaus teilt sich den Eingang mit dem Schwimmbad und befindet sich an der Seite der Tiefgarageneinfahrt der Düsseldorfer Arcaden.

Über dem Eingang steht ein großes "Schwimm’ in Bilk" Logo. Hinter der Tür direkt links zu den zwei Aufzügen, dann in den 2. Stock hochfahren. Der Eingang zum Raum 1 liegt direkt links, wenn man aus dem Aufzug kommt.

>>> Eingang in Google Street View

Einleitung

Das Python Meeting Düsseldorf ist eine regelmäßige Veranstaltung in Düsseldorf, die sich an Python Begeisterte aus der Region wendet.

Einen guten Überblick über die Vorträge bietet unser PyDDF YouTube-Kanal, auf dem wir Videos der Vorträge nach den Meetings veröffentlichen.

Veranstaltet wird das Meeting von der eGenix.com GmbH, Langenfeld, in Zusammenarbeit mit Clark Consulting & Research, Düsseldorf:

Programm

Das Python Meeting Düsseldorf nutzt eine Mischung aus (Lightning) Talks und offener Diskussion.

Vorträge können vorher angemeldet werden, oder auch spontan während des Treffens eingebracht werden. Ein Beamer mit XGA Auflösung steht zur Verfügung.

(Lightning) Talk Anmeldung bitte formlos per EMail an info@pyddf.de

Kostenbeteiligung

Das Python Meeting Düsseldorf wird von Python Nutzern für Python Nutzer veranstaltet.

Da Tagungsraum, Beamer, Internet und Getränke Kosten produzieren, bitten wir die Teilnehmer um einen Beitrag in Höhe von EUR 10,00 inkl. 19% Mwst. Schüler und Studenten zahlen EUR 5,00 inkl. 19% Mwst.

Wir möchten alle Teilnehmer bitten, den Betrag in bar mitzubringen.

Anmeldung

Da wir nur für ca. 20 Personen Sitzplätze haben, möchten wir bitten, sich per EMail anzumelden. Damit wird keine Verpflichtung eingegangen. Es erleichtert uns allerdings die Planung.

Meeting Anmeldung bitte per Meetup oder formlos per EMail an info@pyddf.de

Weitere Informationen

Weitere Informationen finden Sie auf der Webseite des Meetings:

http://pyddf.de/

Viel Spaß !

Marc-Andre Lemburg, eGenix.com

↧

James Bennett: Having some fun with Python

January 20, 2020, 1:05 am

≫ Next: Real Python: Scientific Python: Using SciPy for Optimization

≪ Previous: eGenix.com: Python Meeting Düsseldorf - 2020-01-22

The other day on a Slack I hang out in, someone posted an amusing line of Python code:

port="{port}:{port}".format(port=port)

If it’s not clear after the inevitable Swedish-chef-muppet impression has run through your mind, this string-formatting operation will replace the contents of port with a string containing two copies of whatever was in port, separated by a colon. So if port was "foo", now it will …

Read full entry

↧

Real Python: Scientific Python: Using SciPy for Optimization

January 20, 2020, 6:00 am

≫ Next: Podcast.__init__: Building A Business On Building Data Driven Businesses

≪ Previous: James Bennett: Having some fun with Python

When you want to do scientific work in Python, the first library you can turn to is SciPy. As you’ll see in this tutorial, SciPy is not just a library, but a whole ecosystem of libraries that work together to help you accomplish complicated scientific tasks quickly and reliably.

In this tutorial, you’ll learn how to:

Find information about all the things you can do with SciPy
Install SciPy on your computer
Use SciPy to cluster a dataset by several variables
Use SciPy to find the optimum of a function

Let’s dive into the wonderful world of SciPy!

Free Bonus:Click here to get access to a Conda cheat sheet with handy usage examples for managing your Python environment and packages.

Differentiating SciPy the Ecosystem and SciPy the Library

When you want to use Python for scientific computing tasks, there are several libraries that you’ll probably be advised to use, including:

NumPy
SciPy
Matplotlib
IPython
SymPy
Pandas

Collectively, these libraries make up the SciPy ecosystem and are designed to work together. Many of them rely directly on NumPy arrays to do computations. This tutorial expects that you have some familiarity with creating NumPy arrays and operating on them.

Note: If you need a quick primer or refresher on NumPy, then you can check out these tutorials:

In this tutorial, you’ll learn about the SciPy library, one of the core components of the SciPy ecosystem. The SciPy library is the fundamental library for scientific computing in Python. It provides many efficient and user-friendly interfaces for tasks such as numerical integration, optimization, signal processing, linear algebra, and more.

Understanding SciPy Modules

The SciPy library is composed of a number of modules that separate the library into distinct functional units. If you want to learn about the different modules that SciPy includes, then you can run help() on scipy, as shown below:

>>>

>>> importscipy>>> help(scipy)

This produces some help output for the entire SciPy library, a portion of which is shown below:

Subpackages
-----------

Using any of these subpackages requires an explicit import.  For example,
``import scipy.cluster``.

::

 cluster                      --- Vector Quantization / Kmeans
 fft                          --- Discrete Fourier transforms
 fftpack                      --- Legacy discrete Fourier transforms
 integrate                    --- Integration routines
...

This code block shows the Subpackages portion of the help output, which is a list of all of the available modules within SciPy that you can use for calculations.

Note the text at the top of the section that states, "Using any of these subpackages requires an explicit import." When you want to use functionality from a module in SciPy, you need to import the module that you want to use specifically. You’ll see some examples of this a little later in the tutorial, and guidelines for importing libraries from SciPy are shown in the SciPy documentation.

Once you decide which module you want to use, you can check out the SciPy API reference, which contains all of the details on each module in SciPy. If you’re looking for something with a little more exposition, then the SciPy Lecture Notes are a great resource to go in-depth on many of the SciPy modules.

Later in this tutorial, you’ll learn about cluster and optimize, which are two of the modules in the SciPy library. But first, you’ll need to install SciPy on your computer.

Installing SciPy on Your Computer

As with most Python packages, there are two main ways to install SciPy on your computer:

Here, you’ll learn how to use both of these approaches to install the library. SciPy’s only direct dependency is the NumPy package. Either installation method will automatically install NumPy in addition to SciPy, if necessary.

Anaconda

Anaconda is a popular distribution of Python, mainly because it includes pre-built versions of the most popular scientific Python packages for Windows, macOS, and Linux. If you don’t have Python installed on your computer at all yet, then Anaconda is a great option to get started with. Anaconda comes pre-installed with SciPy and its required dependencies, so once you’ve installed Anaconda, you don’t need to do anything else!

You can download and install Anaconda from their downloads page. Make sure to download the most recent Python 3 release. Once you have the installer on your computer, you can follow the default setup procedure for an application, depending on your platform.

Note: Make sure to install Anaconda in a directory that does not require administrator permissions to modify. This is the default setting in the installer.

If you already have Anaconda installed, but you want to install or update SciPy, then you can do that, too. Open up a terminal application on macOS or Linux, or the Anaconda Prompt on Windows, and type one of the following lines of code:

$ conda install scipy
$ conda update scipy

You should use the first line if you need to install SciPy or the second line if you just want to update SciPy. To make sure SciPy is installed, run Python in your terminal and try to import SciPy:

>>>

>>> importscipy>>> print(scipy.__file__)/.../lib/python3.7/site-packages/scipy/__init__.py

In this code, you’ve imported scipy and printed the location of the file from where scipy is loaded. The example above is for macOS. Your computer will probably show a different location. Now you have SciPy installed on your computer ready for use. You can skip ahead to the next section to get started using SciPy!

Pip

If you already have a version of Python installed that isn’t Anaconda, or you don’t want to use Anaconda, then you’ll be using pip to install SciPy. To learn more about what pip is, check out What Is Pip? A Guide for New Pythonistas.

Note:pip installs packages using a format called wheels. In the wheel format, code is compiled before it’s sent to your computer. This is nearly the same approach that Anaconda takes, although wheel format files are slightly different than the Anaconda format, and the two are not interchangeable.

To install SciPy using pip, open up your terminal application, and type the following line of code:

$ python -m pip install -U scipy

The code will install SciPy if it isn’t already installed, or upgrade SciPy if it is installed. To make sure SciPy is installed, run Python in your terminal and try to import SciPy:

>>>

>>> importscipy>>> print(scipy.__file__)/.../lib/python3.7/site-packages/scipy/__init__.py

In this code, you’ve imported scipy and printed the location of the file from where scipy is loaded. The example above is for macOS using pyenv. Your computer will probably show a different location. Now you have SciPy installed on your computer. Let’s see how you can use SciPy to solve a couple of problems you might encounter!

Using the Cluster Module in SciPy

Clustering is a popular technique to categorize data by associating it into groups. The SciPy library includes an implementation of the k-means clustering algorithm as well as several hierarchical clustering algorithms. In this example, you’ll be using the k-means algorithm in scipy.cluster.vq, where vq stands for vector quantization.

First, you should take a look at the dataset you’ll be using for this example. The dataset consists of 4827 real and 747 spam text (or SMS) messages. The raw dataset can be found on the UCI Machine Learning Repository or the authors’ web page.

Note: The data was collected by Tiago A. Almeida and José María Gómez Hidalgo and published in an article titled “Contributions to the Study of SMS Spam Filtering: New Collection and Results” in the Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG‘11) hosted in Mountain View, CA, USA in 2011.

In the dataset, each message has one of two labels:

ham for legitimate messages
spam for spam messages

The full text message is associated with each label. When you scan through the data, you might notice that spam messages tend to have a lot of numeric digits in them. They often include a phone number or prize winnings. Let’s predict whether or not a message is spam based on the number of digits in the message. To do this, you’ll cluster the data into three groups based on the number of digits that appear in the message:

Not spam: Messages with the smallest number of digits are predicted not to be spam.
Unknown: Messages with an intermediate number of digits are unknown and need to be processed by more advanced algorithms.
Spam: Messages with the highest number of digits are predicted to be spam.

Let’s get started with clustering the text messages. First, you should import the libraries you’ll use in this example:

 1 frompathlibimportPath 2 importnumpyasnp 3 fromscipy.cluster.vqimportwhiten,kmeans,vq

You can see that you’re importing three functions from scipy.cluster.vq. Each of these functions accepts a NumPy array as input. These arrays should have the features of the dataset in the columns and the observations in the rows.

A feature is a variable of interest, while an observation is created each time you record each feature. In this example, there are 5,574 observations, or individual messages, in the dataset. In addition, you’ll see that there are two features:

The number of digits in a text message
The number of times that number of digits appears in the whole dataset

Next, you should load the data file from the UCI database. The data comes as a text file, where the class of the message is separated from the message by a tab character, and each message is on its own line. You should read the data into a list using pathlib.Path:

 4 data=Path("SMSSpamCollection").read_text() 5 data=data.strip() 6 data=data.split("\n")

In this code, you use pathlib.Path.read_text() to read the file into a string. Then, you use .strip() to remove any trailing spaces and split the string into a list with .split().

Next, you can start analyzing the data. You need to count the number of digits that appear in each text message. Python includes collections.Counter in the standard library to collect counts of objects in a dictionary-like structure. However, since all of the functions in scipy.cluster.vq expect NumPy arrays as input, you can’t use collections.Counter for this example. Instead, you use a NumPy array and implement the counts manually.

Again, you’re interested in the number of digits in a given SMS message, and how many SMS messages have that number of digits. First, you should create a NumPy array that associates the number of digits in a given message with the result of the message, whether it was ham or spam:

 7 digit_counts=np.empty((len(data),2),dtype=int)

In this code, you’re creating an empty NumPy array, digit_counts, which has two columns and 5,574 rows. The number of rows is equal to the number of messages in the dataset. You’ll be using digit_counts to associate the number of digits in the message with whether or not the message was spam.

You should create the array before entering the loop, so you don’t have to allocate new memory as your array expands. This improves the efficiency of your code. Next, you should process the data to record the number of digits and the status of the message:

 8 fori,lineinenumerate(data): 9 case,message=line.split("\t")10 num_digits=sum(c.isdigit()forcinmessage)11 digit_counts[i,0]=0ifcase=="ham"else112 digit_counts[i,1]=num_digits

Here’s a line-by-line breakdown of how this code works:

Line 8: Loop over data. You use enumerate() to put the value from the list in line and create an index i for this list. To learn more about enumerate(), check out Use enumerate() to Keep a Running Index.
Line 9: Split the line on the tab character to create case and message. case is a string that says whether the message is ham or spam, while message is a string with the text of the message.
Line 10: Calculate the number of digits in the message by using the sum() of a comprehension. In the comprehension, you check each character in the message using isdigit(), which returns True if the element is a numeral and False otherwise. sum() then treats each True result as a 1 and each False as a 0. So, the result of sum() on this comprehension is the number of characters for which isdigit() returned True.
Line 11: Assign values into digit_counts. You assign the first column of the i row to be 0 if the message was legitimate (ham) or 1 if the message was spam.
Line 12: Assign values into digit_counts. You assign the second column of the i row to be the number of digits in the message.

Now you have a NumPy array that contains the number of digits in each message. However, you want to apply the clustering algorithm to an array that has the number of messages with a certain number of digits. In other words, you need to create an array where the first column has the number of digits in a message, and the second column is the number of messages that have that number of digits. Check out the code below:

13 unique_counts=np.unique(digit_counts[:,1],return_counts=True)

np.unique() takes an array as the first argument and returns another array with the unique elements from the argument. It also takes several optional arguments. Here, you use return_counts=True to instruct np.unique() to also return an array with the number of times each unique element is present in the input array. These two outputs are returned as a tuple that you store in unique_counts.

Next, you need to transform unique_counts into a shape that’s suitable for clustering:

14 unique_counts=np.transpose(np.vstack(unique_counts))

You combine the two 1xN outputs from np.unique() into one 2xN array using np.vstack(), and then transpose them into an Nx2 array. This format is what you’ll use in the clustering functions. Each row in unique_counts now has two elements:

The number of digits in a message
The number of messages that had that number of digits

A subset of the output from these two operations is shown below:

[[   0 4110]
 [   1  486]
 [   2  160]
 ...
 [  40    4]
 [  41    2]
 [  47    1]]

In the dataset, there are 4110 messages that have no digits, 486 that have 1 digit, and so on. Now, you should apply the k-means clustering algorithm to this array:

15 whitened_counts=whiten(unique_counts)16 codebook,_=kmeans(whitened_counts,3)

You use whiten() to normalize each feature to have unit variance, which improves the results from kmeans(). Then, kmeans() takes the whitened data and the number of clusters to create as arguments. In this example, you want to create 3 clusters, for definitely ham, definitely spam, and unknown. kmeans() returns two values:

An array with three rows and two columns representing the centroids of each group: The kmeans() algorithm calculates the optimal location of the centroid of each cluster by minimizing the distance from the observations to each centroid. This array is assigned to codebook.
The mean Euclidian distance from the observations to the centroids: You won’t need that value for the rest of this example, so you can assign it to _.

Next, you should determine which cluster each observation belongs to by using vq():

17 codes,_=vq(unique_counts,codebook)

vq() assigns codes from the codebook to each observation. It returns two values:

The first value is an array of the same length as unique_counts, where the value of each element is an integer representing which cluster that observation is assigned to. Since you used three clusters in this example, each observation is assigned to cluster 0, 1, or 2.
The second value is an array of the Euclidian distance between each observation and its centroid.

Now that you have the data clustered, you should use it to make predictions about the SMS messages. You can inspect the counts to determine at how many digits the clustering algorithm drew the line between definitely ham and unknown, and between unknown and definitely spam:

18 print(unique_counts[codes==0][-1])19 print(unique_counts[codes==1][-1])20 print(unique_counts[codes==2][-1])

In this code, each line is getting the rows in unique_counts where vq() assigned different values of the codes, either 0, 1, or 2. Since that operation returns an array, you should get the last row of the array to determine the highest number of digits assigned to each group. The output is shown below:

definitely spam [47  1]
definitely ham [   0 4110]
unknown [20 18]

In this output, you see that the definitely ham messages are the messages with zero digits in the message, the unknown messages are everything between 1 and 20 digits, and definitely spam messages are everything from 21 to 47 digits, which is the maximum number of digits in your dataset.

Now, you should check how accurate your predictions are on this dataset. First, create some masks for digit_counts so you can easily grab the ham or spam status of the messages:

21 digits=digit_counts[:,1]22 predicted_hams=digits==023 predicted_spams=digits>2024 predicted_unknowns=np.logical_and(digits>0,digits<=20)

In this code, you’re creating the predicted_hams mask, where there are no digits in a message. Then, you create the predicted_spams mask for all messages with more than 20 digits. Finally, the messages in the middle are predicted_unknowns.

Next, apply these masks to the actual digit counts to retrieve the predictions:

25 spam_cluster=digit_counts[predicted_spams]26 ham_cluster=digit_counts[predicted_hams]27 unk_cluster=digit_counts[predicted_unknowns]

Here, you’re applying the masks you created in the last code block to the digit_counts array. This creates three new arrays with only the messages that have been clustered into each group. Finally, you can see how many of each message type have fallen into each cluster:

28 print("hams:",np.unique(ham_cluster[:,0],return_counts=True))29 print("spams:",np.unique(spam_cluster[:,0],return_counts=True))30 print("unknowns:",np.unique(unk_cluster[:,0],return_counts=True))

This code prints the counts of each unique value from the clusters. Remember that 0 means a message was ham and 1 means the message was spam. The results are shown below:

hams: (array([0, 1]), array([4071,   39]))
spams: (array([0, 1]), array([  1, 232]))
unknowns: (array([0, 1]), array([755, 476]))

From this output, you can see that 4110 messages fell into the definitely ham group, of which 4071 were actually ham and only 39 were spam. Conversely, of the 233 messages that fell into the definitely spam group, only 1 was actually ham and the rest were spam.

Of course, over 1200 messages fell into the unknown category, so some more advanced analysis would be needed to classify those messages. You might want to look into something like natural language processing to help improve the accuracy of your prediction, and you can use Python and Keras to help out.

Using the Optimize Module in SciPy

When you need to optimize the input parameters for a function, scipy.optimize contains a number of useful methods for optimizing different kinds of functions:

minimize_scalar() and minimize() to minimize a function of one variable and many variables, respectively
curve_fit() to fit a function to a set of data
root_scalar() and root() to find the zeros of a function of one variable and many variables, respectively
linprog() to minimize a linear objective function with linear inequality and equality constraints

In practice, all of these functions are performing optimization of one sort or another. In this section, you’ll learn about the two minimization functions, minimize_scalar() and minimize().

Minimizing a Function With One Variable

A mathematical function that accepts one number and results in one output is called a scalar function. It’s usually contrasted with multivariate functions that accept multiple numbers and also result in multiple numbers of output. You’ll see an example of optimizing multivariate functions in the next section.

For this section, your scalar function will be a quartic polynomial, and your objective is to find the minimum value of the function. The function is y = 3x⁴ - 2x + 1. The function is plotted in the image below for a range of x from 0 to 1:

In the figure, you can see that there’s a minimum value of this function at approximately x = 0.55. You can use minimize_scalar() to determine the exact x and y coordinates of the minimum. First, import minimize_scalar() from scipy.optimize. Then, you need to define the objective function to be minimized:

 1 fromscipy.optimizeimportminimize_scalar 2  3 defobjective_function(x): 4 return3*x**4-2*x+1

objective_function() takes the input x and applies the necessary mathematical operations to it, then returns the result. In the function definition, you can use any mathematical functions you want. The only limit is that the function must return a single number at the end.

Next, use minimize_scalar() to find the minimum value of this function. minimize_scalar() has only one required input, which is the name of the objective function definition:

 5 res=minimize_scalar(objective_function)

The output of minimize_scalar() is an instance of OptimizeResult. This class collects together many of the relevant details from the optimizer’s run, including whether or not the optimization was successful and, if successful, what the final result was. The output of minimize_scalar() for this function is shown below:

     fun: 0.17451818777634331
    nfev: 16
     nit: 12
 success: True
       x: 0.5503212087491959

These results are all attributes of OptimizeResult. success is a Boolean value indicating whether or not the optimization completed successfully. If the optimization was successful, then fun is the value of the objective function at the optimal value x. You can see from the output that, as expected, the optimal value for this function was near x = 0.55.

Note: As you may know, not every function has a minimum. For instance, try and see what happens if your objective function is y = x³. For minimize_scalar(), objective functions with no minimum often result in an OverflowError because the optimizer eventually tries a number that is too big to be calculated by the computer.

On the opposite side of functions with no minimum are functions that have several minima. In these cases, minimize_scalar() is not guaranteed to find the global minimum of the function. However, minimize_scalar() has a method keyword argument that you can specify to control the solver that’s used for the optimization. The SciPy library has three built-in methods for scalar minimization:

brent is an implementation of Brent’s algorithm. This method is the default.
golden is an implementation of the golden-section search. The documentation notes that Brent’s method is usually better.
bounded is a bounded implementation of Brent’s algorithm. It’s useful to limit the search region when the minimum is in a known range.

When method is either brent or golden, minimize_scalar() takes another argument called bracket. This is a sequence of two or three elements that provide an initial guess for the bounds of the region with the minimum. However, these solvers do not guarantee that the minimum found will be within this range.

On the other hand, when method is bounded, minimize_scalar() takes another argument called bounds. This is a sequence of two elements that strictly bound the search region for the minimum. Try out the bounded method with the function y = x⁴ - x². This function is plotted in the figure below:

Using the previous example code, you can redefine objective_function() like so:

 7 defobjective_function(x): 8 returnx**4-x**2

First, try the default brent method:

 9 res=minimize_scalar(objective_function)

In this code, you didn’t pass a value for method, so minimize_scalar() used the brent method by default. The output is this:

     fun: -0.24999999999999994
    nfev: 15
     nit: 11
 success: True
       x: 0.7071067853059209

You can see that the optimization was successful. It found the optimum near x = 0.707 and y = -1/4. If you solved for the minimum of the equation analytically, then you’d find the minimum at x = 1/√2, which is extremely close to the answer found by the minimization function. However, what if you wanted to find the symmetric minimum at x = -1/√2? You can return the same result by providing the bracket argument to the brent method:

10 res=minimize_scalar(objective_function,bracket=(-1,0))

In this code, you provide the sequence (-1, 0) to bracket to start the search in the region between -1 and 0. You expect there to be a minimum in this region since the objective function is symmetric about the y-axis. However, even with bracket, the brent method still returns the minimum at x = +1/√2. To find the minimum at x = -1/√2, you can use the bounded method with bounds:

11 res=minimize_scalar(objective_function,method='bounded',bounds=(-1,0))

In this code, you add method and bounds as arguments to minimize_scalar(), and you set bounds to be from -1 to 0. The output of this method is as follows:

     fun: -0.24999999999998732
 message: 'Solution found.'
    nfev: 10
  status: 0
 success: True
       x: -0.707106701474177

As expected, the minimum was found at x = -1/√2. Note the additional output from this method, which includes a message attribute in res. This field is often used for more detailed output from some of the minimization solvers.

Minimizing a Function With Many Variables

scipy.optimize also includes the more general minimize(). This function can handle multivariate inputs and outputs and has more complicated optimization algorithms to be able to handle this. In addition, minimize() can handle constraints on the solution to your problem. You can specify three types of constraints:

LinearConstraint: The solution is constrained by taking the inner product of the solution x values with a user-input array and comparing the result to a lower and upper bound.
NonlinearConstraint: The solution is constrained by applying a user-supplied function to the solution x values and comparing the return value with a lower and upper bound.
Bounds: The solution x values are constrained to lie between a lower and upper bound.

When you use these constraints, it can limit the specific choice of optimization method that you’re able to use, since not all of the available methods support constraints in this way.

Let’s try a demonstration on how to use minimize(). Imagine you’re a stockbroker who’s interested in maximizing the total income from the sale of a fixed number of your stocks. You have identified a particular set of buyers, and for each buyer, you know the price they’ll pay and how much cash they have on hand.

You can phrase this problem as a constrained optimization problem. The objective function is that you want to maximize your income. However, minimize() finds the minimum value of a function, so you’ll need to multiply your objective function by -1 to find the x-values that produce the largest negative number.

There is one constraint on the problem, which is that the sum of the total shares purchased by the buyers does not exceed the number of shares you have on hand. There are also bounds on each of the solution variables because each buyer has an upper bound of cash available, and a lower bound of zero. Negative solution x-values mean that you’d be paying the buyers!

Try out the code below to solve this problem. First, import the modules you need and then set variables to determine the number of buyers in the market and the number of shares you want to sell:

 1 importnumpyasnp 2 fromscipy.optimizeimportminimize,LinearConstraint 3  4 n_buyers=10 5 n_shares=15

In this code, you import numpy, minimize(), and LinearConstraint from scipy.optimize. Then, you set a market of 10 buyers who’ll be buying 15 shares in total from you.

Next, create arrays to store the price that each buyer pays, the maximum amount they can afford to spend, and the maximum number of shares each buyer can afford, given the first two arrays. For this example, you can use random number generation in np.random to generate the arrays:

 6 np.random.seed(10) 7 prices=np.random.random(n_buyers) 8 money_available=np.random.randint(1,4,n_buyers)

In this code, you set the seed for NumPy’s random number generators. This function makes sure that each time you run this code, you’ll get the same set of random numbers. It’s here to make sure that your output is the same as the tutorial for comparison.

In line 7, you generate the array of prices the buyers will pay. np.random.random() creates an array of random numbers on the half-open interval [0, 1). The number of elements in the array is determined by the value of the argument, which in this case is the number of buyers.

In line 8, you generate an array of integers on the half-open interval from [1, 4), again with the size of the number of buyers. This array represents the total cash each buyer has available. Now, you need to compute the maximum number of shares each buyer can purchase:

 9 n_shares_per_buyer=money_available/prices10 print(prices,money_available,n_shares_per_buyer,sep="\n")

In line 9, you take the ratio of the money_available with prices to determine the maximum number of shares each buyer can purchase. Finally, you print each of these arrays separated by a newline. The output is shown below:

[0.77132064 0.02075195 0.63364823 0.74880388 0.49850701 0.22479665
 0.19806286 0.76053071 0.16911084 0.08833981]
[1 1 1 3 1 3 3 2 1 1]
[ 1.29647768 48.18824404  1.57816269  4.00638948  2.00598984 13.34539487
 15.14670609  2.62974258  5.91328161 11.3199242 ]

The first row is the array of prices, which are floating-point numbers between 0 and 1. This row is followed by the maximum cash available in integers from 1 to 4. Finally, you see the number of shares each buyer can purchase.

Now, you need to create the constraints and bounds for the solver. The constraint is that the sum of the total purchased shares can’t exceed the total number of shares available. This is a constraint rather than a bound because it involves more than one of the solution variables.

To represent this mathematically, you could say that x[0] + x[1] + ... + x[n] = n_shares, where n is the total number of buyers. More succinctly, you could take the dot or inner product of a vector of ones with the solution values, and constrain that to be equal to n_shares. Remember that LinearConstraint takes the dot product of the input array with the solution values and compares it to the lower and upper bound. You can use this to set up the constraint on n_shares:

11 constraint=LinearConstraint(np.ones(n_buyers),lb=n_shares,ub=n_shares)

In this code, you create an array of ones with the length n_buyers and pass it as the first argument to LinearConstraint. Since LinearConstraint takes the dot product of the solution vector with this argument, it’ll result in the sum of the purchased shares.

This result is then constrained to lie between the other two arguments:

The lower bound lb
The upper bound ub

Since lb = ub = n_shares, this is an equality constraint because the sum of the values must be equal to both lb and ub. If lb were different from ub, then it would be an inequality constraint.

Next, create the bounds for the solution variable. The bounds limit the number of shares purchased to be 0 on the lower side and n_shares_per_buyer on the upper side. The format that minimize() expects for the bounds is a sequence of tuples of lower and upper bounds:

12 bounds=[(0,n)forninn_shares_per_buyer]

In this code, you use a comprehension to generate a list of tuples for each buyer. The last step before you run the optimization is to define the objective function. Recall that you’re trying to maximize your income. Equivalently, you want to make the negative of your income as large a negative number as possible.

The income that you generate from each sale is the price that the buyer pays multiplied by the number of shares they’re buying. Mathematically, you could write this as prices[0]*x[0] + prices[1]*x[1] + ... + prices[n]*x[n], where n is again the total number of buyers.

Once again, you can represent this more succinctly with the inner product, or x.dot(prices). This means that your objective function should take the current solution values x and the array of prices as arguments:

13 defobjective_function(x,prices):14 return-x.dot(prices)

In this code, you define objective_function() to take two arguments. Then you take the dot product of x with prices and return the negative of that value. Remember that you have to return the negative because you’re trying to make that number as small as possible, or as close to negative infinity as possible. Finally, you can call minimize():

15 res=minimize(16 objective_function,17 x0=10*np.random.random(n_buyers),18 args=(prices,),19 constraints=constraint,20 bounds=bounds,21 )

In this code, res is an instance of OptimizeResult, just like with minimize_scalar(). As you’ll see, there are many of the same fields, even though the problem is quite different. In the call to minimize(), you pass five arguments:

objective_function: The first positional argument must be the function that you’re optimizing.
x0: The next argument is an initial guess for the values of the solution. In this case, you’re just providing a random array of values between 0 and 10, with the length of n_buyers. For some algorithms or some problems, choosing an appropriate initial guess may be important. However, for this example, it doesn’t seem too important.
args: The next argument is a tuple of other arguments that are necessary to be passed into the objective function. minimize() will always pass the current value of the solution x into the objective function, so this argument serves as a place to collect any other input necessary. In this example, you need to pass prices to objective_function(), so that goes here.
constraints: The next argument is a sequence of constraints on the problem. You’re passing the constraint you generated earlier on the number of available shares.
bounds: The last argument is the sequence of bounds on the solution variables that you generated earlier.

Once the solver runs, you should inspect res by printing it:

     fun: -8.783020157087366
     jac: array([-0.77132058, -0.02075195, -0.63364816, -0.74880385,
        -0.4985069, -0.22479665, -0.19806278, -0.76053071, -0.16911077,
        -0.08833981])
 message: 'Optimization terminated successfully.'
    nfev: 204
     nit: 17
    njev: 17
  status: 0
 success: True
       x: array([1.29647768e+00, 3.94665456e-13, 1.57816269e+00, 4.00638948e+00,
       2.00598984e+00, 3.48323773e+00, 5.55111512e-14, 2.62974258e+00,
       5.37143977e-14, 1.34606983e-13])

In this output, you can see message and status indicating the final state of the optimization. For this optimizer, a status of 0 means the optimization terminated successfully, which you can also see in the message. Since the optimization was successful, fun shows the value of the objective function at the optimized solution values. You’ll make an income of $8.78 from this sale.

You can see the values of x that optimize the function in res.x. In this case, the result is that you should sell about 1.3 shares to the first buyer, zero to the second buyer, 1.6 to the third buyer, 4.0 to the fourth, and so on.

You should also check and make sure that the constraints and bounds that you set are satisfied. You can do this with the following code:

22 print("The total number of shares is:",sum(res.x))23 print("Leftover money for each buyer:"money_available-res.x*prices)

In this code, you print the sum of the shares purchased by each buyer, which should be equal to n_shares. Then, you print the difference between each buyer’s cash on hand and the amount they spent. Each of these values should be positive. The output from these checks is shown below:

The total number of shares is: 15.0
The amount each buyer has leftover is: [4.78506124e-14 1.00000000e+00
 4.95159469e-14 9.99200722e-14 5.06261699e-14 2.21697984e+00 3.00000000e+00
 9.76996262e-14 1.00000000e+00 1.00000000e+00]

As you can see, all of the constraints and bounds on the solution were satisfied. Now you should try changing the problem so that the solver can’t find a solution. Change n_shares to a value of 1000, so that you’re trying to sell 1000 shares to these same buyers. When you run minimize(), you’ll find that the result is as shown below:

     fun: nan
     jac: array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])
 message: 'Iteration limit exceeded'
    nfev: 2160
     nit: 101
    njev: 100
  status: 9
 success: False
       x: array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])

Notice that the status attribute now has a value of 9, and the message states that the iteration limit has been exceeded. There’s no way to sell 1000 shares given the amount of money each buyer has and the number of buyers in the market. However, rather than raising an error, minimize() still returns an OptimizeResult instance. You need to make sure to check the status code before proceeding with further calculations.

Conclusion

In this tutorial, you learned about the SciPy ecosystem and how that differs from the SciPy library. You read about some of the modules available in SciPy and learned how to install SciPy using Anaconda or pip. Then, you focused on some examples that use the clustering and optimization functionality in SciPy.

In the clustering example, you developed an algorithm to sort spam text messages from legitimate messages. Using kmeans(), you found that messages with more than about 20 digits are extremely likely to be spam!

In the optimization example, you first found the minimum value in a mathematically clear function with only one variable. Then, you solved the more complex problem of maximizing your profit from selling stocks. Using minimize(), you found the optimal number of stocks to sell to a group of buyers and made a profit of $8.79!

SciPy is a huge library, with many more modules to dive into. With the knowledge you have now, you’re well equipped to start exploring!

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

Podcast.init: Building A Business On Building Data Driven Businesses

January 20, 2020, 8:01 am

≫ Next: Catalin George Festila: Python 3.7.5 : Django security issues - part 002.

≪ Previous: Real Python: Scientific Python: Using SciPy for Optimization

In order for an organization to be data driven they need easy access to their data and a simple way of sharing it. Arik Fraimovich built Redash as a way to address that need by connecting to any data source and building attractive dashboards on top of them. In this episode he shares the origin story of the project, his experiences running a business based on open source, and the challenges of working with data effectively.

Summary

Announcements

Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to pythonpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host as usual is Tobias Macey and today I’m interviewing Arik Fraimovich about Redash, an open source business intelligence platform that helps you make sense of your data.

Interview

Introductions
How did you get introduced to Python?
Can you start by describing what Redash is and its origin story?
- What are the primary ways that it is used?
- The business intelligence market is quite mature and has many commercial and open source projects to choose from. What are the aspects of Redash that have allowed you to be successful?
- What would you consider to be your closest competitors?
What was your background with data before starting on Redash?
- What are some of the most notable lessons that you have learned about business intelligence since starting the project?
- How has the landscape for business intelligence and data analysis changed since you began the project?
Beyond just accessing data, Redash focuses on enabling visualization of the results. What types of visualizations do you support and how do you support users in choosing the most effective ways to represent the information?
What are some of the common challenges that your users and customers encounter when communicating with data?
One of the critical aspects of enabling data access in an organization is the ability to collaborate on asking and answering questions. How do you approach that challenge in Redash?
How is Redash implemented and how has the overall design and architecture evolved since you first started working on it?
- How do you manage the complexity of supporting so many different data sources?
- If you were to start over today, what would you do differently?
Beyond the code of Redash, you also have a business around providing it as a hosted service. What are some of the most interesting, challenging, or unexpected lessons that you have learned in the process of building and growing that service?
How do you approach the direction and governance of the open source project and balance that against the wants and needs of the community?
What are some of the most interesting, innovative, or unexpected ways that you have seen Redash used?
When is Redash the wrong platform to use?
What do you have planned for the future of the Redash business and project?

Keep In Touch

arikfr on GitHub
Website
@arikfr on Twitter

Picks

Tobias
- Data Engineering Podcast
Arik
- Peewee ORM
- Amazon ECS

Closing Announcements

Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at pythonpodcast.com/chat

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

↧

Catalin George Festila: Python 3.7.5 : Django security issues - part 002.

January 20, 2020, 3:57 am

≫ Next: Python Circle: Improving python code performance by using lru_cache decorator

≪ Previous: Podcast.__init__: Building A Business On Building Data Driven Businesses

The project can be found at this Github project. Let's start with my default project and activate the env: [mythcat@desk ~]$ cd projects/ [mythcat@desk projects]$ cd django/ [mythcat@desk django]$ source env/bin/activate Let's install this python module: (env) [mythcat@desk django]$ pip3 install django-axes --user Make this changes into settings.py: (env) [mythcat@desk django]$ cd mysite/ (env) [

↧

Python Circle: Improving python code performance by using lru_cache decorator

January 20, 2020, 11:46 am

≫ Next: Test and Code: 98: pytest-testmon - selects tests affected by changed files and methods - Tibor Arpas

≪ Previous: Catalin George Festila: Python 3.7.5 : Django security issues - part 002.

Store the result of repetitive python function calls in the cache, Improve python code performance by using lru_cache decorator, caching results of python function, memoization in python

↧

Test and Code: 98: pytest-testmon - selects tests affected by changed files and methods - Tibor Arpas

January 21, 2020, 12:00 am

≫ Next: Python Circle: Solving Python Error- KeyError: 'key_name'

≪ Previous: Python Circle: Improving python code performance by using lru_cache decorator

pytest-testmon is a pytest plugin which selects and executes only tests you need to run. It does this by collecting dependencies between tests and all executed code (internally using Coverage.py) and comparing the dependencies against changes. testmon updates its database on each test execution, so it works independently of version control.

In this episode, I talk with testmon creator Tibor Arpas about testmon, about it's use and how it works.

Special Guest: Tibor Arpas.

Python Circle: Solving Python Error- KeyError: 'key_name'

January 21, 2020, 5:46 am

≫ Next: Real Python: Basic Data Types in Python

≪ Previous: Test and Code: 98: pytest-testmon - selects tests affected by changed files and methods - Tibor Arpas

Solving KeyError in python, How to handle KeyError in python dictionary, Safely accessing and deleting keys from python dictionary, try except Key error in Python

↧

Real Python: Basic Data Types in Python

January 21, 2020, 6:00 am

≫ Next: PyCoder’s Weekly: Issue #404 (Jan. 21, 2020)

≪ Previous: Python Circle: Solving Python Error- KeyError: 'key_name'

In this step-by-step course, you’ll dig into the basic data types that are built into Python.

By the end of this course:

You’ll learn about several basic numeric, string, and Boolean types that are built into Python.
You’ll see what objects of these types look like and how you can represent them.
You’ll get an overview of Python’s built-in functions, which are pre-written chunks of code that you can call to do useful things.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

PyCoder’s Weekly: Issue #404 (Jan. 21, 2020)

January 21, 2020, 11:30 am

≫ Next: Catalin George Festila: Python 3.7.5 : Use Django Formsets.

≪ Previous: Real Python: Basic Data Types in Python

#404 – JANUARY 21, 2020
View in Browser »

Comparing Python, Go, and C++ on the N-Queens Problem (PDF)

“Python currently is the dominant language in the field of Machine Learning but is often criticized for being slow to perform certain tasks. In this report, we use the well-known N-queens puzzle as a benchmark to show that once compiled using the Numba compiler it becomes competitive with C++ and Go in terms of execution speed while still allowing for very fast prototyping.”
PASCAL FUA, KRZYSZTOF LIS

Codemodding Python: Unittest Asserts to Python Asserts

Large codebases require continued maintenance, but it is time-consuming and cumbersome to change portions of code scattered around many files. This article shows how to write codemods to refactor Python code using its Abstract Syntax Tree—gaining far more granular control than basic regex and search-replace.
HANS-WILHELM WARLO

Automate & Standardize Code Reviews for Python

Take the hassle out of code reviews - Codacy flags errors automatically, directly from your Git workflow. Customize standards on coverage, duplication, complexity & style violations. Use in the cloud or on your servers for 30 different languages. Get started for free →
CODACYsponsor

Arcade: A Primer on the Python Game Framework

In this step-by-step tutorial, you’ll learn how to use arcade, a modern Python framework for crafting games with compelling graphics and sound. Object-oriented and built for Python 3.6 and up, arcade provides you a modern set of tools for crafting great Python game experiences.
REAL PYTHON

Creating a Simple Python Pip Repository

“I wanted the simplest (i.e. most lightweight) possible repository capable of serving packages in such a way as that Python’s pip would be able to install them.”
JAN-PIET MENS

Writing a Polyglot Script

Python and Ruby have somewhat similar syntaxes, could you come up with a program that’s valid in both languages?
NKANAEV.GITHUB.IO

Python Jobs

Articles & Tutorials

Effectively Using Matplotlib

“Now that I have taken the time to learn some of these tools and how to use them with matplotlib, I have started to see matplotlib as an indispensable tool. This post will show how I use matplotlib and provide some recommendations for users getting started”
CHRIS MOFFITT

Scientific Python: Using SciPy for Optimization

Learn about the SciPy ecosystem and how it differs from the SciPy library. You’ll learn how to install SciPy using Anaconda or pip and see some of its modules. Then, you’ll focus on examples that use the clustering and optimization functionality in SciPy.
REAL PYTHON

Python 2 EOL: Survey Results

ActiveState surveyed Python users on how they’ve been preparing for Python 2 End of Life. Find out how your plans stack up, and which pitfalls to avoid. Get the survey results →
ACTIVESTATEsponsor

Basic Data Types in Python

In this course, you’ll learn the basic data types that are built into Python, like numbers, strings, and Booleans. You’ll also get an overview of Python’s built-in functions.
REAL PYTHONvideo

Unexpected Results With `open()` and CPython

“Misusing Python’s open() and the interaction of CPython’s GC and UNIX semantics can lead to unexpected results.”
JAVIER HONDUVILLA COTO

Having Some Fun With Python

Writing obfuscated code for fun and…great learning experiences! ;-)
JAMES BENNETT

Monads for Mortals (in Python)

Implementing and testing the identity monad in Python.
KNOWSUCHAGENCY.COM

Measure and Improve Python Code Performance with Blackfire.io

Profile in development, test/staging, and production—with no overhead for end users! Blackfire supports any Python version from 2.7.x and 3.x. Find bottlenecks in wall-time, I/O, CPU, memory, HTTP requests and SQL queries.
BLACKFIREsponsor

Using LLVM and Arrow to JIT and Evaluate Pandas Expressions

CHRISTIAN PERONE

5 Refactoring Tips to Improve Your Python Code

NICK THAPEN

Filtering Pandas DataFrames With Multiple Conditions

VINAY BABU

On Writing a Safe `repr()`

IONEL CRISTIAN MĂRIEȘ

Projects & Code

voice-flight-following: Announce Nearby Cities Based on Aircraft Position in FSX and P3D

GITHUB.COM/JFAYRE

pypistats: PyPI Download Stats

PYPISTATS.ORG

Typer: Build CLIs With Python Type Hints

TIANGOLO.COM

sql2json: Run a SQL Query and Convert Result to JSON

GITHUB.COM/FSISTEMAS

python-hunter: Flexible Code Tracing Toolkit

GITHUB.COM/IONELMC

CrossHair: SMT-assisted Testing for Python

GITHUB.COM/PSCHANELY• Shared by Phillip Schanely

openpilot: Open Source Driver Assistance System

GITHUB.COM/COMMAAI

JustCause: Compare Methods for Causal Inference

JUSTCAUSE.READTHEDOCS.IO• Shared by Florian Wilhelm

Speck: Line-Art Image Renderer Using Matplotlib

GITHUB.COM/LUCASHADFIELD• Shared by Lucas Hadfield

Events

Python Meeting Düsseldorf

January 23, 2020
EGENIX.COM

PythOnRio Meetup

January 25, 2020
PYTHON.ORG.BR

Python Ho Monthly Meetup

January 26, 2020
EVENTBRITE.COM

Inland Empire Pyladies (CA, USA)

January 27, 2020
MEETUP.COM

Python Sheffield

January 28, 2020
GOOGLE.COM

Heidelberg Python Meetup

January 29, 2020
MEETUP.COM

PiterPy Breakfast

January 29, 2020
TIMEPAD.RU

SPb Python Drinkup

January 30, 2020
MEETUP.COM

PyCascades (10% Discount)

February 8th & 9th in Portland, OR. Get a 10% discount on your ticket courtesy of PyCoder’s with this link.

Happy Pythoning!
This was PyCoder’s Weekly Issue #404.
View in Browser »

[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]

↧

Catalin George Festila: Python 3.7.5 : Use Django Formsets.

January 21, 2020, 4:20 am

≫ Next: Brett Cannon: A quick-and-dirty guide on how to install packages for Python

≪ Previous: PyCoder’s Weekly: Issue #404 (Jan. 21, 2020)

Django Formsets manage the complexity of multiple copies of a form in a view. This simplifies the task of creating a formset for a form that handles multiple instances of a model. Let's start with my old project: [mythcat@desk ~]$ cd projects/ [mythcat@desk projects]$ cd django/ [mythcat@desk django]$ source env/bin/activateInto models.py I add these classes: #create Inline Form with book and

↧

Brett Cannon: A quick-and-dirty guide on how to install packages for Python

January 21, 2020, 2:18 pm

≫ Next: Python Bytes: #165 Ranges as dictionary keys - oh my!

≪ Previous: Catalin George Festila: Python 3.7.5 : Use Django Formsets.

When people start learning Python, they often will come across a package they want to try and it will usually start with "just pip install it!" The problem with that advice is it's a very simplistic view of how to manage packages and can actually lead to problems down the road. And while there is a tutorial on installing packages at packaging.python.org, it might be a bit intimidating for some if they are just looking to quickly get up and going.

If you just want to start poking at Python and want to avoid the pitfalls to installing packages globally, it only takes 3 steps to do the right thing.

Summary

Create a virtual environment, e.g. python3.8 -m venv .venv (substitute py -3.8 for python3.8 if necessary)
Activate the virtual environment, e.g. source .venv/bin/activate.fish (assuming you are using the fish shell)
Install the packages you want, e.g. python -m pip install --upgrade pip if you wanted to install the latest version of pip (which you probably do)

Do note that there is a fancier version of step 1 explained below. This post also covers versions of step 2 for other shells.

Details

Step 1: Create a virtual environment

The first step is creating a virtual environment. You want this because you want to isolate what you install from your global Python installation. Thanks to Python being widely used to run operating sytems now you can actually break your OS if you install directly into your global interpreter. So please, always use some form of isolated environment (I will be using virtual environments because they are built into Python itself, they are lightweight, and VS Code has great support for them😁).

I will also say that when creating your virtual environment you should always do it by specifying the specific version of Python that you want. You will notice below that I use commands like python3.9 or python3.8 and not simply python3. That's to make sure you are getting the version of Python you want.

Also, don't commit your virtual environment to your version control system. If you want to protect yourself against doing this then the commands below will create a .venv directory where you can put a .gitignore file that contains nothing more than * (if you are using UNIX you can do this with echo "*"> .venv/.gitignore after you create your virtual environment).

Lastly, below you will notice me using the --prompt flag. It's totally optional and can be left out on any OS. All it does is make your shell prompt a little more informative for step 2 later on by giving your virtual environment the base name of your current directory, so it's a nice touch but if you don't want to bother with the flag it won't affect anything.

Warning for Debian/Ubuntu users

If you are using a Debian-based OS (e.g. Ubuntu) and you are using the OS-installed version of Python, make sure to install python3-venv via apt. If you installed from python.org or any other means then you will very likely have what you need.

Note for Windows users

Below I use py as that is what comes with the python.org installer. If for some reason it isn't installed then don't worry and substitute e.g. py -3.8 with python3.8.

Python 3.9 and newer

As of this writing Python 3.9 is not out yet, but it will be come October 2020 and hopefully this blog post will still be around by then, so I cover a neat new feature that's coming which makes --prompt easier to use.

UNIX

python3.9 -m venv --prompt . .venv

Windows

py -3.9 -m venv --prompt . .venv

Python 3.8 and older

fish shell

python3.8 -m venv --prompt (basename $PWD) .venv

bash shell

python3.8 -m venv --prompt `basename $PWD` .venv

PowerShell

py -3.8 -m venv --prompt "$(Split-Path -Leaf -Path (Get-Location).Path)" .venv

Step 2: Activate your virtual environment

Activating your virtual environment makes it so that when you type python it will point to the interpreter in your virtual environment (as well as making any tools that get installed by any packages you install later available). This step is completely optional but it is rather handy. If you used --prompt in step 1 then your shell prompt will say the directory name that contains the virtual environment to remind you what python points at.

If at any point you want to turn off the activation of your shell you can run deactivate (closing/quitting your shell also works as activations are not permanent).

fish shell

source .venv/bin/activate.fish

bash shell

source .venv/bin/activate

PowerShell

You will notice that there are two commands here. The first command only needs to be run once and it lowers the security level of PowerShell to allow PowerShell scripts which are signed but whose signing key isn't installed in the OS to be run (the activation script is signed by the Python Software Foundation as of Python 3.8). If you don't run this command you will get an error. It's no big deal and you can still go and change the execution policy as needed.

Set-ExecutionPolicy RemoteSigned
.venv\Scripts\Activate.ps1

Step 3: Install your package(s)

If you have seen instructions to "just" run pip to install something, you may notice my suggestion below is a little different. I have an entire blog post on why, but the short answer is you want to make sure you are installing into your virtual environment and not your global installation of Python by accident (and the whole reason the first two steps of this blog post exist).

I am also using pip as the example project to install with an extra --upgrade flag so that you upgrade pip in your virtual environment. Chances are there's a new version of pip available compared to what gets installed by default and thus this is not only illustrative but also does something useful for you. Plus it's an easy way to make sure your virtual environment is working as expected.

If you activated your environment

python -m pip install --upgrade pip

If you didn't activate your environment

UNIX

.venv/bin/python -m pip install --upgrade pip

Windows

.venv\Scripts\python -m pip install --upgrade pip

That's it!

Following the steps above you ended up with:

A virtual environment which isolated what you installed from your global installation of Python
A shell that's activated for the virtual environment to make it a little easier to use Python
An updated installation of pip

If at any point you don't want your virtual environment anymore you can simply delete the .venv directory that the environment was kept in. And if you want to use a newer version of Python later on, just delete the .venv directory and create a new virtual environment (they are meant to be throw-away things). And remember not to commit your virtual environment to source control.

Please note there are ways to properly manage your package dependencies long-term if this is not a throw-away virtual environment just for experimentation. That topic deserves its own blog post, so I am not even going to attempt to delve into it here. Just know that if you are doing work that is meant to persist you will want to use something that will help you keep track of what packages your code depends on.

I also fully admit that this blog post is opinionated. It's short and bare-bones on purpose using my preferred way to quickly get up and going. Just be aware that there are tools which can help automate these 3 steps if you want. There are even alternative isolation environments other than virtual ones that might fit your needs better. We are rather lucky in the Python community to have so many passionate users that if something feels awkward then someone has probably come up with a solution that might improve the experience for you.

↧

Python Bytes: #165 Ranges as dictionary keys - oh my!

January 21, 2020, 12:00 am

≫ Next: Mike Driscoll: PyDev of the Week: Sebastián Ramírez

≪ Previous: Brett Cannon: A quick-and-dirty guide on how to install packages for Python

↧

Mike Driscoll: PyDev of the Week: Sebastián Ramírez

January 19, 2020, 10:05 pm

≫ Next: EuroPython: EuroPython 2020: Pre-launch Website Ready

≪ Previous: Python Bytes: #165 Ranges as dictionary keys - oh my!

This week we welcome Sebastián Ramírez (@tiangolo) as our PyDev of the Week! Sebastián is the creator of the FastAPI Python web framework. He maintains his own website/blog which you should check out if you have some free time. You can also see his open source projects there. You can also see what projects he is contributing to over on Github.

Let’s take a few moments to get to know Sebastián better!

Sebastián Ramírez

Can you tell us a little about yourself (hobbies, education, etc):

Hey! I’m Sebastián Ramírez, I’m from Colombia, and currently living in Berlin, Germany.

I was “homeschooled” since I was a kid, there wasn’t even a term for that, it wasn’t common. I didn’t go to school nor university, I studied everything at home. At about (I think) 14 I started fiddling with video edition and visual effects, some music production, and then graphic design to help with my parent’s business.

Then I thought that building a website should be almost the same …soon I realized I had to learn some of those scary “programming languages”. HTML, CSS, and JavaScript (“but!!! HTML and CSS are not…” I know, I know). But soon I was able to write a very short text, in a text file, and use it to make a browser show a button, that when clicked would show a pop-up saying “Hello world!”… I was so proud and excited about it, I guess it was a huge “I maked these” moment for me. I still feel that rush, that excitement from time to time. That’s what makes me keep loving code.

I also like to play videogames and watch movies, but many times I end up just coding in my free time too. I’m boring like that…

Why did you start using Python?

At some point, I was taking several (too many) courses on Coursera, edX, and Udacity. I knew mainly frontend vanilla JavaScript (Node.js was just starting), so I did all the exercises for the Cryptography, Algorithms, and other courses with JavaScript running in a browser, it sounds a bit crazy now.

Then I took Andrew Ng’s ML course on Coursera, it used Octave (kinda Matlab) and it taught me enough Octave/Matlab for the course, and also that learning a new language was not so terrible. But then an AI course from Berkeley/edX required Python… so I took the Python crash course that was embedded (it was just like one page). And I went into the AI course with that. I loved the course, and with it, I started to love Python. I had to read a lot of Python docs, tutorials, StackOverflow, etc. just to be able to keep the pace, but I loved it. After that, I took an MIT/edX Python course and several others.

And I just kept learning and loving Python more and more.

What other programming languages do you know and which is your favorite?

I’m quite fond of JavaScript as it was my first language. I have also used some compile-to-JS languages like CoffeeScript, TypeScript. I have also ended up doing quite some Bash for Linux and Docker.

I really like TypeScript, and now I almost never do plain JS without TS, I love having autocompletion everywhere and type checks for free. I naturally got super excited when optional type hints for Python were released as a Christmas gift in 2016. And 2 years later FastAPI came to be, heavily based on them.

What projects are you working on now?

I put a lot of my free time to FastAPI and sibling projects, and also some of the other open source tools I’ve built.

Right now I’m working for Explosion AI. They are the creators of spaCy, the open source, industrial-strength, Natural Language Processing package.

At work, I’m currently on the team building the teams version of Prodigy, a commercial tool for radically efficient machine teaching, using Active (Machine) Learning.

But as open source is very important for the company (because they’re awesome like that), I also devote part of my working time to FastAPI and family.

Which Python libraries are your favorite (core or 3rd party)?

Core, I would say typing, as it’s relatively new and it deserves more attention, I think not many people know that those optional type hints are what powers autocompletion and automatic type checks for errors in editors. Most of the developers love those features, but a few know that type hints are what powers them.

3rd party, I think naturally Starlette and Pydantic, as they power FastAPI.

But I think Pydantic also deserves a lot more attention, even outside of FastAPI. It’s an amazing library, really easy to use, and saves a lot of time debugging, validating, documenting, and parsing data. It’s also great for managing application settings and just moving data around in an app. Imagine using deeply nested dicts and lists of values, but not having to remember what is what everywhere (“did I write ‘username’ or ‘user_name’ as the key in the other function?” ), just having autocomplete for everything and automatic error checks (type checks).

I recently built a GitHub action to help me manage issues, and most of the work ended up being done automatically by Pydantic. It also works great for data science, cleaning and structuring data.

This list could probably grow a lot, but some highlights:

* Dev utils: Poetry or Pipenv, Black, Isort, Flake8, Autoflake8, Mypy, Pytest, Pytest-cov

* For docs: Mkdocs with Mkdocs-material and Markdown-include

* Others: Cookiecutter, Requests or HTTPX, Uvicorn

* Data Science/Processing, ML: Keras with TensorFlow or PyTorch, Numpy, PyAV, Pandas, Numba, and of course, spaCy and Prodigy

Is there anything else you’d like to say?

I love the Python community, I think it’s a friendly ecosystem and I would like all of us to help it be even more welcoming, friendly, and inclusive. I think we all can help in achieving that.

New developers: don’t be shy, you can help too. Updating documentation of a new tool you are learning is a great start.

Maintainers: help us build a friendly ecosystem, it’s difficult for a new developer to come and try to help. Please be nice.

—————————————————————–

Here are a couple of others that you can answer if you want to, but if you don’t have the time, that’s ok:

How did your project, FastAPI, come about?

I had spent years finding the right tools and plug-ins (even testing other languages with their frameworks) to build APIs.

I wanted to have automatic docs; data validation, serialization, and documentation; I wanted it to use open standards like OpenAPI, JSON Schema, and OAuth2; I wanted it to be independent of other things, like database and ORM, etc.

I had somewhat achieved it with some components from several places, but it was difficult to use and somewhat brittle, as there were a lot of components and plug-ins, and I had to somehow make them interact well together.

I also discovered that having types as, in TypeScript, it was possible to have autocompletion and checks for many errors (type checks). But then Python added optional type hints!

And after searching everywhere for a framework that used them and did all that, and finding that it didn’t exist yet, I used all the great ideas brought by previous tools with some of mine to integrate all those features in a single package.

I also wanted to provide a development experience as pleasant as possible, with as small/simple code as possible, while having great performance (powered by the awesome tools underneath, Starlette and Pydantic).

What top three things did you learn while creating the package?

First, that it was possible. I thought building a package that others found useful was reserved for some olympian-semi-god coders. It turns out that if there’s something to solve, and you solve it, and you help others use it to solve the same thing, that’s all that is needed.

Second, I learned a lot about how Python interacts with the web. FastAPI uses the new standard ASGI (the spiritual successor to WSGI), I learned a lot of it. Especially reading the beautiful and clean code of Starlette.

Third, I learned a lot about how Python works underneath by adding features to Pydantic. To be able to provide all its awesome features and the great simplicity while using it, its own internal code has to be, naturally, very complex. I even learned about undocumented features of Python’s internal typing parsing, that are needed to make everything work.

But I don’t think that a new developer needs to learn the last 2 things, the first one is the most important one. And as I was able to build FastAPI using the great tools and ideas provided by others, I hope FastAPI can provide a simple and easy way for others to build their ideas.

Do you have any advice for other aspiring package creators?

Write docs for your package. It doesn’t exist completely if it’s not well documented. And write them from the point of view of a new user, not of your own.

Also, building and publishing a new package is now extremely easy. Use Flit or Poetry if your project is simple enough to use them (i.e. pure Python, you are not building with Cython extensions, etc).

The post PyDev of the Week: Sebastián Ramírez appeared first on The Mouse Vs. The Python.

↧

EuroPython: EuroPython 2020: Pre-launch Website Ready

January 22, 2020, 2:59 am

≫ Next: PyCharm: PyCharm 2019.3.2

≪ Previous: Mike Driscoll: PyDev of the Week: Sebastián Ramírez

In the last couple of weeks we have put together a pre-launch site for EuroPython 2020, which has all the information around the event, as we currently know and can share with you.

https://ep2020.europython.eu/

The main website will go online around early in March and we plan to also open the CFP and ticket sales around that time. It will use the same URL, so you can keep this bookmarked.

Some additional updates:

We have signed the venue agreement, so if you have waited with your flight and hotel bookings for the final confirmation of the conference dates, you can now go ahead. Looking at the hotel booking websites, it’s probably a good idea to book early.
We will now enter negotiations with the booth builder and finalize the sponsorship packages. As last year, we’ll again offer an early bird 10% discount for sponsors who sign up in the first few weeks after we’ve published the finalized packages on our blog.

Enjoy,
–
EuroPython 2020 Team
https://ep2020.europython.eu/

↧

PyCharm: PyCharm 2019.3.2

January 22, 2020, 4:42 am

≫ Next: Stack Abuse: Ensemble/Voting Classification in Python with Scikit-Learn

≪ Previous: EuroPython: EuroPython 2020: Pre-launch Website Ready

We’ve been taking some time to polish PyCharm further, so be sure to update to the newest version! You can get it from within PyCharm (Help | Check for Updates), using JetBrains Toolbox, or by downloading the new version from our website.

Improved in PyCharm

An issue where PyCharm’s debugger would ignore breakpoints in certain conditions has been resolved
Running code on remote interpreters on FreeBSD with elevated privileges now works as expected
There are many small differences between SQL dialects, and we’re always working hard to make sure that our database tooling gets them all right. Fixed in this version are: \gset for PostgreSQL, MEMBER OF for MySQL, and more. Open the Database tool window in PyCharm Professional Edition, and let us know if everything works right for you database!
The Node.JS debugger will now correctly stop at breakpoints after editing the JavaScript code while running [Pro only]

And many more small fixes, see our release notes for details.

Getting the New Version

You can update PyCharm by choosing Help | Check for Updates (or PyCharm | Check for Updates on macOS) in the IDE. PyCharm will be able to patch itself to the new version, there should no longer be a need to run the full installer.

If you’re on Ubuntu 16.04 or later, or any other Linux distribution that supports snap, you should not need to upgrade manually, you’ll automatically receive the new version.

↧

Stack Abuse: Ensemble/Voting Classification in Python with Scikit-Learn

January 22, 2020, 5:36 am

≫ Next: Python Circle: Using IF ELSE condition in Django template

≪ Previous: PyCharm: PyCharm 2019.3.2

Introduction

Ensemble classification models can be powerful machine learning tools capable of achieving excellent performance and generalizing well to new, unseen datasets.

The value of an ensemble classifier is that, in joining together the predictions of multiple classifiers, it can correct for errors made by any individual classifier, leading to better accuracy overall. Let's take a look at the different ensemble classification methods and see how these classifiers can be implemented in Scikit-Learn.

What are Ensemble Models in Machine Learning?

alt
Credit: Pixabay

Ensemble models are an ensemble learning method that combines different algorithms together. In this sense, it is a meta-algorithm rather than an algorithm itself. Ensemble learning methods are valuable because they can improve the performance of a predictive model.

Ensemble learning methods work off of the idea that tying the predictions of multiple classifiers together will lead to better performance by either improving prediction accuracy or reducing aspects like bias and variance.

In general, an ensemble model falls into one of two categories: sequential approaches and parallel approaches.

A sequential ensemble model operates by having the base learners/models generated in sequence. Sequential ensemble methods are typically used to try and increase overall performance, as the ensemble model can compensate for inaccurate predictions by re-weighting the examples that were previously misclassified. A notable example of this is AdaBoost.

A parallel model is, as you may be able to guess, methods that rely on creating and training the base learners in parallel. Parallel methods aim to reduce the error rate by training many models in parallel and averaging the results together. A notable example of a parallel method is the Random Forest Classifier.

Another way of thinking about this is a distinction between homogenous and heterogeneous learners. While most of the ensemble learning methods use homogeneous base learners (many of the same type of learners), some ensemble methods use heterogeneous learners (different learning algorithms joined together).

To recap:

Sequential models try to increase performance by re-weighting examples, and models are generated in sequence.
Parallel models work by averaging results together after training many models at the same time.

We'll now cover different methods of employing these models to solve machine learning classification problems.

Different Ensemble Classification Methods

Bagging

alt
Credit: Wikimedia Commons

Bagging, also known as bootstrap aggregating, is a classification method that aims to reduce the variance of estimates by averaging multiple estimates together. Bagging creates subsets from the main dataset that the learners are trained on.

In order for the predictions of the different classifiers to be aggregated, either an averaging is used for regression, or a voting approach is used for classification (based on the decision of the majority).

One example of a bagging classification method is the Random Forests Classifier. In the case of the random forests classifier, all the individual trees are trained on a different sample of the dataset.

The tree is also trained using random selections of features. When the results are averaged together, the overall variance decreases and the model performs better as a result.

Boosting

Boosting algorithms are capable of taking weak, underperforming models and converting them into strong models. The idea behind boosting algorithms is that you assign many weak learning models to the datasets, and then the weights for misclassified examples are tweaked during subsequent rounds of learning.

The predictions of the classifiers are aggregated and then the final predictions are made through a weighted sum (in the case of regressions), or a weighted majority vote (in the case of classification).

AdaBoost is one example of a boosting classifier method, as is Gradient Boosting, which was derived from the aforementioned algorithm.

If you'd like to read more about Gradient Boosting and the theory behind it, we've already covered that in a previous article.

Stacking

alt
Credit: Wikimedia Commons

Stacking algorithms are an ensemble learning method that combines the decision of different regression or classification algorithms. The component models are trained on the entire training dataset. After these component models are trained, a meta-model is assembled from the different models and then it's trained on the outputs of the component models. This approach typically creates a heterogeneous ensemble because the component models are usually different algorithms.

Example Implementations

Now that we've explored different methods we can use to create ensemble models, let's take a look at how we could implement a classifier using the different methods.

Though, before we can take a look at different ways of implementing ensemble classifiers, we need to select a dataset to use and do some preprocessing of the dataset.

We'll be using the Titanic dataset, which can be downloaded here. Let's do some preprocessing of the data in order to get rid of missing values and scale the data to a uniform range. Then we can go about setting up the ensemble classifiers.

Data Preprocessing

To begin with, we'll start by importing all functions we need from their respective libraries. We'll be using Pandas and Numpy to load and transform the data, as well as the LabelEncoder and StandardScaler tools.

We'll also need the machine learning metrics and the train_test_split function. Finally, we'll need the classifiers we want to use:

import pandas as pd
import numpy as np
import warnings

from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import accuracy_score, f1_score, log_loss
from sklearn.model_selection import train_test_split, KFold, cross_val_score

from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, ExtraTreesClassifier

We'll start by loading in the training and testing data and then creating a function to check for the presence of any null values:

training_data = pd.read_csv("train.csv")
testing_data = pd.read_csv("test.csv")

def get_nulls(training, testing):
    print("Training Data:")
    print(pd.isnull(training).sum())
    print("Testing Data:")
    print(pd.isnull(testing).sum())

get_nulls(training_data, testing_data)

As it happens, there are a lot of missing values in the Age and Cabin categories.

Training Data:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
Testing Data:
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

We're going to start by dropping some of the columns that will likely be useless - the Cabin column and the Ticket column. The Cabin column has far too many missing values and the Ticket column is simply comprised of too many categories to be useful.

After that we will need to impute some missing values. When we do so, we must account for how the dataset is slightly right skewed (young ages are slightly more prominent than older ages). We'll use the median values when we impute the data because due to large outliers taking the average values would give us imputed values that are far from the center of the dataset:

# Drop the cabin column, as there are too many missing values
# Drop the ticket numbers too, as there are too many categories
# Drop names as they won't really help predict survivors

training_data.drop(labels=['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)
testing_data.drop(labels=['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)

# Taking the mean/average value would be impacted by the skew
# so we should use the median value to impute missing values

training_data["Age"].fillna(training_data["Age"].median(), inplace=True)
testing_data["Age"].fillna(testing_data["Age"].median(), inplace=True)
training_data["Embarked"].fillna("S", inplace=True)
testing_data["Fare"].fillna(testing_data["Fare"].median(), inplace=True)

get_nulls(training_data, testing_data)

Now we can see there's no more missing values:

Training Data:
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64
Testing Data:
PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Fare           0
Embarked       0
dtype: int64

We're now going to need to encode the non-numerical data. Let's set up a LabelEncoder and fit it on the Sex feature and then transform the data with the encoder. We'll then replace the values in the Sex feature with those that have been encoded and then do the same for the Embarked feature.

Finally, let's scale the data using the StandardScaler, so there aren't huge fluctuations in values.

encoder_1 = LabelEncoder()
# Fit the encoder on the data
encoder_1.fit(training_data["Sex"])

# Transform and replace training data
training_sex_encoded = encoder_1.transform(training_data["Sex"])
training_data["Sex"] = training_sex_encoded
test_sex_encoded = encoder_1.transform(testing_data["Sex"])
testing_data["Sex"] = test_sex_encoded

encoder_2 = LabelEncoder()
encoder_2.fit(training_data["Embarked"])

training_embarked_encoded = encoder_2.transform(training_data["Embarked"])
training_data["Embarked"] = training_embarked_encoded
testing_embarked_encoded = encoder_2.transform(testing_data["Embarked"])
testing_data["Embarked"] = testing_embarked_encoded

# Any value we want to reshape needs be turned into array first
ages_train = np.array(training_data["Age"]).reshape(-1, 1)
fares_train = np.array(training_data["Fare"]).reshape(-1, 1)
ages_test = np.array(testing_data["Age"]).reshape(-1, 1)
fares_test = np.array(testing_data["Fare"]).reshape(-1, 1)

# Scaler takes arrays
scaler = StandardScaler()

training_data["Age"] = scaler.fit_transform(ages_train)
training_data["Fare"] = scaler.fit_transform(fares_train)
testing_data["Age"] = scaler.fit_transform(ages_test)
testing_data["Fare"] = scaler.fit_transform(fares_test)

Now that our data has been preprocessed, we can select our features and labels and then use the train_test_split function to divide our entire training data up into training and testing sets:

# Now to select our training/testing data
X_features = training_data.drop(labels=['PassengerId', 'Survived'], axis=1)
y_labels = training_data['Survived']

print(X_features.head(5))

# Make the train/test data from validation

X_train, X_val, y_train, y_val = train_test_split(X_features, y_labels, test_size=0.1, random_state=27)

We're now ready to start implementing ensemble classification methods.

Simple Averaging Approach

Before we get into the big three ensemble methods we covered earlier, let's cover a very quick and easy method of using an ensemble approach - averaging predictions. We simply add the different predicted values of our chosen classifiers together and then divide by the total number of classifiers, using floor division to get a whole value.

In this test case we'll be using logistic regression, a Decision Tree Classifier, and the Support Vector Classifier. We fit the classifiers on the data and then save the predictions as variables. Then we simply add the predictions together and divide:

LogReg_clf = LogisticRegression()
DTree_clf = DecisionTreeClassifier()
SVC_clf = SVC()

LogReg_clf.fit(X_train, y_train)
DTree_clf.fit(X_train, y_train)
SVC_clf.fit(X_train, y_train)

LogReg_pred = LogReg_clf.predict(X_val)
DTree_pred = DTree_clf.predict(X_val)
SVC_pred = SVC_clf.predict(X_val)

averaged_preds = (LogReg_pred + DTree_pred + SVC_pred)//3
acc = accuracy_score(y_val, averaged_preds)
print(acc)

Here's the accuracy we got from this method:

0.8444444444444444

Voting\Stacking Classification Example

When it comes to creating a stacking/voting classifier, Scikit-Learn provides us with some handy functions that we can use to accomplish this.

The VotingClassifier takes in a list of different estimators as arguments and a voting method. The hard voting method uses the predicted labels and a majority rules system, while the soft voting method predicts a label based on the argmax/largest predicted value of the sum of the predicted probabilities.

After we provide the desired classifiers, we need to fit the resulting ensemble classifier object. We can then get predictions and use accuracy metrics:

voting_clf = VotingClassifier(estimators=[('SVC', SVC_clf), ('DTree', DTree_clf), ('LogReg', LogReg_clf)], voting='hard')
voting_clf.fit(X_train, y_train)
preds = voting_clf.predict(X_val)
acc = accuracy_score(y_val, preds)
l_loss = log_loss(y_val, preds)
f1 = f1_score(y_val, preds)

print("Accuracy is: " + str(acc))
print("Log Loss is: " + str(l_loss))
print("F1 Score is: " + str(f1))

Here's what the metrics have to say about the VotingClassifier's performance:

Accuracy is: 0.8888888888888888
Log Loss is: 3.8376684749044165
F1 Score is: 0.8484848484848486

Bagging Classification Example

Here's how we can implement bagging classification with Scikit-Learn. Sklearn's BaggingClassifier takes in a chosen classification model as well as the number of estimators that you want to use - you can use a model like Logistic Regression or Decision Trees.

Sklearn also provides access to the RandomForestClassifier and the ExtraTreesClassifier, which are modifications of the decision tree classification. These classifiers can also be used alongside the K-folds cross-validation tool.

We'll compare several different bagging classification approaches here, printing out the mean results of the K-fold cross validation score:

logreg_bagging_model = BaggingClassifier(base_estimator=LogReg_clf, n_estimators=50, random_state=12)
dtree_bagging_model = BaggingClassifier(base_estimator=DTree_clf, n_estimators=50, random_state=12)
random_forest = RandomForestClassifier(n_estimators=100, random_state=12)
extra_trees = ExtraTreesClassifier(n_estimators=100, random_state=12)

def bagging_ensemble(model):
    k_folds = KFold(n_splits=20, random_state=12)
    results = cross_val_score(model, X_train, y_train, cv=k_folds)
    print(results.mean())

bagging_ensemble(logreg_bagging_model)
bagging_ensemble(dtree_bagging_model)
bagging_ensemble(random_forest)
bagging_ensemble(extra_trees)

Here's the results we got from the classifiers:

0.7865853658536585
0.8102439024390244
0.8002439024390245
0.7902439024390244

Boosting Classification Example

Finally, we'll take a look at how to use a boosting classification method. As mentioned, there's a separate article on the topic of Gradient Boosting you can read here.

Scikit-Learn has a built-in AdaBoost classifier, which takes in a given number of estimators as the first argument. We can try using a for loop to see how the classification performance changes at different values, and we can also combine it with the K-Folds cross validation tool:

k_folds = KFold(n_splits=20, random_state=12)

num_estimators = [20, 40, 60, 80, 100]

for i in num_estimators:
    ada_boost = AdaBoostClassifier(n_estimators=i, random_state=12)
    results = cross_val_score(ada_boost, X_train, y_train, cv=k_folds)
    print("Results for {} estimators:".format(i))
    print(results.mean())

Here's the results we got:

Results for 20 estimators:
0.8015243902439024
Results for 40 estimators:
0.8052743902439025
Results for 60 estimators:
0.8053048780487805
Results for 80 estimators:
0.8040243902439024
Results for 100 estimators:
0.8027743902439024

Summing Up

We've covered the ideas behind three different ensemble classification techniques: voting\stacking, bagging, and boosting.

Scikit-Learn allows you to easily create instances of the different ensemble classifiers. These ensemble objects can be combined with other Scikit-Learn tools like K-Folds cross validation.

If you'd like to learn more about appropriate uses for ensemble classifiers, and the theories behind them, I suggest checking out the links found here or here.

↧

Python Circle: Using IF ELSE condition in Django template

January 22, 2020, 5:46 am