Real Python: Python String Formatting Tutorial

July 4, 2018, 7:00 am

≫ Next: PyCharm: PyCharm 2018.2 EAP 6

≪ Previous: Reinout van Rees: Plain text pypi description formatting: possible cause

Remember the Zen of Python and how there should be “one obvious way to do something in Python”? You might scratch your head when you find out that there are four major ways to do string formatting in Python.

In this tutorial, you’ll learn the four main approaches to string formatting in Python, as well as their strengths and weaknesses. You’ll also get a simple rule of thumb for how to pick the best general purpose string formatting approach in your own programs.

Let’s jump right in, as we’ve got a lot to cover. In order to have a simple toy example for experimentation, let’s assume you’ve got the following variables (or constants, really) to work with:

>>> errno=50159747054>>> name='Bob'

Based on these variables, you’d like to generate an output string containing a simple error message:

'Hey Bob, there is a 0xbadc0ffee error!'

That error could really spoil a dev’s Monday morning… But we’re here to discuss string formatting. So let’s get to work.

#1 “Old Style” String Formatting (% operator)

Strings in Python have a unique built-in operation that can be accessed with the % operator. This lets you do simple positional formatting very easily. If you’ve ever worked with a printf-style function in C, you’ll recognize how this works instantly. Here’s a simple example:

>>> 'Hello, %s'%name"Hello, Bob"

I’m using the %s format specifier here to tell Python where to substitute the value of name, represented as a string.

There are other format specifiers available that let you control the output format. For example, it’s possible to convert numbers to hexadecimal notation or add whitespace padding to generate nicely formatted tables and reports. (See Python Docs: “printf-style String Formatting”.)

Here, you can use the %x format specifier to convert an int value to a string and to represent it as a hexadecimal number:

>>> '%x'%errno'badc0ffee'

The “old style” string formatting syntax changes slightly if you want to make multiple substitutions in a single string. Because the % operator takes only one argument, you need to wrap the right-hand side in a tuple, like so:

>>> 'Hey %s, there is a 0x%x error!'%(name,errno)'Hey Bob, there is a 0xbadc0ffee error!'

It’s also possible to refer to variable substitutions by name in your format string, if you pass a mapping to the % operator:

>>> 'Hey %(name)s, there is a 0x%(errno)x error!'%{... "name":name,"errno":errno}'Hey Bob, there is a 0xbadc0ffee error!'

This makes your format strings easier to maintain and easier to modify in the future. You don’t have to worry about making sure the order you’re passing in the values matches up with the order in which the values are referenced in the format string. Of course, the downside is that this technique requires a little more typing.

I’m sure you’ve been wondering why this printf-style formatting is called “old style” string formatting. It was technically superseded by “new style” formatting in Python 3, which we’re going to talk about next.

#2 “New Style” String Formatting (`str.format`)

Python 3 introduced a new way to do string formatting that was also later back-ported to Python 2.7. This “new style” string formatting gets rid of the %-operator special syntax and makes the syntax for string formatting more regular. Formatting is now handled by calling .format() on a string object.

You can use format() to do simple positional formatting, just like you could with “old style” formatting:

>>> 'Hello, {}'.format(name)'Hello, Bob'

Or, you can refer to your variable substitutions by name and use them in any order you want. This is quite a powerful feature as it allows for re-arranging the order of display without changing the arguments passed to format():

>>> 'Hey {name}, there is a 0x{errno:x} error!'.format(... name=name,errno=errno)'Hey Bob, there is a 0xbadc0ffee error!'

This also shows that the syntax to format an int variable as a hexadecimal string has changed. Now you need to pass a format spec by adding a :x suffix. The format string syntax has become more powerful without complicating the simpler use cases. It pays off to read up on this string formatting mini-language in the Python documentation.

In Python 3, this “new style” string formatting is to be preferred over %-style formatting. While “old style” formatting has been de-emphasized, it has not been deprecated. It is still supported in the latest versions of Python. According to this discussion on the Python dev email list and this issue on the Python dev bug tracker, %-formatting is going to stick around for a long time to come.

Still, the official Python 3 documentation doesn’t exactly recommend “old style” formatting or speak too fondly of it:

“The formatting operations described here exhibit a variety of quirks that lead to a number of common errors (such as failing to display tuples and dictionaries correctly). Using the newer formatted string literals or the str.format() interface helps avoid these errors. These alternatives also provide more powerful, flexible and extensible approaches to formatting text.” (Source)

This is why I’d personally try to stick with str.format for new code moving forward. Starting with Python 3.6, there’s yet another way to format your strings. I’ll tell you all about it in the next section.

#3 String Interpolation / f-Strings (Python 3.6+)

Python 3.6 added a new string formatting approach called formatted string literals or “f-strings”. This new way of formatting strings lets you use embedded Python expressions inside string constants. Here’s a simple example to give you a feel for the feature:

>>> f'Hello, {name}!''Hello, Bob!'

As you can see, this prefixes the string constant with the letter “f“—hence the name “f-strings.” This new formatting syntax is powerful. Because you can embed arbitrary Python expressions, you can even do inline arithmetic with it. Check out this example:

>>> a=5>>> b=10>>> f'Five plus ten is {a + b} and not {2 * (a + b)}.''Five plus ten is 15 and not 30.'

Formatted string literals are a Python parser feature that converts f-strings into a series of string constants and expressions. They then get joined up to build the final string.

Imagine you had the following greet() function that contains an f-string:

>>> defgreet(name,question):... returnf"Hello, {name}! How's it {question}?"...>>> greet('Bob','going')"Hello, Bob! How's it going?"

When you disassemble the function and inspect what’s going on behind the scenes, you’ll see that the f-string in the function gets transformed into something similar to the following:

>>> defgreet(name,question):... return"Hello, "+name+"! How's it "+question+"?"

The real implementation is slightly faster than that because it uses the BUILD_STRING opcode as an optimization. But functionally they’re the same:

>>> importdis>>> dis.dis(greet)  2           0 LOAD_CONST               1 ('Hello, ')              2 LOAD_FAST                0 (name)              4 FORMAT_VALUE             0              6 LOAD_CONST               2 ("! How's it ")              8 LOAD_FAST                1 (question)             10 FORMAT_VALUE             0             12 LOAD_CONST               3 ('?')             14 BUILD_STRING             5             16 RETURN_VALUE

String literals also support the existing format string syntax of the str.format() method. That allows you to solve the same formatting problems we’ve discussed in the previous two sections:

>>> f"Hey {name}, there's a {errno:#x} error!""Hey Bob, there's a 0xbadc0ffee error!"

Python’s new formatted string literals are similar to JavaScript’s Template Literals added in ES2015. I think they’re quite a nice addition to Python, and I’ve already started using them in my day to day (Python 3) work. You can learn more about formatted string literals in our in-depth Python f-strings tutorial.

#4 Template Strings (Standard Library)

Here’s one more tool for string formatting in Python: template strings. It’s a simpler and less powerful mechanism, but in some cases this might be exactly what you’re looking for.

Let’s take a look at a simple greeting example:

>>> fromstringimportTemplate>>> t=Template('Hey, $name!')>>> t.substitute(name=name)'Hey, Bob!'

You see here that we need to import the Template class from Python’s built-in string module. Template strings are not a core language feature but they’re supplied by the string module in the standard library.

Another difference is that template strings don’t allow format specifiers. So in order to get the previous error string example to work, you’ll need to manually transform the int error number into a hex-string:

>>> templ_string='Hey $name, there is a $error error!'>>> Template(templ_string).substitute(... name=name,error=hex(errno))'Hey Bob, there is a 0xbadc0ffee error!'

That worked great.

So when should you use template strings in your Python programs? In my opinion, the best time to use template strings is when you’re handling formatted strings generated by users of your program. Due to their reduced complexity, template strings are a safer choice.

The more complex formatting mini-languages of the other string formatting techniques might introduce security vulnerabilities to your programs. For example, it’s possible for format strings to access arbitrary variables in your program.

That means, if a malicious user can supply a format string, they can potentially leak secret keys and other sensitive information! Here’s a simple proof of concept of how this attack might be used against your code:

>>> # This is our super secret key:>>> SECRET='this-is-a-secret'>>> classError:... def__init__(self):... pass>>> # A malicious user can craft a format string that>>> # can read data from the global namespace:>>> user_input='{error.__init__.__globals__[SECRET]}'>>> # This allows them to exfiltrate sensitive information,>>> # like the secret key:>>> err=Error()>>> user_input.format(error=err)'this-is-a-secret'

See how a hypothetical attacker was able to extract our secret string by accessing the __globals__ dictionary from a malicious format string? Scary, huh? Template strings close this attack vector. This makes them a safer choice if you’re handling format strings generated from user input:

>>> user_input='${error.__init__.__globals__[SECRET]}'>>> Template(user_input).substitute(error=err)ValueError:"Invalid placeholder in string: line 1, col 1"

Which String Formatting Method Should You Use?

I totally get that having so much choice for how to format your strings in Python can feel very confusing. This is an excellent cue to bust out this handy flowchart infographic I’ve put together for you:

Python String Formatting Rule of Thumb (Click to Tweet)

This flowchart is based on the rule of thumb that I apply when I’m writing Python:

Python String Formatting Rule of Thumb: If your format strings are user-supplied, use Template Strings (#4) to avoid security issues. Otherwise, use Literal String Interpolation/ (#3) if you’re on Python 3.6+, and “New Style” str.format (#2) if you’re not.

Key Takeaways

Perhaps surprisingly, there’s more than one way to handle string formatting in Python.
Each method has its individual pros and cons. Your use case will influence which method you should use.
If you’re having trouble deciding which string formatting method to use, try our Python String Formatting Rule of Thumb.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

PyCharm: PyCharm 2018.2 EAP 6

July 4, 2018, 7:08 am

≫ Next: Python Bytes: #85 Visually debugging your Jupyter notebook

≪ Previous: Real Python: Python String Formatting Tutorial

What’s the best way to celebrate your independence? Fireworks and hot dogs go great with a hot new EAP. Get PyCharm 2018.2 EAP 6 now from the JetBrains website!

New in PyCharm

Setup.py support has been improved, both test_require and install_requires handling have been fixed
Overloads (defined with typing.overload) are now copied to the subclass when overriding a method that defines them.
SQL code style improvements: MySQL line comments are now correctly identified, and it’s now possible to right-align statements on the second word.
And more: read the release notes here

If you haven’t tried it yet: PyCharm 2018.2 comes with support for Pipenv. This EAP already allows you to manage your requirements with a Pipfile, try it now!

Interested?

Download this EAP from our website. Alternatively, you can use the JetBrains Toolbox App to stay up to date throughout the entire EAP.

If you’re on Ubuntu 16.04 or later, you can use snap to get PyCharm EAP, and stay up to date. You can find the installation instructions on our website.

PyCharm 2018.2 is in development during the EAP phase, therefore not all new features are already available. More features will be added in the coming weeks. As PyCharm 2018.2 is pre-release software, it is not as stable as the release versions. Furthermore, we may decide to change and/or drop certain features as the EAP progresses.

All EAP versions will ship with a built-in EAP license, which means that these versions are free to use for 30 days after the day that they are built. As EAPs are released weekly, you’ll be able to use PyCharm Professional Edition EAP for free for the duration of the EAP program, as long as you upgrade at least once every 30 days.

↧

Python Bytes: #85 Visually debugging your Jupyter notebook

July 3, 2018, 1:00 am

≫ Next: Python Software Foundation: Ophidia in Urbe - PyLondinium Arrives

≪ Previous: PyCharm: PyCharm 2018.2 EAP 6

↧

Python Software Foundation: Ophidia in Urbe - PyLondinium Arrives

July 5, 2018, 2:06 am

≫ Next: Python Software Foundation: Python Software Foundation Fellow Members for Q2 2018

≪ Previous: Python Bytes: #85 Visually debugging your Jupyter notebook

Latin scholars will tell you that “Ophidia in Urbe,” the tag line for PyLondinium (London, June 8-10), is Latin for “Snakes in the City”.

The snakes, of course, are Pythonic and “the city” is the City, the banking district of London, specifically Bloomberg’s new European headquarters, just across the way from the Bank of England. It’s a beautiful building and it contains the carefully excavated and reconstructed remains of a 3rd century Roman temple to Mithras. Ergo (as those Romans would say) the need for a Latin tagline.

But what’s really distinctive about PyLondinium is the whole idea behind it. PyLondinium was intended to be a small conference that 1) offered great talks, 2) had a very affordable ticket price, and 3) raised a reasonable amount of money for the benefit of the PSF and its programs around the world. And all of this in London, one of the more expensive cities in the world.

With the London and UK Python community at hand, getting great talks was the easy part. Keeping prices low and still raising money for the cause was a harder problem.

Founder and chair of the conference, Mario Corchero, had an answer to that problem. One of several Bloomberg employees also involved in the Python community, Mario was also the chair of last year’s PyCon España (and co-chair of the PyCon Charlas track), and several other Spanish employees of Bloomberg London had been on the PyConES organizing team. The inspiration of Mario and his team was to combine their own organizing experience with Bloomberg’s sponsorship, which provided the venue and the food.

The result was a strong first time conference - selling 270 tickets, with 2 days of talks preceded by a day with a dateutils sprint, a PyLadies tutorial, and a Trans*Code hackday, in the heart of London, all for a standard ticket price of only £35. Even better, to support diversity anyone attending the PyLadies or Trans*Code events (both free) also got a free ticket to the main conference if they wanted. Feedback from attendees was overwhelmingly positive, and PyLondinium looks poised to build on that success in the future.

And what about raising money for the PSF? Yes, PyLondinium did a great job with that as well, sending $14,000 to the PSF to support Pythonic communities and activities around the world.

Thank you from the PSF, and well done, you!

(PyLondinium photos are available and videos are coming soon)

↧

Python Software Foundation: Python Software Foundation Fellow Members for Q2 2018

July 5, 2018, 2:08 am

≫ Next: The Digital Cat: Useful pytest command line options

≪ Previous: Python Software Foundation: Ophidia in Urbe - PyLondinium Arrives

We are happy to announce our 2018 2nd Quarter Python Software Foundation Fellow Members:

Anthony Shaw
Twitter, GitHub, Website

Christian Barra
Twitter, GitHub, Website

Jeff Reback
Twitter, GitHub

Joris Van den Bossche
Twitter, Github, Website
Katie McLaughlin
Twitter, GitHub, Website

Marc Garcia
Twitter, LinkedIn, GitHub

Rizky Ariestiyansyah
Twitter, GitHub

Tom Augspurger
Website

Wes McKinney
Twitter, GitHub, Website

Yury Selivanov
Twitter, GitHub, Website

Congratulations! Thank you for your continued contributions. We have added you to our Fellow roster online.

The above members have contributed to the Python ecosystem by maintaining popular libraries, organizing Python events, hosting Python meet ups, teaching classes, contributing to CPython, and overall being great mentors in our community. Each of them continues to help make Python more accessible around the world. To learn more about the new Fellow members, check out their links above.

If you would like to nominate someone to be a PSF Fellow, please send a description of their Python accomplishments and their email address to psf-fellow at python.org. Here is the nomination review schedule for 2018:

Q3: July to the end of September (01/07 - 30/09) Cut-off for quarter three will be August 20. New fellows will be announced before end of September.

Q4: October to the end of December (01/10 - 31/12) Cut-off for quarter four will be November 20. New fellows will be announced before December 31.

We are looking for a few more voting members to join the Work Group to help review nominations. If you are a PSF Fellow and would like to join, please write to psf-fellow at python.org.

↧

The Digital Cat: Useful pytest command line options

July 5, 2018, 3:00 am

≫ Next: Will McGugan: Compress WebSocket streams with Lomond 0.3.2

≪ Previous: Python Software Foundation: Python Software Foundation Fellow Members for Q2 2018

I recently gave a workshop on "TDD in Python with pytest", where I developed a very simple Python project together with the attendees following a TDD approach. It's a good way to introduce TDD, I think. I wrote each test together with the attendees, and then I left them the task of writing the Python code that passes the test. This way I could show TDD in action, introducing pytest features like the pytest.raises context manager or the use of assert while they become useful for the actual tests.

This is the approach that I follow in some of my posts on TDD here on the blog, for example A simple example of Python OOP development (with TDD) and A game of tokens: write an interpreter in Python with TDD - Part 1.

Part of the workshop was dedicated to pytest command line options and in general to what pytest can do as a testing framework. Unfortunately there was no time to go through this part, so I promised some of the attendees to give them a written version of it. This post is the fulfilment of that promise.

Please remember to import pytest before using functions, decorators or attributes prefixed by pytest., as I will not repeat it in each example.

Run single tests¶

If you want to run only a specific test you can provide its name on the pytest command line

$ pytest -svv tests/test_calc.py::test_addition

which for example runs the tests_addition test inside the tests/test_calc.py file. You can also specify the file name only to run all the tests contained there

$ pytest -svv tests/test_calc.py

Skipping tests¶

Sometimes it is useful to skip tests. The reason might be that some new code broke too many tests, and we want to face them one at a time, or that a specific feature had to be temporarily disabled. In all those cases the pytest.mask,skip decorator is your friend. Remember that a decorator is something that changes the way the decorated function works (for the skilled reader: it's a function wrapper). Assuming we are working on a tests/test_calc.py file the code might be

@pytest.mark.skipdeftest_test_addition():[...]

The result on the command line will be (after running py.test -svv)

tests/test_calc.py::test_addition SKIPPED

Skipping with a reason¶

The previous solution is good for a temporary skip, but if the test has to remain deactivated for a long time it's better to annotate a specific reason for the exclusion. In my experience 1 day is enough to forget small details like this, so my advice is to always put a well-written reason on skipped tests. To add it you can use the reason attribute of the skip decorator

@pytest.mark.skip(reason="Addition has been deactivated because of issue #123")deftest_test_addition():[...]

Remember to add the -rs option to your command line to see the reason behind skipped tests. So after running py.test -svv -rs we will get something like

tests/test_calc.py::test_addition SKIPPED
[...]============================= short test summary info=============================
SKIP [1] tests/test_calc.py:5: Addition has been deactivated because of issue #123======================12 passed, 1 skipped in 0.02 seconds=======================

Skipping tests conditionally¶

Well, most of the time we will skip tests not for a stable reason, but according to some other condition that we can retrieve from the system, like the Python version, or maybe the region in which a server is running. The decorator that we need to use in that case is skipif, which accepts a condition (a boolean value) and a reason

importos@pytest.mark.skipif(os.environ['AWS_REGION']=='us-west-2',reason="Addition has been deactivated in us-west-2 because of issue #234")deftest_addition():[...]

With this code running AWS_REGION=eu-west-1 py.test -svv -rs will run the test_addition test, while running AWS_REGION=us-west-2 py.test -svv -rs will skip it. The environment variable AWS_REGION set in the previous command lines is an example that simulates the presence of the variable in the system.

Run tests by name¶

You can selectively run tests by name using -k. This option accepts Python expressions that try to match the name of the test with the provided values. So

$ pytest -svv -k "test_addition"

will run all those tests which name contains the substring 'addition', like test_addiiton, test_addition_multiple_inputs, and test_complex_addition. A more complex expression could be for example

$ pytest -svv -k "test_addition and not complex"

which will run both test_addition and test_addition_multiple_inputs but not test_complex_addition.

Tagging tests¶

Tests can be tagged or labelled using pytest.mark, and the tag can be used to run or skip sets of tests. Let's say that we identify a set of very slow tests that we don't want to run continuously.

@pytest.mark.slowdeftest_addition():[...]deftest_subtraction():[...]@pytest.mark.slowdeftest_multiplication():[...]

In the above example test_addition and test_multiplication have been decorated with pytest.mark.slow which tells pytest to label them with the slow identifier. At this point we can run all the tests that are tagged with the -m option

$ pytest -svv -m slow

Tests can be tagged multiple times

@pytest.mark.complex@pytest.mark.slowdeftest_addition():[...]

In this case the test will be run both by pytest -svv -m slow and by pytest -svv -m complex.

The -m option supports complex expressions like

$ pytest -svv -m 'not slow'

which runs all the tests that are not tagged with slow, or

$ pytest -svv -m 'mac or linux'

which runs all the tests tagged with mac and all the tests tagged with linux. Pay attention that -m expressions refer to the tags of each single test, so slow and complex will run only those tests that are tagged both with slow and with complex, and not all the tests marked with the first and all the tests marked with the second.

Adding a command line option¶

You can add custom command line options to pytest with the pytest_addoption and pytest_runtest_setup hooks that allows you to manage the command line parser and the setup for each test.

Let's say, for example, that we want to add a --runslow option that runs all the tests marked with slow. First, create the file tests/conftest.py, which is a file that pytest imports before running your tests, and use the pytest_addoption hook

defpytest_addoption(parser):parser.addoption("--runslow",action="store_true",help="run slow tests")

The command line parser configuration will be stored into the config attribute of the setup of each test. Thus we can use the pytest_runtest_setup hook that runs before each test

defpytest_runtest_setup(item):if'slow'initem.keywordsandnotitem.config.getvalue("runslow"):pytest.skip("need --runslow option to run")

Here item is the single test, so item.keywords is the set of tags attached to the test, and item.config is the configuration after the parser run on the command line. This results in the previous code match TODO(ing?) all the tests that are decorated with @pytest.mark.slow and only when the --runslow option has been specified on the command line. If both those conditions are satisfied the pytest.skip function is run, which skips the current test adding the specified string as a reason.

Coverage¶

Coverage is a measure of the percentage of code lines are "hit" when running tests. Basically the idea is to discover if there are parts of the code that are not run during the tests, thus being untested.

If you follow a strict TDD methodology your coverage will be 100% always, because the only code you will write is the one that you need to pass the tests. But please, please, please keep this in mind: not everything is easily tested and in front of some complex parts of the code you always have to ask yourself "Is it worth?".

Is it worth spending 3 days to write a test for a feature? Well, if a failure in the new code means a huge financial loss for your company, yes. If you are writing a tool for yourself, and the code you are writing is not dangerous at all, maybe not. With all the shades of grey between these two black and white extreme cases.

So, don't become a slave of the coverage index. A coverage of more than 90% is heaven, and being over 80% is perfectly fine. I would say that, except for specific corner cases being under 80% means that you are not really following a TDD methodology. So, maybe go and review your work flow.

Anyway, pytest gives you a nice way to report the coverage using the coverage program. Just install pytest-cov with

$ pip install pytest-cov

and run pytest with

$ pytest -svv --cov=<name> --cov-report=term

where <name> is the name of the Python module you are testing (actually the path where the code you are testing is). This gives you a nice report with the percentage of covered code file by file.

$ py.test -svv --cov=mypymodule --cov-report=term

----------- coverage: platform linux, python 3.6.5-final-0 -----------
Name                          Stmts   Miss  Cover
-------------------------------------------------
mypymodule/__init__.py       30100%
mypymodule/calc.py          230100%
-------------------------------------------------
TOTAL                            260100%

You may also use the term-missing report instad of just term, that lists the code blocks that are not covered

$ py.test -svv --cov=mypymodule --cov-report=term-missing

----------- coverage: platform linux, python 3.6.5-final-0 -----------
Name                          Stmts   Miss  Cover   Missing
-----------------------------------------------------------
mypymodule/__init__.py       30100%
mypymodule/calc.py          23291%   6, 11
-----------------------------------------------------------
TOTAL                            26292%

Here I commented some of the tests to force the coverage percentage to drop. As you can see the report tells us that lines 6 and 11 of the mypymodule/calc.py file are not covered by any test.

Feedback¶

Feel free to use the blog Google+ page to comment the post. Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

↧

Will McGugan: Compress WebSocket streams with Lomond 0.3.2

July 5, 2018, 3:55 am

≫ Next: Eli Bendersky: Elegant Python code for a Markov chain text generator

≪ Previous: The Digital Cat: Useful pytest command line options

I've recently released version 0.3.2 of Lomond, a WebSocket client for Python with a focus on correctness and ease-of-use.

The major feature of the 0.3 release is per-message compression, which allows for text and binary to be sent in compressed form.

Here's a modified version of the Bitcoin price ticker example which enables compression. This spews every trade made on the GDax platform:

Using the per-message compression extension to the WebSocket spec with Lomond.

That's a lot of data coming through the WebSocket, which fortunately compresses very well.

Also in this release are a number of optimisations to reduce cpu and memory usage.

Lomond is fast approaching a 1.0 release, but is already very stable. It is part of the open source software that powers Dataplicity.

↧

Eli Bendersky: Elegant Python code for a Markov chain text generator

July 5, 2018, 5:40 am

≫ Next: NumFOCUS: NumFOCUS 2018 Google Summer of Code, Part 4 (Final)

≪ Previous: Will McGugan: Compress WebSocket streams with Lomond 0.3.2

While preparing the post on minimal char-based RNNs, I coded a simple Markov chain text generator to serve as a comparison for the quality of the RNN model. That code turned out to be consice and quite elegant (IMHO!), so it seemsed like I should write a few words about it.

It's so short I'm just going to paste it here in its entirety, but this link should have it in a Python file with some extra debugging information for tinkering, along with a sample input file.

fromcollectionsimportdefaultdict,Counterimportrandomimportsys# This is the length of the "state" the current character is predicted from.# For Markov chains with memory, this is the "order" of the chain. For n-grams,# n is STATE_LEN+1 since it includes the predicted character as well.STATE_LEN=4data=sys.stdin.read()model=defaultdict(Counter)print('Learning model...')foriinrange(len(data)-STATE_LEN):state=data[i:i+STATE_LEN]next=data[i+STATE_LEN]model[state][next]+=1print('Sampling...')state=random.choice(list(model))out=list(state)foriinrange(400):out.extend(random.choices(list(model[state]),model[state].values()))state=state[1:]+out[-1]print(''.join(out))

Without going into too much details, a Markov Chain is a model describing the probabilities of events based on the current state only (without having to recall all past states). It's very easy to implement and "train".

In the code shown above, the most important part to grok is the data structure model. It's a dictionary mapping a string state to the probabilities of characters following this state. The size of that string is configurable, but let's just assume it's 4 for the rest of the discussion. This is the order of the Markov chain. For every string seen in the input, we look at the character following it and increment a counter for that character; the end result is a dictionary mapping the alphabet to integers. For example, we may find that for the state "foob", 'a' appeared 75 times right after it, 'b' appeared 25 times, 'e' 44 times and so on.

The learning process is simply sliding a "window" of 4 characters over the input, recording these appearances:

The learning loop is extremely concise; this is made possible by the right choice of Python data structures. First, we use a defaultdict for the model itself; this lets us avoid existence checks or try for states that don't appear in the model at all.

Second, the objects contained inside model are of type Counter, which is a subclass of dict with some special sauce. In its most basic usage, a counter is meant to store an integer count for its keys - exactly what we need here. So a lot of power is packed into this simple statement:

model[state][next]+=1

If you try to rewrite it with model being a dict of dicts, it will become much more complicated to keep track of the corner cases.

With the learning loop completed, we have in model every 4-letter string encountered in the text, mapped to its Counter of occurrences for the character immediately following it. We're ready to generate text, or "sample from the model".

We start by picking a random state that was seen in the training text. Then, we loop for an arbitrary bound and at every step we randomly select the following character, and update the current state. The following character is selected using weighted random selection - precisely the right idiom here, as we already have in each counter the "weights" - the more often some char was observed after a given state, the higher the chance to select it for sampling will be.

Starting with Python 3.6, the standard library has random.choices to implement weighted random selection. Before Python 3.6 we'd have to write that function on our own (Counter has the most_common() method that would make it easier to write an efficient version).

↧

NumFOCUS: NumFOCUS 2018 Google Summer of Code, Part 4 (Final)

July 5, 2018, 7:13 am

≫ Next: Will McGugan: Idiomatic usage of Python assignment expressions (PEP 572)

≪ Previous: Eli Bendersky: Elegant Python code for a Markov chain text generator

The post NumFOCUS 2018 Google Summer of Code, Part 4 (Final) appeared first on NumFOCUS.

↧

Will McGugan: Idiomatic usage of Python assignment expressions (PEP 572)

July 5, 2018, 8:16 am

≫ Next: Test and Code: 43: Kelsey Hightower - End to End & Integration Testing

≪ Previous: NumFOCUS: NumFOCUS 2018 Google Summer of Code, Part 4 (Final)

Most PEPs (Python Enhancement Proposal) tend do go under the radar of most developers, but PEP 572 has caused a lot of controversy in the Python community, with some developers expressing an intense dislike for the new syntax (and that is sugar coated).

Personally, I don't think it deserves the vitriol it received, and I'm looking forward to its addition in Python 3.8.

So what are assignment expressions? Currently in Python assignments have to be statements, so you can do foo = bar(), but you can't do that assignment from within an if or while statement (for example). Which is why the following is a syntax error:

if foo = bar():
    print(f'foo is {foo}')
else:
    print('foo was non zero')

That's often been considered a plus point for Python, because a common error in languages that support this syntax is confusing an assignment = for a comparison ==, which too often leads to code that runs but produces unexpected results.

New Assignment Operator

The PEP introduces a new operator := which assigns and returns a value. Note that it doesn't replace= as the assignment operator, it is an entirely new operator that has a different use case.

Let's take a look at a use case for assignment expressions. We have two file-like objects, src and dst, and we want to copy the data from one to the other a chunk at a time. You might write a loop such as the following to do the copy:

chunk = src.read(CHUNK_SIZE)
while chunk:
    dst.write(chunk)
    chunk = read(CHUNK_SIZE)

This code reads and writes data a chunk at a time until an empty string is read which indicates that the end of the file has been reached.

I think the above code is readable enough, but there is something awkward about having two identical calls to read. We could avoid that double read with a loop and a break such as this:

while True:
    chunk = src.read(CHUNK_SIZE)
    if not chunk:
        break
    dst.write(CHUNK_SIZE)

This works fine, but it's a little clumsy and I don't think it expresses the intent of the code particularly well; five lines seems excessive to accomplish something that feels trivial.

Yet another solution would be to use the lesser known second parameter to iter, which calls a callable until a sentinel value is found. We could implement the read / write loop as follows:

for chunk in iter(lambda: src.read(CHUNK_SIZE) or None, None):
    dst.write(chunk)

This is pretty much how PyFilesystem copies data from one file to another.

I think the iter version expresses intent quite well, but I wouldn't say it is particularly readable. Even at only two lines, If I was scanning that code, I would have to pause to figure those lines out.

Writing this loop with the assignment operator is also two lines:

while chunk:=src.read(CHUNK_SIZE):
    dst.write(chunk)

The assignment to chunk happens within the while loop expression, which allows us to read the data and also decide wether to exit the loop.

I think this version expresses intent the best. The first line is a nice idiom for read chunks until empty, which I think developers could easily learn to use and recognise.

I don't see this syntax causing much confusion because in nearly every situation, a regular assignment is best. It's true that you could create some difficult to understand code with this feature (especially in list / generator expressions), but that's true of most language constructs. Developers will always be faced with balancing expressiveness and brevity for clarity.

Other examples

The following tweet has links to other places where assignment expressions could be used to simplify common constructs:

FYI @VictorStinner opened some pull requests to CPython for __demonstration purposes__ how PEP 572 assignment expression can be used to make CPython library code more readable/shorter: https://t.co/Gc5IwdqwD1 https://t.co/IxkNMJvo9G https://t.co/eNPKiRgeI6 pic.twitter.com/v6noX5iV6i
— Squeaky (@squeaky_pl) July 5, 2018

Conclusion

I hope that devs will give this syntax another chance. Some developers on the Reddit thread have suggested that they would ban assignment expressions in their code. Hopefully by the time Python3.8 has become more mainstream and assignment expression idioms are more common, they will reconsider.

↧

Test and Code: 43: Kelsey Hightower - End to End & Integration Testing

July 5, 2018, 11:00 am

≫ Next: Fabio Zadrozny: PyDev 6.4.3 (code formatter standalone, debugger improvements and f-strings handling)

≪ Previous: Will McGugan: Idiomatic usage of Python assignment expressions (PEP 572)

I first heard Kelsey speak during his 2017 PyCon keynote.
He's an amazing speaker, and I knew right then I wanted to hear more about what he does and hear more of his story.

We discuss testing, of course, but we take it further and discuss:

tests for large systems, like kubernetes
Testing in real world scenarios with all the configuration and everything
Becoming a complete engineer by thinking about the end to end flow from the users perspective
Learning from other roles, and the value of roles to allow focus and expertise
We even get into Chaos Engineering and testing live systems.

Special Guest: Kelsey Hightower.

Fabio Zadrozny: PyDev 6.4.3 (code formatter standalone, debugger improvements and f-strings handling)

July 5, 2018, 9:37 pm

≫ Next: Talk Python to Me: #168 10 Python security holes and how to plug them

≪ Previous: Test and Code: 43: Kelsey Hightower - End to End & Integration Testing

The latest version of PyDev is now out...

Major changes in this release include:

1. Being able to use the PyDev code formatter as a standalone tool.

To use it it's possible to install it as pip install pydevf (the command line is provided as a python library which will call the actual formatter from PyDev -- see the README at https://github.com/fabioz/PyDev.Formatter for more details on how to use it).

The target of the PyDev formatter is trying to keep as close to the original structure of the code while fixing many common issues (so, it won't try to indent based on line width but will fix many common issues such as a space after a comma, space at start of comment, blank lines among methods and classes, etc).

2. Improvements to the debugger, such as:

Thread creation is notified as threads are created instead of synchronized afterwards.
Support for using frame evaluation disabled by default as it made the debugger much slower on some cases.
Fixed case where breakpoint was missed if an exception was raised in a given line.
Properly break on unhandled exceptions on threads.
Add missing import which affected repl with IPython.
Fix for case where breakpoints could be missed.

As a note, the debugger improvements have been sponsored by Microsoft, which is in the process of using the PyDev Debugger as the core of ptvsd, the Python debugger package used by Python in Visual Studio and the Python Extension for Visual Studio Code (note that it's still marked as experimental there as it's in the process of being integrated into ptvsd).

It's really nice to see pydevd being used in more IDEs in the Python world! 😉

Besides those, there are some bugfixes in handling f-strings and sending the contents of the current line to the console (through F2).

Also, a new major version of LiClipse (5.0) is now also available (see: http://www.liclipse.com/download.html for how to get it). It includes the latest PyDev and a new major Eclipse release (4.8 - Photon).

↧

Talk Python to Me: #168 10 Python security holes and how to plug them

July 6, 2018, 1:00 am

≫ Next: Made With Mu: Rainbow Pride Traffic Lights

≪ Previous: Fabio Zadrozny: PyDev 6.4.3 (code formatter standalone, debugger improvements and f-strings handling)

Do you write Python software that uses the network, opens files, or accepts user input? Of course you do! That's what almost all software does. But these actions can let bad actors exploit mistakes and oversights we've made to compromise our systems.

↧

Made With Mu: Rainbow Pride Traffic Lights

July 7, 2018, 4:40 am

≫ Next: Weekly Python StackOverflow Report: (cxxxiii) stackoverflow python report

≪ Previous: Talk Python to Me: #168 10 Python security holes and how to plug them

Caitlinsdad has created colourful Rainbow Pride Traffic Lights with Mu and CircuitPython. This is great timing since Pride month just finished and today tens of thousands of people are taking part in the Pride in London Parade. Thanks to Adafruit for the heads-up!

Caitlinsdad explains:

So I have an Adafruit Circuit Playground Express board to play with. This project seems like a good way to learn Circuit Python programming and to get more familiar with the Mu Editor and REPL function.

Thanks to some epic cardboard crafting, Neopixels, a healthy dollop of paint and experimentation with timings in the CircuitPython code, the results are a customizable rainbow light-show:

The detailed write-up can be found on Caitlinsdad’s area of Instructables and the source code is in this gist.

↧

Weekly Python StackOverflow Report: (cxxxiii) stackoverflow python report

July 7, 2018, 1:58 pm

≫ Next: Techiediaries - Django: Pipenv Tutorial for Django Developers

≪ Previous: Made With Mu: Rainbow Pride Traffic Lights

These are the ten most rated questions at Stack Overflow last week.
Between brackets: [question score / answers count]
Build date: 2018-07-07 20:57:19 GMT

↧

Techiediaries - Django: Pipenv Tutorial for Django Developers

July 7, 2018, 5:00 pm

≫ Next: Continuum Analytics Blog: Scalable Machine Learning with Dask—Your Questions Answered!

≪ Previous: Weekly Python StackOverflow Report: (cxxxiii) stackoverflow python report

Pipenv is the new officially recommended packaging tool for Python which is similar to modern package managers like NPM (Node.js) or Composer (PHP). Pipenv solves common problems, most Python developers, encounter in the typical workflow using pip and virtualenv or venv.

This tutorial will teach you to install Pipenv in Linux and Windows, how to use Pipenv for managing Python dependencies and how to use the traditional existing tools, such as Pip and virtualenv, with Pipenv.

Pipenv Tutorial for Django Developers: Pipenv vs. Pip vs. virtualenv vs. venv

Pipenv vs. Pip vs. Virtualenv vs. Venv

When working with Python projects you usually use a requirements.txt and Pip to install the packages automatically (pip install -r requirements.txt) on your development machine or later on your production machine.

An example requirements.txt looks like:

Django
distribute
dj-database-url
psycopg2
wsgiref

Each line contains a dependency that Pip will install, either globally or in a virtual environment if it's activated.

We didn't specify the required versions which will install the latest versions of the wanted packages. Now, imagine that you are using this requirements file for a production environment after a while in development. This may present some problems if the newer versions have breaking changes.

You can solve the issue by adding versions (or pinning requirements) so Pip will install the same versions in production: text Django==1.4.1 distribute==0.6.27 dj-database-url==0.2.1 psycopg2==2.4.5 wsgiref==0.1.2

You may think this is going to solve your issues but is that true? Not always, because even if the requirements are determined, these packages have other dependencies which Pip will install too and if their versions are not pinned then you ended up installing the latest versions of the dependencies of your project's dependencies which may have breaking changes.

Now, how to replicate the same environment in your production environment? You can, actually, use pip freeze > requirements.txt which will produce a requirements file will all dependencies and the exact versions used in your development environment. Now, doesn't that solve our earlier issues?

In one way, Yes but this will results in other issues.

By pinning your project dependencies, you make sure you your project doesn't break when it's deployed on a production machine. Now that you’ve pinned the exact versions of every dependency that your project use, you need also to update these versions, manually, when it's necessary especially for patching any discovered security issues that don't have any breaking changes. This is not always convenient.

Here comes Pipenv! Pipenv relives you from manually updating the versions of sub-dependencies but in the same time allows you to have deterministic versions of your project's dependencies. So you can have the latest versions of dependencies and sub-dependencies as long as they don't introduce any breaking changes.

Multiple Project with Different Versions of Dependencies

Python is usually installed system-wide by users. In most cases when you are just using Python for running some tools you will be ok with ths setup but for developers, especially if they are working with multiple Python projects, usually, using different versions of packages you will have a hard time when switching between projects. The solution is using virtual environments (virtualenv for Python 2 or venv for Python 3) which provides isolated environments with their own python binaries and dependencies.

But since the solution already exist? What does Pipenv provide?

Pipenv includes built in support for virtual environments so once you've installed Pipenv, you don't need to install virtualenv or venv which, in many cases, results in headache for developers.

Also Pipenv allow you to specify Python 2 or Python 3 using a switch for your virtual environment.

Pipenv vs Pip Dependency Resolver

Pip itself doesn't provide dependency resolution, to avoid conflicts when many dependencies require different versions of the same dependency, so you have to explicitly specify the range for wanted versions in requirements.txt. For example

packageC >=1.0,<=2.0
packageA
packageB

Here packageA has a requirement for a version >= 1.0 for packageC and packageB needs a version <= 2.0 of packageC. Without specifying the range Pip will fail install the required version.

You can refer to this open issue for more information.

Pipenv is smart enough to figure out the versions of the sub dependencies that meet the requirements without explicitly specifying them.

Getting Started with Pipenv with Django Example

Now that we've seen the issues pipenv solves, let's how to get started using pipenv for creating a virtual environment for a Django project and installing the dependencies. Actually this tutorial is a part of a series of tutorials to use Django with modern front-end frameworks and libraries such as Angular, React and Vue.

Installing Pipenv in Linux

First, you'll need to install pipenv using pip:

$ pip install pipenv

This will install pipenv system-wide. After installing pipenv you can now stop using pip and start using pipenv which uses pip and virtualenv or venv behind the curtains.

In my system (Ubuntu 16) I got this error:

Could not install packages due to an EnvironmentError: [Errno 13] Permission denied: '/usr/bin/easy_install' Consider using the --user option or check the permissions.

This is due because I'm installing the pipenv to a system-wide folder which I don't have permissions to write to.

There are three options to generally solve that type of errors:

Use a virtual environment to install the package ** Not recommended for our situation but recommended for most other cases**.
Install the package to the user folder:
python -m pip install --user <package>
use sudo to install to the system folder (not recommended)
sudo python -m pip install <package>

I'm using the second option:

python -m pip install --user pipenv

Installing Pipenv in Windows

You can install pipenv in Windows 10 using Power Shell by following these instructions:

First, start by running Windows Power Shell as Administrator

Next, run the following command:

pip install pipenv

You need to have pip installed on your Windows system.

Next, run the following command and change the user name with your own:

set PATH=%PATH%;set PATH=%PATH%;'c:\users<USERNAME>\appdata\local\programs\python\python36-32\Scripts'

You can then start using pipenv easily from your Power Shell.

Pipenv makes use of two additional files that replace requirements.txt which are Pipfile and Pipfile.lock (the file responsible for producing deterministic builds).

Let's start by spawning a shell with a virtual environment where we can do all the work related to our current project. Run the following command from your terminal:

$ pipenv shell --three

This will create a virtual environment This is our example Pipenv file in a default location (usually the home folder) for all created virtual environments.

The --three option allow us to specify the version of Python. In this case we want Python 3. For Python 2 use --two. If you don't specify the version, the default one will be used.

You can also provide a specific version like 3.6 with the --python option. For example --python 3.6.

You'll get something similar to this output:

Creating a virtualenv for this project...
Pipfile: /home/ahmed/Desktop/djangoreactdemo/backend/Pipfile
Using /usr/bin/python3.5m (3.5.2) to create virtualenv...
⠋Running virtualenv with interpreter /usr/bin/python3.5m
Using base prefix '/usr'
New python executable in /home/ahmed/.local/share/virtualenvs/backend-mJ9anpjL/bin/python3.5m
Also creating executable in /home/ahmed/.local/share/virtualenvs/backend-mJ9anpjL/bin/python
Installing setuptools, pip, wheel...done.
Setting project for backend-mJ9anpjL to /home/ahmed/Desktop/djangoreactdemo/backend

Virtualenv location: /home/ahmed/.local/share/virtualenvs/backend-mJ9anpjL
Launching subshell in virtual environment…

Also the virtual environment will be activated.

Pipenv tutorial

Next, we can install our dependencies. Let's start with Django

$ pipenv install django

This will install the latest version of Django.

Next, let's install Django REST framework

$ pipenv install djangorestframework

Finally let's install django-cors-headers package for easily enabling CORS in our Django project

$ pipenv install django-cors-headers

After installing the required packages for our project we can inspect Pipfile. This is the content:

[[source]]url="https://pypi.python.org/simple"verify_ssl=truename="pypi"[dev-packages][packages]django="*"djangorestframework="*"django-cors-headers="*"[requires]python_version="3.5"

The Pipfile uses TOML for syntax. And contains different sections such as:

[dev-packages] that contains packages required for development only,
[packages] for development and production packages,
[requires] for other requirements like the version of Python.

Create a Django Project

Now navigate to where you want to create your Django project and run the following command to generate a new project named backend:

$ django-admin.py startproject backend

This will create a project with this structure

.
├── backend
│   ├── __init__.py
│   ├── settings.py
│   ├── urls.py
│   └── wsgi.py
└── manage.py

You can migrate your database with:

$ python manage.py migrate

This will create a database.sqlite file inside your project's root folder and you'll get a similar output:

Operations to perform:
  Apply all migrations: admin, auth, contenttypes, sessions
Running migrations:
  Applying contenttypes.0001_initial... OK
  Applying auth.0001_initial... OK
  Applying admin.0001_initial... OK
  Applying admin.0002_logentry_remove_auto_add... OK
  Applying contenttypes.0002_remove_content_type_name... OK
  Applying auth.0002_alter_permission_name_max_length... OK
  Applying auth.0003_alter_user_email_max_length... OK
  Applying auth.0004_alter_user_username_opts... OK
  Applying auth.0005_alter_user_last_login_null... OK
  Applying auth.0006_require_contenttypes_0002... OK
  Applying auth.0007_alter_validators_add_error_messages... OK
  Applying auth.0008_alter_user_username_max_length... OK
  Applying auth.0009_alter_user_last_name_max_length... OK
  Applying sessions.0001_initial... OK

Pipenv tutorial

Finally serve your project using the following command:

$ python manage.py runserver

Now, you can visit your web application from http://127.0.0.1:8000/:

Pipenv tutorial with Django

Using Pipenv with Existing Projects

For most cases, we'll be using an existing Django project from our front-end tutorials so you'll need to clone a project from GitHub which uses pipenv. In this case, you only need to spawn a shell and install packages from Pipfile or Pipfile.lock using the following command:

$ pipenv install --dev

This will use Pipfile.lock to install packages.

Pipenv tutorial with django

Conclusion

Pipenv is now the official package manager for Python. If you are still using pip, virtualenv or venv for your Python projects then I recommend to start making the switch. Migrating an existing project to use pipenv is very straightforward. You can follow the same steps in this tutorial for legacy projects that use a requirements.txt and run pipenv install which will automatically detect the requirements.txt file and convert it to a Pipfile.

↧

Continuum Analytics Blog: Scalable Machine Learning with Dask—Your Questions Answered!

July 8, 2018, 10:28 am

≫ Next: Erik Marsja: A Basic Pandas Dataframe Tutorial for Beginners

≪ Previous: Techiediaries - Django: Pipenv Tutorial for Django Developers

Building powerful machine learning models often requires more computing power than a laptop can provide. Although it’s fairly easy to provision compute instances in the cloud these days, all the computing power in the world won’t help you if your machine learning library cannot scale. Unfortunately, popular libraries like scikit-learn, XGBoost, and TensorFlow don’t offer …
Read more →

The post Scalable Machine Learning with Dask—Your Questions Answered! appeared first on Anaconda.

↧

Erik Marsja: A Basic Pandas Dataframe Tutorial for Beginners

July 8, 2018, 4:58 pm

≫ Next: Techiediaries - Django: Create New Django Project (Django 1.11 Example)

≪ Previous: Continuum Analytics Blog: Scalable Machine Learning with Dask—Your Questions Answered!

In this Pandas tutorial we will learn how to work with Pandas dataframes. More specifically, we will learn how to read and write Excel (i.e., xlsx) and CSV files using Pandas.

We will also learn how to add a column to Pandas dataframe object, and how to remove a column. Finally, we will also learn how to subset and group our dataframe.

What is Pandas Dataframe?

Pandas is a library for enabling data analysis in Python. It’s very easy to use and quite similar to the programming language R’s data frames. It’s open source and free.

When working datasets from real experiments we need a method to group data of differing types. For instance, in psychology research we often use different data types. If you have experience in doing data analysis with SPSS you are probably familiar with some of them (e.g., categorical, ordinal, continuous).

Imagine that we have collected data in an experiment in which we were interested in how images of kittens and puppies affected the mood in the subjects and compared it to neutral images.

After each image, randomly presented on a computer screen, the subjects were to rate their mood on a scale.

Then the data might look like this:

Condition	Mood Rating	Subject Number	Trial Number
Puppy	7	1	1
Kitten	6	1	2
Puppy	7	1	4
Neutral	6	1	5
…	…	…	…
Puppy	6	12	9
Neutral	6	12	10

This is generally what a dataframe is. Obviously, working with Pandas dataframe will make working with our data easier. See here for more extensive information.

Pandas Create Dataframe

In Psychology, the most common methods to collect data is using questionnaires, experiment software (e.g., PsychoPy, OpenSesame), and observations.

When using digital applications for both questionnaires and experiment software we will, of course, also get our data in a digital file format (e.g., Excel spreadsheets and Comma-separated, CSV, files).

Pandas Dataframe

If the dataset is quite small it is possible to create a dataframe directly using Python and Pandas:

import pandas as pd

# Create some variables
trials = [1, 2, 3, 4, 5, 6]
subj_id = [1]*6
group = ['Control']*6
condition = ['Affect']*3 + ['Neutral']*3

# Create a dictionairy
data = {'Condition':condition, 'Subject_ID':subj_id, 
        'Trial':trials, 'Group':group}

# Create the dataframe
df = pd.DataFrame(data)
df.head()

Crunching in data by hand when the datasets are large is, however, very time-consuming and nothing to recommend. Below you will learn how to read Excel Spreadsheets and CSV files in Python and Pandas.

Loading Data Using Pandas

As mentioned above, large dataframes are usually read into a dataframe from a file. Here we will learn how to u se Pandas read_excel and read_csv methods to load data into a dataframe. There are a lot of datasets available to practice working with Pandas dataframe. In the examples below we will use some of the R datasets that can be found here.

Working with Excel Spreadsheets Using Pandas

Spreadsheets can quickly be loaded into a Pandas dataframe and you can, of course, also write a spreadsheet from a dataframe. This section will cover how to do this.

Reading Excel Files Using Pandas read_excel

One way to read a dataset into Python is using the method read_excel, which has many arguments.

pd.read_excel(io, sheet_name=0, header=0)

io is the Excel file containing the data. It should be type string data type and could be a locally stored file as well as a URL.

sheet_name can be a string for the specific sheet we want to load and integers for zero-indexed sheet positions. If we specify None all sheets are read into the dataframe.

header can be an integer or a list of integers. The default is 0 and the integer represent the row where the column names. Add None if you don’t have column names in your Excel file.

Excel File Written with to_excel

See the read_excel documentation if you want to learn about the other arguments.

Pandas Read Excel Example

Here’s a working example on how to use Pandas read_excel:

import pandas as pd

# Load a XLSX file from a URL
xlsx_source = 'http://ww2.amstat.org/publications' \
              '/jse/v20n3/delzell/conflictdata.xlsx'
# Reading the excel file to a dataframe. 
# Note, there's only one sheet in the example file
df = pd.read_excel(xlsx_source, sheet_name='conflictdata')
df.head()

In the example above we are reading an Excel file (‘conflictdata.xlsx’). The dataset have only one sheet but for clarity we added the ‘conflictdata’ sheet name as an argument. That is, sheet_name was, in this case, nothing we needed to use.

The last line may be familiar to R users and is printing the first X lines of the dataframe:

First 5 rows from an Excel file loaded into a Pandas dataframe

As you may have noticed we did not use the header argument when we read the Excel file above. If we set header to None we’ll get digits as column names. This, unfortunately, makes working with the Pandas dataframe a bit annoying.

Luckily, we can pass a list of column names as an argument. Finally, as the example xlsx file contains column names we skip the first row using skiprows. Note, skiprows can be used to skip more than one row. Just add a list with the row numbers that are to be skipped.

Here’s another example how to read Excel using Python Pandas:

import pandas as pd

xlsx_source = 'http://ww2.amstat.org/publications' \
              '/jse/v20n3/delzell/conflictdata.xlsx'

# Creating a list of column names
col_names = ['Col' + str(i) for i in range (1, 17)]

# Reading the excel file
df = pd.read_excel(xlsx_source, sheet_name='conflictdata', 
                   header=None, names=col_names, skiprows=[0])
df.head()

Writing Excel Files Using Pandas to_excel

We can also save a new xlsx (or overwrite the old, if we like) using Pandas to_excel method. For instance, say we made some changes to the data (e.g., aggregated data, changed the names of factors or columns) and we collaborate with other researchers. Now we don’t want to send them the old Excel file.

df.to_excel(excel_writer, sheet_name='Sheet1', index=False)

excel_writer can be a string (your file name) or an ExcelWriter object.

sheet_name should be a string with the sheet name. Default is ‘Sheet1’.

index should be a boolean (i.e., True or False). Typically, we don’t want to write a new column with numbers. Default is True.

Pandas dataframe to Excel example:

df.to_excel('newfilename.xlsx', sheet_name='NewColNames', index=False)

It was pretty simple now have written a new Excel file (xlsx) to the same directory as your Python script.

Working with CSV Files Using Pandas

Now we continue to a more common way to store data, at least in Psychology research; CSV files. We will learn how to use Python Pandas to load CSV files into dataframes.

pd.read_csv(filepath_or_buffer, sep=',')

file_path_buffer is the name of the file to be read from. The file_path_buffer can be relative to the directory that your Python script is in or absolute. It can also be a URL. What is important here that what we type in first is a string. Don’t worry we will go through this later with an example.

sep is the delimiter to use. The most common delimiter of a CSV file is comma (“,”) and it’s what delimits the columns in the CSV file. If you don’t know you may try to set it to None as the Python parsing engine will detect the delimiter.

Have a look at the if you want to learn about the other arguments.

It’s easy to read a csv file in Python Pandas. Here’s a working example on how to use Pandas read_csv:

import pandas as pd

df = pd.read_csv('https://vincentarelbundock.github.io/' \
                 'Rdatasets/csv/psych/Tucker.csv', sep=',')
df.head()

Writing CSV Files Using Pandas to_csv

There are of course occasions when you may want to save your dataframe to csv. This is, of course, also possible with Pandas. We just use the Pandas dataframe to_csv method:

df.to_csv(path_or_buf, sep=',', index=False)

Pandas Dataframe to CSV Example:

df.to_csv('newfilename.csv', sep=';', index=False)

It was simple to export Pandas dataframe to a CSV file, right? Note, we used semicolon as separator. In some countries (e.g., Sweden) comma is used as decimal separator. Thus, this file can now be opened using Excel if we ever want to do that.

Here’s a video tutorial for reading and writing csv files using Pandas:

Now we have learned how to read and write Excel and CSV files using Pandas read_excel, to_excel, and read_csv, to_csv methods. The next section of this Pandas tutorial will continue with how to work with Pandas dataframe.

Working with Pandas Dataframe

Now that we know how to read and write Excel and CSV files using Python and Pandas we continue with working with Pandas Dataframes. We start off with basics: head and tail.

head enables us to print the first x rows. As earlier explained, by default we see the first 5 rows but. We can, of course, have a look more or less rows:

import pandas as pd

df = pd.read_csv('https://vincentarelbundock.github.io/' \
'Rdatasets/csv/carData/Wong.csv', sep=',')
df.head(4)

Using tail, on the other hand, will print the x last rows of the dataframe:

df.tail(4)

Each column or variable, in a Pandas dataframe has a unique name. We can extract variables by means of the dataframe name, and the column name. This can be done using the dot sign:

piq = df.piq
piq[0:4]

We can also use the [ ] notation to extract columns. For example, df.piq and df[‘piq’] is equal:

df.piq is the same as df[‘piq’]Furthermore, if we pass a list we can select more than one of the variables in a dataframe. For example, we get the two columns “piq” and “viq” ([‘piq’, ‘viq’] ) as a dataframe like this:

pviq = df[['piq', 'viq']]

How to Add a Column to Pandas Dataframe

We can also create a new variable within a Pandas dataframe, by naming it and assigning it a value. For instance, in the dataset we working here we have two variables “piq” (imathematical IQ) and “viq” (verbal IQ). We may want to calculate a mean IQ score by adding “piq” and “viq” together and then divide it by 2.

We can calculate this and add it to the df dataframe quite easy:

df['iq'] = (df['piq'] + df['viq'])/2

Alternatevily, we can calculate this using the method mean(). Here we use the argument axis = 1 so that we get the row means:

df['iq'] = df[['piq', 'viq']].mean(axis=1)

Sometimes we may want to just add a column to a dataframe without doing any calculation. It’s done in a similar way:

df['NewCol'] = ''

Remove Columns From a Dataframe

Other times we may also want to drop columns from a Pandas dataframe. For instance, the column in df that is named ‘Unnamed: 0’ is quite unnecessary to keep.

Removing columns can be done using drop. In this example we are going to add a list to drop the ‘NewCol’ and the ‘Unnamed: 0’ columns. If we only want to remove one column from the Pandas dataframe we’d input a string (e.g., ‘NewCol’).

df.drop(['NewCol', 'Unnamed: 0'], axis=1, inplace=True)

Note to drop columns, and not rows, the axis argument is set to 1 and to make the changes to the dataframe we set inplace to True.

The above calculations are great examples for when you may want to save your dataframe as a CSV file.

If you need to reverse the order of your dataframe check my post Six Ways to Reverse Pandas Dataframe

How to Subset Pandas Dataframe

There are many methods for selecting rows of a dataframe. One simple method is by using query. This method is similar to the function subset in R.

Here’s an exemple in which we subset the dataframe where “piq” is greater than 80:

df_piq = df.query('piq > 80')
df_piq.head(4)

Selected rows of a Pandas Dataframe

df_males = df[df['sex'] == 'Male']

The next subsetting example shows how to filter the dataframe with multiple criteria. In this case, we sellect observations from df where sex is male and iq is greater than 80. Note that the ampersand “&“in Pandas is the preferred AND operator.

df_male80 = df.query('iq > 80 & sex == "Male"')

It’s also possible to use the OR operator. In the following example, we filter Pandas dataframe based on rows that have a value of age greater than or equal to 40 or age less then 14. Furthermore, we we filter the dataframe by the columns ‘piq’ and ‘viq’.

df.query('age >= 40 | age < 14')[['piq', 'viq']].head()

Pandas Dataframe Filtered using the OR operator and 2 columns selected

Random Sampling Rows From a Dataframe

Using the sample method it’s also possible to draw random samples of size n from the. In the example below we draw 25random samples (n=25) and get at subset of ten obsevations from the Pandas dataframe.

df_random = df.sample(n=25)

How to Group Data using Pandas Dataframe

Now we have learned how to read Excel and CSV files to a Panda dataframe, how to add and remove columns, and subset the created dataframe.

Although subsets may come in handy there are no need for doing this when you want to look at specific groups in the data.

Pandas have a method for grouping the data which can come in handy; groupby. Especially, if you want to summarize your data using Pandas.

As an example, we can based on theory have a hypothesis that there’s a difference between men and women. Thus, in the first example we are going to group the data by sex and get the mean age, piq, and viq.

df_sex = df.groupby('sex')
df_sex[['age', 'piq', 'viq']].mean()

Summary Statistics (i.e., mean) grouped by Sex

If we were to fully test our hypothesis we would nened to apply hypothesisis testing. Here are two posts about carrying out between-subject analysis of variance using Python:

In the next example we are going to use Pandas describe on our grouped dataframe. Using describe we will get a table with descriptive statistics (e.g., count, mean, standard deviation) of the added column ‘iq’.

df_sex[['iq']].describe()

More about summary statistics using Python and Pandas can be read in the post Descriptive Statistics using Python.

Summary of What We’ve Learned

Working with CSV and Excel files using Pandas
- Pandas read_excel & to_excel
- Pandas read_csv & to_csv
Working with Pandas Dataframe
- Add a column to a dataframe
- Remove a column from a dataframe
- Subsetting a dataframe
- Grouping a dataframe

There are, of course, may more things that we can do with Pandas dataframes. We usually want to explore our data with more descriptive statistics and visualizations. Make sure to check back here for more both more basic guides and in-depth guides on working with Pandas dataframe. These guides will include how to visualize data and how to carry out parametric statistics.

Leave a comment below if you have any requests or suggestions on what should be covered next!

The post A Basic Pandas Dataframe Tutorial for Beginners appeared first on Erik Marsja.

↧

Techiediaries - Django: Create New Django Project (Django 1.11 Example)

July 7, 2018, 5:00 pm

≫ Next: Mike Driscoll: PyDev of the Week: Ryan Kirkbride

≪ Previous: Erik Marsja: A Basic Pandas Dataframe Tutorial for Beginners

Create new Django 1.11 example project

Django is a Python based framework which offers developers all they need to create web apps and websites in a clean, rapid and pragmatic way.

How to create a Django project is the first thing that gets asked by a beginner Django developer so let's see how to quickly create a new Django 1.11 project from scratch.

This tutorial is updated to use Pipenv: The new official package manager for Python

Tutorial Requirements for Creating a New Django Project

To follow this tutorial, you need to have basic Python experience and also how to work with Linux bash commands or Windows command prompt (if you are using Windows for development).

Setting Up a Development Environment Before Creating a New Django Project

Before you can create a new Django project you need to have a development environment ready with the following requirements:

Python 2.7.x or 3.4.x
cUrl

If you are using a Linux/MAC system you should have Python already installed.

You can check the version of your installed Python by running the following command in your terminal

$ python -v

If you don't have Python installed head over to Python download page and grab the installer for your system.

Pip is a Python package manager used to easily install Python packages and their dependencies. You can install pip using curl utility:

$ curl "https://bootstrap.pypa.io/get-pip.py" -o "get-pip.py" python get-pip.py

You can then verify if it's successfully installed by running:

$ pip -V

Setting Up a Development Environment with Pipenv

Let's now see how to setup our development environment using pipenv which automatically creates a virtual environment and abstracts pip for installing dependencies.

First, you'll need to install pipenv using pip:

$ python -m pip install --user pipenv

This will install pipenv for the current user. After installing pipenv you can now stop using pip and start using pipenv which uses pip and virtualenv or venv behind the curtains.

Pipenv makes use of two additional files that replace requirements.txt which are Pipfile and Pipfile.lock (the file responsible for producing deterministic builds).

Let's start by spawning a shell with a virtual environment where we can do all the work related to our current project. Run the following command from your terminal:

$ pipenv shell --two

This will create a virtual environment with Python 2.7. You will see a similar output in your terminal:

Creating a virtualenv for this project...
Pipfile: /home/ahmed/Desktop/django11example/Pipfile
Using /usr/bin/python2 (2.7.12) to create virtualenv...
⠋Running virtualenv with interpreter /usr/bin/python2
New python executable in /home/ahmed/.local/share/virtualenvs/django11example-Jpiac6qK/bin/python2
Also creating executable in /home/ahmed/.local/share/virtualenvs/django11example-Jpiac6qK/bin/python
Installing setuptools, pip, wheel...done.
Setting project for django11example-Jpiac6qK to /home/ahmed/Desktop/django11example

Virtualenv location: /home/ahmed/.local/share/virtualenvs/django11example-Jpiac6qK
Launching subshell in virtual environment…
 . /home/ahmed/.local/share/virtualenvs/django11example-Jpiac6qK/bin/activate

Also the virtual environment will be activated.

Create Django project

Inside your current project, you'll have a Pipfile created. This is the content of this file:

[[source]]url="https://pypi.org/simple"verify_ssl=truename="pypi"[dev-packages][packages][requires]python_version="2.7"

Using Virtualenv (Skip if Using Pipenv)

If you are using pipenv, you can skip this section as pipenv automatically manages a virtual environment for your project.

Virtualenv is a tool which allows you to create virtual Python environments. It's a common practice to create a virtual environment for each Python project to prevent different versions of the same packages to conflict with each other when switching between projects.

To install Virtualenv, run the following command:

$ pip install virtualenv

After installing all development requirements. It's time to create a virtual development environment using virtualenv:

$ mkdir django111-project
$ cd django111-project
$ virtualenv env

Next, make sure to activate the environment with:

$ source env/bin/activate

In case you want to deactivate the environment, simply run:

deactivate

Installing Django with Pip (Skip if Using Pipenv)

Now we are ready to install Django:

$ pip install django

This will install the latest version of Django.

When writing this tutorial the latest Django version that's compatible with Python 2.7 is 1.11

Installing Django 1.11 with Pipenv

If you are using Pipenv, you can install packages using pipenv install django:

$ pipenv install django

This will use pip behind the curtains.

Let's install Django

$ pipenv install django

This will install the latest version of Django.

Create Django project

Create New Django Project

After installing the development environment, setting up a new virtual environment and installing the latest version of Django. You can create a django project by running:

$ django-admin.py startproject django111project

This will create a project with the following directory structure:

├── django111project
    │   ├── __init__.py
    │   ├── settings.py
    │   ├── urls.py
    │   ├── wsgi.py
    └── manage.py

Next navigate inside your project:

$ cd django111project

And create your SQLite database using:

$ python manage.py migrate

This will create a SQLite database which is the default option for Django project but you can also use any other advanced database system such as MySQL or PostgresSQL etc.

SQLite comes pre-installed with Python so we are going to use it for this simple project.

You can, then, run a local development server with:

$ python manage.py runserver

You should be able to visit your web application from http://127.0.0.1:8000/

Create a Django Application

A Django application is a collection of files used to separate logical units of your Django project for the sake of organization.

Before implementing your project features, It's better to first create a Django application for each feature. For example:

$ python manage.py startapp authentication

The project directory structure looks like:

├── authentication
    │   ├── admin.py
    │   ├── apps.py
    │   ├── __init__.py
    │   ├── migrations
    │   │   └── __init__.py
    │   ├── models.py
    │   ├── tests.py
    │   └── views.py
    ├── db.sqlite3
    ├── django111project
    │   ├── __init__.py
    │   ├── __init__.pyc
    │   ├── settings.py
    │   ├── settings.pyc
    │   ├── urls.py
    │   ├── urls.pyc
    │   ├── wsgi.py
    │   └── wsgi.pyc
    └── manage.py

After creating an application, you need to add some configuration in settings.py. Head over to settings.py and open it, locate the INSTALLED_APPS array and add your application:

INSTALLED_APPS=('django.contrib.auth','django.contrib.contenttypes','django.contrib.sessions','django.contrib.sites','django.contrib.messages','django.contrib.staticfiles','django.contrib.admin','authentication',)

Next, you need to create the migrations files (Supposing you have added any Django models in the authentication application):

$ python manage.py makemigrations

Then, actually, run the migration to create actual database tables:

python manage.py migrate

You should get similar output:

Operations to perform:
  Apply all migrations: admin, auth, contenttypes, sessions
Running migrations:
  Applying contenttypes.0001_initial... OK
  Applying auth.0001_initial... OK
  Applying admin.0001_initial... OK
  Applying admin.0002_logentry_remove_auto_add... OK
  Applying contenttypes.0002_remove_content_type_name... OK
  Applying auth.0002_alter_permission_name_max_length... OK
  Applying auth.0003_alter_user_email_max_length... OK
  Applying auth.0004_alter_user_username_opts... OK
  Applying auth.0005_alter_user_last_login_null... OK
  Applying auth.0006_require_contenttypes_0002... OK
  Applying auth.0007_alter_validators_add_error_messages... OK
  Applying auth.0008_alter_user_username_max_length... OK
  Applying sessions.0001_initial... OK

You can serve your project using:

$ python manage.py runserver

You should get similar output:

Performing system checks...

System check identified no issues (0 silenced).
July 08, 2018 - 22:06:57
Django version 1.11.14, using settings 'django111project.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CONTROL-C.

This is a screen shot of the home page:

Create Django project

Conclusion

We have seen how to create a new Django project after setting up the development environment and installing Django.

↧

Mike Driscoll: PyDev of the Week: Ryan Kirkbride

July 8, 2018, 10:05 pm

≫ Next: Matthew Rocklin: Dask Development Log

≪ Previous: Techiediaries - Django: Create New Django Project (Django 1.11 Example)

This week we welcome Ryan Kirkbride (@ryankirkbride26) as our PyDev of the Week! Ryan is the creator of Foxdot, a live coding library for live coding music in an interactive Python editor. You can see what projects Ryan is a part of by going to his Github page. Let’s take a few moments to get to know Ryan!

Can you tell us a little about yourself (hobbies, education, etc):

I’m currently doing a PhD at the University of Leeds in the School of Music researching collaborative tools for live coding music. Live coding is basically interactive programming for creating music or visuals and probably my favourite thing to do right now. There’s a growing scene called “Algorave” where live coders get together to make music for people to dance to and they’re a lot of fun to perform at.

Why did you start using Python?

I started using Python during my first year of university when I was studying computer science. It’s such as great language for beginners but there’s also so much to learn as you use it more and more.

What other programming languages do you know and which is your favorite?

At university I did some projects using Javascript, Java and C++, but Python remained my favourite to use throughout the course of my degree. I just really liked the simplicity of the syntax and how easily I could express my ideas.

What projects are you working on now?

At the moment I’m working on my live coding environment, FoxDot, which is basically a library for live coding music that comes with an interactive Python editor. It doesn’t actually make any sound itself but triggers synths and samples loaded in a program called SuperCollider. FoxDot isn’t the only library out there for live coding music but it might be the only one using Python. It’s very heavily inspired by a Haskell environment called TidalCycles and the SuperCollider program I mentioned above.

I’m also working on a real-time collaborative editor for live coding as part of my PhD. Kind of like a Google docs editor for live coding music. I started writing it in Python to just use it with FoxDot but it can also be used with TidalCycles and SuperCollider and I’d like to add a popular Ruby-based language called Sonic-Pi.

Which Python libraries are your favorite (core or 3rd party)?

This is a tough one! I had a lot of fun with the Natural Language Toolkit (nltk) module writing a program for predicting movie box office takings based on reviews but I haven’t used it in some time. I’ve recently started to use “functools” a lot although I’ve actually delved that deeply into it yet. Combining functions to manipulate patterns of notes and rhythms is a really important part of algorithmic music and functools helps me do that using things like partials etc.

How did you end up creating/working on the Foxdot project?

I tried live coding using TidalCycles, which is based in Haskell – a functional programming language, but I wasn’t always able to express myself in the way I wanted to. It’s a wonderfully elegant way to create music but I was quite rooted in my object-oriented programming ways and decided to have a go at creating my own version in Python. Using the standard library’s “exec” function it’s really easy to take a string of text and run code and I just went from there!

What have you learned from working on an open source project?

It’s really made me work harder at documenting my code! It started off as a completely personal project and there were no docstrings and barely any comments but as other people started to use and contribute to it I started to appreciate how useful good documentation is. When other contributors add or change something, it makes it much easier if there is clear information in the code.

Do you have any tips for people who would like to contribute to or start their own open source projects?

Set up some automatic documentation generator, e.g. sphinx and read-the-docs, from the start. It’s very hard to go back and do once the project is well underway.

Is there anything else you’d like to say?

Just that programming doesn’t always have to be goal-oriented and that it can be purely creative. Live coding is a really good creative outlet that lets you express yourself using skills you’ve developed as a programmer.

Thanks for doing the interview!

↧

#1 “Old Style” String Formatting (% operator)

#2 “New Style” String Formatting (str.format)

#3 String Interpolation / f-Strings (Python 3.6+)

#4 Template Strings (Standard Library)

Which String Formatting Method Should You Use?

Key Takeaways

New in PyCharm

Interested?

Run single tests¶

Skipping tests¶

Skipping with a reason¶

Skipping tests conditionally¶

Run tests by name¶

Tagging tests¶

Adding a command line option¶

Coverage¶

Feedback¶

New Assignment Operator

Other examples

Conclusion

What is Pandas Dataframe?

Pandas Create Dataframe

Loading Data Using Pandas

Working with Excel Spreadsheets Using Pandas

Reading Excel Files Using Pandas read_excel

Pandas Read Excel Example

Writing Excel Files Using Pandas to_excel

Pandas dataframe to Excel example:

Working with CSV Files Using Pandas

Writing CSV Files Using Pandas to_csv

Pandas Dataframe to CSV Example:

Working with Pandas Dataframe

How to Add a Column to Pandas Dataframe

Remove Columns From a Dataframe

How to Subset Pandas Dataframe

Random Sampling Rows From a Dataframe

How to Group Data using Pandas Dataframe

Summary of What We’ve Learned

Tutorial Requirements for Creating a New Django Project

Setting Up a Development Environment Before Creating a New Django Project

Setting Up a Development Environment with Pipenv

Using Virtualenv (Skip if Using Pipenv)

Installing Django with Pip (Skip if Using Pipenv)

Installing Django 1.11 with Pipenv

Create New Django Project

Create a Django Application

Conclusion

#2 “New Style” String Formatting (`str.format`)