The Digital Cat: Python 3 OOP Part 1 - Objects and types

June 25, 2018, 5:30 am

≫ Next: The Digital Cat: Python Mocks: a gentle introduction - Part 1

≪ Previous: Karim Elghamrawy: Python Enumerate Explained (With Examples)

This post is available as an IPython Notebookhere

About this series

Object-oriented programming (OOP) has been the leading programming paradigm for several decades now, starting from the initial attempts back in the 60s to some of the most important languages used nowadays. Being a set of programming concepts and design methodologies, OOP can never be said to be "correctly" or "fully" implemented by a language: indeed there are as many implementations as languages.

So one of the most interesting aspects of OOP languages is to understand how they implement those concepts. In this post I am going to try and start analyzing the OOP implementation of the Python language. Due to the richness of the topic, however, I consider this attempt just like a set of thoughts for Python beginners trying to find their way into this beautiful (and sometimes peculiar) language.

This series of posts wants to introduce the reader to the Python 3 implementation of Object Oriented Programming concepts. The content of this and the following posts will not be completely different from that of the previous "OOP Concepts in Python 2.x" series, however. The reason is that while some of the internal structures change a lot, the global philosophy doesn't, being Python 3 an evolution of Python 2 and not a new language.

So I chose to split the previous series and to adapt the content to Python 3 instead of posting a mere list of corrections. I find this way to be more useful for new readers, that otherwise sould be forced to read the previous series.

Print

One of the most noticeable changes introduced by Python 3 is the transformation of the print keyword into the print() function. This is indeed a very small change, compared to other modifications made to the internal structures, but is the most visual-striking one, and will be the source of 80% of your syntax errors when you will start writing Python 3 code.

Remember that print is now a function so write print(a) and not print a.

Back to the Object

Computer science deals with data and with procedures to manipulate that data. Everything, from the earliest Fortran programs to the latest mobile apps is about data and their manipulation.

So if data are the ingredients and procedures are the recipes, it seems (and can be) reasonable to keep them separate.

Let's do some procedural programming in Python

# This is some datadata=(13,63,5,378,58,40)# This is a procedure that computes the averagedefavg(d):returnsum(d)/len(d)print(avg(data))

As you can see the code is quite good and general: the procedure (function) operates on a sequence of data, and it returns the average of the sequence items. So far, so good: computing the average of some numbers leaves the numbers untouched and creates new data.

The observation of the everyday world, however, shows that complex data mutate: an electrical device is on or off, a door is open or closed, the content of a bookshelf in your room changes as you buy new books.

You can still manage it keeping data and procedures separate, for example

# These are two numbered doors, initially closeddoor1=[1,'closed']door2=[2,'closed']# This procedure opens a doordefopen_door(door):door[1]='open'open_door(door1)print(door1)

I described a door as a structure containing a number and the status of the door (as you would do in languages like LISP, for example). The procedure knows how this structure is made and may alter it.

This also works like a charm. Some problems arise, however, when we start building specialized types of data. What happens, for example, when I introduce a "lockable door" data type, which can be opened only when it is not locked? Let's see

# These are two standard doors, initially closeddoor1=[1,'closed']door2=[2,'closed']# This is a lockable door, initially closed and unlockedldoor1=[1,'closed','unlocked']# This procedure opens a standard doordefopen_door(door):door[1]='open'# This procedure opens a lockable doordefopen_ldoor(door):ifdoor[2]=='unlocked':door[1]='open'open_door(door1)print(door1)open_ldoor(ldoor1)print(ldoor1)

Everything still works, no surprises in this code. However, as you can see, I had to find a different name for the procedure that opens a locked door since its implementation differs from the procedure that opens a standard door. But, wait... I'm still opening a door, the action is the same, and it just changes the status of the door itself. So why shall I remember that a locked door shall be opened with open_ldoor() instead of open_door() if the verb is the same?

Chances are that this separation between data and procedures doesn't perfectly fit some situations. The key problem is that the "open" action is not actually using the door; rather it is changing its state. So, just like the volume control buttons of your phone, which are on your phone, the "open" procedure should stick to the "door" data.

This is exactly what leads to the concept of object: an object, in the OOP context, is a structure holding data and procedures operating on them.

What About Type?

When you talk about data you immediately need to introduce the concept of type. This concept may have two meanings that are worth being mentioned in computer science: the behavioural and the structural one.

The behavioural meaning represents the fact that you know what something is by describing how it acts. This is the foundation of the so-called "duck typing" (here "typing" means "to give a type" and not "to type on a keyboard"): if it ~~types~~ acts like a duck, it is a duck.

The structural meaning identifies the type of something by looking at its internal structure. So two things that act in the same way but are internally different are of different type.

Both points of view can be valid, and different languages may implement and emphasize one meaning of type or the other, and even both.

Class Games

Objects in Python may be built describing their structure through a class. A class is the programming representation of a generic object, such as "a book", "a car", "a door": when I talk about "a door" everyone can understand what I'm saying, without the need of referring to a specific door in the room.

In Python, the type of an object is represented by the class used to build the object: that is, in Python the word type has the same meaning of the word class.

For example, one of the built-in classes of Python is int, which represents an integer number

>>> a=6>>> print(a)6>>> print(type(a))<class 'int'>>>> print(a.__class__)<class 'int'>

As you can see, the built-in function type() returns the content of the magic attribute__class__ (magic here means that its value is managed by Python itself offstage). The type of the variable a, or its class, is int. (This is a very inaccurate description of this rather complex topic, so remember that at the moment we are just scratching the surface).

Once you have a class you can instantiate it to get a concrete object (an instance) of that type, i.e. an object built according to the structure of that class. The Python syntax to instantiate a class is the same of a function call

>>> b=int()>>> type(b)<class 'int'>

When you create an instance, you can pass some values, according to the class definition, to initialize it.

>>> b=int()>>> print(b)0>>> c=int(7)>>> print(c)7

In this example, the int class creates an integer with value 0 when called without arguments, otherwise it uses the given argument to initialize the newly created object.

Let us write a class that represents a door to match the procedural examples done in the first section

classDoor:def__init__(self,number,status):self.number=numberself.status=statusdefopen(self):self.status='open'defclose(self):self.status='closed'

The class keyword defines a new class named Door; everything indented under class is part of the class. The functions you write inside the object are called methods and don't differ at all from standard functions; the nomenclature changes only to highlight the fact that those functions now are part of an object.

Methods of a class must accept as first argument a special value called self (the name is a convention but please never break it).

The class can be given a special method called __init__() which is run when the class is instantiated, receiving the arguments passed when calling the class; the general name of such a method, in the OOP context, is constructor, even if the __init__() method is not the only part of this mechanism in Python.

The self.number and self.status variables are called attributes of the object. In Python, methods and attributes are both members of the object and are accessible with the dotted syntax; the difference between attributes and methods is that the latter can be called (in Python lingo you say that a method is a callable).

As you can see the __init__() method shall create and initialize the attributes since they are not declared elsewhere. This is very important in Python and is strictly linked with the way the language handles the type of variables. I will detail those concepts when dealing with polymorphism in a later post.

The class can be used to create a concrete object

>>> door1=Door(1,'closed')>>> type(door1)<class '__main__.Door'>>>> print(door1.number)1>>> print(door1.status)closed

Now door1 is an instance of the Door class; type() returns the class as __main__.Door since the class was defined directly in the interactive shell, that is in the current main module.

To call a method of an object, that is to run one of its internal functions, you just access it as an attribute with the dotted syntax and call it like a standard function.

>>> door1.open()>>> print(door1.number)1>>> print(door1.status)open

In this case, the open() method of the door1 instance has been called. No arguments have been passed to the open() method, but if you review the class declaration, you see that it was declared to accept an argument (self). When you call a method of an instance, Python automatically passes the instance itself to the method as the first argument.

You can create as many instances as needed and they are completely unrelated each other. That is, the changes you make on one instance do not reflect on another instance of the same class.

Recap

Objects are described by a class, which can generate one or more instances, unrelated each other. A class contains methods, which are functions, and they accept at least one argument called self, which is the actual instance on which the method has been called. A special method, __init__() deals with the initialization of the object, setting the initial value of the attributes.

Movie Trivia

Section titles come from the following movies: Back to the Future (1985) , What About Bob? (1991), Wargames (1983).

Sources

You will find a lot of documentation in this Reddit post. Most of the information contained in this series come from those sources.

Feedback

Feel free to use the blog Google+ page to comment the post. The GitHub issues page is the best place to submit corrections.

↧

The Digital Cat: Python Mocks: a gentle introduction - Part 1

June 25, 2018, 5:30 am

≫ Next: Real Python: What Can I Do With Python?

≪ Previous: The Digital Cat: Python 3 OOP Part 1 - Objects and types

As already stressed in the two introductory posts on TDD (you can find them here) testing requires to write some code that uses the functions and objects you are going to develop. This means that you need to isolate a given (external) function that is part of your public API and demonstrate that it works with standard inputs and in edge cases.

For example, if you are going to develop an object that stores percentages (such as for example poll results), you should test the following conditions: the class can store a standard percentage such as 42%, the class shall give an error if you try to store a negative percentage, the class shall give an error if you store a percentage greater than 100%.

Tests shall be idempotent and isolated. Idempotent in mathematics and computer science identifies a process that can be run multiple times without changing the status of the system. Isolated means that a test shall not change its behaviour depending on previous executions of itself, nor depend on the previous execution (or missing execution) of other tests.

Such restrictions, which guarantee that your tests are not passing due to a temporary configuration of the system or the order in which they are run, can raise big issues when dealing with external libraries and systems, or with intrinsically mutable concepts such as time. In the testing discipline, such issues are mostly faced using mocks, that is objects that pretend to be other objects.

In this series of posts I am going to review the Python mock library and exemplify its use. I will not cover everything you may do with mock, obviously, but hopefully I'll give you the information you need to start using this powerful library.

Installation

First of all, mock is a Python library which development started around 2008. It was selected to be included in the standard library as of Python 3.3, which however does not prevent you to use other libraries if you prefer.

Python 3 users, thus, are not required to take any step, while for Python 2 projects you are still required to issue a pip install mock to install it into the system or the current virtualenv.

You may find the official documentation here. It is very detailed, and as always I strongly recommend taking your time to run through it.

Basic concepts

A mock, in the testing lingo, is an object that simulates the behaviour of another (more complex) object. When you (unit)test an object of your library you need sometimes to access other systems your object want to connect to, but you do not really want to be forced to run them, for several reasons.

The first one is that connecting with external systems means having a complex testing environment, that is you are dropping the isolation requirement of you tests. If your object wants to connect with a website, for example, you are forced to have a running Internet connection, and if the remote website is down you cannot test your library.

The second reason is that the setup of an external system is usually slow in comparison with the speed of unit tests. We expect to run hundred of tests in seconds, and if we have to fetch information from a remote server for each of them the time easily increases by several orders of magnitude. Remember: having slow tests means that you cannot run them while you develop, which in turn means that you will not really use them for TDD.

The third reason is more subtle, and has to to with the mutable nature of an external system, thus I'll postpone the discussion of this issue for the moment.

Let us try and work with a mock in Python and see what it can do. First of all fire up a Python shell or a Jupyter Notebook and import the library

fromunittestimportmock

If you are using Python 2 you have to install it and use

importmock

The main object that the library provides is Mock and you can instantiate it without any argument

m=mock.Mock()

This object has the peculiar property of creating methods and attributes on the fly when you require them. Let us first look inside the object to take a glance of what it provides

>>> dir(m)['assert_any_call', 'assert_called_once_with', 'assert_called_with', 'assert_has_calls', 'attach_mock', 'call_args', 'call_args_list', 'call_count', 'called', 'configure_mock', 'method_calls', 'mock_add_spec', 'mock_calls', 'reset_mock', 'return_value', 'side_effect']

As you can see there are some methods which are already defined into the Mock object. Let us read a non-existent attribute

>>> m.some_attribute<Mock name='mock.some_attribute' id='140222043808432'>>>> dir(m)['assert_any_call', 'assert_called_once_with', 'assert_called_with', 'assert_has_calls', 'attach_mock', 'call_args', 'call_args_list', 'call_count', 'called', 'configure_mock', 'method_calls', 'mock_add_spec', 'mock_calls', 'reset_mock', 'return_value', 'side_effect', 'some_attribute']

Well, as you can see this class is somehow different from what you are accustomed to. First of all its instances do not raise an AttributeError when asked for a non-existent attribute, but they happily return another instance of Mock itself. Second, the attribute you tried to access has now been created inside the object and accessing it returns the same mock object as before.

>>> m.some_attribute<Mock name='mock.some_attribute' id='140222043808432'>

Mock objects are callables, which means that they may act both as attributes and as methods. If you try to call the mock it just returns you another mock with a name that includes parentheses to signal its callable nature

>>> m.some_attribute()<Mock name='mock.some_attribute()' id='140247621475856'>

As you can understand, such objects are the perfect tool to mimic other objects or systems, since they may expose any API without raising exceptions. To use them in tests, however, we need them to behave just like the original, which implies returning sensible values or performing operations.

Return value

The simplest thing a mock can do for you is to return a given value every time you call it. This is configured setting the return_value attribute of a mock object

>>> m.someattribute.return_value=42>>> m.someattribute()42

Now the object does not return a mock object any more, instead it just returns the static value stored in the return_value attribute. Obviously you can also store a callable such as a function or an object, and the method will return it, but it will not run it. Let me give you an example

>>> defprint_answer():... print("42")... >>> >>> m.some_attribute.return_value=print_answer>>> m.some_attribute()<function print_answer at 0x7f8df1e3f400>

As you can see calling some_attribute() just returns the value stored in return_value, that is the function itself. To return values that come from a function we have to use a slightly more complex attribute of mock objects called side_effect.

Side effect

The side_effect parameter of mock objects is a very powerful tool. It accepts three different flavours of objects, callables, iterables, and exceptions, and changes its behaviour accordingly.

If you pass an exception the mock will raise it

>>> m.some_attribute.side_effect=ValueError('A custom value error')>>> m.some_attribute()Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/unittest/mock.py", line 902, in __call__return_mock_self._mock_call(*args,**kwargs)
  File "/usr/lib/python3.4/unittest/mock.py", line 958, in _mock_callraiseeffectValueError: A custom value error

If you pass an iterable, such as for example a generator, or a plain list, tuple, or similar objects, the mock will yield the values of that iterable, i.e. return every value contained in the iterable on subsequent calls of the mock. Let me give you an example

>>> m.some_attribute.side_effect=range(3)>>> m.some_attribute()0>>> m.some_attribute()1>>> m.some_attribute()2>>> m.some_attribute()Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/unittest/mock.py", line 902, in __call__return_mock_self._mock_call(*args,**kwargs)
  File "/usr/lib/python3.4/unittest/mock.py", line 961, in _mock_callresult=next(effect)StopIteration

As promised, the mock just returns every object found in the iterable (in this case a range object) once at a time until the generator is exhausted. According to the iterator protocol (see this post) once every item has been returned the object raises the StopIteration exception, which means that you can correctly use it in a loop.

The last and perhaps most used case is that of passing a callable to side_effect, which shamelessly executes it with its own same parameters. This is very powerful, especially if you stop thinking about "functions" and start considering "callables". Indeed, side_effect also accepts a class and calls it, that is it can instantiate objects. Let us consider a simple example with a function without arguments

>>> defprint_answer():... print("42")>>> m.some_attribute.side_effect=print_answer>>> m.some_attribute.side_effect()42

A slightly more complex example: a function with arguments

>>> defprint_number(num):... print("Number:",num)... >>> m.some_attribute.side_effect=print_number>>> m.some_attribute.side_effect(5)Number: 5

And finally an example with a class

>>> classNumber(object):... def__init__(self,value):... self._value=value... defprint_value(self):... print("Value:",self._value)... >>> m.some_attribute.side_effect=Number>>> n=m.some_attribute.side_effect(26)>>> n<__main__.Number object at 0x7f8df1aa4470>>>> n.print_value()Value: 26

Testing with mocks

Now we know how to build a mock and how to give it a static return value or make it call a callable object. It is time to see how to use a mock in a test and what facilities do mocks provide. I'm going to use pytest as a testing framework. You can find a quick introduction to pytest and TDD here).

Setup

If you want to quickly setup a pytest playground you may execute this code in a terminal (you need to have Python 3 and virtualenv installed in your system)

mkdir mockplayground
cd mockplayground
virtualenv venv3 -p python3
source venv3/bin/activate
pip install --upgrade pip
pip install pytest
echo"[pytest]">> pytest.ini
echo"norecursedirs=venv*">> pytest.ini
mkdir tests
touch myobj.py
touch tests/test_mock.py
PYTHONPATH="." py.test

The PYTHONPATH environment variable is an easy way to avoid having to setup a whole Python project to just test some simple code.

The three test types

According to Sandy Metz we need to test only three types of messages (calls) between objects:

Incoming queries (assertion on result)
Incoming commands (assertion on direct public side effects)
Outgoing commands (expectation on call and arguments)

You can see the original talk here or read the slides here. The final table is shown in slide number 176.

As you can see when dealing with external objects we are only interested in knowing IF a method was called and WHAT PARAMETERS the caller passed to the object. We are not testing if the remote object returns the correct result, this is faked by the mock, which indeed returns exactly the result we need.

So the purpose of the methods provided by mock objects is to allow us to check what methods we called on the mock itself and what parameters we used in the call.

Asserting calls

To show how to use Python mocks in testing I will follow the TDD methodology, writing tests first and then writing the code that makes the tests pass. In this post I want to give you a simple overview of the mock objects, so I will not implement a real world use case, and the code will be very trivial. In the second part of this series I will test and implement a real class, in order to show some more interesting use cases.

The first thing we are usually interested in when dealing with an external object is to know that a given method has been called on it. Python mocks provide the assert_called_with() method to check this condition.

The use case we are going to test is the following. We instantiate the myobj.MyObj class, which requires an external object. The class shall call the connect() method of the external object without any parameter.

fromunittestimportmockimportmyobjdeftest_instantiation():external_obj=mock.Mock()myobj.MyObj(external_obj)external_obj.connect.assert_called_with()

The myobj.MyObj class, in this simple example, needs to connect to an external object, for example a remote repository or a database. The only thing we need to know for testing purposes is if the class called the connect() method of the external object without any parameter.

So the first thing we do in this test is to instantiate the mock object. This is a fake version of the external object, and its only purpose is to accept calls from the MyObj object under test and return sensible values. Then we instantiate the MyObj class passing the external object. We expect the class to call the connect() method so we express this expectation calling external_obj.connect.assert_called_with().

What happens behind the scenes? The MyObj class receives the external object and somewhere is its initialization process calls the connect() method of the mock object and this creates the method itself as a mock object. This new mock records the parameters used to call it and the subsequent call to assert_called_with() checks that the method was called and that no parameters were passed.

Running pytest the test obviously fails.

$ PYTHONPATH="." py.test
==========================================test session starts==========================================
platform linux -- Python 3.4.3+, pytest-2.9.0, py-1.4.31, pluggy-0.3.1
rootdir: /home/leo/devel/mockplayground, inifile: pytest.ini
collected 1 items 

tests/test_mock.py F===============================================FAILURES================================================
___________________________________________ test_instantiation __________________________________________

    def test_instantiation():
        external_obj= mock.Mock()>       myobj.MyObj(external_obj)
E       AttributeError: 'module' object has no attribute 'MyObj'

tests/test_mock.py:6: AttributeError=======================================1 failed in 0.03 seconds========================================
$

Putting this code in myobj.py is enough to make the test pass

classMyObj():def__init__(self,repo):repo.connect()

As you can see, the __init__() method actually calls repo.connect(), where repo is expected to be a full-featured external object that provides a given API. In this case (for the moment) the API is just its connect() method. Calling repo.connect() when repo is a mock object silently creates the method as a mock object, as shown before.

The assert_called_with() method also allows us to check the parameters we passed when calling. To show this let us pretend that we expect the MyObj.setup() method to call setup(cache=True, max_connections=256) on the external object. As you can see we pass a couple of arguments (namely cache and max_connections) to the called method, and we want to be sure that the call was exactly in this form. The new test is thus

deftest_setup():external_obj=mock.Mock()obj=myobj.MyObj(external_obj)obj.setup()external_obj.setup.assert_called_with(cache=True,max_connections=256)

As usual the first run fails. Be sure to check this, it is part of the TDD methodology. You must have a test that DOES NOT PASS, then write some code that make it pass.

$ PYTHONPATH="." py.test
==========================================test session starts==========================================
platform linux -- Python 3.4.3+, pytest-2.9.0, py-1.4.31, pluggy-0.3.1
rootdir: /home/leo/devel/mockplayground, inifile: pytest.ini
collected 2 items 

tests/test_mock.py .F

===============================================FAILURES================================================
______________________________________________ test_setup _______________________________________________

    def test_setup():
        external_obj= mock.Mock()obj= myobj.MyObj(external_obj)>       obj.setup()
E       AttributeError: 'MyObj' object has no attribute 'setup'

tests/test_mock.py:14: AttributeError==================================1 failed, 1 passed in 0.03 seconds===================================
$

To show you what type of check the mock object provides let me implement a partially correct solution

classMyObj():def__init__(self,repo):self._repo=reporepo.connect()defsetup(self):self._repo.setup(cache=True)

As you can see the external object has been stored in self._repo and the call to self._repo.setup() is not exactly what the test expects, lacking the max_connections parameter. Running pytest we obtain the following result (I removed most of the pytest output)

E           AssertionError: Expected call: setup(cache=True, max_connections=256)
E           Actual call: setup(cache=True)

and you see that the error message is very clear about what we expected and what happened in our code.

As you can read in the official documentation, the Mock object also provides the following methods and attributes: assert_called_once_with, assert_any_call, assert_has_calls, assert_not_called, called, call_count. Each of them explores a different aspect of the mock behaviour concerning calls, make sure to check their description and the examples provided along.

Final words

In this first part of the series I described the behaviour of mock objects and the methods they provide to simulate return values and to test calls. They are a very powerful tool that allows you to avoid creating complex and slow tests that depend on external facilities to run, thus missing the main purpose of tests, which is that of continuously helping you to check your code.

In the next issue of the series I will explore the automatic creation of mock methods from a given object and the very important patching mechanism provided by the patch decorator and context manager.

Feedback

Feel free to use the blog Google+ page to comment the post. The GitHub issues page is the best place to submit corrections.

↧

Real Python: What Can I Do With Python?

June 25, 2018, 7:00 am

≫ Next: Stack Abuse: The Python tempfile Module

≪ Previous: The Digital Cat: Python Mocks: a gentle introduction - Part 1

You’ve done it: you’ve finished a course or finally made it to the end of a book that teaches you the basics of programming with Python. You’ve mastered lists, dictionaries, classes, and maybe even some object oriented concepts.

So… what next?

Python is a very versatile programming language, with a plethora of uses in a variety of different fields. If you’ve grasped the basics of Python and are itching to build something with the language, then it’s time to figure out what your next step should be.

In this article, we offer several different projects, resources, and tutorials that you can use to start building things with Python.

Free Bonus: Python Cheat Sheet

Get a Python Cheat Sheet (PDF) and learn the basics of Python 3, like working with data types, dictionaries, lists, and Python functions:

What Others Do With Python

You’re probably wondering what people are building with Python in the real world. So first, let’s take a quick look at how some of the big tech companies are using the language.

Google is a company that has used Python from the start, and it’s gained a place as one of the tech giant’s main server-side languages. Guido van Rossum, Python’s Benevolent Dictator for Life, even worked there for several years, overseeing the language’s development.

Instagram likes Python for its simplicity. The service is known for running “the world’s largest deployment of the Django web framework, which is written entirely in Python.”

Spotify puts the language to use in its data analysis and back-end services. According to their team, Python’s ease of use leads to a lightning-fast development pipeline. Spotify performs a ton of analyses to give recommendations to their users, so they need something that’s simple but also works well. Python to the rescue!

You can check out this article to see what other companies are doing with Python.

If you’re already convinced, then let’s get you started!

What You Can Do With Python

From web development to data science, machine learning, and more, Python’s real-world applications are limitless. Here are some projects that will assist you in finally putting your Python skills to good use.

#1: Automate the Boring Stuff

This is a resource on “practical programming for total beginners.” Like the title says, this book will teach you how to automate tedious tasks such as updating spreadsheets or renaming files on your computer. It’s the perfect starting point for anyone who’s mastered the basics of Python.

You’ll get a chance to practice what you’ve learned so far by creating dictionaries, scraping the web, working with files, and creating objects and classes. The hands-on applications that you come across in this book will provide you with real-world results that you can see immediately.

This resource is available in different formats to give you the best learning experience possible. Buy the book on Amazon or read it online for free.

#2: Stay on Top of Bitcoin Prices

Everyone seems to be talking about Bitcoin these days. Ever since topping out at a price of almost $20,000 in December 2017, the cryptocurrency has been on the minds of millions. Its price continues to fluctuate, but many would consider it a worthwhile investment.

If you’re looking to cash in on the virtual gold rush and just need to know when to make your move, then you’ll need to stay on top of Bitcoin’s prices. This tutorial can teach you how to use your Python skills to build a Bitcoin price notification service.

The foundation of this project is the creation of IFTTT (“if this, then that”) applets. You’ll learn how to use the requests library to send HTTP requests and how to use a webhook to connect your app to external services.

This is the perfect starter project for a beginner Pythonista with an interest in crypto. The service you build with this tutorial can be extended to other currencies as well, so don’t worry—Ethereum is fair game, too.

#3: Create a Calculator

This simple project is a solid gateway into GUI programming. Building back-end services is one important part of deployment, but there may be a front-end that needs to be taken into account. Creating applications that users can easily interact with is paramount.

If you’re interested in UX and UI design, then take a look at this tutorial. You’ll be working with the tkinter module, the standard graphical user interface package that comes traditionally bundled with Python.

The tkinter module is a wrapper around Tcl/Tk, a combination of the Tcl scripting language and a GUI framework extension, Tk. If you have Python installed, then you should already have the tkinter framework ready to go as well. A simple call will get you started:

fromtkinterimport*

Once you’ve got that set up, you can get to work on building your first GUI calculator in Python.

Practice using the tkinter module and watch your vision materialize on the screen. Then, once you’ve got your feet wet, you can branch out and start working with Python’s other GUI toolkits. Check out the official documentation on GUI Programming in Python for more information.

#4: Mine Twitter Data

Thanks to the Internet—and, increasingly, the Internet of Things—we now have access to hordes of data that weren’t available even a decade ago. Analytics is a huge part of any field that works with data. What are people talking about? What patterns can we see in their behavior?

Twitter is a great place to get answers to some of these questions. If you’re interested in data analysis, then a Twitter data mining project is a great way to use your Python skills to answer questions about the world around you.

Our Twitter sentiment analysis tutorial will teach you how to mine Twitter data and analyze user sentiment with a docker environment. You’ll learn how to register an application with Twitter, which you’ll need to do in order to access their streaming API.

You’ll see how to use Tweepy to filter which tweets you want to pull, TextBlob to calculate the sentiment of those tweets, Elasticsearch to analyze their content, and Kibana to visualize the results. After you finish this tutorial, you should be ready to dive into other projects that use Python for text processing and speech recognition.

#5: Build a Microblog With Flask

It seems like everyone has a blog these days, but it’s not a bad idea to have a central hub for yourself online. With the advent of Twitter and Instagram, microblogging in particular has become exceedingly popular. In this project by Miguel Grinberg, you’ll learn how to build your own microblog.

It’s called “The Flask Mega-Tutorial,” and it truly lives up to its name. With 23 chapters to work through, you’ll develop a deep understanding of the Flask micro web-framework. At the end of this project, you should have a fully functional web application.

You don’t need to know anything about Flask to get started, so it’s perfect for those of you who are itching to get your hands dirty with web development.

The tutorial was recently updated to include content that will help you become a better web developer in general. You can read it for free online, purchase a copy on Amazon, or have the author walk you step by step through his online course. Once you’re done, you’ll be able to move on to Django and creating even larger-scale web applications.

#6: Build a Blockchain

While the blockchain was initially developed as a financial technology, it’s spreading to a variety of other industries. Blockchains can be used for almost any kind of transaction: from real estate dealings to medical record transfers.

You can get a better understanding of how they work by building one yourself. Hackernoon’s tutorial will assist you in implementing a blockchain from scratch. At the end of this project, you’ll have gained an in-depth understanding of how this transactional technology works.

You’ll be working with HTTP clients and the requests library. Once you install the Flask web framework, you’ll be able to use HTTP requests to communicate with your blockchain over the Internet.

Remember, blockchain isn’t just for crypto enthusiasts. Once you’ve built one for yourself, see if you can’t find a creative way to implement the technology in your field of interest.

#7: Bottle Up a Twitter Feed

Interested in building web applications but unsure about starting a mega-project? No worries—we’ve got something for you. Follow along with us to learn how to create a simple web app in just a few hours.

Bob Belderbos shares how he implemented the 40th PyBites Code Challenge, where participants were instructed to create a simple web app to better navigate the Daily Python Tip feed on Twitter. You can walk through his implementation of the challenge and code alongside him.

Instead of Flask, you’ll be using the Bottle micro web-framework. Bottle is known as a low-dependency solution for deploying apps quickly. Since it is designed to be lightweight and simple to use, you’ll have your application developed in no time.

You’ll also use the Tweepy module to load data from the Twitter API. You’ll store the data in an SQLAlchemy database, so you’ll get some practice writing SQL queries as well. Fork the repo to get started!

#8: Play PyGames

This one is for those of you who like to have fun! Python can be used to code a variety of arcade games, adventure games, and puzzle games that you can deploy within a few days. Classics like hangman, tic-tac-toe, ping-pong, and more are all doable with your newly acquired programming skills.

The Pygame library makes it even easier to build your own games. It contains almost anything you could need when starting to develop a game.

Pygame is free and open source. It includes computer graphics and sound libraries that you can use to add interactive functionality to your application.

There are scores of games you can create with the library. Whatever you choose to invent, feel free to share your stuff with the Pygame community!

#9: Choose Your Own Adventure

If you’re more into storytelling, then you can still build something cool with Python.

The language is extremely easy to write in, which makes it the perfect environment for developing interactive fiction. This free resource will guide you through the process of writing a text-based adventure game in Python.

The tutorial assumes basic knowledge of programming in Python, but it helps you bridge the gap between what you know and how to use that knowledge to build an application.

If you want to take your story to the next level, you can use a software engine like Ren’Py to add sounds and images to your game, creating a full-fledged visual novel. (Then, you can put it up on Steam and see how it does! The best way to get feedback on your work is to release your creation out into the world.)

#10: Say “Hello World!” to Machine Learning

Machine Learning can be a critical field of understanding for anyone interested in Artificial Intelligence. However, it can be intimidating to get started, because the space is fast and ever-changing.

Luckily, there are resources online that can help you get your feet wet before you dive into the world of data science. This tutorial by Jason Brownlee is a wonderful introduction to using Python for machine learning.

You’ll walk through some of the most common machine learning algorithms as well as the Python libraries that will assist you in making predictions.

The tutorial is extremely simple and very easy to follow. You can complete it in as little as a few hours. By the time you’re done, you’ll have gained a quick understanding of how to use Python to perform data science.

When you’re sure you’re ready to dive in, check out our stock of data science tutorials, where you’ll learn how to analyze fingerprints, create visualizations, and recognize speech and faces, all in Python.

#11: Get Challenged

If you’re not sure about taking the plunge with some of the larger projects listed above, but the smaller ones don’t interest you either, then you might be wondering what else there is. How on earth can you find something that excites you?

Coding challenges can help you practice your Python skills and gain a surface-level understanding of all the different things you can do with Python.

To put it simply: you’re presented with a problem, and you have to find a solution that uses Python.

You’ll get a chance to develop implementations that make sense to you, but you’ll also have the opportunity to dive deep into the Python language by way of hints. These give you an idea of which modules you should be importing to help you solve the challenge.

Coding challenges are a great way to learn breadth-first about as many libraries, methods, and frameworks as possible. You’re guaranteed to find something that you’ll want to explore more on your own time. You might even come back to this list and find that something you used in one of your challenges has sparked a new interest for you!

To get started, try one of these on for size:

The Python Challenge has over 20 levels for you to work through. Create small Python scripts to find a solution to the level. There are hints scattered about the Internet, but try to see the clues and figure it out for yourself!
PyBites Code Challenges has 50 challenges and counting! These challenges encourage you to master Python by building applications that accomplish tasks.

If you’d rather push yourself by coding through these challenges on your own instead of working through a step-by-step tutorial, then it’s always a good idea to have a resource you can turn to for help. Python Tricks: The Book is an amazing source of information to have on hand when you are working through these challenges. It will take you through some of the lesser-known parts of Python that you’ll need to solve them.

What You Probably Shouldn’t Do With Python

Clearly, Python is an extremely versatile language, and there’s a lot you can do with it. But you can’t do everything with it. In fact, there are some things that Python is not very well suited for at all.

As an interpreted language, Python has trouble interacting with low-level devices, like device drivers. For instance, you’d have a problem if you wanted to write an operating system with Python only. You’re better off sticking with C or C++ for low-level applications.

However, even that might not be true for long. As a testament to Python’s flexibility, there are those out there who are working on projects that extend Python’s usability to low-level interactions. MicroPython is just one of these projects, designing low-level capability for Python.

What if My Idea Isn’t on This List?

That’s okay! This list isn’t exhaustive—there are countless other tools and applications you can build with Python that we haven’t covered here. Don’t think you’re limited to what’s on this list. It is simply a resource to give you a place to start.

This video will give you some ideas on other projects that Python is well-suited for. You can also check out this blog post to learn where to find inspiration for more Python projects.

In the end, it’s up to you to do the research and find projects that pique your interest. If you’re not sure where to begin, then follow us on Twitter. We regularly share cool and interesting Python projects from our reader community. You might find something that you can’t wait to contribute to!

What to Do Next

So there you have it! Eleven ways to start working your way from Python beginner to savvy Pythonista.

No matter where you choose to begin, you’re sure to open up countless avenues for developing your programming skills. Pick something—anything—and get started! Do you have an idea for a project that didn’t make this list? Leave a comment down below! You could suggest the perfect project for a fellow programmer.

If you get stuck and need a nudge in the right direction, check out our tips for developing positive learning strategies to help get yourself back on track.

Another great way to get unstuck is to talk it out. Coding doesn’t have to be a solitary activity. If you need a way to ask questions and get answers quickly from knowledgeable professionals, then consider joining the PythonistaCafe. This private community allows you to network with those who will help push you through any walls you may hit on your journey to Python mastery. Click here to learn more, or go ahead an apply!

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

Stack Abuse: The Python tempfile Module

June 25, 2018, 9:38 am

≫ Next: Davy Wybiral: Hookah: A Swiss Army knife for data pipelines

≪ Previous: Real Python: What Can I Do With Python?

Introduction

Temporary files, or "tempfiles", are mainly used to store intermediate information on disk for an application. These files are normally created for different purposes such as temporary backup or if the application is dealing with a large dataset bigger than the system's memory, etc. Ideally, these files are located in a separate directory, which varies on different operating systems, and the name of these files are unique. The data stored in temporary files is not always required after the application quits, so you may want these files to be deleted after use.

Python provides a module known as tempfile, which makes creating and handling temporary files easier. This module provides a few methods to create temporary files and directories in different ways. tempfile comes in handy whenever you want to use temporary files to store data in a Python program. Let's take a look at a couple of different examples on how the tempfile module can be used.

Creating a Temporary File

Suppose your application needs a temporary file for use within the program, i.e. it will create one file, use it to store some data, and then delete it after use. To achieve this, we can use the TemporaryFile() function.

This function will create one temporary file to the default tempfile location. This location may be different between operating systems. The best part is that the temporary file created by TemporaryFile() will be removed automatically whenever it the file is closed. Also, it does not create any reference to this file in the system's filesystem table. This makes it private to the current application i.e. no other program will be able to open the file.

Let's take a look at the below Python program to see how it works:

import tempfile #1

print("Creating one temporary file...")

temp = tempfile.TemporaryFile() #2

try:  
    print("Created file is:", temp) #3
    print("Name of the file is:", temp.name) #4
finally:  
    print("Closing the temp file")
    temp.close() #5

It will print the below output:

$ python3 temp-file.py
Creating one temporary file...  
Created file is: <_io.BufferedRandom name=4>  
Name of the file is: 4  
Closing the temp file

To create one temporary file in Python, you need to import the tempfile module.
As explained above, we have created the temporary file using the TemporaryFile() function.
From the output, you can see that the created object is actually not a file, it is a file-like object. And, the mode parameter (not shown in our example) of the created file is w+b, i.e. you can read and write without being closed.
The temporary file created has no name.
Finally, we are closing the file using the close() method. It will be destroyed after it is closed.

One thing we should point out is that the file created using the TemporaryFile() function may or may not have a visible name in the file system. On Unix, the directory entry for the file is removed automatically after it is created, although this is not supported on other platforms. Normally TemporaryFile() is the ideal way to create one temporary storage area for any program in Python.

Create a Named Temporary File

In our previous example, we have seen that the temporary file created using the TemporaryFile() function is actually a file-like object without an actual file name. Python also provides a different method, NamedTemporaryFile(), to create a file with a visible name in the file system. Other than providing a name to the tempfile, NamedTemporaryFile() works the same as TemporaryFile(). Now let's use the same above example to create a named temporary file:

import tempfile

print("Creating one named temporary file...")

temp = tempfile.NamedTemporaryFile()

try:  
    print("Created file is:", temp)
    print("Name of the file is:", temp.name)
finally:  
    print("Closing the temp file")
    temp.close()

Running this code will print output similar to the following:

$ python3 named-temp-file.py
Creating one named temporary file...  
Created file is: <tempfile._TemporaryFileWrapper object at 0x103f22ba8>  
Name of the file is: /var/folders/l7/80bx27yx3hx_0_p1_qtjyyd40000gn/T/tmpa3rq8lon  
Closing the temp file

So, the created file actually has a name this time. The advantage of NamedTemporaryFile() is that we can save the name of the created temp files and use them later before closing or destroying it. If the delete parameter is set to False, then we can close the file without it being destroyed, allowing us to re-open it later on.

Providing a Suffix or Prefix to the Name

Sometimes we need to add a prefix or suffix to a temp-file's name. It will help us to identify all temp files created by our program.

To achieve this, we can use the same NamedTemporaryFile function defined above. The only thing we need to add is two extra parameters while calling this function: suffix and prefix

import tempfile

temp = tempfile.NamedTemporaryFile(prefix="dummyPrefix_", suffix="_dummySuffix")

try:  
    print("Created file is:", temp)
    print("Name of the file is:", temp.name)
finally:  
    temp.close()

Running this code will print the following output:

$ python3 prefix-suffix-temp-file.py
Created file is: <tempfile._TemporaryFileWrapper object at 0x102183470>  
Name of the file is: /var/folders/tp/pn3dvz_n7cj7nfs0y2szsk9h0000gn/T/dummyPrefix_uz63brcp_dummySuffix

So, if we will pass the two extra arguments suffix and prefix to the NamedTemporaryFile() function, it will automatically add those in the start and end of the file name.

Finding the Default Location of Temp Files

The tempfile.tempdir variable holds the default location for all temporary files. If the value of tempdir is None or unset, Python will search a standard list of directories and sets tempdir to the first directory value, but only if the calling program can create a file in it. The following are the list of directories it will scan, in this order:

The directory named by the TMPDIR environment variable.
The directory named by the TEMP environment variable.
The directory named by the TMP environment variable
Platform-specific directories:
1. On windows, C:\TEMP, C:\TMP, \TEMP, and \TMP, in the same order.
2. On other platforms, /tmp, /var/tmp, and /usr/tmp, in the same order.
The current working directory.

To find out the default location of temporary files, we can call tempfile.gettempdir() method. It will return the value of tempdir if it is not None. Otherwise it will first search for the directory location using the steps mentioned above and then return the location.

import tempfile

print("Current temp directory:", tempfile.gettempdir())

tempfile.tempdir = "/temp"

print("Temp directory after change:", tempfile.gettempdir())

If you will run the above program, it will print an output similar to the following:

$ python3 dir-loc-temp-file.py
Current temp directory: /var/folders/tp/pn3dvz_n7cj7nfs0y2szsk9h0000gn/T  
Temp directory after change: /temp

You can see that the first temp directory location is the system-provided directory location and the second temp directory is the same value as the one that we have defined.

Reading and Writing Data from Temp Files

We have learned how to create a temporary file, create a temporary file with a name, and how to create a temporary file with a suffix and/or prefix. Now, let's try to understand how to actually read and write data from a temporary file in Python.

Reading and writing data from a temporary file in Python is pretty straightforward. For writing, you can use the write() method and for reading, you can use the read() method. For example:

import tempfile

temp = tempfile.TemporaryFile()

try:  
    temp.write(b'Hello world!')
    temp.seek(0)

    print(temp.read())
finally:  
    temp.close()

This will print the output as b'Hello world!' since the write() method takes input data in bytes (hence the b prefix on the string).

If you want to write text data into a temp file, you can use the writelines() method instead. For using this method, we need to create the tempfile using w+t mode instead of the default w+b mode. To do this, a mode param can be passed to TemporaryFile() to change the mode of the created temp file.

import tempfile

temp = tempfile.TemporaryFile(mode='w+t')

try:  
    temp.writelines("Hello world!")
    temp.seek(0)

    print(temp.read())
finally:  
    temp.close()

Unlike the previous example, this will print "Hello World" as the output.

Create a Temporary Directory

If your program has several temporary files, it may be more convenient to create one temporary directory and put all of your temp files inside of it. To create a temporary directory, we can use the TemporaryDirectory() function. After all temp files are closed, we need to delete the directory manually.

import tempfile

with tempfile.TemporaryDirectory() as tmpdirname:  
    print('Created temporary directory:', tmpdirname)

# Both the directory and its contents have been deleted

It will print the below output:

$ python3 mk-dir-temp-file.py
Created temporary directory: /var/folders/l7/80bx27yx3hx_0_p1_qtjyyd40000gn/T/tmpn_ke7_rk

Create a Secure Temporary File and Directory

By using mkstemp(), we can create a temporary file in the most secure manner possible. The temporary file created using this method is readable and writable only by the creating user ID. We can pass prefix and suffix arguments to add prefix and suffix to the created file name. By default, it opens the file in binary mode. To open it in text mode, we can pass text=True as an argument to the function. Unlike TemporaryFile(), the file created by mkstemp() doesn't get deleted automatically after closing it.

As you can see in the example below, the user is responsible for deleting the file.

import tempfile  
import os

temp_directory = tempfile.TemporaryDirectory()

print("Directory name:", temp_directory)

os.removedirs(temp_directory)

$ python3 mk-secure-dir-temp-file.py
Directory name: /var/folders/tp/pn3dvz_n7cj7nfs0y2szsk9h0000gn/T/tmpf8f6xc53

Similar to mkstemp(), we can create a temporary directory in the most secure manner possible using mkdtemp() method. And again, like mkstemp(), it also supports prefix and suffix arguments for adding a prefix and suffix to the directory name.

Conclusion

In this article we have learned different ways to create temporary files and directories in Python. You can use temp files in any Python program you want. But just make sure to delete it if the particular method used doesn't automatically delete it on its own. Also keep in mind that behavior may different between operating systems, like the output directory names and file names.

All of these functions we have explained above works with many different arguments, although we have not covered in detail what type of arguments each function takes. If you want to learn more about the tempfile module, you should check out the Python 3 official documentation.

↧

Davy Wybiral: Hookah: A Swiss Army knife for data pipelines

June 25, 2018, 9:52 am

≫ Next: Bill Ward / AdminTome: Kafka Python Tutorial for Fast Data Architecture

≪ Previous: Stack Abuse: The Python tempfile Module

Hookah lets you pipe data between different stream types.

Check it out on Github: https://github.com/wybiral/hookah

View the Go package docs: https://godoc.org/github.com/wybiral/hookah

Some CLI examples:

Pipe from stdin to a new TCP server on port 8080:

hookah -o tcp-server://localhost:8080

Pipe from an existing TCP server on port 8080 to a new HTTP server on port 8081:

hookah -i tcp://localhost:8080 -o http-server://localhost:8081

Pipe from a new Unix domain socket listener to stdout:

hookah -i unix-server://path/to/sock

Pipe from a new HTTP server on port 8080 to an existing Unix domain socket:

hookah -i http-server://localhost:8080 -o unix://path/to/sock

↧

Bill Ward / AdminTome: Kafka Python Tutorial for Fast Data Architecture

June 25, 2018, 2:03 pm

≫ Next: Matthew Rocklin: Dask Scaling Limits

≪ Previous: Davy Wybiral: Hookah: A Swiss Army knife for data pipelines

In this Kafka python tutorial we will create a python application that will publish data to a Kafka topic and another app that will consume the messages.

Fast Data Series Articles

Installing Apache Mesos 1.6.0 on Ubuntu 18.04
Kafka Tutorial for Fast Data Architecture
Kafka Python Tutorial for Fast Data Architecture

This is the third article in my Fast Data Architecture series that walks you through implementing Bid Data using a SMACK Stack. This article builds on the others so if you have not read through those, I highly suggest you do so that you have the infrastructure you need to follow along in this tutorial.

Example Application Architecture

In order to demonstrate how to analyze your big data we will be configuring a big data pipeline that will pull site metrics from Clicky.com and push those metrics to a Kafka topic on our Kafka Cluster.

This is just one pipeline that you might want to implement in your Big Data Implementation. Web site statistics can be a valuable part of your data as this can give you data about web site visitors, pages visited, etc. Combine this data with other data like social media shares when you perform your data analytics and you would be able to make some pretty neat business decisions about when is the best time for you to post site updates to social media in order to attract the most visitors. That is main benefit of implementing big data: not necessarily the raw data itself but the business knowledge you can extract from that raw data and make more informed business decisions.

In this example, we will pull the ‘pages‘ statistics from the Clicky.com API and push them to the admintome-pages Kafka topic. This will give us JSON data of AdminTome’s top pages.

Clicky Web Analytics

In order to fully follow along in this article you will need to have a website linked to Clicky.com. It’s free so why not. Register your site at clicky.com. I personally use it because it has better metrics reporting for blogs (like abandon rate) than Google Analytics gives. You will need to add some code to your page so that clicky can start collecting metrics.

After you page is sending metrics to clicky you will need to get some values in order to use the Clicky API and pull metrics from our python application. Go to preferences for your site and you will see two numbers that we will need:

Site ID
Site key

Don’t publish these anywhere because they could give anyone access to your website data. We will need these numbers later when we connect to the API and pull our site statistics.

Preparing Kafka

First, we need to prepare our Kafka Cluster by adding a topic to our Kafka cluster that we will use to send messages to. As you can see from the diagram above our topic in Kafka is going to be admintome-pages.

Login to the Mesos Master you ran Kafka-mesos from . If you followed the previous article, the master I used was mesos1.admintome.lab. Next, we will create the topic using the kafka-mesos.sh script:

$ cd kafka/
$ ./kafka-mesos.sh topic add admintome-pages --broker=0 --api=http://mslave2.admintome.lab:7000

Notice that the API parameter points to the Kafka scheduler we created using kafka-mesos in the last article. You can verify that you now have the correct topics:

$ ./kafka-mesos.sh topic list --api=http://mslave2.admintome.lab:7000
topics:
name: __consumer_offsets
partitions: 0:[0], 1:[0], 2:[0], 3:[0], 4:[0], 5:[0], 6:[0], 7:[0], 8:[0], 9:[0], 10:[0], 11:[0], 12:[0], 13:[0], 14:[0], 15:[0], 16:[0], 17:[0], 18:[0], 19:[0], 20:[0], 21:[0], 22:[0], 23:[0], 24:[0], 25:[0], 26:[0], 27:[0], 28:[0], 29:[0], 30:[0], 31:[0], 32:[0], 33:[0], 34:[0], 35:[0], 36:[0], 37:[0], 38:[0], 39:[0], 40:[0], 41:[0], 42:[0], 43:[0], 44:[0], 45:[0], 46:[0], 47:[0], 48:[0], 49:[0]
options: segment.bytes=104857600,cleanup.policy=compact,compression.type=producer

name: admintome
partitions: 0:[0]

name: admintome-pages
partitions: 0:[0]

And there is our new topic ready to go! Now it’s time to get to the fun stuff and start developing our python application

Kafka Python Tutorial

Now that we have Kafka ready to go we will start to develop our Kafka producer. The producer will get page metrics from the Clicky API and push those metrics in JSON form to our topic that we created earlier.

I assume that you have Python 3 installed on your system and virtualenv installed as well.

To get started we will need to setup our environment.

$ mkdir ~/Development/python/venvs
$ mkdir ~/Development/python/site-stats-intake
$ cd ~/Development/python/site-stats-intake
$ virtualenv ../venvs/intake
$ source ../venvs/intake/bin/activate
(intake) $ pip install kafka-python requests
(intake) $ pip freeze > requirements.txt

Next we need to create our classes.

Clicky Class

We will create a new python class called Clicky that we will use to interact with the Clicky API. Create a new file called clicky.py and add the following contents:

import requests
import json


class Clicky(object):

    def __init__(self, site_id, sitekey):
        self.site_id = site_id
        self.sitekey = sitekey
        self.output = "json"

    def get_data(self, data_type):
        click_api_url = "https://api.clicky.com/api/stats/4"
        payload = {"site_id": self.site_id,
                   "sitekey": self.sitekey,
                   "type": data_type,
                   "output": self.output}
        response = requests.get(click_api_url, params=payload)
        raw_stats = response.text
        return raw_stats

    def get_pages_data(self):
        data = self.get_data("pages")
        return json.loads(data)

Save the file and exit.

In order to get our metrics we need to send an HTTP GET request to the Clicky API URL which is

https://api.clicky.com/api/stats/4

We also need to include several parameters:

site_id: This is the Site ID number that we got earlier
sitekey: This is the Site key number that also got earlier
type: To get our top pages we set the type to ‘pages’
output: We set this to “json” so that the API will return JSON data

Finally we call the requests python module to perform an HTTP GET to our API URL with the parameters we specified. In the get_pages_data method we return a dict that represents our JSON data. Next, we will code our Kafka class implementation.

MyKafka Class

This class will interact with our Kafka cluster and push web site metrics to our topic for us. Create a new file called mykafka.py and add the following contents:

from kafka import KafkaProducer
import json


class MyKafka(object):

    def __init__(self, kafka_brokers):
        self.producer = KafkaProducer(
            value_serializer=lambda v: json.dumps(v).encode('utf-8'),
            bootstrap_servers=kafka_brokers
        )

    def send_page_data(self, json_data):
        self.producer.send('admintome-pages', json_data)

First, we import the kafka-python library specifically the KafkaProducer class that will let us code a Kafka producer and publish messages to our Kafka Topic.

from kafka import KafkaProducer

We now define our MyKafka class and create the constructor for it:

class MyKafka(object):
    def __init__(self, kafka_brokers):

This takes an argument that represents the kafka brokers that will be used to connect to our Kafka cluster. This an array of strings in the form of:

[ "broker:ip", "broker:ip" ]

We will use only one broker where is the one we created in the last article: mslave1.admintome.lab:31000:

[ "mslave1.admintome.lab:31000" ]

We next instantiate a new KafkaProducer object named producer. Since we will be sending data to Kafka in the form of JSON we tell the KafkaProducer to use the JSON decoder dumps to parse the data using the value_serializer parameter. We also tell it to use our brokers with the bootstrap_servers parameter.

self.producer = KafkaProducer(
            value_serializer=lambda v: json.dumps(v).encode('utf-8'),
            bootstrap_servers=kafka_brokers
        )

Finally, we create a new method that we will use to send the messages to our admintome-pages topic:

def send_page_data(self, json_data):
    self.producer.send('admintome-pages', json_data)

That’s all there is to it. Now we will write our Main class that will control everything.

Main Class

Create a new file called main.py and add the following contents:

from clicky import Clicky
from mykafka import MyKafka
import logging
import time
import os
from logging.config import dictConfig


class Main(object):

    def __init__(self):
        if 'KAFKA_BROKERS' in os.environ:
            kafka_brokers = os.environ['KAFKA_BROKERS'].split(',')
        else:
            raise ValueError('KAFKA_BROKERS environment variable not set')

        if 'SITE_ID' in os.environ:
            self.site_id = os.environ['SITE_ID']
        else:
            raise ValueError('SITE_ID environment variable not set')

        if 'SITEKEY' in os.environ:
            self.sitekey = os.environ['SITEKEY']
        else:
            raise ValueError('SITEKEY environment variable not set')

        logging_config = dict(
            version=1,
            formatters={
                'f': {'format':
                      '%(asctime)s %(name)-12s %(levelname)-8s %(message)s'}
            },
            handlers={
                'h': {'class': 'logging.StreamHandler',
                      'formatter': 'f',
                      'level': logging.DEBUG}
            },
            root={
                'handlers': ['h'],
                'level': logging.DEBUG,
            },
        )
        self.logger = logging.getLogger()

        dictConfig(logging_config)
        self.logger.info("Initializing Kafka Producer")
        self.logger.info("KAFKA_BROKERS={0}".format(kafka_brokers))
        self.mykafka = MyKafka(kafka_brokers)

    def init_clicky(self):
        self.clicky = Clicky(self.site_id, self.sitekey)
        self.logger.info("Clicky Stats Polling Initialized")

    def run(self):
        self.init_clicky()
        starttime = time.time()
        while True:
            data = self.clicky.get_pages_data()
            self.logger.info("Successfully polled Clicky pages data")
            self.mykafka.send_page_data(data)
            self.logger.info("Published page data to Kafka")
            time.sleep(300.0 - ((time.time() - starttime) % 300.0))


if __name__ == "__main__":
    logging.info("Initializing Clicky Stats Polling")
    main = Main()
    main.run()

The end state of this example is to build a Docker container that we will then run on Marathon. With that in mind, we don’t want to hard code some of our sensitive information (like our clicky site id and sitekey) in our code. We want to be able to pull those from environment variables. If they are not set then we through an exception and exit out.

        if 'KAFKA_BROKERS' in os.environ:
            kafka_brokers = os.environ['KAFKA_BROKERS'].split(',')
        else:
            raise ValueError('KAFKA_BROKERS environment variable not set')

        if 'SITE_ID' in os.environ:
            self.site_id = os.environ['SITE_ID']
        else:
            raise ValueError('SITE_ID environment variable not set')

        if 'SITEKEY' in os.environ:
            self.sitekey = os.environ['SITEKEY']
        else:
            raise ValueError('SITEKEY environment variable not set')

We also configure logging so that we can see what is going on with our application. I have coded an infinite loop in our code that will poll clicky and push the metrics to our Kafka topic every five minutes.

    def run(self):
        self.init_clicky()
        starttime = time.time()
        while True:
            data = self.clicky.get_pages_data()
            self.logger.info("Successfully polled Clicky pages data")
            self.mykafka.send_page_data(data)
            self.logger.info("Published page data to Kafka")
            time.sleep(300.0 - ((time.time() - starttime) % 300.0))

Save the file and exit.

Running our application

To test that everything works you can try running the application after you set your environment variables:

(intake) $ export KAFKA_BROKERS="mslave1.admintome.lab:31000"
(intake) $ export SITE_ID="{your site id}"
(intake) $ export SITEKEY="{your sitekey}"
(intake) $ python main.py
2018-06-25 15:34:32,259 root INFO Initializing Kafka Producer
2018-06-25 15:34:32,259 root INFO KAFKA_BROKERS=['mslave1.admintome.lab:31000']
2018-06-25 15:34:32,374 root INFO Clicky Stats Polling Initialized
2018-06-25 15:34:32,754 root INFO Successfully polled Clicky pages data
2018-06-25 15:34:32,755 root INFO Published page data to Kafka

We are now sending messages to our Kafka Topic! We will build our Docker container next and deploy it to Marathon. Finally, we will wrap up by writing a test consumer that will get our messages from our topic.

I have created a GitHub repository for all the code used in this article: https://github.com/admintome/clicky-state-intake

Create a Docker container

Now that we have our application code written, we can create a docker container so that we can deploy it to Marathon. Create a Dockerfile file in your application directory with the following contents:

FROM python:3

WORKDIR /usr/src/app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD [ "python", "./main.py" ]

Build the container

$ docker build -t  {your docker hub username}site-stats-intake .

After docker is completed you will want to push it to your docker repository that your Mesos Slaves have access to. For me this is Docker Hub:

$ docker push -t admintome/site-stats-intake

Then login to each of your Mesos slaves and pull the image down

$ docker pull admintome/site-stats-intake

We are now ready to create a Marathon application deployment for our application.

Deploying to Marathon

Go to your Marathon GUI.

http://mesos1.admintome.lab:8080

Click on the Create Application Button. Then click the JSON mode button:

Paste in the following JSON code:

{
  "id": "site-stats-intake",
  "cmd": null,
  "cpus": 1,
  "mem": 128,
  "disk": 0,
  "instances": 1,
  "container": {
    "docker": {
      "image": "admintome/site-stats-intake"
    },
    "type": "DOCKER"
  },
  "networks": [
    {
      "mode": "host"
    }
  ],
  "env": {
    "KAFKA_BROKERS": "192.168.1.x:port",
    "SITE_ID": "{your site_id}",
    "SITEKEY": "{your sitekey}"
  }
}

Be sure to substitute the correct values for KAFKA_BROKERS, SITE_ID, and SITEKEY in the env section for your environment.

Finally, click on the Create Application button to deploy the application. After a few seconds you should see the application is Running.

To see the logs click on the site-stats-intake application then click on the stderr link to download a text file containing the logs.

Now that we have our application deployed to Marathon we will write a short consumer that we will run on our development system to show us what messages have been received.

Write a Python Kafka Consumer

This will be a simple Kafka consumer that will check out topic and display all messages on the topic. Not really useful at this point but it lets us know that our little polling application is working correctly.

Create a new file called consumer.py and add the following contents:

import sys
from kafka import KafkaConsumer

consumer = KafkaConsumer('admintome-pages', bootstrap_servers="mslave1.admintome.lab:31000",
                         auto_offset_reset='earliest')

try:
    for message in consumer:
        print(message.value)
except KeyboardInterrupt:
    sys.exit()

Save and exit the file. This has the Kafka broker hardcoded because we simply using it to test everything. Make sure to update the bootstrap-servers parameter with your broker name and port.

Now run the command and you should see a ton of JSON that represents your most visited pages:

(intake) $ python consumer.py
b'[{"type": "pages", "dates": [{"date": "2018-06-25", "items": [{"value": "145", "value_percent": "43.2", "title": "Kafka Tutorial for Fast Data Architecture - AdminTome Blog", "stats_url": "http://clicky.com/stats/visitors?site_id=101045340&date=2018-06-25&href=%2Fblog%2Fkafka-tutorial-for-fast-data-architecture%2F", "url": "http://www.admintome.com/blog/kafka-tutorial-for-fast-data-architecture/"},...

What’s Next?

We now have a data pipeline that has some data that we can use. The next step will be to use that data and analyze it. In the article, we will install and configure the next part of our SMACK stack which is Apache Spark. We will also configure it analyze our data and give us something meaningful.

The post Kafka Python Tutorial for Fast Data Architecture appeared first on AdminTome Blog.

↧

Matthew Rocklin: Dask Scaling Limits

June 25, 2018, 5:00 pm

≫ Next: Andre Roberge: Pythonic switch statement

≪ Previous: Bill Ward / AdminTome: Kafka Python Tutorial for Fast Data Architecture

This work is supported by Anaconda Inc.

History

For the first year of Dask’s life it focused exclusively on single node parallelism. We felt then that efficiently supporting 100+GB datasets on personal laptops or 1TB datasets on large workstations was a sweet spot for productivity, especially when avoiding the pain of deploying and configuring distributed systems. We still believe in the efficiency of single-node parallelism, but in the years since, Dask has extended itself to support larger distributed systems.

After that first year, Dask focused equally on both single-node and distributed parallelism. We maintain two entirely separate schedulers, one optimized for each case. This allows Dask to be very simple to use on single machines, but also scale up to thousand-node clusters and 100+TB datasets when needed with the same API.

Dask’s distributed system has a single central scheduler and many distributed workers. This is a common architecture today that scales out to a few thousand nodes. Roughly speaking Dask scales about the same as a system like Apache Spark, but less well than a high-performance system like MPI.

An Example

Most Dask examples in blogposts or talks are on modestly sized datasets, usually in the 10-50GB range. This, combined with Dask’s history with medium-data on single-nodes may have given people a more humble impression of Dask than is appropriate.

As a small nudge, here is an example using Dask to interact with 50 36-core nodes on an artificial terabyte dataset.

This is a common size for a typical modestly sized Dask cluster. We usually see Dask deployment sizes either in the tens of machines (usually with Hadoop style or ad-hoc enterprise clusters), or in the few-thousand range (usually with high performance computers or cloud deployments). We’re showing the modest case here just due to lack of resources. Everything in that example should work fine scaling out a couple extra orders of magnitude.

Challenges to Scaling Out

For the rest of the article we’ll talk about common causes that we see today that get in the way of scaling out. These are collected from experience working both with people in the open source community, as well as private contracts.

Simple Map-Reduce style

If you’re doing simple map-reduce style parallelism then things will be pretty smooth out to a large number of nodes. However, there are still some limitations to keep in mind:

The scheduler will have at least one, and possibly a few connections open to each worker. You’ll want to ensure that your machines can have many open file handles at once. Some Linux distributions cap this at 1024 by default, but it is easy to change.
The scheduler has an overhead of around 200 microseconds per task. So if each task takes one second then your scheduler can saturate 5000 cores, but if each task takes only 100ms then your scheduler can only saturate around 500 cores, and so on. Task duration imposes an inversely proportional constraint on scaling.
If you want to scale larger than this then your tasks will need to start doing more work in each task to avoid overhead. Often this involves moving inner for loops within tasks rather than spreading them out to many tasks.

More complex algorithms

If you’re doing more complex algorithms (which is common among Dask users) then many more things can break along the way. High performance computing isn’t about doing any one thing well, it’s about doing nothing badly. This section lists a few issues that arise for larger deployments:

Dask collection algorithms may be suboptimal.
The parallel algorithms in Dask-array/bag/dataframe/ml are pretty good, but as Dask scales out to larger clusters and its algorithms are used by more domains we invariably find that small corners of the API fail beyond a certain point. Luckily these are usually pretty easy to fix after they are reported.
The graph size may grow too large for the scheduler
The metadata describing your computation has to all fit on a single machine, the Dask scheduler. This metadata, the task graph, can grow big if you’re not careful. It’s nice to have a scheduler process with at least a few gigabytes of memory if you’re going to be processing million-node task graphs. A task takes up around 1kB of memory if you’re careful to avoid closing over any unnecessary local data.
The graph serialization time may become annoying for interactive use
Again, if you have million node task graphs you’re going to be serializaing them up and passing them from the client to the scheduler. This is fine, assuming they fit at both ends, but can take up some time and limit interactivity. If you press compute and nothing shows up on the dashboard for a minute or two, this is what’s happening.
The interactive dashboard plots stop being as useful
Those beautiful plots on the dashboard were mostly designed for deployments with 1-100 nodes, but not 1000s. Seeing the start and stop time of every task of a million-task computation just isn’t something that our brains can fully understand.
This is something that we would like to improve. If anyone out there is interested in scalable performance diagnostics, please get involved.
Other components that you rely on, like distributed storage, may also start to break
Dask provides users more power than they’re accustomed to. It’s easy for them to accidentally clobber some other component of their systems, like distributed storage, a local database, the network, and so on, with too many requests.
Many of these systems provide abstractions that are very well tested and stable for normal single-machine use, but that quickly become brittle when you have a thousand machines acting on them with the full creativity of a novice user. Dask provies some primitives like distributed locks and queues to help control access to these resources, but it’s on the user to use them well and not break things.

Conclusion

Dask scales happily out to tens of nodes, like in the example above, or to thousands of nodes, which I’m not showing here simply due to lack of resources.

Dask provides this scalability while still maintaining the flexibility and freedom to build custom systems that has defined the project since it began. However, the combination of scalability and freedom makes it hard for Dask to fully protect users from breaking things. It’s much easier to protect users when you can constrain what they can do. When users stick to standard workflows like Dask dataframe or Dask array they’ll probably be ok, but when operating with full creativity at the thousand-node scale some expertise will invariably be necessary. We try hard to provide the diagnostics and tools necessary to investigate issues and control operation. The project is getting better at this every day, in large part due to some expert users out there.

A Call for Examples

Do you use Dask on more than one machine to do interesting work? We’d love to hear about it either in the comments below, or in this online form.

↧

Andre Roberge: Pythonic switch statement

June 26, 2018, 1:29 am

≫ Next: "Menno's Musings": Listing S3 objects with NodeJS

≪ Previous: Matthew Rocklin: Dask Scaling Limits

Playing with experimental and an old recipe created by Brian Beck.

The content of a test file:

from __experimental__ import switch_statement

def example(n):
    result = ''
    switch n:
        case 2:
            result += '2 is even and '
        case 3, 5, 7:
            result += f'{n} is prime'
            break
        case 0: pass
        case 1:
            pass
        case 4, 6, 8, 9:
            result = f'{n} is not prime'
            break
        default:
            result = f'{n} is not a single digit integer'
    return result

for i in range(11):
    print(example(i))

Trying it out

$ python -m experimental test_switch
0 is not prime
1 is not prime
2 is even and 2 is prime
3 is prime
4 is not prime
5 is prime
6 is not prime
7 is prime
8 is not prime
9 is not prime
10 is not a single digit integer

Just having fun ... Please, do not even think of using this for serious work.

↧

"Menno's Musings": Listing S3 objects with NodeJS

June 26, 2018, 3:21 am

≫ Next: Real Python: Cool New Features in Python 3.7

≪ Previous: Andre Roberge: Pythonic switch statement

I recently had to write some NodeJS code which uses the [AWS SDK] to list all the objects in a S3 bucket which potentially contains many objects (currently over 80,000 in production). The S3 listObjects API will only return up to 1,000 keys at a time so you have to make multiple calls, setting the Marker field to page through all the keys.

It turns out there's a lot of sub-optimal examples out there for how to do this which often involve global state and complicated recursive callbacks. I'm also a fan of the clarity of JavaScript's newer async/await feature for handling asynchronous code so I was keen on a solution which uses that style.

Real Python: Cool New Features in Python 3.7

June 27, 2018, 7:00 am

≫ Next: BeDjango: Top 6 Django Decorators

≪ Previous: "Menno's Musings": Listing S3 objects with NodeJS

Python 3.7 is officially released! This new Python version has been in development since September 2016, and now we all get to enjoy the results of the core developers’ hard work.

What does the new Python version bring? While the documentation gives a good overview of the new features, this article will take a deep dive into some of the biggest pieces of news. These include:

Easier access to debuggers through a new breakpoint() built-in
Simple class creation using data classes
Customized access to module attributes
Improved support for type hinting
Higher precision timing functions

More importantly, Python 3.7 is fast.

In the final sections of this article, you’ll read more about this speed, as well as some of the other cool features of Python 3.7. You will also get some advice on upgrading to the new version.

The `breakpoint()` Built-In

While we might strive to write perfect code, the simple truth is that we never do. Debugging is an important part of programming. Python 3.7 introduces the new built-in function breakpoint(). This does not really add any new functionality to Python, but it makes using debuggers more flexible and intuitive.

Assume that you have the following buggy code in the file bugs.py:

defdivide(e,f):returnf/ea,b=0,1print(divide(a,b))

Running the code causes a ZeroDivisionError inside the divide() function. Let’s say that you want to interrupt your code and drop into a debugger right at the top of divide(). You can do so by setting a so called “breakpoint” in your code:

defdivide(e,f):# Insert breakpoint herereturnf/e

A breakpoint is a signal inside your code that execution should temporarily stop, so that you can look around at the current state of the program. How do you place the breakpoint? In Python 3.6 and below, you use this somewhat cryptic line:

defdivide(e,f):importpdb;pdb.set_trace()returnf/e

Here, pdb is the Python Debugger from the standard library. In Python 3.7, you can use the new breakpoint() function call as a shortcut instead:

defdivide(e,f):breakpoint()returnf/e

In the background, breakpoint() is first importing pdb and then calling pdb.set_trace() for you. The obvious benefits are that breakpoint() is easier to remember and that you only need to type 12 characters instead of 27. However, the real bonus of using breakpoint() is its customizability.

Run your bugs.py script with breakpoint():

$ python3.7 bugs.py 
> /home/gahjelle/bugs.py(3)divide()-> return f / e(Pdb)

The script will break when it reaches breakpoint() and drop you into a PDB debugging session. You can type c and hit Enter to continue the script. Refer to Nathan Jennings’ PDB guide if you want to learn more about PDB and debugging.

Now, say that you think you’ve fixed the bug. You would like to run the script again but without stopping in the debugger. You could, of course, comment out the breakpoint() line, but another option is to use the PYTHONBREAKPOINT environment variable. This variable controls the behavior of breakpoint(), and setting PYTHONBREAKPOINT=0 means that any call to breakpoint() is ignored:

$PYTHONBREAKPOINT=0 python3.7 bugs.py
ZeroDivisionError: division by zero

Oops, it seems as if you haven’t fixed the bug after all…

Another option is to use PYTHONBREAKPOINT to specify a debugger other than PDB. For instance, to use PuDB (a visual debugger in the console) you can do:

$PYTHONBREAKPOINT=pudb.set_trace python3.7 bugs.py

For this to work, you need to have pudb installed (pip install pudb). Python will take care of importing pudb for you though. This way you can also set your default debugger. Simply set the PYTHONBREAKPOINT environment variable to your preferred debugger. See this guide for instructions on how to set an environment variable on your system.

The new breakpoint() function does not only work with debuggers. One convenient option could be to simply start an interactive shell inside your code. For instance, to start an IPython session, you can use the following:

$PYTHONBREAKPOINT=IPython.embed python3.7 bugs.py 
IPython 6.3.1 -- An enhanced Interactive Python. Type '?' for help.In [1]: print(e / f)0.0

You can also create your own function and have breakpoint() call that. The following code prints all variables in the local scope. Add it to a file called bp_utils.py:

frompprintimportpprintimportsysdefprint_locals():caller=sys._getframe(1)# Caller is 1 frame up.pprint(caller.f_locals)

To use this function, set PYTHONBREAKPOINT as before, with the <module>.<function> notation:

$PYTHONBREAKPOINT=bp_utils.print_locals python3.7 bugs.py 
{'e': 0, 'f': 1}ZeroDivisionError: division by zero

Normally, breakpoint() will be used to call functions and methods that do not need arguments. However, it is possible to pass arguments as well. Change the line breakpoint() in bugs.py to:

breakpoint(e,f,end="<-END\n")

Note: The default PDB debugger will raise a TypeError at this line because pdb.set_trace() does not take any positional arguments.

Run this code with breakpoint() masquerading as the print() function to see a simple example of the arguments being passed through:

$PYTHONBREAKPOINT=print python3.7 bugs.py 
0 1<-ENDZeroDivisionError: division by zero

See PEP 553 as well as the documentation for breakpoint() and sys.breakpointhook() for more information.

Data Classes

The new dataclasses module makes it more convenient to write your own classes, as special methods like .__init__(), .__repr__(), and .__eq__() are added automatically. Using the @dataclass decorator, you can write something like:

fromdataclassesimportdataclass,field@dataclass(order=True)classCountry:name:strpopulation:intarea:float=field(repr=False,compare=False)coastline:float=0defbeach_per_person(self):"""Meters of coastline per person"""return(self.coastline*1000)/self.population

These nine lines of code stand in for quite a bit of boilerplate code and best practices. Think about what it would take to implement Country as a regular class: the .__init__() method, a repr, six different comparison methods as well as the .beach_per_person() method. You can expand the box below to see an implementation of Country that is roughly equivalent to the data class:

classCountry:def__init__(self,name,population,area,coastline=0):self.name=nameself.population=populationself.area=areaself.coastline=coastlinedef__repr__(self):return(f"Country(name={self.name!r}, population={self.population!r},"f" coastline={self.coastline!r})")def__eq__(self,other):ifother.__class__isself.__class__:return((self.name,self.population,self.coastline)==(other.name,other.population,other.coastline))returnNotImplementeddef__ne__(self,other):ifother.__class__isself.__class__:return((self.name,self.population,self.coastline)!=(other.name,other.population,other.coastline))returnNotImplementeddef__lt__(self,other):ifother.__class__isself.__class__:return((self.name,self.population,self.coastline)<(other.name,other.population,other.coastline))returnNotImplementeddef__le__(self,other):ifother.__class__isself.__class__:return((self.name,self.population,self.coastline)<=(other.name,other.population,other.coastline))returnNotImplementeddef__gt__(self,other):ifother.__class__isself.__class__:return((self.name,self.population,self.coastline)>(other.name,other.population,other.coastline))returnNotImplementeddef__ge__(self,other):ifother.__class__isself.__class__:return((self.name,self.population,self.coastline)>=(other.name,other.population,other.coastline))returnNotImplementeddefbeach_per_person(self):"""Meters of coastline per person"""return(self.coastline*1000)/self.population

After creation, a data class is a normal class. You can, for instance, inherit from a data class in the normal way. The main purpose of data classes is to make it quick and easy to write robust classes, in particular small classes that mainly store data.

You can use the Country data class like any other class:

>>> norway=Country("Norway",5320045,323802,58133)>>> norwayCountry(name='Norway', population=5320045, coastline=58133)>>> norway.area323802>>> usa=Country("United States",326625791,9833517,19924)>>> nepal=Country("Nepal",29384297,147181)>>> nepalCountry(name='Nepal', population=29384297, coastline=0)>>> usa.beach_per_person()0.06099946957342386>>> norway.beach_per_person()10.927163210085629

Note that all the fields .name, .population, .area, and .coastline are used when initializing the class (although .coastline is optional, as is shown in the example of landlocked Nepal). The Country class has a reasonable repr, while defining methods works the same as for regular classes.

By default, data classes can be compared for equality. Since we specified order=True in the @dataclass decorator, the Country class can also be sorted:

>>> norway==norwayTrue>>> nepal==usaFalse>>> sorted((norway,usa,nepal))[Country(name='Nepal', population=29384297, coastline=0), Country(name='Norway', population=5320045, coastline=58133), Country(name='United States', population=326625791, coastline=19924)]

The sorting happens on the field values, first .name then .population, and so on. However, if you use field(), you can customize which fields will be used in the comparison. In the example, the .area field was left out of the repr and the comparisons.

Note: The country data are from the CIA World Factbook with population numbers estimated for July 2017.

Before you all go book your next beach holidays in Norway, here is what the Factbook says about the Norwegian climate: “temperate along coast, modified by North Atlantic Current; colder interior with increased precipitation and colder summers; rainy year-round on west coast.”

Data classes do some of the same things as namedtuple. Yet, they draw their biggest inspiration from the attrs project. See our full guide to data classes for more examples and further information, as well as PEP 557 for the official description.

Customization of Module Attributes

Attributes are everywhere in Python! While class attributes are probably the most famous, attributes can actually be put on essentially anything—including functions and modules. Several of Python’s basic features are implemented as attributes: most of the introspection functionality, doc-strings, and name spaces. Functions inside a module are made available as module attributes.

Attributes are most often retrieved using the dot notation: thing.attribute. However, you can also get attributes that are named at runtime using getattr():

importrandomrandom_attr=random.choice(("gammavariate","lognormvariate","normalvariate"))random_func=getattr(random,random_attr)print(f"A {random_attr} random value: {random_func(1, 1)}")

Running this code will produce something like:

A gammavariate random value: 2.8017715125270618

For classes, calling thing.attr will first look for attr defined on thing. If it is not found, then the special method thing.__getattr__("attr") is called. (This is a simplification. See this article for more details.) The .__getattr__() method can be used to customize access to attributes on objects.

Until Python 3.7, the same customization was not easily available for module attributes. However, PEP 562 introduces __getattr__() on modules, together with a corresponding __dir__() function. The __dir__() special function allows customization of the result of calling dir() on a module.

The PEP itself gives a few examples of how these functions can be used, including adding deprecation warnings to functions and lazy loading of heavy submodules. Below, we will build a simple plugin system that allows functions to be added to a module dynamically. This example takes advantage of Python packages. See this article if you need a refresher on packages.

Create a new directory, plugins, and add the following code to a file, plugins/__init__.py:

fromimportlibimportimport_modulefromimportlibimportresourcesPLUGINS=dict()defregister_plugin(func):"""Decorator to register plug-ins"""name=func.__name__PLUGINS[name]=funcreturnfuncdef__getattr__(name):"""Return a named plugin"""try:returnPLUGINS[name]exceptKeyError:_import_plugins()ifnameinPLUGINS:returnPLUGINS[name]else:raiseAttributeError(f"module {__name__!r} has no attribute {name!r}")fromNonedef__dir__():"""List available plug-ins"""_import_plugins()returnlist(PLUGINS.keys())def_import_plugins():"""Import all resources to register plug-ins"""fornameinresources.contents(__name__):ifname.endswith(".py"):import_module(f"{__name__}.{name[:-3]}")

Before we look at what this code does, add two more files inside the plugins directory. First, let’s see plugins/plugin_1.py:

from.importregister_plugin@register_plugindefhello_1():print("Hello from Plugin 1")

Next, add similar code in the file plugins/plugin_2.py:

from.importregister_plugin@register_plugindefhello_2():print("Hello from Plugin 2")@register_plugindefgoodbye():print("Plugin 2 says goodbye")

These plugins can now be used as follows:

>>> importplugins>>> plugins.hello_1()Hello from Plugin 1>>> dir(plugins)['goodbye', 'hello_1', 'hello_2']>>> plugins.goodbye()Plugin 2 says goodbye

This may not all seem that revolutionary (and it probably isn’t), but let’s look at what actually happened here. Normally, to be able to call plugins.hello_1(), the hello_1() function must be defined in a plugins module or explicitly imported inside __init__.py in a plugins package. Here, it is neither!

Instead, hello_1() is defined in an arbitrary file inside the plugins package, and hello_1() becomes a part of the plugins package by registering itself using the @register_plugindecorator.

The difference is subtle. Instead of the package dictating which functions are available, the individual functions register themselves as part of the package. This gives you a simple structure where you can add functions independently of the rest of the code without having to keep a centralized list of which functions are available.

Let us do a quick review of what __getattr__() does inside the plugins/__init__.py code. When you asked for plugins.hello_1(), Python first looks for a hello_1() function inside the plugins/__init__.py file. As no such function exists, Python calls __getattr__("hello_1") instead. Remember the source code of the __getattr__() function:

def__getattr__(name):"""Return a named plugin"""try:returnPLUGINS[name]# 1) Try to return pluginexceptKeyError:_import_plugins()# 2) Import all pluginsifnameinPLUGINS:returnPLUGINS[name]# 3) Try to return plugin againelse:raiseAttributeError(# 4) Raise errorf"module {__name__!r} has no attribute {name!r}")fromNone

__getattr__() contains the following steps. The numbers in the following list correspond to the numbered comments in the code:

First, the function optimistically tries to return the named plugin from the PLUGINS dictionary. This will succeed if a plugin named name exists and has already been imported.
If the named plugin is not found in the PLUGINS dictionary, we make sure all plugins are imported.
Return the named plugin if it has become available after the import.
If the plugin is not in the PLUGINS dictionary after importing all plugins, we raise an AttributeError saying that name is not an attribute (plugin) on the current module.

How is the PLUGINS dictionary populated though? The _import_plugins() function imports all Python files inside the plugins package, but does not seem to touch PLUGINS:

def_import_plugins():"""Import all resources to register plug-ins"""fornameinresources.contents(__name__):ifname.endswith(".py"):import_module(f"{__name__}.{name[:-3]}")

Don’t forget that each plugin function is decorated by the @register_plugin decorator. This decorator is called when the plugins are imported and is the one actually populating the PLUGINS dictionary. You can see this if you manually import one of the plugin files:

>>> importplugins>>> plugins.PLUGINS{}>>> importplugins.plugin_1>>> plugins.PLUGINS{'hello_1': <function hello_1 at 0x7f29d4341598>}

Continuing the example, note that calling dir() on the module also imports the remaining plugins:

>>> dir(plugins)['goodbye', 'hello_1', 'hello_2']>>> plugins.PLUGINS{'hello_1': <function hello_1 at 0x7f29d4341598>, 'hello_2': <function hello_2 at 0x7f29d4341620>, 'goodbye': <function goodbye at 0x7f29d43416a8>}

dir() usually lists all available attributes on an object. Normally, using dir() on a module results in something like this:

>>> importplugins>>> dir(plugins)['PLUGINS', '__builtins__', '__cached__', '__doc__', '__file__', '__getattr__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '_import_plugins', 'import_module', 'register_plugin', 'resources']

While this might be useful information, we are more interested in exposing the available plugins. In Python 3.7, you can customize the result of calling dir() on a module by adding a __dir__() special function. For plugins/__init__.py, this function first makes sure all plugins have been imported and then lists their names:

def__dir__():"""List available plug-ins"""_import_plugins()returnlist(PLUGINS.keys())

Before leaving this example, please note that we also used another cool new feature of Python 3.7. To import all modules inside the plugins directory, we used the new importlib.resources module. This module gives access to files and resources inside modules and packages without the need for __file__ hacks (which do not always work) or pkg_resources (which is slow). Other features of importlib.resources will be highlighted later.

Typing Enhancements

Type hinting and annotations have been in constant development throughout the Python 3 series of releases. Python’s typing system is now quite stable. Still, Python 3.7 brings some enhancements to the table: better performance, core support, and forward references.

Python does not do any type checking at runtime (unless you are explicitly using packages like enforce). Therefore, adding type hints to your code should not affect its performance.

Unfortunately, this is not completely true as most type hints need the typing module. The typing module is one of the slowest modules in the standard library. PEP 560 adds some core support for typing in Python 3.7, which significantly speeds up the typing module. The details of this are in general not necessary to know about. Simply lean back and enjoy the increased performance.

While Python’s type system is reasonably expressive, one issue that causes some pain is forward references. Type hints—or more generally annotations—are evaluated while the module is imported. Therefore, all names must already be defined before they are used. The following is not possible:

classTree:def__init__(self,left:Tree,right:Tree)->None:self.left=leftself.right=right

Running the code raises a NameError because the class Tree is not yet (completely) defined in the definition of the .__init__() method:

Traceback (most recent call last):
  File "tree.py", line 1, in <module>classTree:
  File "tree.py", line 2, in Treedef__init__(self,left:Tree,right:Tree)->None:NameError: name 'Tree' is not defined

To overcome this, you would have needed to write "Tree" as a string literal instead:

classTree:def__init__(self,left:"Tree",right:"Tree")->None:self.left=leftself.right=right

See PEP 484 for the original discussion.

In a future Python 4.0, such so called forward references will be allowed. This will be handled by not evaluating annotations until that is explicitly asked for. PEP 563 describes the details of this proposal. In Python 3.7, forward references are already available as a __future__ import. You can now write the following:

from__future__importannotationsclassTree:def__init__(self,left:Tree,right:Tree)->None:self.left=leftself.right=right

Note that in addition to avoiding the somewhat clumsy "Tree" syntax, the postponed evaluation of annotations will also speed up your code, since type hints are not executed. Forward references are already supported by mypy.

By far, the most common use of annotations is type hinting. Still, you have full access to the annotations at runtime and can use them as you see fit. If you are handling annotations directly, you need to deal with the possible forward references explicitly.

Let us create some admittedly silly examples that show when annotations are evaluated. First we do it old-style, so annotations are evaluated at import time. Let anno.py contain the following code:

defgreet(name:print("Now!")):print(f"Hello {name}")

Note that the annotation of name is print(). This is only to see exactly when the annotation is evaluated. Import the new module:

>>> importannoNow!>>> anno.greet.__annotations__{'name': None}>>> anno.greet("Alice")Hello Alice

As you can see, the annotation was evaluated at import time. Note that name ends up annotated with None because that is the return value of print().

Add the __future__ import to enable postponed evaluation of annotations:

from__future__importannotationsdefgreet(name:print("Now!")):print(f"Hello {name}")

Importing this updated code will not evaluate the annotation:

>>> importanno>>> anno.greet.__annotations__{'name': "print('Now!')"}>>> anno.greet("Marty")Hello Marty

Note that Now! is never printed and the annotation is kept as a string literal in the __annotations__ dictionary. In order to evaluate the annotation, use typing.get_type_hints() or eval():

>>> importtyping>>> typing.get_type_hints(anno.greet)Now!{'name': <class 'NoneType'>}>>> eval(anno.greet.__annotations__["name"])Now!>>> anno.greet.__annotations__{'name': "print('Now!')"}

Observe that the __annotations__ dictionary is never updated, so you need to evaluate the annotation every time you use it.

Timing Precision

In Python 3.7, the time module gains some new functions as described in PEP 564. In particular, the following six functions are added:

clock_gettime_ns(): Returns the time of a specified clock
clock_settime_ns(): Sets the time of a specified clock
monotonic_ns(): Returns the time of a relative clock that cannot go backwards (for instance due to daylight savings)
perf_counter_ns(): Returns the value of a performance counter—a clock specifically designed to measure short intervals
process_time_ns(): Returns the sum of the system and user CPU time of the current process (not including sleep time)
time_ns(): Returns the number of nanoseconds since January 1st 1970

In a sense, there is no new functionality added. Each function is similar to an already existing function without the _ns suffix. The difference being that the new functions return a number of nanoseconds as an int instead of a number of seconds as a float.

For most applications, the difference between these new nanosecond functions and their old counterpart will not be appreciable. However, the new functions are easier to reason about because they rely on int instead of float. Floating point numbers are by nature inaccurate:

>>> 0.1+0.1+0.10.30000000000000004>>> 0.1+0.1+0.1==0.3False

This is not an issue with Python but rather a consequence of computers needing to represent infinite decimal numbers using a finite number of bits.

A Python float follows the IEEE 754 standard and uses 53 significant bits. The result is that any time greater than about 104 days (2⁵³ or approximately 9 quadrillion nanoseconds) cannot be expressed as a float with nanosecond precision. In contrast, a Python int is unlimited, so an integer number of nanoseconds will always have nanosecond precision independent of the time value.

As an example, time.time() returns the number of seconds since January 1st 1970. This number is already quite big, so the precision of this number is at the microsecond level. This function is the one showing the biggest improvement in its _ns version. The resolution of time.time_ns() is about 3 times better than for time.time().

What is a nanosecond by the way? Technically, it is one billionth of a second, or 1e-9 second if you prefer scientific notation. These are just numbers though and do not really provide any intuition. For a better visual aid, see Grace Hopper’s wonderful demonstration of the nanosecond.

As an aside, if you need to work with datetimes with nanosecond precision, the datetime standard library will not cut it. It explicitly only handles microseconds:

>>> fromdatetimeimportdatetime,timedelta>>> datetime(2018,6,27)+timedelta(seconds=1e-6)datetime.datetime(2018, 6, 27, 0, 0, 0, 1)>>> datetime(2018,6,27)+timedelta(seconds=1e-9)datetime.datetime(2018, 6, 27, 0, 0)

Instead, you can use the astropy project. Its astropy.time package represents datetimes using two float objects which guarantees “sub-nanosecond precision over times spanning the age of the universe.”

>>> fromastropy.timeimportTime,TimeDelta>>> Time("2018-06-27")<Time object: scale='utc' format='iso' value=2018-06-27 00:00:00.000>>>> t=Time("2018-06-27")+TimeDelta(1e-9,format="sec")>>> (t-Time("2018-06-27")).sec9.976020010071807e-10

The latest version of astropy is available in Python 3.5 and later.

Other Pretty Cool Features

So far, you have seen the headline news regarding what’s new in Python 3.7. However, there are many other changes that are also pretty cool. In this section, we will look briefly at some of them.

The Order of Dictionaries Is Guaranteed

The CPython implementation of Python 3.6 has ordered dictionaries. (PyPy also has this.) This means that items in dictionaries are iterated over in the same order they were inserted. The first example is using Python 3.5, and the second is using Python 3.6:

>>> {"one":1,"two":2,"three":3}# Python <= 3.5{'three': 3, 'one': 1, 'two': 2}>>> {"one":1,"two":2,"three":3}# Python >= 3.6{'one': 1, 'two': 2, 'three': 3}

In Python 3.6, this ordering was just a nice consequence of that implementation of dict. In Python 3.7, however, dictionaries preserving their insert order is part of the language specification. As such, it may now be relied on in projects that support only Python >= 3.7 (or CPython >= 3.6).

“`async`” and “`await`” Are Keywords

Python 3.5 introduced coroutines with async and await syntax. To avoid issues of backwards compatibility, async and await were not added to the list of reserved keywords. In other words, it was still possible to define variables or functions named async and await.

In Python 3.7, this is no longer possible:

>>> async=1
  File "<stdin>", line 1async=1^SyntaxError: invalid syntax>>> defawait():
  File "<stdin>", line 1defawait():^SyntaxError: invalid syntax

“`asyncio`” Face Lift

The asyncio standard library was originally introduced in Python 3.4 to handle concurrency in a modern way using event loops, coroutines and futures. Here is a gentle introduction.

In Python 3.7, the asyncio module is getting a major face lift, including many new functions, support for the context variables mentioned above, and performance improvements. Of particular note is asyncio.run(), which simplifies calling coroutines from synchronous code. Using asyncio.run(), you do not need to explicitly create the event loop. An asynchronous Hello World program can now be written:

importasyncioasyncdefhello_world():print("Hello World!")asyncio.run(hello_world())

Context Variables

Context variables are variables that can have different values depending on their context. They are similar to Thread-Local Storage in which each execution thread may have a different value for a variable. However, with context variables, there may be several contexts in one execution thread. The main use case for context variables is keeping track of variables in concurrent asynchronous tasks.

The following example constructs three contexts, each with their own value for the value name. The greet() function is later able to use the value of name inside each context:

importcontextvarsname=contextvars.ContextVar("name")contexts=list()defgreet():print(f"Hello {name.get()}")# Construct contexts and set the context variable nameforfirst_namein["Steve","Dina","Harry"]:ctx=contextvars.copy_context()ctx.run(name.set,first_name)contexts.append(ctx)# Run greet function inside each contextforctxinreversed(contexts):ctx.run(greet)

Running this script greets Steve, Dina, and Harry in reverse order:

$ python3.7 context_demo.py
Hello HarryHello DinaHello Steve

Importing Data Files With “`importlib.resources`“

One challenge when packaging a Python project is deciding what to do with project resources like data files needed by the project. A few options have commonly been used:

Hard-code a path to the data file.
Put the data file inside the package and locate it using __file__.
Use setuptools.pkg_resources to access the data file resource.

Each of these have their shortcomings. The first option is not portable. Using __file__ is more portable, but if the Python project is installed it might end up inside a zip and not have a __file__ attribute. The third option solves this problem, but is unfortunately very slow.

A better solution is the new importlib.resources module in the standard library. It uses Python’s existing import functionality to also import data files. Assume you have a resource inside a Python package like this:

data/
│
├── alice_in_wonderland.txt
└── __init__.py

Note that data needs to be a Python package. That is, the directory needs to contain an __init__.py file (which may be empty). You can then read the alice_in_wonderland.txt file as follows:

>>> fromimportlibimportresources>>> withresources.open_text("data","alice_in_wonderland.txt")asfid:... alice=fid.readlines()... >>> print("".join(alice[:7]))CHAPTER I. Down the Rabbit-HoleAlice was beginning to get very tired of sitting by her sister on thebank, and of having nothing to do: once or twice she had peeped into thebook her sister was reading, but it had no pictures or conversations init, ‘and what is the use of a book,’ thought Alice ‘without pictures orconversations?’

A similar resources.open_binary() function is available for opening files in binary mode. In the earlier “plugins as module attributes” example, we used importlib.resources to discover the available plugins using resources.contents(). See Barry Warsaw’s PyCon 2018 talk for more information.

It is possible to use importlib.resources in Python 2.7 and Python 3.4+ through a backport. A guide on migrating from pkg_resources to importlib.resources is available.

Developer Tricks

Python 3.7 has added several features aimed at you as a developer. You have already seen the new breakpoint() built-in. In addition, a few new -X command line options have been added to the Python interpreter.

You can easily get an idea of how much time the imports in your script takes, using -X importtime:

$ python3.7 -X importtime my_script.py
import time: self [us] | cumulative | imported packageimport time:      2607 |       2607 | _frozen_importlib_external...import time:       844 |      28866 |   importlib.resourcesimport time:       404 |      30434 | plugins

The cumulative column shows the cumulative time of import (in microseconds). In this example, importing plugins took about 0.03 seconds, most of which was spent importing importlib.resources. The self column shows the import time excluding nested imports.

You can now use -X dev to activate “development mode.” The development mode will add certain debug features and runtime checks that are considered too slow to be enabled by default. These include enabling faulthandler to show a traceback on serious crashes, as well as more warnings and debug hooks.

Finally, -X utf8 enables UTF-8 mode. (See PEP 540.) In this mode, UTF-8 will be used for text encoding regardless of the current locale.

Optimizations

Each new release of Python comes with a set of optimizations. In Python 3.7, there are some significant speed-ups, including:

There is less overhead in calling many methods in the standard library.
Method calls are up to 20% faster in general.
The startup time of Python itself is reduced by 10-30%.
Importing typing is 7 times faster.

In addition, many more specialized optimizations are included. See this list for a detailed overview.

The upshot of all these optimizations is that Python 3.7 is fast. It is simply the fastest version of CPython released so far.

So, Should I Upgrade?

Let’s start with the simple answer. If you want to try out any of the new features you have seen here, then you do need to be able to use Python 3.7. Using tools such as pyenv or Anaconda makes it easy to have several versions of Python installed side by side. There is no downside to installing Python 3.7 and trying it out.

Now, for the more complicated questions. Should you upgrade your production environment to Python 3.7? Should you make your own project dependent on Python 3.7 to take advantage of the new features?

With the obvious caveat that you should always do thorough testing before upgrading your production environment, there are very few things in Python 3.7 that will break earlier code (async and await becoming keywords is one example though). If you are already using a modern Python, upgrading to 3.7 should be quite smooth. If you want to be a little conservative, you might want to wait for the release of the first maintenance release—Python 3.7.1—tentatively expected some time in July 2018.

Arguing that you should make your project 3.7 only is harder. Many of the new features in Python 3.7 are either available as backports to Python 3.6 (data classes, importlib.resources) or conveniences (faster startup and method calls, easier debugging, and -X options). The latter, you can take advantage of by running Python 3.7 yourself while keeping your code compatible with Python 3.6 (or lower).

The big features that will lock your code to Python 3.7 are __getattr__() on modules, forward references in type hints, and the nanosecond time functions. If you really need any of these, you should go ahead and bump your requirements. Otherwise, your project will probably be more useful to others if it can be run on Python 3.6 for a while longer.

See the Porting to Python 3.7 guide for details to be aware of when upgrading.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

BeDjango: Top 6 Django Decorators

June 27, 2018, 9:37 am

≫ Next: Continuum Analytics Blog: Scalable Machine Learning in the Enterprise with Dask

≪ Previous: Real Python: Cool New Features in Python 3.7

What is a Decorator?

A decorator is the name of one of the most popular design patterns used nowadays, many times we use it without knowing that we are using a design pattern. And what's so special about this pattern? As we can read at Python Wiki using It is a way of apparently modifying an object's behavior, by enclosing it inside a decorating object with a similar interface. You can get more information about Pattern Design here.

Why I should use decorators in my web application?

Decorators dynamically alter the functionality of a function, method or class without having to make subclasses or change the source code of the decorated class. Thanks to this our code will be more cleaner, more readable, maintainable (Which is no small thing), and reduce the boilerplate code allowing us to add functionality to multiple classes using a single method.
A good example of the importance and easy of use of these decorators can be seen in the decorator @login_required that provides django, and that you probably used if you have some experience with our favorite framework. It’s just a piece of code where we check if the user is not authenticated the user is redirected to the login url.

The way that the decorators as used is the following:

from django.contrib.auth.decorators import login_required


@login_required
def my_view(request)
    …

Each time that a user try to access to my_view, the code inside login_required will be ejecuted.

Some of our favorite decorators

In this section we will show you some of the decorators that we think are most useful or that we have ever used with positive results, keep in mind that many of these can be customized to suit your needs. For this post we will use the original decorators with their font.

Group Required

Sometimes we need to protect some views, to allow a certain group of users to access it. Instead of checking within it if the user belongs to that group/s, we can use the following decorator

from django.contrib.auth.decorators import user_passes_test


def group_required(*group_names):
   """Requires user membership in at least one of the groups passed in."""   def in_groups(u):
       if u.is_authenticated():
           if bool(u.groups.filter(name__in=group_names)) | u.is_superuser:
               return True
       return False
   return user_passes_test(in_groups)


# The way to use this decorator is:
@group_required(‘admins’, ‘seller’)
def my_view(request, pk)
    ...