PyCharm: PyCharm 2020.1 EAP 7

March 12, 2020, 3:20 pm

≪ Previous: Will McGugan: Progress bars with Rich

We have a new Early Access Program (EAP) version of PyCharm that can now be downloaded from our website

This EAP has a lot of important bug fixes, some new features, and a few usability improvements. All of which makes PyCharm that much better to work with.

New in PyCharm

Command-line docker run options

When you run your application for the very first time, PyCharm automatically creates the temporary Run/Debug configuration. You can modify it to specify or alter the default parameters and save it as a permanent Run/Debug configuration. Now the Docker container settings in the Python Run/Debug configurations are aligned with the Docker run options.

And the dialog for the file settings has been cleaned up to remove any redundant settings.

Better UX for configuring project interpreter

If your project was previously configured with an interpreter that is not currently available, PyCharm shows a warning and provides two options: select an interpreter that fits the previous configuration or configure another Python interpreter:

Note, when you open a project configured for the outdated version of the Python interpreter, the following message appears:

Fixed in this Version

PyCharm resolves all imports properly and correctly recognizes all parts of the import and namespace packages.
‘Sort imported names in “from” imports’ when adding imports via inspection popup is respected. So now the imports are sorted correctly by their name rather than by the order you added them.
For PyCharm Professional users the SQL database issues with the “Preview update” have been fixed. Now it works when the table is introspected and when using an alias. And the issue with importing CSV/TXT to an SQL database without importing an id value has been resolved.
There is all this and more. You can find the details in the release notes.

Interested?

Download this EAP from our website. Alternatively, you can use the JetBrains Toolbox App to stay up to date throughout the entire EAP.
If you’re on Ubuntu 16.04 or later, you can use snap to get PyCharm EAP and stay up to date. You can find the installation instructions on our website.

↧

Moshe Zadka: Or else:

March 12, 2020, 7:00 pm

≫ Next: "Coder's Cat": Encapsulation in Python

≪ Previous: PyCharm: PyCharm 2020.1 EAP 7

This was originally sent to my newsletter. I send one e-mail, always about Python, every other Sunday. If this blog post interests you, consider subscribing.

The underappreciated else keyword in Python has three distinct uses.

if/else

On an if statement, else will contain code that runs if the condition is false.

if anonymize:
    print("Hello world")
else:
    print("Hello, name")

This is probably the least surprising use.

loop/else

The easiest to explain is while/else: it works the same as if/else, and runs when the condition is false.

However, it does not run if the loop was broken out of using break or an exception: it serves as something that runs on normal loop termination.

for/else functions in the same way: it runs on normal loop termination, and not if the loop was broken out of using a break.

For example, searching for an odd element in a list:

for x in numbers:
    if x % 2 == 1:
        print("Found", x)
        break
else:
    print("No odd found")

This is a powerful way to avoid sentinel values.

try/except/else

When writing code that might raise an exception, we want to be able to catch it -- but we want to avoid catching unanticipated exceptions. This means we want to protect as little code with try as possible, but still have some code that runs only in the normal path.

try:
    before, after = things
except ValueError:
    part1 = things[0]
    part2 = 0
    after = 0
else:
    part1, part2 = before

This means that if things does not have two items, this is a valid case we can recover from. However, if it does have two items, the first one must also have two items. If this is not the case, this snippet will raise ValueError.

↧

"Coder's Cat": Encapsulation in Python

March 12, 2020, 9:22 am

≫ Next: NumFOCUS: PyData COVID-19 Response

≪ Previous: Moshe Zadka: Or else:

Encapsulation is an essential aspect of Object Oriented Porgramming.

Let’s explain encapsulation in plain words: information hiding. This means delimiting of the internal interface and attribute from the external world.

The benefit of information hiding is reducing system complexity and increasing robustness.

Why? Because encapsulation limits the interdependencies of different software components. Suppose we create a module. Our users could only interact with us through public APIs; they don’t care about the internals of this module. Even when the details of internals implementation changed, the user’s code doesn’t need a corresponding change.

To implement encapsulation, we need to learn how to define and use private attribute and a private function.

Enough theory now, let’s talk about how we do this in Python?

Python is an interpreted programming language and implements weak encapsulation. Weak encapsulation means it is performed by convention rather than being enforced by the language. So there are some differences with Java or C++.

Protected attribute and method

If you have read some Python code, you will always find some attribute names with a prefixed underscore. Let’s write a simple Class:

class Base(object):

    def __init__(self):
        self.name = "hello"
        self._protected_name = "hello_again"

    def _protected_print(self):
        print "called _protected_print"

b = Base()
print b.name
print b._protected_name

b._protected_name = "new name"
print b._protected_name

b._protected_print()

The output will be:

hello
hello_again
new name
called _protected_print

From the result, an attribute or method with a prefixed underscore acts the same as the normal one.

So, why we need to add a prefixed underscore for an attribute?

The prefix underscore is a warning for developers: please be careful about this attribute or method, don’t use it outside of declared Class!

pylint will report out this kind of bad smell code:

2020_03_11_private-function-in-python.org_20200313_002122.png

Another benefit of prefix score is: it avoids wildcard importing the internal functions outside of the defined module. Let’s have a look at this code:

# foo module: foo.py
def func_a():
    print("func_a called!")

def _func_b():
    print("func_b called!")

Then if we use wildcard import in another part of code:

from foo import *

func_a()
func_b()

We will encounter an error:

2020_03_11_private-function-in-python.org_20200313_221319.png

By the way, wildcard import is another bad smell in Python and we should avoid in practice.

Private attribute and method

In traditional OOP programming languages, why private attributes and methods can not accessed by derived Class?

Because it is useful in information hiding. Suppose we declare an attribute with name mood, but in the derived Class we redeclare another attribute of name mood. This overrides the previous one in the parent Class and will likely introduce a bug in code.

So, how to use the private attribute in Python?

The answer is adding a double prefix underscore in an attribute or method. Let’s run this code snippet:

class Base(object):

    def __private(self):
        print("private value in Base")

    def _protected(self):
        print("protected value in Base")

    def public(self):
        print("public value in Base")
        self.__private()
        self._protected()

class Derived(Base):
    def __private(self):
        print("derived private")

    def _protected(self):
        print("derived protected")

d = Derived()
d.public()

The output will be:

public value in Base
private value in Base
derived protected

We call the public function from a derived object, which will invoke the public function in Base class. Note this, because __private is a private method, only object its self could use it, there is no naming conflict for a private method.

If we add another line of code:

d.__private()

It will trigger another error:

2020_03_11_private-function-in-python.org_20200313_224523.png

Why?

Let’s print all the methods of object and find out there a method with name of _Base__private.

2020_03_11_private-function-in-python.org_20200313_224715.png

This is called name mangling that the Python interpreter applies. Because the name was added Class prefix name, private methods are protected carefully from getting overridden in derived Class.

Again, this means we can use d._Base__private to call the private function. Remember, it’s not enforced!

The post Encapsulation in Python appeared first on CodersCat.

↧

NumFOCUS: PyData COVID-19 Response

March 13, 2020, 8:02 am

≫ Next: Python Anywhere: How to use shared in-browser consoles to cooperate while working remotely.

≪ Previous: "Coder's Cat": Encapsulation in Python

The safety and well-being of our community are extremely important to us. We have therefore decided to postpone all PyData conferences scheduled to take place until the end of June: PyData Miami PyData London PyData Amsterdam We have been closely monitoring the situation and believe this is the best action to take based on the […]

The post PyData COVID-19 Response appeared first on NumFOCUS.

↧

Python Anywhere: How to use shared in-browser consoles to cooperate while working remotely.

March 12, 2020, 7:12 am

≫ Next: Python Bytes: #172 Floating high above the web with Helium

≪ Previous: NumFOCUS: PyData COVID-19 Response

One of the challenges of remote work is when you need to work together on one thing.

Our in-browser consoles are one of the core features of our service. Almost since the beginning, PythonAnywhere has been able to share consoles -- you entered the name of another user or an email address, and they got an email telling them how to log in and view your Python (or Bash, or IPython) console. If you use an email, the person you invite doesn't have to be PythonAnywhere registered user.

When you open a console, there is a button on the top right that says "Share with others"...

...which should popup a dialog box when you click on it.

I may invite one or many persons. Some of them can have full access, some of them read-only.

In the shared console, you share not only what you see, but also a cursor -- so people that it's shared with can also type into it. That can be prevented if you share your console in "read-only" mode. You may later revoke the sharing using the same button and dialog box.

↧

Python Bytes: #172 Floating high above the web with Helium

March 13, 2020, 1:00 am

≫ Next: Ned Batchelder: Functional strategies in Python

≪ Previous: Python Anywhere: How to use shared in-browser consoles to cooperate while working remotely.

Sponsored by DigitalOcean: <a href="http://pythonbytes.fm/digitalocean">pythonbytes.fm/digitalocean</a> Michael #1: <a href="https://hynek.me/articles/python-in-production/">Python in Production Hynek</a> <ul> <li>Missing a key part from the public Python discourse and I would like to help to change that.</li> <li>Hynek was listening to <a href="https://runninginproduction.com/podcast/10-scholarpack-runs-10-percent-of-the-uks-primary-schools-and-gets-huge-traffic">a podcast</a> about running Python services in production. </li> <li>Disagreed with some of the choices they made, it acutely reminded me about what I’ve been missing in the past years from the public Python discourse.</li> <li>And yet despite the fact that the details aren’t relevant to me, the mindsets, thought processes, and stories around it captivated me and I happily listened to it on my vacation.</li> <li>Python conferences were a lot more like this. I remember startups and established companies alike to talk about running Python in production, lessons learned, and so on. (Instagram and to a certain degree Spotify being notable exceptions)</li> <li>An Offer: So in a completely egoistical move, I would like to encourage people who do interesting stuff with Python to run websites or some kind of web and network services to tell us about it at PyCons, meetups, and in blogs.</li> <li>Dan Bader and I covered this back on <a href="https://talkpython.fm/episodes/show/215/the-software-powering-talk-python-courses-and-podcast">Talk Python, episode 215</a>.</li> </ul> <a href="https://simonwillison.net/2020/Feb/11/cheating-at-unit-tests-pytest-black/">Brian #2: How to cheat at unit tests with pytest and Black</a> <ul> <li>Simon Willison</li> <li>Premise: “In pure test-driven development you write the tests first, and don’t start on the implementation until you’ve watched them fail.”</li> <li>too slow, so …, “cheat” <ul> <li>write a pytest test that calls the function you are working on and compares the return value to something obviously wrong.</li> <li>when it fails, copy the actual output and paste it into your test</li> <li>now it should pass</li> <li>run black to reformat the huge return value to something manageable</li> </ul></li> <li>Brian’s comments: <ul> <li>That’s turning exploratory and manual testing into automated regression tests, not cheating.</li> <li>There is no “pure test-driven development”, we still can’t agree on what a unit is or if mocks are good or evil.</li> </ul></li> </ul> Michael #3: <a href="https://segment.com/blog/goodbye-microservices/">Goodbye Microservices: From 100s of problem children to 1 superstar</a> <ul> <li>Retrospective by Alexandra Noonan</li> <li>Javascript but the lessons are cross language</li> <li>Microservices is the architecture du jour</li> <li>Segment <a href="https://segment.com/blog/why-microservices/">adopted this as a best practice</a> early-on, which served us well in some cases, and, as you’ll soon learn, not so well in others.</li> <li>Microservices is a service-oriented software architecture in which server-side applications are constructed by combining many single-purpose, low-footprint network services.</li> <li>Touted benefits are improved modularity, reduced testing burden, better functional composition, environmental isolation, and development team autonomy.</li> <li>Instead of enabling us to move faster, the small team found themselves mired in exploding complexity. Essential benefits of this architecture became burdens. As our velocity plummeted, our defect rate exploded.</li> <li>Her post is the story of how we took a step back and embraced an approach that aligned well with our product requirements and needs of the team.</li> </ul> Brian #4: <a href="https://github.com/mherrmann/helium">Helium</a> Michael #5: <a href="https://pythonhosted.org/uncertainties/">uncertainties package</a> <ul> <li>From Tim Head on upcoming Talk Python Binder episode.</li> <li>Do you know how uncertainty flows through calculations?</li> <li>Example:</li> </ul> <pre><code> Jane needs to calculate the volume of her pool, so that she knows how much water she'll need to fill it. She measures the length, width, and height: length L = 5.56 +/- 0.14 meters = 5.56 m +/- 2.5% width W = 3.12 +/- 0.08 meters = 3.12 m +/- 2.6% depth D = 2.94 +/- 0.11 meters = 2.94 m +/- 3.7% </code></pre> One can find the percentage uncertainty in the result by adding together the percentage uncertainties in each individual measurement: <pre><code> percentage uncertainty in volume = (percentage uncertainty in L) + (percentage uncertainty in W) + (percentage uncertainty in D) = 2.5% + 2.6% + 3.7% = 8.8% </code></pre> <ul> <li>We don’t want to deal with these manually! So we use the uncertainties package.</li> <li>Example of using the library:</li> </ul> <pre><code> >>> from uncertainties import ufloat >>> from uncertainties.umath import * # sin(), etc. >>> x = ufloat(1, 0.1) # x = 1+/-0.1 >>> print 2*x 2.00+/-0.20 >>> sin(2*x) # In a Python shell, "print" is optional 0.9092974268256817+/-0.08322936730942848 </code></pre> Brian #6: <a href="https://arpitbhayani.me/blogs/python-prompts">Personalize your python prompt</a> <ul> <li>Arpit Bhayani</li> <li>Those three <code>>>></code> in the interactive Python prompt. you can muck with those by changing <code>sys.ps1</code></li> <li>Fun.</li> <li>But you can also implement dynamic behavior by creating class and putting code in the <code>__str__</code> method. Very clever.</li> <li>note to self: task for the day: reproduce the windows command prompt with directory listing and slashes in the other direction.</li> </ul> Extras: Michael: <ul> <li>Now that Python for Absolute Beginners is out, starting on a new course: Hybrid Data-Driven + CMS web apps.</li> </ul> Joke: A Python Editor Limerick <ul> <li>via Alexander A.</li> </ul> CODING ENVIRONMENT, IN THREE PARTS: To this day, some prefer BBEdit. VSCode is now getting some credit. Vim and Emacs are fine; so are Atom and Sublime. Doesn't matter much, if you don't let it. But wait! Let's not forget IDEs! Using PyCharm sure is a breeze! Komodo, Eclipse, and IDEA; CLion is my panacea, and XCode leaves me at ease. But Jupyter Notebook is also legit! Data scientists must prefer it. In the browser, you code; results are then showed. But good luck when you try to use git.

↧

Ned Batchelder: Functional strategies in Python

March 13, 2020, 11:01 am

≫ Next: Weekly Python StackOverflow Report: (ccxix) stackoverflow python report

≪ Previous: Python Bytes: #172 Floating high above the web with Helium

I got into a debate about Python’s support for functional programming (FP) with a friend. One of the challenging parts was listening to him say, “Python is broken” a number of times.

Python is not broken. It’s just not a great language for writing pure functional programs. Python seemed broken to my friend in exactly the same way that a hammer seems broken to someone trying to turn a screw with it.

I understand his frustration. Once you have fully embraced the FP mindset, it is difficult to understand why people would write programs any other way.

I have not fully embraced the FP mindset. But that doesn’t mean that I can’t apply some FP lessons to my Python programs.

In discussions about how FP and Python relate, I think too much attention is paid to the tactics. For example, some people say, “no need for map/filter/lambda, use list comprehensions.” Not only does this put off FP people because they’re being told to abandon the tools they are used to, but it gives the impression that list comprehensions are somehow at odds with FP constructs, or are exact replacements.

Rather than focus on the tactics, the important ideas to take from FP are strategies, including:

Write small functions with no side-effects
Don’t change existing data, make new data
Combine functions to make larger functions

These strategies all lead to modularized code, free from mysterious action at a distance. The code is easier to reason about, debug, and extend.

Of course, languages that are built from the ground up with these ideas in mind will have great tools to support them. They have tactics like:

Immutable data structures
Rich libraries of higher-order functions
Good support for recursion

Functional languages like Scheme, Clojure, Haskell, and Scala have these tools built-in. They are of course going to be way better for writing Functional programs than Python is.

FP people look at Python, see none of these tools, and conclude that Python can’t be used for functional programming. As I said before, Python is not a great language for writing purely function programs. But it’s not a lost cause.

Even without those FP tools in Python, we can keep the FP strategies in mind. Although list comprehensions are presented as the alternative to FP tools, they help with the FP strategies, because they force you to make new data instead of mutating existing data.

If other FP professionals are like my friend, they are probably saying to themselves, “Ned, you just don’t get it.” Perhaps that is true, how would I know? That doesn’t mean I can’t improve my Python programs by thinking Functionally. I’m only just dipping my toes in the water so far, but I want to do more.

For more thoughts about this:

Gary Bernhardt: Boundaries, a PyCon talk that discusses Functional Core and Imperative Shell.
If you want more Functional tools, there are third-party Python packages like:
- pyrsistent, providing immutable data structures
- pydash, providing functional tools
- fnc, providing functional tools

↧

Weekly Python StackOverflow Report: (ccxix) stackoverflow python report

March 14, 2020, 5:08 am

≫ Next: Quansight Labs Blog: Documentation as a way to build Community

≪ Previous: Ned Batchelder: Functional strategies in Python

These are the ten most rated questions at Stack Overflow last week.
Between brackets: [question score / answers count]
Build date: 2020-03-14 12:08:07 GMT

↧

Quansight Labs Blog: Documentation as a way to build Community

March 14, 2020, 5:25 am

≫ Next: Talk Python to Me: #255 Talking to cars with Python

≪ Previous: Weekly Python StackOverflow Report: (ccxix) stackoverflow python report

As a long time user and participant in open source communities, I've always known that documentation is far from being a solved problem. At least, that's the impression we get from many developers: "writing docs is boring"; "it's a chore, nobody likes to do it". I have come to realize I'm one of those rare people who likes to write both code and documentation.

Nobody will argue against documentation. It is clear that for an open-source software project, documentation is the public face of the project. The docs influence how people interact with the software and with the community. It sets the tone about inclusiveness, how people communicate and what users and contributors can do. Looking at the results of a “NumPy Tutorial” search on any search engine also gives an idea of the demand for this kind of content - it is possible to find documentation about how to read the NumPy documentation!

I've started working at Quansight in January, and I have started doing work related to the NumPy CZI Grant. As a former professor in mathematics, this seemed like an interesting project both because of its potential impact on the NumPy (and larger) community and because of its relevance to me, as I love writing educational material and documentation. Having official high-level documentation written using up-to-date content and techniques will certainly mean more users (and developers/contributors) are involved in the NumPy community.

So, if everybody agrees on its importance, why is it so hard to write good documentation?

Talk Python to Me: #255 Talking to cars with Python

March 14, 2020, 1:00 am

≫ Next: Reuven Lerner: Announcing: Free, weekly “Python for non-programmers” workshop

≪ Previous: Quansight Labs Blog: Documentation as a way to build Community

Modern cars have become mobile computer systems with many small computers running millions of lines of code. On this episode, we plug a little Python into those data streams. You'll meet Shea Newton, who is a Python developer who has worked on autonomous cars and is currently at ActiveState. Links from the show <div>Shea on Twitter: <a href="https://twitter.com/shnewto/" target="_blank" rel="noopener">shnewto</a> Video presentation of PDX Talk: <a href="https://www.youtube.com/watch?v=r1QgGO23ob4" target="_blank" rel="noopener">youtube.com</a> Shea's source for PDX Python talk: <a href="https://github.com/shnewto/can-we-talk" target="_blank" rel="noopener">github.com</a> DonkeyCar: <a href="https://www.donkeycar.com/" target="_blank" rel="noopener">donkeycar.com</a> Roomba Programming: <a href="https://github.com/NickWaterton/Roomba980-Python" target="_blank" rel="noopener">github.com</a> </div> Sponsors <a href='https://talkpython.fm/datadog'>Datadog</a> <a href='https://talkpython.fm/clubhouse'>Clubhouse</a> <a href='https://talkpython.fm/training'>Talk Python Training</a>

↧

Reuven Lerner: Announcing: Free, weekly “Python for non-programmers” workshop

March 14, 2020, 2:04 pm

≫ Next: Paweł Fertyk: WebRTC: a working example

≪ Previous: Talk Python to Me: #255 Talking to cars with Python

This is a tough time for the world. Wherever you live, you have likely been affected by covid-19, the coronavirus that has been making its way to every country, city, and town.

Many countries, companies, and individuals are now restricted to their homes. This can be frustrating in many ways. Moreover, I’m not alone in believing that we’re about to see some very troubled times for the world economy.

I’ve been trying to decide what I can do, as a Python instructor, to help people in these trying times. And after some thought, I’ve decided to offer a free, weekly live workshop for people with little or no Python programming experience.

This workshop will be run as a live Zoom session on Fridays, with me teaching Python programming from the ground up. This is similar to the “Python for non-programmers” course that I’ve been giving to Fortune 500 companies for the last few years — although it’ll be broken up over many weeks, and will hopefully have many more participants than I’ve ever had before.

This is an experiment, and I’m asking for your help in letting people know about it. If you, or someone you know, wants to spend an hour a week learning programming basics, then I invite you to join me in my “Python for non-programmers” workshop.

I’ll ask questions, I’ll give exercises, and I’ll take questions. And I’ll tell lots of bad jokes, too. But in the end, I hope that you’ll learn, gain some skills, and have some fun not thinking about the news.

Not only will this workshop be completely free of charge, but I’ll be sharing each week’s recordings, as well. So if you cannot attend, or if you want to catch up on old videos, or even check out the Jupyter notebooks I use in each class, then don’t worry — all of that will be available to you, for as long as you want.

Sounds good? I sure hope so! You can join here:

Join my free, weekly “Python for non-programmers” workshop

When you join, I’ll send a welcome message. And then I’ll send you connection instructions on Wednesday or Thursday.

Questions? Just contact me, via e-mail (reuven@lerner.co.il) or on Twitter (@reuvenmlerner).

The post Announcing: Free, weekly “Python for non-programmers” workshop appeared first on Reuven Lerner.

↧

Paweł Fertyk: WebRTC: a working example

March 14, 2020, 4:00 pm

≫ Next: Anarcat: Remote presence tools for social distancing

≪ Previous: Reuven Lerner: Announcing: Free, weekly “Python for non-programmers” workshop

Recently I had to use WebRTC for a simple project. The technology itself has many advantages and is being developed as an open standard, without the need for any plugins. However, I was quite new to WebRTC and had some problems getting my head around the basic concepts, as well as creating a working solution. There are many tutorials available, but most of them are incomplete, obsolete, or forced me to use some third party services (e.g. Google Firebase), that only made the whole process more complicated to setup and more difficult to understand.

I decided to put together the information from all those resources and create a simple, working example of a WebRTC application. It does not require any third party services, unless you want to use it over a public network (in which case owning a server would really help). I hope it will provide a good starting point for everyone who is interested in exploring WebRTC.

This is not going to be a full tutorial of the WebRTC technology. You can find plenty of tutorials and detailed explanations all over the internet, for example here. You can also check the WebRTC API, if you want more information. This post is just going to show you one possible working example of WebRTC and explain how it works.

General description

Full source code of this example is available on GitHub. The program consists of three parts:

web application
signaling server
TURN server

The web application is very simple: one HTML file and one JavaScript file (plus one dependency: socket.io.js, which is included in the repository). It is designed to work with only two clients (two web browsers or two tabs of the same browser). Once you open it in your browser (tested on Firefox 74), it will ask for permission to use your camera and microphone. Once the permission is granted, the video and audio from each of the tabs will be streamed to the other one.

WebRTC application in action

Note: you might experience some problems if you try to access the same camera from both tabs. In my test, I've used two devices while testing on my machine (a built-in laptop camera and a USB webcam).

The signaling server is used by WebRTC applications to exchange information required to create a direct connection between peers. You can choose any technology you want for this. This example uses websockets (python-socketio on backend and socket.io-client on frontent).

The TURN server is required if you want to use this example over a public network. The process is described further in this post. For local network testing you will not need it.

Signaling

The signaling server is written in Python3 and looks like this:

fromaiohttpimportwebimportsocketioROOM='room'sio=socketio.AsyncServer(cors_allowed_origins='*')app=web.Application()sio.attach(app)@sio.eventasyncdefconnect(sid,environ):print('Connected',sid)awaitsio.emit('ready',room=ROOM,skip_sid=sid)sio.enter_room(sid,ROOM)@sio.eventdefdisconnect(sid):sio.leave_room(sid,ROOM)print('Disconnected',sid)@sio.eventasyncdefdata(sid,data):print('Message from {}: {}'.format(sid,data))awaitsio.emit('data',data,room=ROOM,skip_sid=sid)if__name__=='__main__':web.run_app(app,port=9999)

Every client joins the same room. Before entering the room, a ready event is sent to all clients currently in the room. That means that the first websocket connection will not get any message (the room is empty), but when the second connection is established, the first one will receive a ready event, signaling that there are at least two clients in the room and the WebRTC connection can start. Other than that, this server will forward any data (data event) that is sent by one websocket to the other one.

Setup is quite simple:

cd signaling
pip install aiohttp python-socketio
python server.py

This will start the signaling server at localhost:9999.

WebRTC

The simplified process of using WebRTC in this example looks like this:

both clients obtain their local media streams
once the stream is obtained, each client connects to the signaling server
once the second client connects, the first one receives a ready event, which means that the WebRTC connection can be negotiated
the first client creates a RTCPeerConnection object and sends an offer to the second client
the second client receives the offer, creates a RTCPeerConnection object, and sends an answer
more information is also exchanged, like ICE candidates
once the connection is negotiated, a callback for receiving a remote stream is called, and that stream is used as a source of the video element.

If you want to run this example on localhost, signaling server and the web application is all you need. The main part of the HTML file is a single video element (which source is going to be set later by the script):

<!DOCTYPE html><htmllang="en"><head><metacharset="UTF-8"><title>WebRTC working example</title></head><body><videoid="remoteStream"autoplayplaysinline></video><scriptsrc="socket.io.js"></script><scriptsrc="main.js"></script></body></html>

JavaScript part is a bit more complicated, and I'll explain it step by step. First, there are the config variables:

// Config variablesconstSIGNALING_SERVER_URL='http://localhost:9999';constPC_CONFIG={};

For localhost PC_CONFIG can stay empty, and SIGNALING_SERVER_URL should point to the signaling server you've started in the previous step.

Next, we have the signaling methods:

letsocket=io(SIGNALING_SERVER_URL,{autoConnect:false});socket.on('data',(data)=>{console.log('Data received: ',data);handleSignalingData(data);});socket.on('ready',()=>{console.log('Ready');createPeerConnection();sendOffer();});letsendData=(data)=>{socket.emit('data',data);};

In this example, we want to connect to the signaling server only after we obtain the local media stream, so we need to set { autoConnect: false }. Other than that, we have a sendData method that emits a data event, and we react to the data event by handling the incoming information appropriately (more about it later). Also, receiving a ready event means that both clients have obtained their local media streams and have connected to the signaling server, so we can create a connection on our side and negotiate an offer with the remote side.

Next, we have the WebRTC related variables:

letpc;letlocalStream;letremoteStreamElement=document.querySelector('#remoteStream');

The pc will hold our peer connection, localStream is the stream we obtain from the browser, and remoteStreamElement is the video element that we will use to display the remote stream.

To get the media stream from the browser, we will use getLocalStream method:

letgetLocalStream=()=>{navigator.mediaDevices.getUserMedia({audio:true,video:true}).then((stream)=>{console.log('Stream found');localStream=stream;// Connect after making sure that local stream is availblesocket.connect();}).catch(error=>{console.error('Stream not found: ',error);});}

As you can see, we are going to connect to the signaling server only after the stream (audio and video) is obtained. Please note that all of the WebRTC related types and variables (like navigator, RTCPeerConnection, etc.) are provided by the browser, and do not require you to install anything.

Creating a peer connection is relatively easy:

letcreatePeerConnection=()=>{try{pc=newRTCPeerConnection(PC_CONFIG);pc.onicecandidate=onIceCandidate;pc.onaddstream=onAddStream;pc.addStream(localStream);console.log('PeerConnection created');}catch(error){console.error('PeerConnection failed: ',error);}};

The two callbacks we are going to use are onicecandidate (called when the remote side sends us an ICE candidate), and onaddstream (called after the remote side adds its local media stream to its peer connection).

Next we have the offer and answer logic:

letsendOffer=()=>{console.log('Send offer');pc.createOffer().then(setAndSendLocalDescription,(error)=>{console.error('Send offer failed: ',error);});};letsendAnswer=()=>{console.log('Send answer');pc.createAnswer().then(setAndSendLocalDescription,(error)=>{console.error('Send answer failed: ',error);});};letsetAndSendLocalDescription=(sessionDescription)=>{pc.setLocalDescription(sessionDescription);console.log('Local description set');sendData(sessionDescription);};

The details of WebRTC offer-answer negotiation are not a part of this post (please check the WebRTC documentation if you want to know more about the process). It's enough to know that one side sends an offer, the other reacts to it by sending an answer, and both sides use the description for their corresponding peer connections.

The WebRTC callbacks look like this:

letonIceCandidate=(event)=>{if(event.candidate){console.log('ICE candidate');sendData({type:'candidate',candidate:event.candidate});}};letonAddStream=(event)=>{console.log('Add stream');remoteStreamElement.srcObject=event.stream;};

Received ICE candidates are sent to the other client, and when the other client sets the media stream, we react by using it as a source for our video element.

The last method is used to handle incoming data:

lethandleSignalingData=(data)=>{switch(data.type){case'offer':createPeerConnection();pc.setRemoteDescription(newRTCSessionDescription(data));sendAnswer();break;case'answer':pc.setRemoteDescription(newRTCSessionDescription(data));break;case'candidate':pc.addIceCandidate(newRTCIceCandidate(data.candidate));break;}};

When we receive an offer, we create our own peer connection (the remote one is ready at that point). Then, we set the remote description and send an answer. When we receive the answer, we just set the remote description of our peer connection. Finally, when an ICE candidate is sent by the other client, we add it to our peer connection.

And finally, to actually start the WebRTC connection, we just need to call getLocalStream:

// Start connectiongetLocalStream();

Running on localhost

If you started the signaling server in the previous step, you just need to host the HTML and JavaScript files, for example like this:

cd web
python -m http.server 7000

Then, open two tabs in your browser (or in two different browsers), and enter localhost:7000. As mentioned before, it is best to have two cameras available for this example to work. If everything goes well, you should see one video feed in each of the tabs.

Beyond localhost

You might be tempted to use this example on two different computers in your local network, replacing localhost with your machine's IP address, e.g. 192.168.0.11. You will quicky notice that it doesn't work, and your browser claims that navigator is undefined.

That happens because WebRTC is designed to be secure. That means in order to work it needs a secure context. Simply put: all of the resources (in our case the HTTP server and the signaling server) have to be hosted either on localhost, or using HTTPS. If the context is not secure, navigator will be undefined, and you will not be allowed to access user media. If you want to test this example on different machines, using localhost if obviously not an option. Setting up certificates is not a part of this post, and not an easy task at all. If you just want to quickly check this example on two different computers, you can use a simple trick. Instead of hosting the resources over HTTPS, you can enable insecure context in Firefox. Go to about:config, accept the risk, and change the values of these two variables to true:

media.devices.insecure.enabled
media.getusermedia.insecure.enabled

It should look like this:

Firefox insecure context enabled

Now you should be able to access the web application on two different computers, and the WebRTC connection should be properly established.

Going global

You can use this example over a public network, but it's going to require a bit more work. First, you need to setup a TURN server. Simply put, TURN servers are used to discover WebRTC peers over a public network. Unfortunately, for this step you will need a publicly visible server. Good news is, once you have your own server, the setup process will be quite easy (at least for a Ubuntu-based OS):

sudo apt install coturn
turnserver -a -o -v -n --no-dtls --no-tls -u username:credential

This will start a TURN server using port 3478. The flags mean:

-a: use the long-term credential mechanism
-o: start process as daemon (detach from current shell)
-v: 'Moderate' verbose mode
-n: do not use configuration file, take all parameters from the command line only
--no-dtls: do not start DTLS client listeners
--no-tls: do not start TLS client listeners
-u: user account, in form 'username:password', for long-term credentials

Next, you need to change the peer connection configuration a bit. Edit main.js, replacing {PUBLIC_IP} with an actual IP of your server:

constTURN_SERVER_URL='{PUBLIC_IP}:3478';constTURN_SERVER_USERNAME='username';constTURN_SERVER_CREDENTIAL='credential';constPC_CONFIG={iceServers:[{urls:'turn:'+TURN_SERVER_URL+'?transport=tcp',username:TURN_SERVER_USERNAME,credential:TURN_SERVER_CREDENTIAL},{urls:'turn:'+TURN_SERVER_URL+'?transport=udp',username:TURN_SERVER_USERNAME,credential:TURN_SERVER_CREDENTIAL}]};

Of course, you will also have to host your signaling server and the web application itself on a public IP, and you need to change SIGNALING_SERVER_URL appropriately. Once that is done, this example should work for any two machines connected to the internet.

Conclusion

This is just one of the examples of what you can do with WebRTC. The technology is not limited to audio and video, it can be used to exchange any data. I hope this post will help you get started and work on your own ideas. And, of course, if you have any questions or find any errors, don't hesitate to contact me!

↧

Anarcat: Remote presence tools for social distancing

March 15, 2020, 9:38 am

≫ Next: Mike Driscoll: PyDev of the Week: Jessica Garson

≪ Previous: Paweł Fertyk: WebRTC: a working example

As a technologist, I've been wondering how I can help people with the rapidly spreading coronavirus pandemic. With the world entering the "exponential stage" (e.g. Canada, the USA and basically all of Europe), everyone should take precautions and limit practice Social Distancing (and not dumbfuckery). But this doesn't mean we should dig ourselves in a hole in our basement: we can still talk to each other on the internet, and there are great, and free, tools available to do this. As part of my work as a sysadmin, I've had to answer questions about this a few times and I figured it was useful to share this more publicly.

Just say hi using whatever

First off, feel free to use the normal tools you normally use: Signal, Facetime, Skype, and Discord can be fine to connect with your folks, and since it doesn't take much to make someone's day please do use those tools to call your close ones and say "hi". People, especially your older folks, will feel alone and maybe scared in those crazy times. Every little bit you can do will help, even if it's just a normal phone call, an impromptu balcony fanfare, a remote workout class, or just a sing-along from your balcony, anything goes.

But if those tools don't work well for some reason, or you want to try something new, or someone doesn't have an iPad, or it's too dang cold to go on your balcony, you should know there are other alternatives that you can use.

Jitsi

We've been suggesting our folks use a tool called "Jitsi". Jitsi is a free software platform to host audio/video conferences. It has a web app which means anyone with a web browser can join a session. It can also do "screen sharing" if you need to work together on a project.

There are many "instances", but here's a subset I know about:

https://meet.jitsi.org/ - the official one, might overload
https://meet.mayfirst.org/ - Mayfirst, autonomous collective in New York (see also their usage instructions)
https://meet.greenhost.net/ - Greenhost, a worker's coop in the Netherlands
https://framatalk.org/ - Framasoft, a popular education network in France

You can connect to those with your web browser directly. If your web browser doesn't work, try switching to another (e.g. if Firefox doesn't work, try Chrome and vice-versa). There are also apps for desktop and mobile apps (F-Droid, Google Play, Apple Store) that will work better than just using your browser.

Jitsi should scale for small meetings up to a dozen people or more...

Mumble

... but beyond that, you might have trouble doing a full video-conference with a lot of people anyways. If you need to have a large conference with a lot of people, or if you have bandwidth and reliability problems with Jitsi, you can also try Mumble.

Mumble is an audio-only conferencing service, similar to Discord or Teamspeak, but made with free software. It requires users to install an app but there are clients for every platform out there (F-Droid, Google Play, Apple Store). Mumble is harder to setup, but is much more efficient in terms of bandwidth and latency. In other words, it will just scale and sound better.

Mumble ships with a list of known servers, but you can also connect to those trusted ones:

mumble.mayfirst.org - Mayfirst (see also their instructions on how to use it
mumble.riseup.net - Riseup, an autonomous collective in Seattle (ask me if you need their password)

Live streaming

If for some reason those tools still don't scale, you might have a bigger problem on your hands. If your audience is over 100 people, you will not be able to all join in the same conference together. And besides, maybe you just want to broadcast some news and do not need audio or video feedback from the audience. In this case, you need "live streaming".

Here, proprietary services are Twitch, Livestream.com and Youtube. But the community also provides alternatives to those. This is more complicated to setup, but just to get you started, I'll link to:

https://live.mayfirst.org/ - Mayfirst
https://live.autistici.org/ - Autistici, an autonomous collective in Italy

For either of those tools, you need an app on your desktop. The Mayfirst instructions use OBS Studio for this, but it might be possible to hotwire VLC to stream video from your computer as well.

Text chat

When all else fails, text should go through. Slack, Twitter and Facebook are the best known alternatives here, obviously. I would warn against spending too much time on those, as they can foment harmful rumors and can spread bullshit like a virus on any given day. The situation does not make that any better. But it can be a good way to keep in touch with your loved ones.

But if you want to have a large meetings with a crazy number of people, text can actually accomplish wonders. Internet Relay Chat also known as "IRC" (and which oldies might have experienced for a bit as mIRC) is, incredibly, still alive at the venerable age of 30 years old. It is mainly used by free software projects, but can be used by anyone. Here are some networks you can try:

http://webchat.freenode.net/ - the largest network, Freenode
https://webchat.oftc.net/ - smaller network, used by the Debian oftc.net/)
http://irc.indymedia.org/ - a relic of the almost defunct Indymedia network which I help operate

Those are all web interface to the IRC networks, but there are also a plenitude of IRC apps you can install on your desktop if you want the full experience.

Common recommendations

Regardless of the tools you pick, audio and video streaming is a technical challenge. A lot of things happen under the hood when you pick up your phone and dial a number, and sometimes using a desktop, it can be difficult to get everything "just right".

Some advice:

get a good microphone and headset: good audio really makes a difference in how pleasing the experience will be, both for you and your peers. good hardware will reduce echo, feedback and other audio problems. (see also my audio docs)
check your audio/video setup before joining the meeting, ideally with another participant on the same platform you will use
find a quiet place to meet: even a good microphone will pick up noises from the environment, if you reduce this up front, everything will sound better. if you do live streaming and want high quality recording, considering setting up a smaller room to do recording. (tip: i heard of at least one journalist hiding in a closer full of clothes to make recordings, as it dampens the sound!)
mute your microphone when you are not speaking (spacebar in Jitsi, follow the "audio wizard" in Mumble)

If you have questions or need help, feel free to ask! Comment on this blog or just drop me an email (see contact), I'd be happy to answer your questions.

Other ideas

Inevitably, when I write a post like this, someone will write something like "I can't believe you did not mention APL!" Here's a list of tools I have not mentioned here, deliberately or because I forgot:

Nextcloud Talk - needs access to a special server, but can be used for small meetings
Jabber/XMPP - yes, I know, XMPP can do everything and it's magic. but I've given up a while back, and I don't think setting up audio conferences with multiple enough is easy enough to make the cut here
Signal - signal is great. i use it every day. it's the primary way I do long distance, international voice calls for free, and the only way I do video-conferencing with family and friends at all. but it's one to one only, and the group (text) chat kind of sucks

Also, all the tools I recommend above are made of free software, which means they can be self-hosted. If things go bad and all those services stop existing, it should be possible for you to run your own instance.

Let me know if I forgot anything, but in a friendly way. And stay safe out there.

↧

Mike Driscoll: PyDev of the Week: Jessica Garson

March 15, 2020, 10:05 pm

≫ Next: The Digital Cat: Public key cryptography: RSA keys

≪ Previous: Anarcat: Remote presence tools for social distancing

This week we welcome Jessica Garson (@jessicagarson) as our PyDev of the Week! Jessica is a developer advocate at Twitter. She also teaches Python at New York University. You can see some of what she’s up to over on Github. Let’s spend some time getting to know her better!

Can you tell us a little about yourself (hobbies, education, etc):

I’m currently a Developer Advocate at Twitter, where I work to make sure developers have good experiences using the Twitter API. What that means is that I write example code, speak at conferences and create blog posts. I also make noise music with Python and perform regularly in the New York area under the artist name, Messica Arson. Before working in technology, I worked on political campaigns.

Why did you start using Python?

I started learning how to code on my own in 2010, which proved to be very difficult. I was working at a political data consulting company, and all of the backend code was written in Perl so I started reading a book on Perl. A coworker saw my book and pulled me aside and mentioned that if he were learning how to code today, he’d learn Python. Shortly thereafter, I found a community group in Washington, DC called Hear me Code which was free beginner-friendly classes for women by women.

What other programming languages do you know and which is your favorite?

I’ve been growing my skills in JavaScript lately. I’m excited to learn more about TensorFlow.js. In the past year, I’ve grown my skills in R quite a bit as well. I also make music sometimes using Ruby and Haskell.

What projects are you working on now?

I’ve been excited about the R package, reticulate which allows you to run your favorite Python package in R. I recently figured out how to run the Python package for search-tweets-python inside of R.

Which Python libraries are your favorite (core or 3rd party)?

FoxDot, the Python wrapper for SuperCollider that allows me to make weird danceable noise music with Python. I perform monthly or so in the NYC area and to think that I do that with my favorite programming language is exciting. I’d like to record a new album sometime this year.

How did you get involved in organizing tech conferences/meetups?

I started teaching people how to code pretty shortly after I started learning myself. The environment that I got started in was so supportive. I was so proud of what I was building that I wanted to share what I was creating with others so I started speaking at meetups and conferences locally. I realized at some point that I could also run and curate these events too.

I see you used to teach Python at NYU. Any exciting experiences you would like to share?

Teaching was an incredibly empowering experience. It was so exciting to watch my students learn and grow during our time together. Recently I ran into a former student at PyGotham. It was great to see someone who I taught who is now teaching Python classes on machine learning.

Thanks for doing the interview, Jessica!

The post PyDev of the Week: Jessica Garson appeared first on The Mouse Vs. The Python.

↧

The Digital Cat: Public key cryptography: RSA keys

March 14, 2020, 4:00 am

≫ Next: Mike Driscoll: PyDev of the Week: Jessica Garson

≪ Previous: Mike Driscoll: PyDev of the Week: Jessica Garson

I bet you created at least once an RSA key pair, usually because you needed to connect to GitHub and you wanted to avoid typing your password every time. You diligently followed the documentation on how to create SSH keys and after a couple of minutes your setup was complete.

But do you know what you actually did?

Do you know what the ~/.ssh/id_rsa file really contains? Why did ssh create two files with such a different format? Did you notice that one file begins with ssh-rsa, while the other begins with -----BEGIN RSA PRIVATE KEY-----? Have you noticed that sometimes the header of the second file misses the RSA part and just says BEGIN PRIVATE KEY?

I believe that a minimum level of knowledge regarding the various formats of RSA keys is mandatory for every developer nowadays, not to mention the importance of understanding them deeply if you want to pursue a career in the infrastructure management world.

RSA algorithm and key pairs¶

Since the invention of public-key cryptography, various systems have been devised to create the key pair. One of the first ones is RSA, the creation of three brilliant cryptographers, that dates back to 1977. The story of RSA is quite interesting, as it was first invented by an English mathematician, Clifford Cocks, who was however forced to keep it secret by the British intelligence office he was working for.

Keeping in mind that RSA is not a synonym for public-key cryptography but only one of the possible implementations, I wanted to write a post on it because it is still, more than 40 years after its publication, one of the most widespread algorithms. In particular it is the standard algorithm used to generate SSH key pairs, and since nowadays every developer has their public key on GitHub, BitBucket, or similar systems, we may arguably say that RSA is pretty ubiquitous.

I will not cover the internals of the RSA algorithm in this article, however. If you are interested in the gory details of the mathematical framework you may find plenty of resources both on Internet and in the textbooks. The theory behind it is not trivial, but it is definitely worth the time if you want to be serious about the mathematical part of cryptography.

In this article I will instead explore two ways to create RSA key pairs and the formats used to store them. Applied cryptography is, like many other topics in computer science, a moving target, and the tools change often. Sometimes it is pretty easy to find out how to do something (StackOverflow helps), but less easy to get a clear picture of what is going on.

All the examples shown in this post use a 2048-bits RSA key created for this purpose, so all the numbers you see come from a real example. The key has been obviously trashed after I wrote the article.

The PEM format¶

Let's start the discussion about key pairs with the format used to store them. Nowadays the most widely accepted storage format is called PEM (Privacy-enhanced Electronic Mail). As the name suggests, this format was initially created for e-mail encryption but later became a general format to store cryptographic data like keys and certificates. It is described in RFC 7468 ("Textual Encodings of PKIX, PKCS, and CMS Structures").

An example private key in PEM format is the following

Basically, you can tell you are dealing with a PEM format from the typical header and footer that identify the content. While the hyphens and the two BEGIN and END words are always present, the PRIVATE KEY part describes the content and can change if the PEM file contains something different from a key, for example an X.509 certificate for SSL.

The PEM format specifies that the the body of the content (the part between the header and the footer) is encoded using Base64.

If the private key has been encrypted with a password the header and the footer are different

When the PEM format is used to store cryptographic keys the body of the content is in a format called PKCS #8. Initially a standard created by a private company (RSA Laboratories), it became a de facto standard so has been described in various RFCs, most notably RFC 5208 ("Public-Key Cryptography Standards (PKCS) #8: Private-Key Information Syntax Specification Version 1.2").

The PKCS #8 format describes the content using the ASN.1 (Abstract Syntax Notation One) description language and the relative DER (Distinguished Encoding Rules) to serialize the resulting structure. This means that Base64-decoding the content will return some binary content that can be processed only by an ASN.1 parser.

Let me visually recap the structure

-----BEGIN label-----
+--------------------------- Base64 ---------------------------+
|                                                              |
| PKCS #8 content:                                             |
| ASN.1 language serialized with DER                           |
|                                                              |
+--------------------------------------------------------------+
-----END label-----

Please note that, due to the structure of the underlying ASN.1 structure, every PEM body starts with the MII characters.

OpenSSL and ASN.1¶

OpenSSL can directly decode a key in PEM format and show the underlying ASN.1 structure with the asn1parse module

$ openssl asn1parse -inform pem -in private.pem
    0:d=0hl=4l=1214 cons: SEQUENCE          
    4:d=1hl=2l=1 prim: INTEGER           :00
    7:d=1hl=2l=13 cons: SEQUENCE          
    9:d=2hl=2l=9 prim: OBJECT            :rsaEncryption
   20:d=2hl=2l=0 prim: NULL              
   22:d=1hl=4l=1192 prim: OCTET STRING      [HEX DUMP]:308204A40201000282010100B2F5FD3F9F0917112
   CE42F8BF87ED676E15258BE443F36DEAFB0B69BDE2496B495EAAD1B01CAD84271B014E96F79386C636D348516DA74A68
   A8C70FBA882870C47B4218D8F49186DDF72727B9D80C21911C3E337C6E407FFB47C2F2767B0D164D8A1E9AF95F6481BF
   8D9EDFB2E3904B2529268C460256FAFD0A677D29898F10B1D15128A695839FC08EDD584E8335615B1D1D7277BE65C532
   DCA92DDC7050374868B117EA9154914EF9292B8443F13696E4FAD50DED6BD90E5A6F7ED33BE2ECE31C6DD7A4253EE6CD
   C56787DDD1D5CD776614022DB87D03BB22F23285B5A3167AF8DACABBEA40004471337D3781E8C5CCA0EA5E27799B510E
   4EF938C61CAA60D02030100010282010100B24255000A6A03901827333539511E4F4C21BA43CBB72BF0A51060D4E1719
   0AC50A871C57503986696D7CDFCB80D0726EFE2D76DBA55DFDC0425E064CC753810035C6A0F97AA37AB39E7C6215BC1E
   595131D0C3782E5A11213B59F42A1067F8CF43C538992D6BEFD1DE3F6293CE18ECC1173C4E7D6DD7362AD7323E7A218B
   5FFB0F245EB796327CC87493EDD134234ED5F3B14A4C4D92374597F64A6D3CB2C10F0CD2D57E99F58C8D28F2049D1433
   CC4BD677017AD1BDD1C83CFB8FB7E8C8FDCF0B4FB77DE7B8285749CEDFBFD6878F7F7930073F0F42ADDCBA8385D7ED05
   CDFCAA2A2BA757601723A96201FECCC2E65C65E14F65F1D34D6ECDFE3F85401800102818100E1D16389BF6EFF7AE44F6
   57106ED81C81A48B5FB356F83DD4A229E8654BDC036716BBD9D46DFD1498132545054958ACA5CFDA709D97CC8C6A9E92
   03D05F7B9D45E685A19A5F58267FCB17FCF502B32CFEDB94CAEA58EE5F63EBA5F33D09946C8652132344410D3D658748
   BCAE256F24896C2A9AD9340D3C8392652DA8ED7346D02818100CAE155C9B3A4546B5FC3CF4CC80D539D531C406BAC5ED
   82818E977B496F9F614CEFB1179E3BFBFAB22BCA7F88EBB8C9B1327AE70113242DFF0866370B6C76782DBD50DBE1FEE9
   B3316B9AAC7BABB7CFA0A9EF26C3C976CF62DA8F41EFE065458DC7C1CBCA78FB1CB4FF7AA50D116CE1640956A4E89EAD
   F5293FAA13A2349F42102818100BC3B93324E6D92EE7883AA366624F28ABF461ED3B0BE2CF7F805158939F815D20C075
   83E52C6DCA8DDD5FB2C1EE5AC9474A1476CD16ACFDDB1E24EEA2F204939BA1C58068B2D342FC4169D484D36451BC7B82
   F306176D53FC71809A5A25B320277320DAC3D949D504DD9907164EC3EF7BD1BB4DEA82160A7C4E3AA2ADEE88A9D02818
   02915E921A7D7A7A0F70BD8775C2C16BACD91F319DB1679FFE4CBA30A5768D784EF45B90C4E2B0ECDC18323211B06B03
   AD76E39CD482E3D8CCC50EAE270A1813CE6F80688723F07FF18A3110AD1AE16692CAD73BAA7AAA2CE5800D72F4F92489
   296542C1DA87159382B41A4A42933CD18848BBDB39A0A8E9F5288770E27075B010281803AB4E3B841AB234515BF0A8D2
   E40FB6E95389702D834474E9AD849124DC6C1D342738D4E7510265DF6B744EBAA4A88A7995346BEEF047DB024CE8B2A4
   E3923B0566389948AB0BBB031879770DA14F4418AEB75AE98349122A2D9535117B05BEF938A1211A3BE6E882957BC2A5
   F1DE5CA50C26F42EE0A383A2A2B6340D52E1A36

This that you see in the code snippet is then the private key in ASN.1 format. Remember that DER is only used to go from the text representation of ASN.1 to binary data, so we don't see it unless we decode the Base64 content into a file and open it with a binary editor.

Note that the ASN.1 structure contains the type of the object (rsaEncryption, in this case). You can further decode the OCTET STRING field, which is the actual key, specifying the offset

$ openssl asn1parse -inform pem -in private.pem -strparse 220:d=0hl=4l=1188 cons: SEQUENCE          
    4:d=1hl=2l=1 prim: INTEGER           :00
    7:d=1hl=4l=257 prim: INTEGER           :B2F5FD3F9F0917112CE42F8BF87ED676E15258BE443F36DEAFB
    0B69BDE2496B495EAAD1B01CAD84271B014E96F79386C636D348516DA74A68A8C70FBA882870C47B4218D8F49186DDF
    72727B9D80C21911C3E337C6E407FFB47C2F2767B0D164D8A1E9AF95F6481BF8D9EDFB2E3904B2529268C460256FAFD
    0A677D29898F10B1D15128A695839FC08EDD584E8335615B1D1D7277BE65C532DCA92DDC7050374868B117EA9154914
    EF9292B8443F13696E4FAD50DED6BD90E5A6F7ED33BE2ECE31C6DD7A4253EE6CDC56787DDD1D5CD776614022DB87D03
    BB22F23285B5A3167AF8DACABBEA40004471337D3781E8C5CCA0EA5E27799B510E4EF938C61CAA60D
  268:d=1hl=2l=3 prim: INTEGER           :010001
  273:d=1hl=4l=257 prim: INTEGER           :B24255000A6A03901827333539511E4F4C21BA43CBB72BF0A51
    060D4E17190AC50A871C57503986696D7CDFCB80D0726EFE2D76DBA55DFDC0425E064CC753810035C6A0F97AA37AB39
    E7C6215BC1E595131D0C3782E5A11213B59F42A1067F8CF43C538992D6BEFD1DE3F6293CE18ECC1173C4E7D6DD7362A
    D7323E7A218B5FFB0F245EB796327CC87493EDD134234ED5F3B14A4C4D92374597F64A6D3CB2C10F0CD2D57E99F58C8
    D28F2049D1433CC4BD677017AD1BDD1C83CFB8FB7E8C8FDCF0B4FB77DE7B8285749CEDFBFD6878F7F7930073F0F42AD
    DCBA8385D7ED05CDFCAA2A2BA757601723A96201FECCC2E65C65E14F65F1D34D6ECDFE3F854018001
  534:d=1hl=3l=129 prim: INTEGER           :E1D16389BF6EFF7AE44F657106ED81C81A48B5FB356F83DD4A2
    29E8654BDC036716BBD9D46DFD1498132545054958ACA5CFDA709D97CC8C6A9E9203D05F7B9D45E685A19A5F58267FC
    B17FCF502B32CFEDB94CAEA58EE5F63EBA5F33D09946C8652132344410D3D658748BCAE256F24896C2A9AD9340D3C83
    92652DA8ED7346D
  666:d=1hl=3l=129 prim: INTEGER           :CAE155C9B3A4546B5FC3CF4CC80D539D531C406BAC5ED82818E
    977B496F9F614CEFB1179E3BFBFAB22BCA7F88EBB8C9B1327AE70113242DFF0866370B6C76782DBD50DBE1FEE9B3316
    B9AAC7BABB7CFA0A9EF26C3C976CF62DA8F41EFE065458DC7C1CBCA78FB1CB4FF7AA50D116CE1640956A4E89EADF529
    3FAA13A2349F421
  798:d=1hl=3l=129 prim: INTEGER           :BC3B93324E6D92EE7883AA366624F28ABF461ED3B0BE2CF7F80
    5158939F815D20C07583E52C6DCA8DDD5FB2C1EE5AC9474A1476CD16ACFDDB1E24EEA2F204939BA1C58068B2D342FC4
    169D484D36451BC7B82F306176D53FC71809A5A25B320277320DAC3D949D504DD9907164EC3EF7BD1BB4DEA82160A7C
    4E3AA2ADEE88A9D
  930:d=1hl=3l=128 prim: INTEGER           :2915E921A7D7A7A0F70BD8775C2C16BACD91F319DB1679FFE4C
    BA30A5768D784EF45B90C4E2B0ECDC18323211B06B03AD76E39CD482E3D8CCC50EAE270A1813CE6F80688723F07FF18
    A3110AD1AE16692CAD73BAA7AAA2CE5800D72F4F92489296542C1DA87159382B41A4A42933CD18848BBDB39A0A8E9F5
    288770E27075B01
1061:d=1hl=3l=128 prim: INTEGER           :3AB4E3B841AB234515BF0A8D2E40FB6E95389702D834474E9AD8
    49124DC6C1D342738D4E7510265DF6B744EBAA4A88A7995346BEEF047DB024CE8B2A4E3923B0566389948AB0BBB0318
    79770DA14F4418AEB75AE98349122A2D9535117B05BEF938A1211A3BE6E882957BC2A5F1DE5CA50C26F42EE0A383A2A
    2B6340D52E1A36

Being this an RSA key the fields represent specific components of the algorithm. We find in order the modulus n = pq, the public exponent e, the private exponent d, the two prime numbers p and q, and the values d_p, d_q, and q_inv (for the Chinese remainder theorem speed-up).

If the key has been encrypted there are fields with information about the cipher, and the OCTET STRING fields cannot be further parsed because of the encryption.

$ openssl asn1parse -inform pem -in private-enc.pem
    0:d=0hl=4l=1311 cons: SEQUENCE          
    4:d=1hl=2l=73 cons: SEQUENCE          
    6:d=2hl=2l=9 prim: OBJECT            :PBES2
   17:d=2hl=2l=60 cons: SEQUENCE          
   19:d=3hl=2l=27 cons: SEQUENCE          
   21:d=4hl=2l=9 prim: OBJECT            :PBKDF2
   32:d=4hl=2l=14 cons: SEQUENCE          
   34:d=5hl=2l=8 prim: OCTET STRING      [HEX DUMP]:7FBE6B5C86A4B922
   44:d=5hl=2l=2 prim: INTEGER           :0800
   48:d=3hl=2l=29 cons: SEQUENCE          
   50:d=4hl=2l=9 prim: OBJECT            :aes-256-cbc
   61:d=4hl=2l=16 prim: OCTET STRING      [HEX DUMP]:7FC1CC749F456498F01E43108E4340DE
   79:d=1hl=4l=1232 prim: OCTET STRING      [HEX DUMP]:A5581EDC2797FC4E1AD0B66A00B765900AF1164D8
   F67458C1A4E72F54A65F2B8C0C5AD7E42584B95161FD98FBECA07D8E1049687C365ED157C45F1B57B175D2EF778A1FE7
   D12E50C0DF4248F0E1469DA40F9948581F16546F9582D9DCA83AC07C9466A6E3E6CE98CC241C44DAB32F5891B96DE302
   4B6E6A0F4980C6286D6EB8AA1680AD132810EEFB127DE42968142F4F9A4A2CE55A560C054C54DFFBB720A81F3F50A2A6
   3D748CE06309F55340BD4C74980C48F4C9D41650568A62BBE8E0337653BD4A2F7D47C3A24514B5D3100ED40C164831C6
   5A96DC90AD20F4AEF02E00203B0F0B2D550987AEE8F4C7E0E7C0CFF426B465D3CF568D02EE86AF043345954B0AAA649F
   A9F80E026E2A189EC60772A058615DCFEC9EC4D2D12CDEB7844EAA00202E435A0B9B0A28AC4F2DA213214F773A2319D5
   5A560D5C99246F9895F5EF04D97FF1CE26EFC2FF82249F6E94253CB92EE0A74AE3942285C2DFFC77883709E7FF2569FD
   9C8F58C112CD4A125E40E7BC8599242D71DE7D48416B6A36FBE0B90BA9A05AFB982CAE9AD337C2318582AA328ABC341F
   BB1C036DE334DE327DEC97BA757CBBAED26F25DD74BD8BE9215B479CD49D8357AFA5289A0265ADE025F9FC0CDB1CDBF0
   4C812F20B7CEB58BF12C1FD1756AABD7F557B87E1D245E8062D1DF4078D77AD98BFDF0C0F3A06A7FA11BFAE0EBF8F3EE
   1F8AB0D6D7C905D4D238E2738613EA753E044589CEBDF3714CACEC298653FA45AF5977BDCF23B5DD60B479C7958B8AC1
   8CAA4AA4A79C283805246675BBB8D2D0E5B714320E7E6FE8B2EF73DB9839095229B9653726AB9689B19AB47113F70204
   83B2D1A82FE2EB9ABAB429DDF5ACDEBCAB62BABD48D2DBA1D398B03F9919F1DAC8CDA19D39BBAF2B5FE96C43E78F565C
   465019DF88E71BCE35C6F7F8BE87EB384FA1193345E47CA9382BCEFFC2E6B37681E8D95EB48BC7044F7DCA743217D4C0
   81200502E98EC2CFAA9D17277D5385E65CC8104DA999E31532A8B9B3B4D3E219613AE09BC9F10553CC4E5F135ACD3FB4
   A3BBAB21839CEFBBC0D4BB16AE4FBD7407E6E3709B059BD86AFFE032805CE5FB0B8005009B5964B79E478DA7FE88C20D
   D2FEDA10A0EB3433ADC90AF5DD8772B840A5CD7C5E32D96153E41F12BA501EF1F48C4E20CB0120CFBB6F546C2B6E22E0
   834CB9DFBFA4834FEB4B7374788F781A1634ABF9D1FD014E6DB3749E6A086155521ADB9F271D6BF6F60455903B1D913D
   A639EE9F5CA5135FD2A1873FF35EAB8C151C5B90826E4303233D4BB053EBD929107874CDCCADFFF492A7CB595EADF03E
   4C0FE15326752898F1B9AA3EAC9907D9F276E6AB37AFA34FF8F3DBAB7B009754CF1A13029CD6857686105830F0CF6E99
   476CB07ECAAEA8B5CCC2720479423F8504E783D6712E424C636DAB41203D9EC76F47C4B56F453C42E5626048C24CC585
   F0710514EEF6D4C9644E0721CEAE9F885FBD672742A555095A895C7F0D4E814BEF4D223B13285E95BEDF7357D3545784
   32C1EBB63A6EF1D83E21A08DADA073BF9419C7A3185BB492A13569F262683E7CD86EC66CF671C919789038598EFEC22B
   C8EA1E265A4E0864F9E7253BE32457AC1B186722F3D0FF4AD450D04BA97D5B7DC1AA617DBD25EE8EC912072ABCBF5394
   D08AA276732666D4C349196940BFE869DA909EC03A8E25B23339EE50453CB5F81400B1380CA46AF0FC012CA55F322C1C
   5806E5D76D4CD8308B8FDFE

OpenSSL and RSA keys¶

Another way to look into a private key with OpenSSL is to use the rsa module. While the asn1parse module is a generic ASN.1 parser, the rsa module knows the structure of an RSA key and can properly output the field names

$ openssl rsa -in private.pem -noout -text
Private-Key: (2048 bit)
modulus:
    00:b2:f5:fd:3f:9f:09:17:11:2c:e4:2f:8b:f8:7e:
    d6:76:e1:52:58:be:44:3f:36:de:af:b0:b6:9b:de:
    24:96:b4:95:ea:ad:1b:01:ca:d8:42:71:b0:14:e9:
    6f:79:38:6c:63:6d:34:85:16:da:74:a6:8a:8c:70:
    fb:a8:82:87:0c:47:b4:21:8d:8f:49:18:6d:df:72:
    72:7b:9d:80:c2:19:11:c3:e3:37:c6:e4:07:ff:b4:
    7c:2f:27:67:b0:d1:64:d8:a1:e9:af:95:f6:48:1b:
    f8:d9:ed:fb:2e:39:04:b2:52:92:68:c4:60:25:6f:
    af:d0:a6:77:d2:98:98:f1:0b:1d:15:12:8a:69:58:
    39:fc:08:ed:d5:84:e8:33:56:15:b1:d1:d7:27:7b:
    e6:5c:53:2d:ca:92:dd:c7:05:03:74:86:8b:11:7e:
    a9:15:49:14:ef:92:92:b8:44:3f:13:69:6e:4f:ad:
    50:de:d6:bd:90:e5:a6:f7:ed:33:be:2e:ce:31:c6:
    dd:7a:42:53:ee:6c:dc:56:78:7d:dd:1d:5c:d7:76:
    61:40:22:db:87:d0:3b:b2:2f:23:28:5b:5a:31:67:
    af:8d:ac:ab:be:a4:00:04:47:13:37:d3:78:1e:8c:
    5c:ca:0e:a5:e2:77:99:b5:10:e4:ef:93:8c:61:ca:
    a6:0d
publicExponent: 65537(0x10001)
privateExponent:
    00:b2:42:55:00:0a:6a:03:90:18:27:33:35:39:51:
    1e:4f:4c:21:ba:43:cb:b7:2b:f0:a5:10:60:d4:e1:
    71:90:ac:50:a8:71:c5:75:03:98:66:96:d7:cd:fc:
    b8:0d:07:26:ef:e2:d7:6d:ba:55:df:dc:04:25:e0:
    64:cc:75:38:10:03:5c:6a:0f:97:aa:37:ab:39:e7:
    c6:21:5b:c1:e5:95:13:1d:0c:37:82:e5:a1:12:13:
    b5:9f:42:a1:06:7f:8c:f4:3c:53:89:92:d6:be:fd:
    1d:e3:f6:29:3c:e1:8e:cc:11:73:c4:e7:d6:dd:73:
    62:ad:73:23:e7:a2:18:b5:ff:b0:f2:45:eb:79:63:
    27:cc:87:49:3e:dd:13:42:34:ed:5f:3b:14:a4:c4:
    d9:23:74:59:7f:64:a6:d3:cb:2c:10:f0:cd:2d:57:
    e9:9f:58:c8:d2:8f:20:49:d1:43:3c:c4:bd:67:70:
    17:ad:1b:dd:1c:83:cf:b8:fb:7e:8c:8f:dc:f0:b4:
    fb:77:de:7b:82:85:74:9c:ed:fb:fd:68:78:f7:f7:
    93:00:73:f0:f4:2a:dd:cb:a8:38:5d:7e:d0:5c:df:
    ca:a2:a2:ba:75:76:01:72:3a:96:20:1f:ec:cc:2e:
    65:c6:5e:14:f6:5f:1d:34:d6:ec:df:e3:f8:54:01:
    80:01
prime1:
    00:e1:d1:63:89:bf:6e:ff:7a:e4:4f:65:71:06:ed:
    81:c8:1a:48:b5:fb:35:6f:83:dd:4a:22:9e:86:54:
    bd:c0:36:71:6b:bd:9d:46:df:d1:49:81:32:54:50:
    54:95:8a:ca:5c:fd:a7:09:d9:7c:c8:c6:a9:e9:20:
    3d:05:f7:b9:d4:5e:68:5a:19:a5:f5:82:67:fc:b1:
    7f:cf:50:2b:32:cf:ed:b9:4c:ae:a5:8e:e5:f6:3e:
    ba:5f:33:d0:99:46:c8:65:21:32:34:44:10:d3:d6:
    58:74:8b:ca:e2:56:f2:48:96:c2:a9:ad:93:40:d3:
    c8:39:26:52:da:8e:d7:34:6d
prime2:
    00:ca:e1:55:c9:b3:a4:54:6b:5f:c3:cf:4c:c8:0d:
    53:9d:53:1c:40:6b:ac:5e:d8:28:18:e9:77:b4:96:
    f9:f6:14:ce:fb:11:79:e3:bf:bf:ab:22:bc:a7:f8:
    8e:bb:8c:9b:13:27:ae:70:11:32:42:df:f0:86:63:
    70:b6:c7:67:82:db:d5:0d:be:1f:ee:9b:33:16:b9:
    aa:c7:ba:bb:7c:fa:0a:9e:f2:6c:3c:97:6c:f6:2d:
    a8:f4:1e:fe:06:54:58:dc:7c:1c:bc:a7:8f:b1:cb:
    4f:f7:aa:50:d1:16:ce:16:40:95:6a:4e:89:ea:df:
    52:93:fa:a1:3a:23:49:f4:21
exponent1:
    00:bc:3b:93:32:4e:6d:92:ee:78:83:aa:36:66:24:
    f2:8a:bf:46:1e:d3:b0:be:2c:f7:f8:05:15:89:39:
    f8:15:d2:0c:07:58:3e:52:c6:dc:a8:dd:d5:fb:2c:
    1e:e5:ac:94:74:a1:47:6c:d1:6a:cf:dd:b1:e2:4e:
    ea:2f:20:49:39:ba:1c:58:06:8b:2d:34:2f:c4:16:
    9d:48:4d:36:45:1b:c7:b8:2f:30:61:76:d5:3f:c7:
    18:09:a5:a2:5b:32:02:77:32:0d:ac:3d:94:9d:50:
    4d:d9:90:71:64:ec:3e:f7:bd:1b:b4:de:a8:21:60:
    a7:c4:e3:aa:2a:de:e8:8a:9d
exponent2:
    29:15:e9:21:a7:d7:a7:a0:f7:0b:d8:77:5c:2c:16:
    ba:cd:91:f3:19:db:16:79:ff:e4:cb:a3:0a:57:68:
    d7:84:ef:45:b9:0c:4e:2b:0e:cd:c1:83:23:21:1b:
    06:b0:3a:d7:6e:39:cd:48:2e:3d:8c:cc:50:ea:e2:
    70:a1:81:3c:e6:f8:06:88:72:3f:07:ff:18:a3:11:
    0a:d1:ae:16:69:2c:ad:73:ba:a7:aa:a2:ce:58:00:
    d7:2f:4f:92:48:92:96:54:2c:1d:a8:71:59:38:2b:
    41:a4:a4:29:33:cd:18:84:8b:bd:b3:9a:0a:8e:9f:
    52:88:77:0e:27:07:5b:01
coefficient:
    3a:b4:e3:b8:41:ab:23:45:15:bf:0a:8d:2e:40:fb:
    6e:95:38:97:02:d8:34:47:4e:9a:d8:49:12:4d:c6:
    c1:d3:42:73:8d:4e:75:10:26:5d:f6:b7:44:eb:aa:
    4a:88:a7:99:53:46:be:ef:04:7d:b0:24:ce:8b:2a:
    4e:39:23:b0:56:63:89:94:8a:b0:bb:b0:31:87:97:
    70:da:14:f4:41:8a:eb:75:ae:98:34:91:22:a2:d9:
    53:51:17:b0:5b:ef:93:8a:12:11:a3:be:6e:88:29:
    57:bc:2a:5f:1d:e5:ca:50:c2:6f:42:ee:0a:38:3a:
    2a:2b:63:40:d5:2e:1a:36

The fields are the same we found in the ASN.1 structure, but in this representation we have a better view of the specific values of the RSA key. You can compare the two and see that the value of the fields are the same.

If you want to learn something about RSA try to investigate the historical reasons behind the choice of 65537 as a common public exponent (as you can see in the publicExponent section here).

PKCS #8 vs PKCS #1¶

The first version of the PKCS standard (PKCS #1) was specifically tailored to contain an RSA key. Its ASN.1 definition can be found in RFC 8017 ("PKCS #1: RSA Cryptography Specifications Version 2.2")

RSAPublicKey ::= SEQUENCE {
    modulus           INTEGER,  -- n
    publicExponent    INTEGER   -- e
}

RSAPrivateKey ::= SEQUENCE {
    version           Version,
    modulus           INTEGER,  -- n
    publicExponent    INTEGER,  -- e
    privateExponent   INTEGER,  -- d
    prime1            INTEGER,  -- p
    prime2            INTEGER,  -- q
    exponent1         INTEGER,  -- d mod (p-1)
    exponent2         INTEGER,  -- d mod (q-1)
    coefficient       INTEGER,  -- (inverse of q) mod p
    otherPrimeInfos   OtherPrimeInfos OPTIONAL
}

Subsequently, as the need to describe new types of algorithms increased, the PKCS #8 standard was developed. This can contain different types of keys, and defines a specific field for the algorithm identifier. Its ASN.1 definition can be found in RFC 5958 ("Asymmetric Key Packages")

OneAsymmetricKey ::= SEQUENCE {
     version                   Version,
     privateKeyAlgorithm       PrivateKeyAlgorithmIdentifier,
     privateKey                PrivateKey,
     attributes            [0] Attributes OPTIONAL,
     ...,
     [[2: publicKey        [1] PublicKey OPTIONAL ]],
     ...
   }

PrivateKey ::= OCTET STRING
                     -- Content varies based on type of key. The
                     -- algorithm identifier dictates the format of
                     -- the key.

The definition of the PrivateKey field for the RSA algorithm is the same used in PKCS #1.

If the PEM format uses PKCS #8 its header and footer are

-----BEGIN PRIVATE KEY-----
[...]
-----END PRIVATE KEY-----

If it uses PKCS #1, however, there has to be an external identification of the algorithm, so the header and footer are

-----BEGIN RSA PRIVATE KEY-----
[...]
-----END RSA PRIVATE KEY-----

The structure of PKCS #8 is the reason why we had to parse the field at offset 22 to access the RSA parameters when using the asn1parse module of OpenSSL. If you are parsing a PKCS #1 key in PEM format you don't need this second step.

Private and public key¶

In the RSA algorithm the public key is build using the modulus and the public exponent, which means that we can always derive the public key from the private key. OpenSSL can easily do this with the rsa module, producing the public key in PEM format

$ openssl rsa -in private.pem -pubout
writing RSA key
-----BEGIN PUBLIC KEY-----
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAsvX9P58JFxEs5C+L+H7W
duFSWL5EPzber7C2m94klrSV6q0bAcrYQnGwFOlveThsY200hRbadKaKjHD7qIKH
DEe0IY2PSRht33Jye52AwhkRw+M3xuQH/7R8LydnsNFk2KHpr5X2SBv42e37LjkE
slKSaMRgJW+v0KZ30piY8QsdFRKKaVg5/Ajt1YToM1YVsdHXJ3vmXFMtypLdxwUD
dIaLEX6pFUkU75KSuEQ/E2luT61Q3ta9kOWm9+0zvi7OMcbdekJT7mzcVnh93R1c
13ZhQCLbh9A7si8jKFtaMWevjayrvqQABEcTN9N4Hoxcyg6l4neZtRDk75OMYcqm
DQIDAQAB
-----END PUBLIC KEY-----

You can dump the information in the public key specifying the -pubin flag

$ openssl rsa -in public.pem -noout -text -pubin
Public-Key: (2048 bit)
Modulus:
    00:b2:f5:fd:3f:9f:09:17:11:2c:e4:2f:8b:f8:7e:
    d6:76:e1:52:58:be:44:3f:36:de:af:b0:b6:9b:de:
    24:96:b4:95:ea:ad:1b:01:ca:d8:42:71:b0:14:e9:
    6f:79:38:6c:63:6d:34:85:16:da:74:a6:8a:8c:70:
    fb:a8:82:87:0c:47:b4:21:8d:8f:49:18:6d:df:72:
    72:7b:9d:80:c2:19:11:c3:e3:37:c6:e4:07:ff:b4:
    7c:2f:27:67:b0:d1:64:d8:a1:e9:af:95:f6:48:1b:
    f8:d9:ed:fb:2e:39:04:b2:52:92:68:c4:60:25:6f:
    af:d0:a6:77:d2:98:98:f1:0b:1d:15:12:8a:69:58:
    39:fc:08:ed:d5:84:e8:33:56:15:b1:d1:d7:27:7b:
    e6:5c:53:2d:ca:92:dd:c7:05:03:74:86:8b:11:7e:
    a9:15:49:14:ef:92:92:b8:44:3f:13:69:6e:4f:ad:
    50:de:d6:bd:90:e5:a6:f7:ed:33:be:2e:ce:31:c6:
    dd:7a:42:53:ee:6c:dc:56:78:7d:dd:1d:5c:d7:76:
    61:40:22:db:87:d0:3b:b2:2f:23:28:5b:5a:31:67:
    af:8d:ac:ab:be:a4:00:04:47:13:37:d3:78:1e:8c:
    5c:ca:0e:a5:e2:77:99:b5:10:e4:ef:93:8c:61:ca:
    a6:0d
Exponent: 65537(0x10001)

Generating key pairs with OpenSSL¶

If you want to generate an RSA private key you can do it with OpenSSL

$ openssl genpkey -algorithm RSA -out private.pem -pkeyopt rsa_keygen_bits:2048
......................................................................+++
..........+++

Since OpenSSL is a collection of modules we specify genpkey to generate a private key. The -algorithm option specifies which algorithm we want to use to generate the key (RSA in this case), -out specifies the name of the output file, and -pkeyopt allows us to set the value for specific key options. In this case the length of the RSA key in bits.

If you want an encrypted key you can generate one specifying the cipher (for example -aes-256-cbc)

$ openssl genpkey -algorithm RSA -out private-enc.pem -aes-256-cbc -pkeyopt rsa_keygen_bits:2048
...........................+++
..........+++
Enter PEM pass phrase:
Verifying - Enter PEM pass phrase:

You can see the list of supported ciphers with openssl list-cipher-algorithms. In both cases you can then extract the public key with the method shown previously. OpenSSL private keys are created using PKCS #8, so unencrypted keys will be in the form

-----BEGIN PRIVATE KEY-----
[...]
-----END PRIVATE KEY-----

and encrypted ones in the form

-----BEGIN ENCRYPTED PRIVATE KEY-----
[...]
-----END ENCRYPTED PRIVATE KEY-----

Generating key pairs with OpenSSH¶

Another tool that you can use to generate key pairs is ssh-keygen, which is a tool included in the SSH suite that is specifically used to create and manage SSH keys. As SSH keys are standard asymmetrical keys we can use the tool to create keys for other purposes.

To create a key pair just run

ssh-keygen -t rsa -b 2048 -f key

The -t option specifies the key generation algorithm (RSA in this case), while the -b option specifies the length of the key in bits.

The -f option sets the name of the output file. If not present, ssh-keygen will ask the name of the file, offering to save it to the default file ~/.ssh/id_rsa. The tool always asks for a password to encrypt the key, but you are allowed to enter an empty one to skip the encryption.

This tool creates two files. One is the private key file, named as requested, and the second is the public key file, named like the private key one but with a .pub extension.

OpenSSH private keys are generated using the PKCS #1 format, so the key will be in the form

-----BEGIN RSA PRIVATE KEY-----
[...]
-----END RSA PRIVATE KEY-----

The OpenSSH public key format¶

The public key saved by ssh-keygen is written in the so-called SSH-format, which is not a standard in the cryptography world. It's structure is <algorithm> <key> <comment>, where the <key> part of the format is encoded with Base64.

For example

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCy9f0/nwkXESzkL4v4ftZ24VJYvkQ/Nt6vsLab3iSWtJXqrRsBythCcbAU6W95OGxjbTSFFtp0poqMcPuogocMR7QhjY9JGG3fcnJ7nYDCGRHD4zfG5Af/tHwvJ2ew0WTYoemvlfZIG/jZ7fsuOQSyUpJoxGAlb6/QpnfSmJjxCx0VEoppWDn8CO3VhOgzVhWx0dcne+ZcUy3Kkt3HBQN0hosRfqkVSRTvkpK4RD8TaW5PrVDe1r2Q5ab37TO+Ls4xxt16QlPubNxWeH3dHVzXdmFAItuH0DuyLyMoW1oxZ6+NrKu+pAAERxM303gejFzKDqXid5m1EOTvk4xhyqYN user@host

To manually decode the central part of the key you can run the following code

cat key.pub | cut -d " " -f2 | base64 -d | hexdump -ve '/1 "%02x "' -e '2/8 "\n"'

which in the previous case outputs something like

00 00 00 07 73 73 68 2d 72 73 61 00 00 00 03 0100 01 00 00 01 01 00 b2 f5 fd 3f 9f 09 17 11 2ce4 2f 8b f8 7e d6 76 e1 52 58 be 44 3f 36 de afb0 b6 9b de 24 96 b4 95 ea ad 1b 01 ca d8 42 71b0 14 e9 6f 79 38 6c 63 6d 34 85 16 da 74 a6 8a8c 70 fb a8 82 87 0c 47 b4 21 8d 8f 49 18 6d df72 72 7b 9d 80 c2 19 11 c3 e3 37 c6 e4 07 ff b47c 2f 27 67 b0 d1 64 d8 a1 e9 af 95 f6 48 1b f8d9 ed fb 2e 39 04 b2 52 92 68 c4 60 25 6f af d0a6 77 d2 98 98 f1 0b 1d 15 12 8a 69 58 39 fc 08ed d5 84 e8 33 56 15 b1 d1 d7 27 7b e6 5c 53 2dca 92 dd c7 05 03 74 86 8b 11 7e a9 15 49 14 ef92 92 b8 44 3f 13 69 6e 4f ad 50 de d6 bd 90 e5a6 f7 ed 33 be 2e ce 31 c6 dd 7a 42 53 ee 6c dc56 78 7d dd 1d 5c d7 76 61 40 22 db 87 d0 3b b22f 23 28 5b 5a 31 67 af 8d ac ab be a4 00 04 4713 37 d3 78 1e 8c 5c ca 0e a5 e2 77 99 b5 10 e4ef 93 8c 61 ca a6 0d

The structure of this binary file is pretty simple, and is described in two different RFCs. RFC 4253 ("SSH Transport Layer Protocol") states in section 6.6 that

The "ssh-rsa" key format has the following specific encoding:

      string    "ssh-rsa"
      mpint     e
      mpint     n

while the definition of the string and mpint types can be found in RFC 4251 ("SSH Protocol Architecture"), section 5

string[...]Theyarestoredasauint32containingitslength(numberofbytesthatfollow)andzero(=emptystring)ormorebytesthatarethevalueofthestring.Terminatingnullcharactersarenotused.[...]mpintRepresentsmultipleprecisionintegersintwo'scomplementformat,storedasastring,8bitsperbyte,MSBfirst.[...]

This means that the above sequence of bytes is interpreted as 4 bytes of length (32 bits of the uint32 type) followed by that number of bytes of content.

(4 bytes)   00 00 00 07          = 7
(7 bytes)   73 73 68 2d 72 73 61 = "ssh-rsa" (US-ASCII)
(4 bytes)   00 00 00 03          = 3
(3 bytes)   01 00 01             = 65537 (a common value for the RSA exponent)
(4 bytes)   00 00 01 01          = 257
(257 bytes) 00 b2 .. ca a6 0d    = The key modulus

Please note that since we created a key of 2048 bits we should have a modulus of 256 bytes. Instead this key uses 257 bytes prefixing the number with a 00 byte to avoid it being interpreted as negative (two's complement format).

The structure shown above is the reason why all the RSA public SSH keys start with the same 12 characters AAAAB3NzaC1y. This string, converted in Base64 gives the initial 9 bytes 00 00 00 07 73 73 68 2d 72 (Base64 characters are not a one-to-one mapping of the source bytes). If the exponent is the standard 65537 the key starts with AAAAB3NzaC1yc2EAAAADAQAB, which encoded gives the fist 18 bytes 00 00 00 07 73 73 68 2d 72 73 61 00 00 00 03 01 00 01.

Converting between PEM and OpenSSH format¶

We often need to convert files created with one tool to a different format, so this is a list of the most common conversions you might need. I prefer to consider the key format instead of the source tool, but I give a short description of the reason why you should want to perform the conversion.

PEM/PKCS#1 to PEM/PKCS#8¶

This is useful to convert OpenSSH private keys to a newer format.

openssl pkcs8 -topk8 -inform PEM -outform PEM -in pkcs1.pem -out pkcs8.pem

OpenSSH public to PEM/PKCS#8¶

To convert public OpenSSH keys in a proper PEM format (prints to stdout)

ssh-keygen -e -f public.pub -m PKCS8

This is easy to remember because -e stands for export.

PEM/PKCS#8 to OpenSSH public¶

If you need to use in SSH a key pair created with another system

ssh-keygen -i -f public.pem -m PKCS8

This is easy to remember because -i stands for import.

Reading RSA keys in Python¶

In Python you can use the pycrypto package to access a PEM file containing an RSA key with the RSA.importKey function. Now you can hopefully understand the documentation that says

externKey (string) - The RSA key to import, encoded as a string.

An RSA public key can be in any of the following formats:
    * X.509 subjectPublicKeyInfo DER SEQUENCE (binary or PEM encoding)
    * PKCS#1 RSAPublicKey DER SEQUENCE (binary or PEM encoding)
    * OpenSSH (textual public key only)

An RSA private key can be in any of the following formats:
    * PKCS#1 RSAPrivateKey DER SEQUENCE (binary or PEM encoding)
    * PKCS#8 PrivateKeyInfo DER SEQUENCE (binary or PEM encoding)
    * OpenSSH (textual public key only)

For details about the PEM encoding, see RFC1421/RFC1423.

In case of PEM encoding, the private key can be encrypted with DES or 3TDES
according to a certain pass phrase. Only OpenSSL-compatible pass phrases are
supported.

In practice what you can do with a private.pem file is

fromCrypto.PublicKeyimportRSAf=open('private.pem','r')key=RSA.importKey(f.read())

and the key variable will contain an instance of _RSAobj (not a very pythonic name, to be honest). This instance contains the RSA parameters as attributes as stated in the documentation

modulus=key.npublic_exponent=key.eprivate_exponent=key.dfirst_prime_number=key.psecond_prime_number=key.qq_inv_crt=key.u

Final words¶

I keep finding on StackOverflow (and on other boards) messages of users that are confused by RSA keys, the output of the various tools, and by the subtle but important differences between the formats, so I hope this post helped you to get a better understanding of the matter.

Resources¶

The Wikipedia article on RSA
OpenSSL documentation: asn1parse, rsa, genpkey
The Base64 encoding
The Abstract Syntax Notation One ASN.1 interface description language
RFC 4251 - The Secure Shell (SSH) Protocol Architecture
RFC 4253 - The Secure Shell (SSH) Transport Layer Protocol
RFC 5208 - Public-Key Cryptography Standards (PKCS) #8: Private-Key Information Syntax Specification Version 1.2
RFC 5958 - Asymmetric Key Packages
RFC 7468 - Textual Encodings of PKIX, PKCS, and CMS Structures
RFC 8017 - PKCS #1: RSA Cryptography Specifications Version 2.2
PyCrypto - The Python Cryptography Toolkit

Feedback¶

Feel free to reach me on Twitter if you have questions. The GitHub issues page is the best place to submit corrections.

↧

Mike Driscoll: PyDev of the Week: Jessica Garson

March 15, 2020, 10:05 pm

≫ Next: Stack Abuse: Default Arguments in Python Functions

≪ Previous: The Digital Cat: Public key cryptography: RSA keys

Can you tell us a little about yourself (hobbies, education, etc):

Why did you start using Python?

What other programming languages do you know and which is your favorite?

What projects are you working on now?

Which Python libraries are your favorite (core or 3rd party)?

How did you get involved in organizing tech conferences/meetups?

I see you used to teach Python at NYU. Any exciting experiences you would like to share?

Thanks for doing the interview, Jessica!

The post PyDev of the Week: Jessica Garson appeared first on The Mouse Vs. The Python.

↧

Stack Abuse: Default Arguments in Python Functions

March 16, 2020, 5:29 am

≫ Next: Catalin George Festila: Python 3.5.2 : Detect motion and save images with opencv.

≪ Previous: Mike Driscoll: PyDev of the Week: Jessica Garson

Functions in Python are used to implement logic that you want to execute repeatedly at different places in your code. You can pass data to these functions via function arguments. In addition to passing arguments to functions via a function call, you can also set default argument values in Python functions. These default values are assigned to function arguments if you do not explicitly pass a parameter value to the given argument. Parameters are the values actually passed to function arguments.

In this article, you will see how to use default arguments in Python functions. But first, we will see how to define a function in Python and how to explicitly pass values to function arguments.

Function without Arguments

Let's define a very simple Python function without any arguments:

def my_function():
    print("This is a function without arguments")

The above script defines a function, my_function, which doesn't accept any arguments and simply prints a string.

The following script shows how you'd actually call the my_function() function:

my_function()

In the output, you should see a simple statement printed to the screen by the my_function() function:

This is a function without arguments

Function with Explicit Arguments

Let's now define a simple Python function where we have to pass multiple values for the function arguments. If you do not specify values for all the function arguments, you will see an error.

Here is the function we'll be using as an example:

def func_args(integer1, integer2):
    result = integer1 + integer2
    return result

In the code above we create a function, func_args(), with two arguments integer1 and integer2. The function adds the values passed in the two arguments and returns the result to the function caller.

Let's try calling the above function with two arguments:

result = func_args(10, 20)
print(result)

The above script calls the func_args() method with two parameter values, i.e. 10 and 20. In the output, you should see the sum of these two values, i.e. 30.

Let's now try to call the func_args() method without passing values for the arguments:

result = func_args()
print(result)

In the output, you should see the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-3449c8e5e188> in <module>
----> 1 result = func_args()
      2 print(result)

TypeError: func_args() missing 2 required positional arguments: 'integer1' and 'integer2'

The error is quite clear, the function call to func_args() is missing the 2 required positional arguments, integer1 and integer2. The error basically tells us that we need to pass values for the integer1 and integer2 arguments via the function call.

Let's now pass a value for one of the arguments and see what happens:

result = func_args(10)
print(result)

Now in the output, you should again see the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-640ec7b786e1> in <module>
----> 1 result = func_args(10)
      2 print(result)

TypeError: func_args() missing 1 required positional argument: 'integer2'

The difference here is that the error now tells us that the value for one of the positional arguments, i.e. integer2, is missing. This means that without any default argument values set, you have to pass values explicitly for all the function arguments, otherwise an error will be thrown.

What if you want your function to execute with or without the argument values in the function call? This is where default arguments in Python functions come in to play.

Function with Default Arguments

Default arguments in Python functions are those arguments that take default values if no explicit values are passed to these arguments from the function call. Let's define a function with one default argument.

def find_square(integer1=2):
    result = integer1 * integer1
    return result

The above script defines a function find_square() with one default argument i.e. integer1. The default value for the integer1 argument is set to 2. If you call the find_square() method with a value for the integer1 argument, the find_square() function will return the square of that value.

Otherwise, if you do not pass any value for the integer1 argument of the find_square() function, you will see that the default value, i.e. 2, will be assigned to integer1, and the function will return the square of 2, i.e. 4.

Let's first call the find_square() method with the argument value of 10:

result = find_square(10)
print(result)

Output:

When you execute the above script, the value 10 overwrites the default value of 2 for the argument integer1 of the find_square() function and the function returns 100, which is square of 10.

Now we will call the find_square() function without any value for the argument1 argument. In this case, you will see that 4 will be returned by find_square() function since in the absence of the value for the find_square() function, the default value of 2 will be used as the value for the find_square() function, as shown below:

result = find_square()
print(result)

Output:

A Python function can have multiple default arguments as well. For instance, in the following script, the function adds the integers number passed to the arguments. If none of the integer values is passed to the function, the default arguments would take values 2 and 4 respectively, as shown below:

def add_ints(integer1=2, integer2=4):
    result = integer1 + integer2
    return result

Let's first call the add_ints() function without any parameters:

result = add_ints()
print(result)

Output:

Since we did not pass any values for the function arguments, the default argument values, i.e 2 and 4, have been added together.

Let's now pass two of our own values to the add_ints() function:

result = add_ints(4, 8)
print(result)

Output:

As expected, 4 and 8 were added together to return 12.

A Python function can have both normal (explicit) and default arguments at the same time. Let's create a function take_power(). The first argument to the function is a normal argument while the second argument is a default argument with a value of 2. The function returns the result of the value in the first argument raised to the power of value in the second argument.

def take_power(integer1, integer2=2):
    result = 1
    for i in range(integer2):
        result = result * integer1

    return result

Let's first pass only a single argument:

result = take_power(3)
print(result)

Output:

In the script above, 3 has been passed as a value to the integer1 argument of the take_power() function. No value has been provided for the default argument integer2. Hence, the default value of 2 will be used to take the power of 3 and you will see 9 in the output.

Let's now pass two values to the take_power() function.

result = take_power(3, 4)
print(result)

In the output, you will see 3 raised to the fourth power, i.e. 81.

It's important to note that parameters with default arguments cannot be followed by parameters with no default argument. Take the following function for example:

def take_power(integer1=2, integer2):
    result = 1
    for i in range(integer2):
        result = result * integer1

    return result

Trying to call this function will result in an error since the first argument has a default, but the second one does not:

result = take_power(3, 4)
print(result)

Executing this code results in the following error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-640ec7b786e1> in <module>
----> 1 def take_power(integer1=3, integer2):
      2     result = 1

SyntaxError: non-default argument follows default argument

↧

Catalin George Festila: Python 3.5.2 : Detect motion and save images with opencv.

March 16, 2020, 2:30 am

≫ Next: Real Python: How to Do a Binary Search in Python

≪ Previous: Stack Abuse: Default Arguments in Python Functions

This script is simple to use it with a webcam or to parse a video file. The main goal of this script is to see the difference in various frames of a video or webcam output. The first frame of our video file will contain no motion and just background and then is compute the absolute difference. There is no need to process the large, raw images straight from the video stream and this is the reason

↧

Real Python: How to Do a Binary Search in Python

March 16, 2020, 7:00 am

≫ Next: James Bennett: Against service layers in Django

≪ Previous: Catalin George Festila: Python 3.5.2 : Detect motion and save images with opencv.

Binary search is a classic algorithm in computer science. It often comes up in programming contests and technical interviews. Implementing binary search turns out to be a challenging task, even when you understand the concept. Unless you’re curious or have a specific assignment, you should always leverage existing libraries to do a binary search in Python or any other language.

In this tutorial, you’ll learn how to:

Use the bisect module to do a binary search in Python
Implement a binary search in Python both recursively and iteratively
Recognize and fix defects in a binary search Python implementation
Analyze the time-space complexity of the binary search algorithm
Search even faster than binary search

This tutorial assumes you’re a student or an intermediate programmer with an interest in algorithms and data structures. At the very least, you should be familiar with Python’s built-in data types, such as lists and tuples. In addition, some familiarity with recursion, classes, data classes, and lambdas will help you better understand the concepts you’ll see in this tutorial.

Below you’ll find a link to the sample code you’ll see throughout this tutorial, which requires Python 3.7 or later to run:

Get Sample Code:Click here to get the sample code you'll use to learn about binary search in Python in this tutorial.

Benchmarking

In the next section of this tutorial, you’ll be using a subset of the Internet Movie Database (IMDb) to benchmark the performance of a few search algorithms. This dataset is free of charge for personal and non-commercial use. It’s distributed as a bunch of compressed tab-separated values (TSV) files, which get daily updates.

To make your life easier, you can use a Python script included in the sample code. It’ll automatically fetch the relevant file from IMDb, decompress it, and extract the interesting pieces:

$ python download_imdb.py
Fetching data from IMDb...Created "names.txt" and "sorted_names.txt"

Be warned that this will download and extract approximately 600 MB of data, as well as produce two additional files, which are about half of that in size. The download, as well as the processing of this data, might take a minute or two to complete.

Download IMDb

To manually obtain the data, navigate your web browser to https://datasets.imdbws.com/ and grab the file called name.basics.tsv.gz, which contains the records of actors, directors, writers, and so on. When you decompress the file, you’ll see the following content:

nconst     primaryName      birthYear  deathYear  (...)
nm0000001  Fred Astaire     1899       1987       (...)
nm0000002  Lauren Bacall    1924       2014       (...)
nm0000003  Brigitte Bardot  1934       \N         (...)
nm0000004  John Belushi     1949       1982       (...)

It has a header with the column names in the first line, followed by data records in each of the subsequent lines. Each record contains a unique identifier, a full name, birth year, and a few other attributes. These are all delimited with a tab character.

There are millions of records, so don’t try to open the file with a regular text editor to avoid crashing your computer. Even specialized software such as spreadsheets can have problems opening it. Instead, you might take advantage of the high-performance data grid viewer included in JupyterLab, for example.

Read Tab-Separated Values

There are a few ways to parse a TSV file. For example, you can read it with Pandas, use a dedicated application, or leverage a few command-line tools. However, it’s recommended that you use the hassle-free Python script included in the sample code.

Note: As a rule of thumb, you should avoid parsing files manually because you might overlook edge cases. For example, in one of the fields, the delimiting tab character could be used literally inside quotation marks, which would break the number of columns. Whenever possible, try to find a relevant module in the standard library or a trustworthy third-party one.

Ultimately, you want to end up with two text files at your disposal:

names.txt
sorted_names.txt

One will contain a list of names obtained by cutting out the second column from the original TSV file:

Fred Astaire
Lauren Bacall
Brigitte Bardot
John Belushi
Ingmar Bergman
...

The second one will be the sorted version of this.

Once both files are ready, you can load them into Python using this function:

defload_names(path):withopen(path)astext_file:returntext_file.read().splitlines()names=load_names('names.txt')sorted_names=load_names('sorted_names.txt')

This code returns a list of names pulled from the given file. Note that calling .splitlines() on the resulting string removes the trailing newline character from each line. As an alternative, you could call text_file.readlines(), but that would keep the unwanted newlines.

Measure the Execution Time

To evaluate the performance of a particular algorithm, you can measure its execution time against the IMDb dataset. This is usually done with the help of the built-in time or timeit modules, which are useful for timing a block of code.

You could also define a custom decorator to time a function if you wanted to. The sample code provided uses time.perf_counter_ns(), introduced in Python 3.7, because it offers high precision in nanoseconds.

Understanding Search Algorithms

Searching is ubiquitous and lies at the heart of computer science. You probably did several web searches today alone, but have you ever wondered what searching really means?

Search algorithms take many different forms. For example, you can:

Do a full-text search
Match strings with fuzzy searching
Find the shortest path in a graph
Query a database
Look for a minimum or maximum value

In this tutorial, you’ll learn about searching for an element in a sorted list of items, like a phone book. When you search for such an element, you might be asking one of the following questions:

Question	Answer
Is it there?	Yes
Where is it?	On the 42nd page
Which one is it?	A person named John Doe

The answer to the first question tells you whether an element is present in the collection. It always holds either true or false. The second answer is the location of an element within the collection, which may be unavailable if that element was missing. Finally, the third answer is the element itself, or a lack of it.

Note: Sometimes there might be more than one correct answer due to duplicate or similar items. For example, if you have a few contacts with the same name, then they will all fit your search criteria. At other times, there might only be an approximate answer or no answer at all.

In the most common case, you’ll be searching by value, which compares elements in the collection against the exact one you provide as a reference. In other words, your search criteria are the entire element, such as a number, a string, or an object like a person. Even the tiniest difference between the two compared elements won’t result in a match.

On the other hand, you can be more granular with your search criteria by choosing some property of an element, such as a person’s last name. This is called searching by key because you pick one or more attributes to compare. Before you dive into binary search in Python, let’s take a quick look at other search algorithms to get a bigger picture and understand how they work.

Random Search

How might you look for something in your backpack? You might just dig your hand into it, pick an item at random, and see if it’s the one you wanted. If you’re out of luck, then you put the item back, rinse, and repeat. This example is a good way to understand random search, which is one of the least efficient search algorithms. The inefficiency of this approach stems from the fact that you’re running the risk of picking the same wrong thing multiple times.

Note: Funnily enough, this strategy could be the most efficient one, in theory, if you were very lucky or had a small number of elements in the collection.

The fundamental principle of this algorithm can be expressed with the following snippet of Python code:

importrandomdeffind(elements,value):whileTrue:random_element=random.choice(elements)ifrandom_element==value:returnrandom_element

The function loops until some element chosen at random matches the value given as input. However, this isn’t very useful because the function returns either None implicitly or the same value it already received in a parameter. You can find the full implementation in the sample code available for download at the link below:

Get Sample Code:Click here to get the sample code you'll use to learn about binary search in Python in this tutorial.

For microscopic datasets, the random search algorithm appears to be doing its job reasonably fast:

>>>

>>> fromsearch.randomimport*# Sample code to download>>> fruits=['orange','plum','banana','apple']>>> contains(fruits,'banana')True>>> find_index(fruits,'banana')2>>> find(fruits,key=len,value=4)'plum'

However, imagine having to search like that through millions of elements! Here’s a quick rundown of a performance test that was done against the IMDb dataset:

Search Term	Element Index	Best Time	Average Time	Worst Time
Fred Astaire	`0`	`0.74s`	`21.69s`	`43.16s`
Alicia Monica	`4,500,000`	`1.02s`	`26.17s`	`66.34s`
Baoyin Liu	`9,500,000`	`0.11s`	`17.41s`	`51.03s`
missing	`N/A`	`5m 16s`	`5m 40s`	`5m 54s`

Unique elements at different memory locations were specifically chosen to avoid bias. Each term was searched for ten times to account for the randomness of the algorithm and other factors such as garbage collection or system processes running in the background.

Note: If you’d like to conduct this experiment yourself, then refer back to the instructions in the introduction to this tutorial. To measure the performance of your code, you can use the built-in time and timeit modules, or you can time functions with a custom decorator.

The algorithm has a non-deterministic performance. While the average time to find an element doesn’t depend on its whereabouts, the best and worst times are two to three orders of magnitude apart. It also suffers from inconsistent behavior. Consider having a collection of elements containing some duplicates. Because the algorithm picks elements at random, it’ll inevitably return different copies upon subsequent runs.

How can you improve on this? One way to address both issues at once is by using a linear search.

Linear Search

When you’re deciding what to have for lunch, you may be looking around the menu chaotically until something catches your eye. Alternatively, you can take a more systematic approach by scanning the menu from top to bottom and scrutinizing every item in a sequence. That’s linear search in a nutshell. To implement it in Python, you could enumerate() elements to keep track of the current element’s index:

deffind_index(elements,value):forindex,elementinenumerate(elements):ifelement==value:returnindex

The function loops over a collection of elements in a predefined and consistent order. It stops when the element is found, or when there are no more elements to check. This strategy guarantees that no element is visited more than once because you’re traversing them in order by index.

Let’s see how well linear search copes with the IMDb dataset you used before:

Search Term	Element Index	Best Time	Average Time	Worst Time
Fred Astaire	`0`	`491ns`	`1.17µs`	`6.1µs`
Alicia Monica	`4,500,000`	`0.37s`	`0.38s`	`0.39s`
Baoyin Liu	`9,500,000`	`0.77s`	`0.79s`	`0.82s`
missing	`N/A`	`0.79s`	`0.81s`	`0.83s`

There’s hardly any variance in the lookup time of an individual element. The average time is virtually the same as the best and the worst one. Since the elements are always browsed in the same order, the number of comparisons required to find the same element doesn’t change.

However, the lookup time grows with the increasing index of an element in the collection. The further the element is from the beginning of the list, the more comparisons have to run. In the worst case, when an element is missing, the whole collection has to be checked to give a definite answer.

When you project experimental data onto a plot and connect the dots, then you’ll immediately see the relationship between element location and the time it takes to find it:

All samples lie on a straight line and can be described by a linear function, which is where the name of the algorithm comes from. You can assume that, on average, the time required to find any element using a linear search will be proportional to the number of all elements in the collection. They don’t scale well as the amount of data to search increases.

For example, biometric scanners available at some airports wouldn’t recognize passengers in a matter of seconds, had they been implemented using linear search. On the other hand, the linear search algorithm may be a good choice for smaller datasets, because it doesn’t require preprocessing the data. In such a case, the benefits of preprocessing wouldn’t pay back its cost.

Python already ships with linear search, so there’s no point in writing it yourself. The list data structure, for example, exposes a method that will return the index of an element or raise an exception otherwise:

>>>

>>> fruits=['orange','plum','banana','apple']>>> fruits.index('banana')2>>> fruits.index('blueberry')Traceback (most recent call last):
  File "<stdin>", line 1, in <module>ValueError: 'blueberry' is not in list

This can also tell you if the element is present in the collection, but a more Pythonic way would involve using the versatile in operator:

>>>

>>> 'banana'infruitsTrue>>> 'blueberry'infruitsFalse

It’s worth noting that despite using linear search under the hood, these built-in functions and operators will blow your implementation out of the water. That’s because they were written in pure C, which compiles to native machine code. The standard Python interpreter is no match for it, no matter how hard you try.

A quick test with the timeit module reveals that the Python implementation might run almost ten times slower than the equivalent native one:

>>>

>>> importtimeit>>> fromsearch.linearimportcontains>>> fruits=['orange','plum','banana','apple']>>> timeit.timeit(lambda:contains(fruits,'blueberry'))1.8904765040024358>>> timeit.timeit(lambda:'blueberry'infruits)0.22473459799948614

However, for sufficiently large datasets, even the native code will hit its limits, and the only solution will be to rethink the algorithm.

Note: The in operator doesn’t always do a linear search. When you use it on a set, for example, it does a hash-based search instead. The operator can work with any iterable, including tuple, list, set, dict, and str. You can even support your custom classes with it by implementing the magic method.__contains__() to define the underlying logic.

In real-life scenarios, the linear search algorithm should usually be avoided. For example, there was a time I wasn’t able to register my cat at the vet clinic because their system kept crashing. The doctor told me he must eventually upgrade his computer because adding more records into the database makes it run slower and slower.

I remember thinking to myself at that point that the person who wrote that software clearly didn’t know about the binary search algorithm!

Binary Search

The word binary is generally associated with the number 2. In this context, it refers to dividing a collection of elements into two halves and throwing away one of them at each step of the algorithm. This can dramatically reduce the number of comparisons required to find an element. But there’s a catch—elements in the collection must be sorted first.

The idea behind it resembles the steps for finding a page in a book. At first, you typically open the book to a completely random page or at least one that’s close to where you think your desired page might be.

Occasionally, you’ll be fortunate enough to find that page on the first try. However, if the page number is too low, then you know the page must be to the right. If you overshoot on the next try, and the current page number is higher than the page you’re looking for, then you know for sure that it must be somewhere in between.

You repeat the process, but rather than choosing a page at random, you check the page located right in the middle of that new range. This minimizes the number of tries. A similar approach can be used in the number guessing game. If you haven’t heard of that game, then you can look it up on the Internet to get a plethora of examples implemented in Python.

Note: Sometimes, if the values are uniformly distributed, you can calculate the middle index with linear interpolation rather than taking the average. This variation of the algorithm will require even fewer steps.

The page numbers that restrict the range of pages to search through are known as the lower bound and the upper bound. In binary search, you commonly start with the first page as the lower bound and the last page as the upper bound. You must update both bounds as you go. For example, if the page you turn to is lower than the one you’re looking for, then that’s your new lower bound.

Let’s say you were looking for a strawberry in a collection of fruits sorted in ascending order by size:

On the first attempt, the element in the middle happens to be a lemon. Since it’s bigger than a strawberry, you can discard all elements to the right, including the lemon. You’ll move the upper bound to a new position and update the middle index:

Now, you’re left with only half of the fruits you began with. The current middle element is indeed the strawberry you were looking for, which concludes the search. If it wasn’t, then you’d just update the bounds accordingly and continue until they pass each other. For example, looking for a missing plum, which would go between the strawberry and a kiwi, will end with the following result:

Notice there weren’t that many comparisons that had to be made in order to find the desired element. That’s the magic of binary search. Even if you’re dealing with a million elements, you’d only require at most a handful of checks. This number won’t exceed the logarithm base two of the total number of elements due to halving. In other words, the number of remaining elements is reduced by half at each step.

This is possible because the elements are already sorted by size. However, if you wanted to find fruits by another key, such as a color, then you’d have to sort the entire collection once again. To avoid the costly overhead of sorting, you might try to compute different views of the same collection in advance. This is somewhat similar to creating a database index.

Consider what happens if you add, delete or update an element in a collection. For a binary search to continue working, you’d need to maintain the proper sort order. This can be done with the bisect module, which you’ll read about in the upcoming section.

You’ll see how to implement the binary search algorithm in Python later on in this tutorial. For now, let’s confront it with the IMDb dataset. Notice that there are different people to search for than before. That’s because the dataset must be sorted for binary search, which reorders the elements. The new elements are located roughly at the same indices as before, to keep the measurements comparable:

Search Term	Element Index	Average Time	Comparisons
(…) Berendse	`0`	`6.52µs`	23
Jonathan Samuangte	`4,499,997`	`6.99µs`	24
Yorgos Rahmatoulin	`9,500,001`	`6.5µs`	23
missing	`N/A`	`7.2µs`	23

The answers are nearly instantaneous. In the average case, it takes only a few microseconds for the binary search to find one element among all nine million! Other than that, the number of comparisons for the chosen elements remains almost constant, which coincides with the following formula:

Finding most elements will require the highest number of comparisons, which can be derived from a logarithm of the size of the collection. Conversely, there’s just one element in the middle that can be found on the first try with one comparison.

Binary search is a great example of a divide-and-conquer technique, which partitions one problem into a bunch of smaller problems of the same kind. The individual solutions are then combined to form the final answer. Another well-known example of this technique is the quicksort algorithm.

Note: Don’t confuse divide-and-conquer with dynamic programming, which is a somewhat similar technique.

Unlike other search algorithms, binary search can be used beyond just searching. For example, it allows for set membership testing, finding the largest or smallest value, finding the nearest neighbor of the target value, performing range queries, and more.

If speed is a top priority, then binary search is not always the best choice. There are even faster algorithms that can take advantage of hash-based data structures. However, those algorithms require a lot of additional memory, whereas binary search offers a good space-time tradeoff.

Hash-Based Search

To search faster, you need to narrow down the problem space. Binary search achieves that goal by halving the number of candidates at each step. That means that even if you have one million elements, it takes at most twenty comparisons to determine if the element is present, provided that all elements are sorted.

The fastest way to search is to know where to find what you’re looking for. If you knew the exact memory location of an element, then you’d access it directly without the need for searching in the first place. Mapping an element or (more commonly) one of its keys to the element location in memory is referred to as hashing.

You can think of hashing not as searching for the specific element, but instead computing the index based on the element itself. That’s the job of a hash function, which needs to hold certain mathematical properties. A good hash function should:

Take arbitrary input and turn it into a fixed-size output.
Have a uniform value distribution to mitigate hash collisions.
Produce deterministic results.
Be a one-way function.
Amplify input changes to achieve the avalanche effect.

At the same time, it shouldn’t be too computationally expensive, or else its cost would outweigh the gains. A hash function is also used for data integrity verification as well as in cryptography.

A data structure that uses this concept to map keys into values is called a map, a hash table, a dictionary, or an associative array.

Note: Python has two built-in data structures, namely set and dict, which rely on the hash function to find elements. While a set hashes its elements, a dict uses the hash function against element keys. To find out exactly how a dict is implemented in Python, check out Raymond Hettinger’s conference talk on Modern Python Dictionaries.

Another way to visualize hashing is to imagine so-called buckets of similar elements grouped under their respective keys. For example, you may be harvesting fruits into different buckets based on color:

The coconut and a kiwi fruit go to the bucket labeled brown, while an apple ends up in a bucket with the red label, and so on. This allows you to glance through a fraction of the elements quickly. Ideally, you want to have only one fruit in each bucket. Otherwise, you get what’s known as a collision, which leads to extra work.

Note: The buckets, as well as their contents, are typically in no particular order.

Let’s put the names from the IMDb dataset into a dictionary, so that each name becomes a key, and the corresponding value becomes its index:

>>>

>>> frombenchmarkimportload_names# Sample code to download>>> names=load_names('names.txt')>>> index_by_name={... name:indexforindex,nameinenumerate(names)... }

After loading textual names into a flat list, you can enumerate() it inside a dictionary comprehension to create the mapping. Now, checking the element’s presence as well as getting its index is straightforward:

>>>

>>> 'Guido van Rossum'inindex_by_nameFalse>>> 'Arnold Schwarzenegger'inindex_by_nameTrue>>> index_by_name['Arnold Schwarzenegger']215

Thanks to the hash function used behind the scenes, you don’t have to implement any search at all!

Here’s how the hash-based search algorithm performs against the IMDb dataset:

Search Term	Element Index	Best Time	Average Time	Worst Time
Fred Astaire	`0`	`0.18µs`	`0.4µs`	`1.9µs`
Alicia Monica	`4,500,000`	`0.17µs`	`0.4µs`	`2.4µs`
Baoyin Liu	`9,500,000`	`0.17µs`	`0.4µs`	`2.6µs`
missing	`N/A`	`0.19µs`	`0.4µs`	`1.7µs`

Not only is the average time an order of magnitude faster than the already fast binary search Python implementation, but the speed is also sustained across all elements regardless of where they are.

The price for that gain is approximately 0.5 GB of more memory consumed by the Python process, slower load time, and the need to keep that additional data consistent with dictionary contents. In turn, the lookup is very quick, while updates and insertions are slightly slower when compared to a list.

Another constraint that dictionaries impose on their keys is that they must be hashable, and their hash values can’t change over time. You can check if a particular data type is hashable in Python by calling hash() on it:

>>>

>>> key=['round','juicy']>>> hash(key)Traceback (most recent call last):
  File "<stdin>", line 1, in <module>TypeError: unhashable type: 'list'

Mutable collections—such as a list, set, and dict—aren’t hashable. In practice, dictionary keys should be immutable because their hash value often depends on some attributes of the key. If a mutable collection was hashable and could be used as a key, then its hash value would be different every time the contents changed. Consider what would happen if a particular fruit changed color due to ripening. You’d be looking for it in the wrong bucket!

The hash function has many other uses. For example, it’s used in cryptography to avoid storing passwords in plain text form, as well as for data integrity verification.

Using the `bisect` Module

Binary search in Python can be performed using the built-in bisect module, which also helps with preserving a list in sorted order. It’s based on the bisection method for finding roots of functions. This module comes with six functions divided into two categories:

Find Index	Insert Element
`bisect()`	`insort()`
`bisect_left()`	`insort_left()`
`bisect_right()`	`insort_right()`

These functions allow you to either find an index of an element or add a new element in the right position. Those in the first row are just aliases for bisect_right() and insort_right(), respectively. In reality, you’re dealing with only four functions.

Note: It’s your responsibility to sort the list before passing it to one of the functions. If the elements aren’t sorted, then you’ll most likely get incorrect results.

Without further ado, let’s see the bisect module in action.

Finding an Element

To find the index of an existing element in a sorted list, you want to bisect_left():

>>>

>>> importbisect>>> sorted_fruits=['apple','banana','orange','plum']>>> bisect.bisect_left(sorted_fruits,'banana')1

The output tells you that a banana is the second fruit on the list because it was found at index 1. However, if an element was missing, then you’d still get its expected position:

>>>

>>> bisect.bisect_left(sorted_fruits,'apricot')1>>> bisect.bisect_left(sorted_fruits,'watermelon')4

Even though these fruits aren’t on the list yet, you can get an idea of where to put them. For example, an apricot should come between the apple and the banana, whereas a watermelon should become the last element. You’ll know if an element was found by evaluating two conditions:

Is the index within the size of the list?
Is the value of the element the desired one?

This can be translated to a universal function for finding elements by value:

deffind_index(elements,value):index=bisect.bisect_left(elements,value)ifindex<len(elements)andelements[index]==value:returnindex

When there’s a match, the function will return the corresponding element index. Otherwise, it’ll return None implicitly.

To search by key, you have to maintain a separate list of keys. Since this incurs an additional cost, it’s worthwhile to calculate the keys upfront and reuse them as much as possible. You can define a helper class to be able to search by different keys without introducing much code duplication:

classSearchBy:def__init__(self,key,elements):self.elements_by_key=sorted([(key(x),x)forxinelements])self.keys=[x[0]forxinself.elements_by_key]

The key is a function passed as the first parameter to __init__(). Once you have it, you make a sorted list of key-value pairs to be able to retrieve an element from its key at a later time. Representing pairs with tuples guarantees that the first element of each pair will be sorted. In the next step, you extract the keys to make a flat list that’s suitable for your binary search Python implementation.

Then there’s the actual method for finding elements by key:

classSearchBy:def__init__(self,key,elements):...deffind(self,value):index=bisect.bisect_left(self.keys,value)ifindex<len(self.keys)andself.keys[index]==value:returnself.elements_by_key[index][1]

This code bisects the list of sorted keys to get the index of an element by key. If such a key exists, then its index can be used to get the corresponding pair from the previously computed list of key-value pairs. The second element of that pair is the desired value.

Note: This is just an illustrative example. You’ll be better off using the recommended recipe, which is mentioned in the official documentation.

If you had multiple bananas, then bisect_left() would return the leftmost instance:

>>>

>>> sorted_fruits=[... 'apple',... 'banana','banana','banana',... 'orange',... 'plum'... ]>>> bisect.bisect_left(sorted_fruits,'banana')1

Predictably, to get the rightmost banana, you’d need to call bisect_right() or its bisect() alias. However, those two functions return one index further from the actual rightmost banana, which is useful for finding the insertion point of a new element:

>>>

>>> bisect.bisect_right(sorted_fruits,'banana')4>>> bisect.bisect(sorted_fruits,'banana')4>>> sorted_fruits[4]'orange'

When you combine the code, you can see how many bananas you have:

>>>

>>> l=bisect.bisect_left(sorted_fruits,'banana')>>> r=bisect.bisect_right(sorted_fruits,'banana')>>> r-l3

If an element were missing, then both bisect_left() and bisect_right() would return the same index yielding zero bananas.

Inserting a New Element

Another practical application of the bisect module is maintaining the order of elements in an already sorted list. After all, you wouldn’t want to sort the whole list every time you had to insert something into it. In most cases, all three functions can be used interchangeably:

>>>

>>> importbisect>>> sorted_fruits=['apple','banana','orange']>>> bisect.insort(sorted_fruits,'apricot')>>> bisect.insort_left(sorted_fruits,'watermelon')>>> bisect.insort_right(sorted_fruits,'plum')>>> sorted_fruits['apple', 'apricot', 'banana', 'orange', 'plum', 'watermelon']

You won’t see any difference until there are duplicates in your list. But even then, it won’t become apparent as long as those duplicates are simple values. Adding another banana to the left will have the same effect as adding it to the right.

To notice the difference, you need a data type whose objects can have unique identities despite having equal values. Let’s define a Person type using the @dataclass decorator, which was introduced in Python 3.7:

fromdataclassesimportdataclass,field@dataclassclassPerson:name:strnumber:int=field(compare=False)def__repr__(self):returnf'{self.name}({self.number})'

A person has a name and an arbitrary number assigned to it. By excluding the number field from the equality test, you make two people equal even if they have different values of that attribute:

>>>

>>> p1=Person('John',1)>>> p2=Person('John',2)>>> p1==p2True

On the other hand, those two variables refer to completely separate entities, which allows you to make a distinction between them:

>>>

>>> p1isp2False>>> p1John(1)>>> p2John(2)

The variables p1 and p2 are indeed different objects.

Note that instances of a data class aren’t comparable by default, which prevents you from using the bisection algorithm on them:

>>>

>>> alice,bob=Person('Alice',1),Person('Bob',1)>>> alice<bobTraceback (most recent call last):
  File "<stdin>", line 1, in <module>TypeError: '<' not supported between instances of 'Person' and 'Person'

Python doesn’t know to order alice and bob, because they’re objects of a custom class. Traditionally, you’d implement the magic method .__lt__() in your class, which stands for less than, to tell the interpreter how to compare such elements. However, the @dataclass decorator accepts a few optional Boolean flags. One of them is order, which results in an automatic generation of the magic methods for comparison when set to True:

@dataclass(order=True)classPerson:...

In turn, this allows you to compare two people and decide which one comes first:

>>>

>>> alice<bobTrue>>> bob<aliceFalse

Finally, you can take advantage of the name and number properties to observe where various functions insert new people to the list:

>>>

>>> sorted_people=[Person('John',1)]>>> bisect.insort_left(sorted_people,Person('John',2))>>> bisect.insort_right(sorted_people,Person('John',3))>>> sorted_people[John(2), John(1), John(3)]

The numbers in parentheses after the names indicate the insertion order. In the beginning, there was just one John, who got the number 1. Then, you added its duplicate to the left, and later one more to the right.

Implementing Binary Search in Python

Keep in mind that you probably shouldn’t implement the algorithm unless you have a strong reason to. You’ll save time and won’t need to reinvent the wheel. The chances are that the library code is mature, already tested by real users in a production environment, and has extensive functionality delivered by multiple contributors.

That said, there are times when it makes sense to roll up your sleeves and do it yourself. Your company might have a policy banning certain open source libraries due to licensing or security matters. Maybe you can’t afford another dependency due to memory or network bandwidth constraints. Lastly, writing code yourself might be a great learning tool!

You can implement most algorithms in two ways:

Iteratively
Recursively

However, there are exceptions to that rule. One notable example is the Ackermann function, which can only be expressed in terms of recursion.

Before you go any further, make sure that you have a good grasp of the binary search algorithm. You can refer to an earlier part of this tutorial for a quick refresher.

Iteratively

The iterative version of the algorithm involves a loop, which will repeat some steps until the stopping condition is met. Let’s begin by implementing a function that will search elements by value and return their index:

deffind_index(elements,value):...

You’re going to reuse this function later.

Assuming that all elements are sorted, you can set the lower and the upper boundaries at the opposite ends of the sequence:

deffind_index(elements,value):left,right=0,len(elements)-1

Now, you want to identify the middle element to see if it has the desired value. Calculating the middle index can be done by taking the average of both boundaries:

deffind_index(elements,value):left,right=0,len(elements)-1middle=(left+right)//2

Notice how an integer division helps to handle both an odd and even number of elements in the bounded range by flooring the result. Depending on how you’re going to update the boundaries and define the stopping condition, you could also use a ceiling function.

Next, you either finish or split the sequence in two and continue searching in one of the resultant halves:

deffind_index(elements,value):left,right=0,len(elements)-1middle=(left+right)//2ifelements[middle]==value:returnmiddleifelements[middle]<value:left=middle+1elifelements[middle]>value:right=middle-1

If the element in the middle was a match, then you return its index. Otherwise, if it was too small, then you need to move the lower boundary up. If it was too big, then you need to move the upper boundary down.

To keep going, you have to enclose most of the steps in a loop, which will stop when the lower boundary overtakes the upper one:

deffind_index(elements,value):left,right=0,len(elements)-1whileleft<=right:middle=(left+right)//2ifelements[middle]==value:returnmiddleifelements[middle]<value:left=middle+1elifelements[middle]>value:right=middle-1

In other words, you want to iterate as long as the lower boundary is below or equal to the upper one. Otherwise, there was no match, and the function returns None implicitly.

Searching by key boils down to looking at an object’s attributes instead of its literal value. A key could be the number of characters in a fruit’s name, for example. You can adapt find_index() to accept and use a key parameter:

deffind_index(elements,value,key):left,right=0,len(elements)-1whileleft<=right:middle=(left+right)//2middle_element=key(elements[middle])ifmiddle_element==value:returnmiddleifmiddle_element<value:left=middle+1elifmiddle_element>value:right=middle-1

However, you must also remember to sort the list using the same key that you’re going to search with:

>>>

>>> fruits=['orange','plum','watermelon','apple']>>> fruits.sort(key=len)>>> fruits['plum', 'apple', 'orange', 'watermelon']>>> fruits[find_index(fruits,key=len,value=10)]'watermelon'>>> print(find_index(fruits,key=len,value=3))None

In the example above, watermelon was chosen because its name is precisely ten characters long, while no fruits on the list have names made up of three letters.

That’s great, but at the same time, you’ve just lost the ability to search by value. To remedy this, you could assign the key a default value of None and then check if it was given or not. However, in a more streamlined solution, you’d always want to call the key. By default, it would be an identity function returning the element itself:

defidentity(element):returnelementdeffind_index(elements,value,key=identity):...

Alternatively, you might define the identity function inline with an anonymous lambda expression:

deffind_index(elements,value,key=lambdax:x):...

find_index() answers only one question. There are still two others, which are “Is it there?” and “What is it?” To answer these two, you can build on top of it:

deffind_index(elements,value,key):...defcontains(elements,value,key=identity):returnfind_index(elements,value,key)isnotNonedeffind(elements,value,key=identity):index=find_index(elements,value,key)returnNoneifindexisNoneelseelements[index]

With these three functions, you can tell almost everything about an element. However, you still haven’t addressed duplicates in your implementation. What if you had a collection of people, and some of them shared a common name or surname? For example, there might be a Smith family or a few guys going by the name of John among the people:

people=[Person('Bob','Williams'),Person('John','Doe'),Person('Paul','Brown'),Person('Alice','Smith'),Person('John','Smith'),]

To model the Person type, you can modify a data class defined earlier:

fromdataclassesimportdataclass@dataclass(order=True)classPerson:name:strsurname:str

Notice the use of the order attribute to enable automatic generation of magic methods for comparing instances of the class by all fields. Alternatively, you might prefer to take advantage of the namedtuple, which has a shorter syntax:

fromcollectionsimportnamedtuplePerson=namedtuple('Person','name surname')

Both definitions are fine and interchangeable. Each person has a name and a surname attribute. To sort and search by one of them, you can conveniently define the key function with an attrgetter() available in the built-in operator module:

>>>

>>> fromoperatorimportattrgetter>>> by_surname=attrgetter('surname')>>> people.sort(key=by_surname)>>> people[Person(name='Paul', surname='Brown'), Person(name='John', surname='Doe'), Person(name='Alice', surname='Smith'), Person(name='John', surname='Smith'), Person(name='Bob', surname='Williams')]

Notice how people are now sorted by surname in ascending order. There’s John Smith and Alice Smith, but binary searching for the Smith surname currently gives you only one arbitrary result:

>>>

>>> find(people,key=by_surname,value='Smith')Person(name='Alice', surname='Smith')

To mimic the features of the bisect module shown before, you can write your own version of bisect_left() and bisect_right(). Before finding the leftmost instance of a duplicate element, you want to determine if there’s such an element at all:

deffind_leftmost_index(elements,value,key=identity):index=find_index(elements,value,key)ifindexisnotNone:...returnindex

If some index has been found, then you can look to the left and keep moving until you come across an element with a different key or there are no more elements:

deffind_leftmost_index(elements,value,key=identity):index=find_index(elements,value,key)ifindexisnotNone:whileindex>=0andkey(elements[index])==value:index-=1index+=1returnindex

Once you go past the leftmost element, you need to move the index back by one position to the right.

Finding the rightmost instance is quite similar, but you need to flip the conditions:

deffind_rightmost_index(elements,value,key=identity):index=find_index(elements,value,key)ifindexisnotNone:whileindex<len(elements)andkey(elements[index])==value:index+=1index-=1returnindex

Instead of going left, now you’re going to the right until the end of the list. Using both functions allows you to find all occurrences of duplicate items:

deffind_all_indices(elements,value,key=identity):left=find_leftmost_index(elements,value,key)right=find_rightmost_index(elements,value,key)ifleftandright:returnset(range(left,right+1))returnset()

This function always returns a set. If the element isn’t found, then the set will be empty. If the element is unique, then the set will be made up of only a single index. Otherwise, there will be multiple indices in the set.

To wrap up, you can define even more abstract functions to complete your binary search Python library:

deffind_leftmost(elements,value,key=identity):index=find_leftmost_index(elements,value,key)returnNoneifindexisNoneelseelements[index]deffind_rightmost(elements,value,key=identity):index=find_rightmost_index(elements,value,key)returnNoneifindexisNoneelseelements[index]deffind_all(elements,value,key=identity):return{elements[i]foriinfind_all_indices(elements,value,key)}

Not only does this allow you to pinpoint the exact location of elements on the list, but also to retrieve those elements. You’re able to ask very specific questions:

Is it there?	Where is it?	What is it?
`contains()`	`find_index()`	`find()`
	`find_leftmost_index()`	`find_leftmost()`
	`find_rightmost_index()`	`find_rightmost()`
	`find_all_indices()`	`find_all()`

The complete code of this binary search Python library can be found at the link below:

Get Sample Code:Click here to get the sample code you'll use to learn about binary search in Python in this tutorial.

Recursively

For the sake of simplicity, you’re only going to consider the recursive version of contains(), which tells you if an element was found.

Note: My favorite definition of recursion was given in an episode of the Fun Fun Function series about functional programming in JavaScript:

“Recursion is when a function calls itself until it doesn’t.”
— Mattias Petter Johansson

The most straightforward approach would be to take the iterative version of binary search and use the slicing operator to chop the list:

defcontains(elements,value):left,right=0,len(elements)-1ifleft<=right:middle=(left+right)//2ifelements[middle]==value:returnTrueifelements[middle]<value:returncontains(elements[middle+1:],value)elifelements[middle]>value:returncontains(elements[:middle],value)returnFalse

Instead of looping, you check the condition once and sometimes call the same function on a smaller list. What could go wrong with that? Well, it turns out that slicing generates copies of element references, which can have noticeable memory and computational overhead.

To avoid copying, you might reuse the same list but pass different boundaries into the function whenever necessary:

defcontains(elements,value,left,right):ifleft<=right:middle=(left+right)//2ifelements[middle]==value:returnTrueifelements[middle]<value:returncontains(elements,value,middle+1,right)elifelements[middle]>value:returncontains(elements,value,left,middle-1)returnFalse

The downside is that every time you want to call that function, you have to pass initial boundaries, making sure they’re correct:

>>>

>>> sorted_fruits=['apple','banana','orange','plum']>>> contains(sorted_fruits,'apple',0,len(sorted_fruits)-1)True

If you were to make a mistake, then it would potentially not find that element. You can improve this by using default function arguments or by introducing a helper function that delegates to the recursive one:

defcontains(elements,value):returnrecursive(elements,value,0,len(elements)-1)defrecursive(elements,value,left,right):...

Going further, you might prefer to nest one function in another to hide the technical details and to take advantage of variable reuse from outer scope:

defcontains(elements,value):defrecursive(left,right):ifleft<=right:middle=(left+right)//2ifelements[middle]==value:returnTrueifelements[middle]<value:returnrecursive(middle+1,right)elifelements[middle]>value:returnrecursive(left,middle-1)returnFalsereturnrecursive(0,len(elements)-1)

The recursive() inner function can access both elements and value parameters even though they’re defined in the enclosing scope. The life cycle and visibility of variables in Python is dictated by the so-called LEGB rule, which tells the interpreter to look for symbols in the following order:

Local scope
Enclosing scope
Global scope
Built-in symbols

This allows variables that are defined in outer scope to be accessed from within nested blocks of code.

The choice between an iterative and a recursive implementation is often the net result of performance considerations, convenience, as well as personal taste. However, there are also certain risks involved with recursion, which is one of the subjects of the next section.

Covering Tricky Details

Here’s what the author of The Art of Computer Programming has to say about implementing the binary search algorithm:

“Although the basic idea of binary search is comparatively straightforward, the details can be surprisingly tricky, and many good programmers have done it wrong the first few times they tried.”
— Donald Knuth

If that doesn’t deter you enough from the idea of writing the algorithm yourself, then maybe this will. The standard library in Java had a subtle bug in their implementation of binary search, which remained undiscovered for a decade! But the bug itself traces its roots much earlier than that.

Note: I once fell victim to the binary search algorithm during a technical screening. There were a couple of coding puzzles to solve, including a binary search one. Guess which one I failed to complete? Yeah.

The following list isn’t exhaustive, but at the same time, it doesn’t talk about common mistakes like forgetting to sort the list.

Integer Overflow

This is the Java bug that was just mentioned. If you recall, the binary search Python algorithm inspects the middle element of a bounded range in a sorted collection. But how is that middle element chosen exactly? Usually, you take the average of the lower and upper boundary to find the middle index:

middle=(left+right)//2

This method of calculating the average works just fine in the overwhelming majority of cases. However, once the collection of elements becomes sufficiently large, the sum of both boundaries won’t fit the integer data type. It’ll be larger than the maximum value allowed for integers.

Some programming languages might raise an error in such situations, which would immediately stop program execution. Unfortunately, that’s not always the case. For example, Java silently ignores this problem, letting the value flip around and become some seemingly random number. You’ll only know about the problem as long as the resulting number happens to be negative, which throws an IndexOutOfBoundsException.

Here’s an example that demonstrates this behavior in jshell, which is kind of like an interactive interpreter for Java:

jshell>vara=Integer.MAX_VALUEa==>2147483647jshell>a+1$2==>-2147483648

A safer way to find the middle index could be calculating the offset first and then adding it to the lower boundary:

middle=left+(right-left)//2

Even if both values are maxed out, the sum in the formula above will never be. There are a few more ways, but the good news is that you don’t need to worry about any of these, because Python is free from the integer overflow error. There’s no upper limit on how big integers can be other than your memory:

>>>

>>> 2147483647**7210624582650556372047028295576838759252690170086892944262392971263

However, there’s a catch. When you call functions from a library, that code might be subject to the C language constraints and still cause an overflow. There are plenty of libraries based on the C language in Python. You could even build your own C extension module or load a dynamically-linked library into Python using ctypes.

Stack Overflow

The stack overflow problem may, theoretically, concern the recursive implementation of binary search. Most programming languages impose a limit on the number of nested function calls. Each call is associated with a return address stored on a stack. In Python, the default limit is a few thousand levels of such calls:

>>>

>>> importsys>>> sys.getrecursionlimit()3000

This won’t be enough for a lot of recursive functions. However, it’s very unlikely that a binary search in Python would ever need more due to its logarithmic nature. You’d need a collection of two to the power of three thousand elements. That’s a number with over nine hundred digits!

Nevertheless, it’s still possible for the infinite recursion error to arise if the stopping condition is stated incorrectly due to a bug. In such a case, the infinite recursion will eventually cause a stack overflow.

Note: The stack overflow error is very common among languages with manual memory management. People would often google those errors to see if someone else already had similar issues, which gave the name to a popular Q&A site for programmers.

You can temporarily lift or decrease the recursion limit to simulate a stack overflow error. Note that the effective limit will be smaller because of the functions that the Python runtime environment has to call:

>>>

>>> defcountup(limit,n=1):... print(n)... ifn<limit:... countup(limit,n+1)...>>> importsys>>> sys.setrecursionlimit(7)# Actual limit is 3>>> countup(10)123Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 4, in countup
  File "<stdin>", line 4, in countup
  File "<stdin>", line 2, in countupRecursionError: maximum recursion depth exceeded while calling a Python object

The recursive function was called three times before saturating the stack. The remaining four calls must have been made by the interactive interpreter. If you run that same code in PyCharm or an alternative Python shell, then you might get a different result.

Duplicate Elements

You’re aware of the possibility of having duplicate elements in the list and you know how to deal with them. This is just to emphasize that a conventional binary search in Python might not produce deterministic results. Depending on how the list was sorted or how many elements it has, you’ll get a different answer:

>>>

>>> fromsearch.binaryimport*>>> sorted_fruits=['apple','banana','banana','orange']>>> find_index(sorted_fruits,'banana')1>>> sorted_fruits.append('plum')>>> find_index(sorted_fruits,'banana')2

There are two bananas on the list. At first, the call to find_index() returns the left one. However, adding a completely unrelated element at the end of the list makes the same call give you a different banana.

The same principle, known as algorithm stability, applies to sorting algorithms. Some are stable, meaning they don’t change the relative positions of equivalent elements. Others don’t make such guarantees. If you ever need to sort elements by multiple criteria, then you should always start from the least significant key to retain stability.

Floating-Point Rounding

So far you’ve only searched for fruits or people, but what about numbers? They should be no different, right? Let’s make a list of floating-point numbers at 0.1 increments using a list comprehension:

>>>

>>> sorted_numbers=[0.1*iforiinrange(1,4)]

The list should contain numbers the one-tenth, two-tenths, and three-tenths. Surprisingly, only two of those three numbers can be found:

>>>

>>> fromsearch.binaryimportcontains>>> contains(sorted_numbers,0.1)True>>> contains(sorted_numbers,0.2)True>>> contains(sorted_numbers,0.3)False

This isn’t a problem strictly related to binary search in Python, as the built-in linear search is consistent with it:

>>>

>>> 0.1insorted_numbersTrue>>> 0.2insorted_numbersTrue>>> 0.3insorted_numbersFalse

It’s not even a problem related to Python but rather to how floating-point numbers are represented in computer memory. This is defined by the IEEE 754 standard for floating-point arithmetic. Without going into much detail, some decimal numbers don’t have a finite representation in binary form. Because of limited memory, those numbers get rounded, causing a floating-point rounding error.

Note: If you require maximum precision, then steer away from floating-point numbers. They’re great for engineering purposes. However, for monetary operations, you don’t want rounding errors to accumulate. It’s recommended to scale down all prices and amounts to the smallest unit, such as cents or pennies, and treat them as integers.

Alternatively, many programming languages have support for fixed-point numbers, such as the decimal type in Python. This puts you in control of when and how rounding is taking place.

If you do need to work with floating-point numbers, then you should replace exact matching with an approximate comparison. Let’s consider two variables with slightly different values:

>>>

>>> a=0.3>>> b=0.1*3>>> b0.30000000000000004>>> a==bFalse

Regular comparison gives a negative result, although both values are nearly identical. Fortunately, Python comes with a function that will test if two values are close to each other within some small neighborhood:

>>>

>>> importmath>>> math.isclose(a,b)True

That neighborhood, which is the maximum distance between the values, can be adjusted if needed:

>>>

>>> math.isclose(a,b,rel_tol=1e-16)False

You can use that function to do a binary search in Python in the following way:

importmathdeffind_index(elements,value):left,right=0,len(elements)-1whileleft<=right:middle=(left+right)//2ifmath.isclose(elements[middle],value):returnmiddleifelements[middle]<value:left=middle+1elifelements[middle]>value:right=middle-1

On the other hand, this implementation of binary search in Python is specific to floating-point numbers only. You couldn’t use it to search for anything else without getting an error.

Analyzing the Time-Space Complexity of Binary Search

The following section will contain no code and some math concepts.

In computing, you can optimize the performance of pretty much any algorithm at the expense of increased memory use. For instance, you saw that a hash-based search of the IMDb dataset required an extra 0.5 GB of memory to achieve unparalleled speed.

Conversely, to save bandwidth, you’d compress a video stream before sending it over the network, increasing the amount of work to be done. This phenomenon is known as the space-time tradeoff and is useful in evaluating an algorithm’s complexity.

Time-Space Complexity

The computational complexity is a relative measure of how many resources an algorithm needs to do its job. The resources include computation time as well as the amount of memory it uses. Comparing the complexity of various algorithms allows you to make an informed decision about which is better in a given situation.

Note: Algorithms that don’t need to allocate more memory than their input data already consumes are called in-place, or in-situ, algorithms. This results in mutating the original data, which sometimes may have unwanted side-effects.

You looked at a few search algorithms and their average performance against a large dataset. It’s clear from those measurements that a binary search is faster than a linear search. You can even tell by what factor.

However, if you took the same measurements in a different environment, you’d probably get slightly or perhaps entirely different results. There are invisible factors at play that can be influencing your test. Besides, such measurements aren’t always feasible. So, how can you compare time complexities quickly and objectively?

The first step is to break down the algorithm into smaller pieces and find the one that is doing the most work. It’s likely going to be some elementary operation that gets called a lot and consistently takes about the same time to run. For search algorithms, such an operation might be the comparison of two elements.

Having established that, you can now analyze the algorithm. To find the time complexity, you want to describe the relationship between the number of elementary operations executed versus the size of the input. Formally, such a relationship is a mathematical function. However, you’re not interested in looking for its exact algebraic formula but rather estimating its overall shape.

There are a few well-known classes of functions that most algorithms fit in. Once you classify an algorithm according to one of them, you can put it on a scale:

Common Classes of Time Complexity

These classes tell you how the number of elementary operations increases with the growing size of the input. They are, from left to right:

Constant
Logarithmic
Linear
Quasilinear
Quadratic
Exponential
Factorial

This can give you an idea about the performance of the algorithm you’re considering. A constant complexity, regardless of the input size, is the most desired one. A logarithmic complexity is still pretty good, indicating a divide-and-conquer technique at use. The further to the right on this scale, the worse the complexity of the algorithm, because it has more work to do.

When you’re talking about the time complexity, what you typically mean is the asymptotic complexity, which describes the behavior under very large data sets. This simplifies the function formula by eliminating all terms and coefficients but the one that grows at the fastest rate (for example, n squared).

However, a single function doesn’t provide enough information to compare two algorithms accurately. The time complexity may vary depending on the volume of data. For example, the binary search algorithm is like a turbocharged engine, which builds pressure before it’s ready to deliver power. On the other hand, the linear search algorithm is fast from the start but quickly reaches its peak power and ultimately loses the race:

In terms of speed, the binary search algorithm starts to overtake the linear search when there’s a certain number of elements in the collection. For smaller collections, a linear search might be a better choice.

Note: Note that the same algorithm may have different optimistic, pessimistic, and average time complexity. For example, in the best-case scenario, a linear search algorithm will find the element at the first index, after running just one comparison.

On the other end of the spectrum, it’ll have to compare a reference value to all elements in the collection. In practice, you want to know the pessimistic complexity of an algorithm.

There are a few mathematical notations of the asymptotic complexity, which are used to compare algorithms. By far the most popular one is the Big-O notation.

The Big-O Notation

The Big-O notation represents the worst-case scenario of asymptotic complexity. Although this might sound rather intimidating, you don’t need to know the formal definition. Intuitively, it’s a very rough measure of the rate of growth at the tail of the function that describes the complexity. You pronounce it as “big-oh” of something:

That “something” is usually a function of data size or just the digit “one” that stands for a constant. For example, the linear search algorithm has a time complexity of O(n), while a hash-based search has O(1) complexity.

Note: When you say that some algorithm has complexity O(f(n)), where n is the size of the input data, then it means that the function f(n) is an upper bound of the graph of that complexity. In other words, the actual complexity of that algorithm won’t grow faster than f(n) multiplied by some constant, when n approaches infinity.

In real-life, the Big-O notation is used less formally as both an upper and a lower bound. This is useful for the classification and comparison of algorithms without having to worry about the exact function formulas.

The Complexity of Binary Search

You’ll estimate the asymptotic time complexity of binary search by determining the number of comparisons in the worst-case scenario—when an element is missing—as a function of input size. You can approach this problem in three different ways:

Tabular
Graphical
Analytical

The tabular method is about collecting empirical data, putting it in a table, and trying to guess the formula by eyeballing sampled values:

Number of Elements	Number of Comparisons
0	0
1	1
2	2
3	2
4	3
5	3
6	3
7	3
8	4

The number of comparisons grows as you increase the number of elements in the collection, but the rate of growth is slower than if it was a linear function. That’s an indication of a good algorithm that can scale with data.

If that doesn’t help you, you can try the graphical method, which visualizes the sampled data by drawing a graph:

The data points seem to overlay with a curve, but you don’t have enough information to provide a conclusive answer. It could be a polynomial, whose graph turns up and down for larger inputs.

Taking the analytical approach, you can choose some relationship and look for patterns. For example, you might study how the number of elements shrinks in each step of the algorithm:

Comparison	Number of Elements
-	n
1st	n/2
2nd	n/4
3rd	n/8
⋮	⋮
k-th	n/2 ^k

In the beginning, you start with the whole collection of n elements. After the first comparison, you’re left with only half of them. Next, you have a quarter, and so on. The pattern that arises from this observation is that after k-th comparison, there are n/2 ^k elements. Variable k is the expected number of elementary operations.

After all k comparisons, there will be no more elements left. However, when you take one step back, that is k - 1, there will be exactly one element left. This gives you a convenient equation:

Multiply both sides of the equation by the denominator, then take the logarithm base two of the result, and move the remaining constant to the right. You’ve just found the formula for the binary search complexity, which is on the order of O(log(n)).

Conclusion

Now you know the binary search algorithm inside and out. You can flawlessly implement it yourself, or take advantage of the standard library in Python. Having tapped into the concept of time-space complexity, you’re able to choose the best search algorithm for the given situation.

Now you can:

Use the bisect module to do a binary search in Python
Implement binary search in Python recursively and iteratively
Recognize and fix defects in a binary search Python implementation
Analyze the time-space complexity of the binary search algorithm
Search even faster than binary search

With all this knowledge, you’ll rock your programming interview! Whether the binary search algorithm is an optimal solution to a particular problem, you have the tools to figure it out on your own. You don’t need a computer science degree to do so.

You can grab all of the code you’ve seen in this tutorial at the link below:

Get Sample Code:Click here to get the sample code you'll use to learn about binary search in Python in this tutorial.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

James Bennett: Against service layers in Django

March 16, 2020, 8:52 am

≫ Next: Mike Driscoll: Python 101 2nd Edition Kickstarter Ending in Two Days

≪ Previous: Real Python: How to Do a Binary Search in Python

Recently I’ve seen posts and questions pop up in a few places about a sort of “enterprise” Django style guide that’s been getting attention. There are a number of things I disagree with in that guide, but the big one, and the one people have mostly been asking about, is the recommendation to add a “service layer” to Django applications. The short version of my opinion on this is: it’s probably not what you want …

Read full entry

↧

New in PyCharm

Command-line docker run options

Better UX for configuring project interpreter

Fixed in this Version

Interested?

if/else

loop/else

try/except/else

Protected attribute and method

Private attribute and method

General description

Signaling

WebRTC

Running on localhost

Beyond localhost

Going global

Conclusion

Just say hi using whatever

Jitsi

Mumble

Live streaming

Text chat

Common recommendations

Other ideas

RSA algorithm and key pairs¶

The PEM format¶

OpenSSL and ASN.1¶

OpenSSL and RSA keys¶

PKCS #8 vs PKCS #1¶

Private and public key¶

Generating key pairs with OpenSSL¶

Generating key pairs with OpenSSH¶

The OpenSSH public key format¶

Converting between PEM and OpenSSH format¶

PEM/PKCS#1 to PEM/PKCS#8¶

OpenSSH public to PEM/PKCS#8¶

PEM/PKCS#8 to OpenSSH public¶

Reading RSA keys in Python¶

Final words¶

Resources¶

Feedback¶

Function without Arguments

Function with Explicit Arguments

Function with Default Arguments

Benchmarking

Download IMDb

Read Tab-Separated Values

Measure the Execution Time

Understanding Search Algorithms

Random Search

Linear Search

Binary Search

Hash-Based Search

Using the bisect Module

Finding an Element

Inserting a New Element

Implementing Binary Search in Python

Iteratively

Recursively

Covering Tricky Details

Integer Overflow

Stack Overflow

Duplicate Elements

Floating-Point Rounding

Analyzing the Time-Space Complexity of Binary Search

Time-Space Complexity

The Big-O Notation

The Complexity of Binary Search

Conclusion

Using the `bisect` Module