Mike Driscoll: PyDev of the Week: Sebastian Steins

December 8, 2019, 10:05 pm

≫ Next: Stein Magnus Jodal: pathlib and paths with arbitrary bytes

≪ Previous: Stack Abuse: Executing Shell Commands with Python

This week we welcome Sebastian Steins (@sebastiansteins) as our PyDev of the Week! Sebastian is the creator of the Pythonic News website. You can find out more about Sebastian by checking out what he’s been up to over on Github. Let’s take a few moments to get to know him better!

Sebastian Steins

Can you tell us a little about yourself (hobbies, education, etc):

I am a software developer from Germany and live close to the Dutch and Belgian border. The internet emerged when I was in school. I have always been fascinated by computers and wanted to learn to program. Unfortunately, this was not so easy at the time, and I did not have teachers who could have supported me in that matter. It changed, however, when I got my first modem. The internet opened a whole new world for me, and I started to learn HTML, Perl and later, PHP. I built CGI scripts and small web apps back then, and it was really fun. Eventually, I took programming as my career path, although I sometimes struggled with that decision. Besides my degree in computer science, I also heard lectures on economics and had a few positions in the finance sector early in my career. Now, I enjoy coaching teams of great software engineers in architecture matters and try to pass my knowledge to junior devs.

When I’m not in front of a computer, I like to ride my road bike, learn new stuff from audiobooks and would never say no to a night out in a good restaurant.

Why did you start using Python?

I started using Python when I needed a replacement for PHP, so it was very early on. It was in the very early days of the Python 2.0 release. I immediately liked it, because it was basically like writing pseudocode. This is what I still love about being able to “talk to a computer”: Expressing ideas and see results very quickly. Meanwhile, other languages have kept up and are equally expressive as Python. However, Python has become a little bit of my home base ever since.

What other programming languages do you know and which is your favorite?

I worked in different projects with many different programming languages like Java, C#, C and JavaScript.

What projects are you working on now?

I am working as a freelance consultant in software engineering. Besides that, I teach introductory courses over at smartninja.de. Recently I started a small project called Pythonic News, which is basically a Hacker News clone for the Python community.

Which Python libraries are your favorite (core or 3rd party)?

There are so many great libraries in the Python ecosystem. This is another point which makes programming in Python so enjoyable. Most of these libraries are very Pythonic in a way that ideas can be expressed very concisely. I really like the Django framework, but there are also smaller packages like requests which I use quite frequently.

Bigger ones, such as numpy and pandas really proves that Python is so versatile that there is hardly a problem which cannot be solved with it.

What is the origin of your Pythonic News site?

I’ve built the https://news.python.sc site as a Django project. During the creation, I tried to use as many features of the Django framework as possible. This is because I created it initially as an example app for a Python training and wanted to showcase different ways of achieving things. This is also why not everything in the codebase can be considered as best practise. The goal was to show what could be done with Django, including downsides of particular approaches. For example, I used model inheritance on the core database objects which certainly is not the best choice performance-wise.

What have you learned creating the project?

Just for fun, I posted the project to Reddit and Hacker News. I got so much positive feedback that I wanted to see how it did before it was even complete. I learned that there are still places on the web which feel like the “good old days”. People just like to talk about the topics they care about. The early-2000s esthetics of the site turned out to be a good fit for the audience. There was not a single issue with spam or offensive behaviour. There still is “an internet” which is happening outside the walled gardens of the big tech companies.

This made me very happy!

Furthermore, I saw strangers reacted on GitHub in a way I would never have expected. That made me think that I will create a make-of tutorial series from that project to teach more people about Python. I will publish it on my site when it’s ready.

Do you have any words of wisdom for other content creators?

Just start and let the world know about your ideas and your creation. Even if it’s not perfect, you’ll find people that care. If it’s a Python project you created, of course, you can submit it to the “Show PN” section of Pythonic News.

Thanks for doing the interview, Sebastian!

The post PyDev of the Week: Sebastian Steins appeared first on The Mouse Vs. The Python.

↧

Stein Magnus Jodal: pathlib and paths with arbitrary bytes

December 9, 2019, 4:00 pm

≫ Next: Real Python: Variables in Python

≪ Previous: Mike Driscoll: PyDev of the Week: Sebastian Steins

The ">">pathlib module was added to the standard library in Python 3.4, and is one of the many nice improvements that Python 3 has gained over the past decade. In three weeks, Python 3.5 will be the oldest version of Python that still receive security patches. This means that the presence of pathlib can soon be taken for granted on all Python installations, and the quest towards replacing os.path can begin for real.

In this post I’ll have a look at how pathlib can be used to handle file names with arbitrary bytes, as this is valid on most file systems.

Introduction to `pathlib`

pathlib provides an API for working with file system paths that works consistently across platforms and handles most corner cases well. If you replace all of your os.path usage with pathlib, you’ll probably have more readable code with fewer bugs.

Here’s a few examples to get a feel for the pathlib API.

The entry point to the pathlib API is usually the Path class:

>>>frompathlibimportPath>>>

pathlib can replace the os.path module:

>>>p=Path("hello.txt")>>>pPosixPath('hello.txt')>>>p.name'hello.txt'>>>p.parentPosixPath('.')>>>p.exists()True>>>p.is_file()True>>>

It has convenient shortcuts for getting the contents of a file, or writing to it:

>>>p.read_bytes()b'Hello, world!\n'>>>p.read_text()'Hello, world!\n'>>>

As well as the low level APIs:

>>>withp.open()asfh:...fh.seek(7)...print(fh.read(5))...7world>>>

It has replacements for os.rename() and friends:

>>>q=p.with_suffix(".md")>>>qPosixPath('hello.md')>>>p.rename(q)>>>p.exists()False>>>q.exists()True>>>

You can concatenate paths with a forward slash, which might be controversial, but reads a lot better than os.path.join(a, b) once you accept it:

>>>d=p/"..">>>dPosixPath('hello.txt/..')>>>d=d.resolve()>>>dPosixPath('/home/jodal')>>>d.is_dir()True>>>

When working with directories, you no longer need os.walk() or glob.glob():

>>>len(list(d.iterdir()))112>>>d.glob("*.md")[PosixPath('/home/jodal/hello.md')]>>>

Most APIs in the standard library, and many third party libraries, have learned to accept os.PathLike in addition to bytes and strings as file paths, which means that you can use your Path instances where you used to use strings.

`Path`, text, and bytes

If you interface with an API that doesn’t accept Path instances, the convention is to convert the Path instance to a plain old string using str() or bytes():

>>>str(p)'hello.txt'>>>bytes(p)b'hello.txt'>>>

pathlib uses Unicode strings aka text instead of bytes wherever possible. In fact, pathlib requires you to instantiate Path using a Unicode string:

>>>Path(b'/tmp/foo')Traceback(mostrecentcalllast):...TypeError:argumentshouldbeastrobjectoranos.PathLikeobjectreturningstr,not<class'bytes'>>>> Path(b'/tmp/foo'.decode())PosixPath('/tmp/foo')>>>

If you provide pathlib with a non-ASCII text string and ask it to encode it as an URI, it will helpfully percent-encode it for us:

>>>p=Path('/tmp/æ')>>>pPosixPath('/tmp/æ')>>>p.as_uri()'file:///tmp/%C3%A6'>>>

In the as_uri() output, we can see that æ became %C3%A6. It first encoded the character æ to bytes using UTF-8 encoding, resulting in the two bytes 0xC3 and 0xA6. It then encoded the bytes that was not safe for use in URIs with percent-encoding, resulting in %C3%A6.

Arbitrary bytes

Since pathlib implicitly uses UTF-8 for encoding, how well does it handle file systems that contain non-UTF-8 directory and file names?

The NTFS file system only allows file names that can be represented as UTF-16, but most other file systems accept almost any sequence of bytes as a file name.

Let’s test with a couple of worst case bytes:

the null byte, 0x00 or \x00 when escaped in Python, and
the first half of a two-byte UTF-8 codepoint, 0xC3 or \xC3 in Python.

>>>name=b'/tmp/ab\x00cd\xC3'>>>nameb'/tmp/ab\x00cd\xc3'>>>

As this file name cannot be decoded as UTF-8, we can’t even construct a Path instance:

>>>p=Path(name)Traceback(mostrecentcalllast):...TypeError:argumentshouldbeastrobjectoranos.PathLikeobjectreturningstr,not<class'bytes'>>>> p = Path(name.decode())Traceback (most recent call last):  ...UnicodeDecodeError: 'utf-8' codec can'tdecodebyte0xc3inposition10:unexpectedendofdata>>>

What if we handle the decoding errors by replacing the offending bytes?

>>>p=Path(name.decode(errors="replace"))>>>

With the replace error handler, we manage to create a Path instance, but we’ve lost some information by replacing 0xC3 with the Unicode replacement character, U+FFFD, here encoded into UTF-8 as 0xEFBFBD:

>>>pPosixPath('/tmp/ab\x00cd�')>>>p.as_uri()'file:///tmp/ab%00cd%EF%BF%BD'>>>str(p)'/tmp/ab\x00cd�'>>>str(p).encode()b'/tmp/ab\x00cd\xef\xbf\xbd'>>>

A Path that cannot reconstruct the bytes it originated from is not useful, as we cannot use it to find and access the file represented by the Path instance.

>>>str(p).encode()==nameFalse>>>

We need a way to convert arbitrary bytes to a Path instance without loosing any information, so that given only the Path instance we can later reconstruct the exact same bytes and access our file.

Surrogate escape

Luckily, PEP 383 introduced just that in Python 3.1, released more than a decade ago: the surrogateescape error handler.

The surrogateescape error handler is not available in Python 2.7, so users of the pathlib2 backport is probably out of luck. With three weeks left until Python 2’s end-of-life, you probably have other things to worry about.

Using the surrogateescape error handler we get the following behavior from pathlib:

>>>nameb'/tmp/ab\x00cd\xc3'>>>name.decode(errors="surrogateescape")'/tmp/ab\x00cd\udcc3'>>>

\udcc3 in the decoded string is a surrogate escape code point for the 0xC3 byte.

Passing the text string to pathlib, we get a Path like usual, and with str(p) we get the same text string back again, with the surrogate escape code point:

>>>p=Path(name.decode(errors="surrogateescape"))>>>pPosixPath('/tmp/ab\x00cd\udcc3')>>>str(p)'/tmp/ab\x00cd\udcc3'>>>

Let’s try encoding this back to bytes and compare it to the bytes we started with:

>>>str(p).encode()==nameTraceback(mostrecentcalllast):..UnicodeEncodeError:'utf-8'codeccan't encode character '\udcc3' in position 10: surrogates not allowed>>>

That failed because str.encode() does not allow encoding of surrogates by default. When we create a Unicode string by decoding with the surrogateescape error handler, we must also use the surrogateescape error handler to encode the text back to bytes:

>>>str(p).encode(errors="surrogateescape")==nameTrue>>>

Interestingly, when rendering the path as an URI, pathlib correctly converts the surrogate to the original byte, 0xC3, percent-encoded as %C3.

>>>p.as_uri()'file:///tmp/ab%00cd%C3'>>>

In fact, pathlib is able to correctly encode paths with surrogates to bytes itself:

>>>bytes(p)b'/tmp/ab\x00cd\xc3'>>>bytes(p)==nameTrue>>>

`os.fsencode()`

The reason pathlib correctly converts the path to bytes is because its __bytes__() implementation uses os.fseencode() to encode a Path to bytes.

os.fsencode() explicitly encodes using the surrogateescape handler, except on Windows, where it uses the strict handler to not allow any file names that cannot be used on NTFS. It lets bytes pass through unchanged.

>>>importos>>>os.fsencode("/tmp/æ")b'/tmp/\xc3\xa6'>>>os.fsencode(b"/tmp/ab\x00cd\xc3")b'/tmp/ab\x00cd\xc3'>>>

os.fsencode() has a sibling in os.fsdecode(), which does the opposite. It decodes using the surrogateescape handler, except on Windows, where it uses the strict handler. It lets text, not bytes, pass through unchanged.

>>>os.fsdecode(b"/tmp/ab\x00cd\xc3")'/tmp/ab\x00cd\udcc3'>>>os.fsdecode("/tmp/æ")'/tmp/æ'>>>

With this, we can put together some recommendations.

To `Path` and back again

To create a Path instance that can hold a path with arbitrary byte encoding, use Path(os.fsdecode(...)):

>>>p1=Path(os.fsdecode(b'/tmp/ab\x00cd\xc3'))>>>p2=Path(os.fsdecode('/tmp/æ'))>>>p1PosixPath('/tmp/ab\x00cd\udcc3')>>>p2PosixPath('/tmp/æ')>>>

To get the exact bytes from a Path instance, use bytes(path):

>>>bytes(p1)b'/tmp/ab\x00cd\xc3'>>>bytes(p2)b'/tmp/\xc3\xa6'

Presentation

When it comes to presenting the path to a end user, in a user interface or a log file, there is no perfect solution.

Using URIs yields a consistent representation of the file name, irrespective of its validity as UTF-8. The URIs are true to the actual bytes on the file system, and does not contain any surrogates.

>>>p1.as_uri()'file:///tmp/ab%00cd%C3'>>>p2.as_uri()'file:///tmp/%C3%A6'

Why you should not use `str(path)`

However, for anything that is valid UTF-8 but not valid ASCII, which includes most languages spoken by humans, the result of str(path) is a lot more readable than URIs.

For paths which are valid UTF-8 str(path) yields the cleanest result:

>>>str(p2)'/tmp/æ'

But for names with bytes that are invalid UTF-8, the use of surrogates, which is an implementation detail, leaks through:

>>>str(p1)'/tmp/ab\x00cd\udcc3'

You might think that since invalid UTF-8 isn’t that common, we can live with the surrogates leaking through in those rare cases?

Printing a text string involves encoding it to bytes before it is streamed to your output, e.g. a terminal or a file. As encoding text with surrogates isn’t allowed by default, print(str(path)) can crash your application:

>>>print(str(p2))/tmp/æ>>>print(str(p1))Traceback(mostrecentcalllast):...UnicodeEncodeError:'utf-8'codeccan't encode character '\udcc3' in position 10: surrogates not allowed>>>

It is possible to go down this path, but then you need to always use the Python repr() of the string, instead of printing the string itself:

>>>print(repr(str(p1)))'/tmp/ab\x00de\udcc3'>>>

Lesser of the evils

So, with my current knowledge of possible solutions, my recommendation for consistent presentation of paths with arbitrary bytes in user interfaces, including logging and console output, is to use either repr(path) or path.as_uri(), with a preference for path.as_uri():

>>>print(repr(p1))PosixPath('/tmp/ab\x00de\udcc3')>>>print(p1.as_uri())file:///tmp/ab%00de%C3>>>

Summary

Let’s repeat what we’ve learned here:

pathlib provides a nice and portable API to work with file system paths in Python.
With the help of the surrogateescape error handler, pathlib can handle paths with arbitrary bytes.
os.fsencode() and os.fsdecode() are convenient ways to use the surrogateescape error handler, with the added bonus of correct behavior on Windows too.
Using str(path) to present Path objects in user interfaces might crash your application. The alternatives, repr(path) and path.as_uri(), are not perfect either, but at least they don’t crash and burn.

Cheat sheet

>>>importos>>>frompathlibimportPath>>># To create paths:
>>>p=Path(os.fsdecode(b'/tmp/ab\x00cd\xc3'))>>>pPosixPath('/tmp/ab\x00cd\udcc3')>>># To use with APIs not supporting `os.PathLike`:
>>>bytes(p)b'/tmp/ab\x00cd\xc3'>>># To show to users:
>>>print(p.as_uri())file:///tmp/ab%00cd%C3

↧

Real Python: Variables in Python

December 10, 2019, 6:00 am

≫ Next: PyCoder’s Weekly: Issue #398 (Dec. 10, 2019)

≪ Previous: Stein Magnus Jodal: pathlib and paths with arbitrary bytes

If you want to write code that is more complex, then your program will need data that can change as program execution proceeds.

Here’s what you’ll learn in this course:

How every item of data in a Python program can be described by the abstract term object
How to manipulate objects using symbolic names called variables

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

PyCoder’s Weekly: Issue #398 (Dec. 10, 2019)

December 10, 2019, 11:30 am

≫ Next: Mike Driscoll: PyDev of the Week: Sebastian Steins

≪ Previous: Real Python: Variables in Python

#398 – DECEMBER 10, 2019
View in Browser »

MicroPython: An Intro to Programming Hardware in Python

Are you interested in the Internet of Things, home automation, and connected devices? In this tutorial, you’ll learn about MicroPython and the world of electronics hardware. You’ll set up your board, write your code, and deploy a MicroPython project to your own device.
REAL PYTHON

PEP 591: Adding a `final` Qualifier to Typing

This PEP proposes a final qualifier to be added to the typing module—in the form of a final decorator and a Final type annotation—to serve three related purposes: Declaring that a method should not be overridden, declaring that a class should not be subclassed, declaring that a variable or attribute should not be reassigned.
PYTHON.ORG

Monitor Your Python Environment With Datadog APM and Distributed Tracing

Use detailed flame graphs to identify bottlenecks and latency, and correlate log and trace data for individual requests. Plus, the Datadog APM’s tracing client auto-instruments popular frameworks and libraries like Flask, Tornado, Django, and more. Try it free with a 14-day trial of Datadog →
DATADOGsponsor

Doing Python Configuration Right

“Let’s talk about configuring Python applications, specifically the kind that might live in multiple environments – dev, stage, production, etc.”
MICHAEL WHALEN

8 Great pytest Plugins

“One of the best aspects of the popular Python testing tool is its robust ecosystem. Here are eight of the best pytest plugins available.”
OPENSOURCE.COM• Shared by Python Bytes FM

Two Malicious Python Libraries Caught Stealing SSH and GPG Keys

Dismissing Python Garbage Collection at Instagram

How Instagram’s Django backend runs with GC disabled and gets a 10% performance gain.
INSTAGRAM ENGINEERING BLOG

Python Code Style & Pythonic Idioms

A review of general concepts, language idioms, and Pythonic coding conventions.
PYTHON-GUIDE.ORG

Monads Aren’t as Hard as You Think

Monads explained with Python code examples.
YING WANG

CPython 3.8.1rc1 Available for Testing

PYTHON.ORG

Python Jobs

Articles & Tutorials

Inflationary Constant Factors and Why Python is Faster Than C++ (PDF)

Why constant-factor differences in algorithm complexity do matter in practice.
MEHRDAD NIKNAMI

Beautiful Soup: Build a Web Scraper With Python

In this tutorial, you’ll walk through the main steps of the web scraping process. You’ll learn how to write a script that uses Python’s requests library to scrape data from a website. You’ll also use Beautiful Soup to extract the specific pieces of information that you’re interested in.
REAL PYTHON

See How the Top BI Platforms Compare to Mode Analytics

This month, Mode was named a leader in BI and analytics by G2 Crowd. G2 aggregated customer reviews for all the top BI platforms. See which companies do best in ease of setup, ease of admin, future direction, and more →
MODE ANALYTICSsponsor

Multithreading vs Multiprocessing in Python

A discussion of misconceptions about multithreading in Python.
AMINE BAATOUT

Variables in Python

Learn how every item of data in a Python program can be described by the abstract term “object,” and how to manipulate objects using symbolic names called “variables.”
REAL PYTHONvideo

3 Packages to Build a Spell Checker in Python

“Learn what packages can work as a spell checker in Python. We’ll discuss pyspellchecker, TextBlob, and autocorrect for performing this task.”
THEAUTOMATIC.NET

OCR in Python with Tesseract, OpenCV and Pytesseract

Get started with Tesseract and OpenCV for OCR in Python: preprocessing, deep learning OCR, text extraction and limitations.
FILIP ZELIC

Using Machine Learning to Learn How to Compress

shrynk is a Python package that uses machine learning to compress your Pandas DataFrame (or Python dictionaries).
VKS.AI• Shared by Pascal van Kooten

Installing & Updating Packages in Python

Learn how to install, use, and update Python packages using pip, conda, and Anaconda Navigator.
ERIK MARSJA

Python Strings and Character Data (Interactive Quiz)

Test your understanding of Python strings and character data.
REAL PYTHON

Measure and Improve Python Code Performance With Blackfire.io

Profile in development, test/staging, and production, with no overhead for end users! Blackfire supports any Python version from 2.7.x and 3.x. Find bottlenecks in wall-time, I/O, CPU, memory, HTTP requests, and SQL queries.
BLACKFIREsponsor

A Walkthrough of the DeepMind MuZero Pseudocode

DAVID FOSTER

Python Snippets That Might Come in Handy

MARTIN HEINZ

Creating an Email Service for Childhood Memories With Python

ROBIN WILSON

Projects & Code

Metaflow: Netflix’s Python Framework for Data Science

METAFLOW.ORG

GoCheese: Private Python Package Repository and Caching Proxy

CYPHERPUNKS.RU

parse: Parse Strings Using `format()` Syntax

PYPI.ORG

py27hash: Python 2.7 Hashing and Iteration in Python 3+

GITHUB.COM/DAVIDMEZZETTI• Shared by David Mezzetti

rotvpn: Run Your Own VPN in the Cloud With Python

GITHUB.COM/JAR-O• Shared by James Robson

schemathesis: Property-Based Testing for Open API Schemas

GITHUB.COM/KIWICOM• Shared by Dmitry Dygalo

Events

Python Miami

December 14 to December 15, 2019
PYTHONDEVELOPERSMIAMI.COM

DFW Pythoneers 2nd Saturday Teaching Meeting

December 14, 2019
MEETUP.COM

Inland Empire Python Users Group Monthly Meeting

December 17, 2019
MEETUP.COM

Heidelberg Python Meetup

December 18, 2019
MEETUP.COM

PiterPy Breakfast

December 18, 2019
TIMEPAD.RU

PyData Bristol Meetup

December 19, 2019
MEETUP.COM

Python Northwest

December 19, 2019
PYNW.ORG.UK

PyLadies Dublin

December 19, 2019
PYLADIES.COM

MadPUG

December 19, 2019
MEETUP.COM

Happy Pythoning!
This was PyCoder’s Weekly Issue #398.
View in Browser »

[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]

↧

Mike Driscoll: PyDev of the Week: Sebastian Steins

December 8, 2019, 10:05 pm

≫ Next: IslandT: Python Positional-only parameters

≪ Previous: PyCoder’s Weekly: Issue #398 (Dec. 10, 2019)

Sebastian Steins

Can you tell us a little about yourself (hobbies, education, etc):

When I’m not in front of a computer, I like to ride my road bike, learn new stuff from audiobooks and would never say no to a night out in a good restaurant.

Why did you start using Python?

What other programming languages do you know and which is your favorite?

I worked in different projects with many different programming languages like Java, C#, C and JavaScript.

What projects are you working on now?

Which Python libraries are your favorite (core or 3rd party)?

Bigger ones, such as numpy and pandas really proves that Python is so versatile that there is hardly a problem which cannot be solved with it.

What is the origin of your Pythonic News site?

What have you learned creating the project?

This made me very happy!

Do you have any words of wisdom for other content creators?

Thanks for doing the interview, Sebastian!

The post PyDev of the Week: Sebastian Steins appeared first on The Mouse Vs. The Python.

↧

IslandT: Python Positional-only parameters

December 10, 2019, 8:48 pm

≫ Next: Stack Abuse: Encoding and Decoding Base64 Strings in Python

≪ Previous: Mike Driscoll: PyDev of the Week: Sebastian Steins

I have downloaded Python 3.8 and start to play around with those latest python functions. In this article, we will look at the Positional-only parameter syntax which is a function parameter syntax / to indicate that some function parameters must be specified positionally and cannot be used as keyword arguments which means after the / syntax we may specify a value for each parameter within that function. For example,

def f(a, b, /, c, d):
    print(a, b, c, d)
f(10, 20, 30, d=40)

The below example will print out the sum of all the parameters within that function.

import math  

def e(a):
    return a * a

def f(a, b, /, **kwargs):
    sum = a + b 
    for num in kwargs:
        sum += kwargs[num]
    print(sum)

f(2, 3, c=40, d=e(10), e=math .sin(60)) # output 144.695

The above syntax has been contributed by Pablo Galindo. Do you think the syntax is useful? Leave your comment below this post.

↧

Stack Abuse: Encoding and Decoding Base64 Strings in Python

December 11, 2019, 5:46 am

≫ Next: Talk Python to Me: #242 Your education will be live-streamed

≪ Previous: IslandT: Python Positional-only parameters

Introduction

Have you ever received a PDF or an image file from someone via email, only to see strange characters when you open it? This can happen if your email server was only designed to handle text data. Files with binary data, bytes that represent non-text information like images, can be easily corrupted when being transferred and processed to text-only systems.

Base64 encoding allows us to convert bytes containing binary or text data to ASCII characters. By encoding our data, we improve the chances of it being processed correctly by various systems.

In this tutorial, we would learn how Base64 encoding and decoding works, and how it can be used. We will then use Python to Base64 encode and decode both text and binary data.

What is Base64 Encoding?

Base64 encoding is a type of conversion of bytes into ASCII characters. In mathematics, the base of a number system refers to how many different characters represent numbers. The name of this encoding comes directly from the mathematical definition of bases - we have 64 characters that represent numbers.

The Base64 character set contains:

26 uppercase letters
26 lowercase letters
10 numbers
+ and / for new lines (some implementations may use different characters)

When the computer converts Base64 characters to binary, each Base64 character represents 6 bits of information.

Note: This is not an encryption algorithm, and should not be used for security purposes.

Now that we know what Base64 encoding and how it is represented on a computer, let's look deeper into how it works.

How Does Base64 Encoding Work?

We will illustrate how Base64 encoding works by converting text data, as it's more standard than the various binary formats to choose from. If we were to Base64 encode a string we would follow these steps:

Take the ASCII value of each character in the string
Calculate the 8-bit binary equivalent of the ASCII values
Convert the 8-bit chunks into chunks of 6 bits by simply re-grouping the digits
Convert the 6-bit binary groups to their respective decimal values.
Using a base64 encoding table, assign the respective base64 character for each decimal value.

Let's see how it works by converting the string "Python" to a Base64 string.

The ASCII values of the characters P, y, t, h, o, n are 15, 50, 45, 33, 40, 39 respectively. We can represent these ASCII values in 8-bit binary as follows:

01010000 01111001 01110100 01101000 01101111 01101110

Recall that Base64 characters only represent 6 bits of data. We now re-group the 8-bit binary sequences into chunks of 6 bits. The resultant binary will look like this:

010100 000111 100101 110100 011010 000110 111101 101110

Note: Sometimes we are not able to group the data into sequences of 6 bits. If that occurs, we have to pad the sequence.

With our data in groups of 6 bits, we can obtain the decimal value for each group. Using our last result, we get the following decimal values:

20 7 37 52 26 6 61 46

Finally, we will convert these decimals into the appropriate Base64 character using the Base64 conversion table:

Base64 Encoding Table

As you can see, the value 20 corresponds to the letter U. Then we look at 7 and observe it's mapped to H. Continuing this lookup for all decimal values, we can determine that "Python" is represented as UHl0aG9u when Base64 encoded. You can verify this result with an online converter.

To Base64 encode a string, we convert it to binary sequences, then to decimal sequences, and finally, use a lookup table to get a string of ASCII characters. With that deeper understanding of how it works, let's look at why would we Base64 encode our data.

Why use Base64 Encoding?

In computers, all data of different types are transmitted as 1s and 0s. However, some communication channels and applications are not able to understand all the bits it receives. This is because the meaning of a sequence of 1s and 0s is dependent on the type of data it represents. For example, 10110001 must be processed differently if it represents a letter or an image.

To work around this limitation, you can encode your data to text, improving the chances of it being transmitted and processed correctly. Base64 is a popular method to get binary data into ASCII characters, which is widely understood by the majority of networks and applications.

A common real-world scenario where Base64 encoding is heavily used are in mail servers. They were originally built to handle text data, but we also expect them to send images and other media with a message. In those cases, your media data would be Base64 encoded when it is being sent. It will then be Base64 decoded when it is received so an application can use it. So, for example, the image in the HTML might look like this:

<img src="data:image/png;base64,aVRBOw0AKg1mL9...">

Understanding that data sometimes need to be sent as text so it won't be corrupted, let's look at how we can use Python to Base64 encoded and decode data.

Encoding Strings with Python

Python 3 provides a base64 module that allows us to easily encode and decode information. We first convert the string into a bytes-like object. Once converted, we can use the base64 module to encode it.

In a new file encoding_text.py, enter the following:

import base64

message = "Python is fun"
message_bytes = message.encode('ascii')
base64_bytes = base64.b64encode(message_bytes)
base64_message = base64_bytes.decode('ascii')

print(base64_message)

In the code above, we first imported the base64 module. The message variable stores our input string to be encoded. We convert that to a bytes-like object using the string's encode method and store it in message_bytes. We then Base64 encode message_bytes and store the result in base64_bytes using the base64.b64encode method. We finally get the string representation of the Base64 conversion by decoding the base64_bytes as ASCII.

Note: Be sure to use the same encoding format to when converting from string to bytes, and from bytes to string. This prevents data corruption.

Running this file would provide the following output:

$ python3 encoding_text.py
UHl0aG9uIGlzIGZ1bg==

Now let's see how we can decode a Base64 string to its raw representation.

Decoding Strings with Python

Decoding a Base64 string is essentially a reverse of the encoding process. We decode the Base64 string into bytes of unencoded data. We then convert the bytes-like object into a string.

In a new file called decoding_text.py, write the following code:

import base64

base64_message = 'UHl0aG9uIGlzIGZ1bg=='
base64_bytes = base64_message.encode('ascii')
message_bytes = base64.b64decode(base64_bytes)
message = message_bytes.decode('ascii')

print(message)

Once again, we need the base64 module imported. We then encode our message into a bytes-like object with encode('ASCII'). We continue by calling the base64.b64decode method to decode the base64_bytes into our message_bytes variable. Finally, we decode message_bytes into a string object message, so it becomes human readable.

Run this file to see the following output:

$ python3 decoding_text.py
Python is fun

Now that we can encode and decode string data, let's try to encode binary data.

Encoding Binary Data with Python

As we mentioned previously, Base64 encoding is primarily used to represent binary data as text. In Python, we need to read the binary file, and Base64 encode its bytes so we can generate its encoded string.

Let's see how we can encode this image:

Python logo

Create a new file encoding_binary.py and add the following:

import base64

with open('logo.png', 'rb') as binary_file:
    binary_file_data = binary_file.read()
    base64_encoded_data = base64.b64encode(binary_file_data)
    base64_message = base64_encoded_data.decode('utf-8')

    print(base64_message)

Let's go over the code snippet above. We open the file using open('my_image.png', 'rb'). Note how we passed the 'rb' argument along with the file path - this tells Python that we are reading a binary file. Without using 'rb', Python would assume we are reading a text file.

We then use the read() method to get all the data in the file into the binary_file_data variable. Similar to how we treated strings, we Base64 encoded the bytes with base64.b64encode and then used the decode('utf-8') on base64_encoded_data to get the Base64 encoded data using human-readable characters.

Executing the code will produce similar output to:

$ python3 encoding_binary.py
iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAACXBIWXMAAAsTAAALEwEAmpwYAAAB1klEQVQ4jY2TTUhUURTHf+fy/HrjhNEX2KRGiyIXg8xgSURuokXLxFW0qDTaSQupkHirthK0qF0WQQQR0UCbwCQyw8KCiDbShEYLJQdmpsk3895p4aSv92ass7pcfv/zP+fcc4U6kXKe2pTY3tjSUHjtnFgB0VqchC/SY8/293S23f+6VEj9KKwCoPDNIJdmr598GOZNJKNWTic7tqb27WwNuuwGvVWrAit84fsmMzE1P1+1TiKMVKvYUjdBvzPZXCwXzyhyWNBgVYkgrIow09VJMznpyebWE+Tdn9cEroBSc1JVPS+6moh5Xyjj65vEgBxafGzWetTh+rr1eE/c/TMYg8hlAOvI6JP4KmwLgJ4qD0TIbliTB+sunjkbeLekKsZ6Zc8V027aBRoBRHVoduDiSypmGFG7CrcBEyDHA0ZNfNphC0D6amYa6ANw3YbWD4Pn3oIc+EdL36V3od0A+MaMAXmA8x2Zyn+IQeQeBDfRcUw3B+2PxwZ/EdtTDpCPQLMh9TKx0k3pXipEVlknsf5KoNzGyOe1sz8nvYtTQT6yyvTjIaxsmHGB9pFx4n3jIEfDePQvCIrnn0J4B/gA5J4XcRfu4JZuRAw3C51OtOjM3l2bMb8Br5eXCsT/w/EAAAAASUVORK5CYII=

Your output may vary depending on the image you've chosen to encode.

Now that we know how to Bas64 encode binary data in Python, let's move on Base64 decoding binary data.

Decoding Binary Data with Python

Base64 decoding binary is similar to Base64 decoding text data. The key difference is that after we Base64 decode the string, we save the data as a binary file instead of a string.

Let's see how to Base64 decode binary data in practice by creating a new file called decoding_binary.py. Type the following code into the Python file:

import base64

base64_img = 'iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAACXBIWXMAAAsTAAA' \
            'LEwEAmpwYAAAB1klEQVQ4jY2TTUhUURTHf+fy/HrjhNEX2KRGiyIXg8xgSURuokX' \
            'LxFW0qDTaSQupkHirthK0qF0WQQQR0UCbwCQyw8KCiDbShEYLJQdmpsk3895p4aS' \
            'v92ass7pcfv/zP+fcc4U6kXKe2pTY3tjSUHjtnFgB0VqchC/SY8/293S23f+6VEj' \
            '9KKwCoPDNIJdmr598GOZNJKNWTic7tqb27WwNuuwGvVWrAit84fsmMzE1P1+1TiK' \
            'MVKvYUjdBvzPZXCwXzyhyWNBgVYkgrIow09VJMznpyebWE+Tdn9cEroBSc1JVPS+' \
            '6moh5Xyjj65vEgBxafGzWetTh+rr1eE/c/TMYg8hlAOvI6JP4KmwLgJ4qD0TIbli' \
            'TB+sunjkbeLekKsZ6Zc8V027aBRoBRHVoduDiSypmGFG7CrcBEyDHA0ZNfNphC0D' \
            '6amYa6ANw3YbWD4Pn3oIc+EdL36V3od0A+MaMAXmA8x2Zyn+IQeQeBDfRcUw3B+2' \
            'PxwZ/EdtTDpCPQLMh9TKx0k3pXipEVlknsf5KoNzGyOe1sz8nvYtTQT6yyvTjIax' \
            'smHGB9pFx4n3jIEfDePQvCIrnn0J4B/gA5J4XcRfu4JZuRAw3C51OtOjM3l2bMb8' \
            'Br5eXCsT/w/EAAAAASUVORK5CYII='

base64_img_bytes = base64_img.encode('utf-8')
with open('decoded_image.png', 'wb') as file_to_save:
    decoded_image_data = base64.decodebytes(base64_img_bytes)
    file_to_save.write(decoded_image_data)

In the above code, we first convert our Base64 string data into a bytes-like object that can be decoded. When you are base64 decoding a binary file, you must know the type of data that is being decoded. For example, this data is only valid as a PNG file and not a MP3 file as it encodes an image.

Once the destination file is open, we Base64 decode the data with base64.decodebytes, a different method from base64.b64decode that was used with strings. This method should be used to decode binary data. Finally, we write the decoded data to a file.

In the same directory that you executed decoding_binary.py, you would now see a new decoded_image.png file that contains the original image that was encoded earlier.

Conclusion

Base64 encoding is a popular technique to convert data in different binary formats to a string of ASCII characters. This is useful when transmitting data to networks or applications that cannot process raw binary data but would readily handle text.

With Python, we can use the base64 module to Base64 encode and decode text and binary data.

What applications would you use to encode and decode Base64 data?

↧

Talk Python to Me: #242 Your education will be live-streamed

December 11, 2019, 12:00 am

≫ Next: Real Python: Data Engineer Interview Questions With Python

≪ Previous: Stack Abuse: Encoding and Decoding Base64 Strings in Python

Online education has certainly gone mainstream. Developers and companies have finally gotten comfortable taking online courses. Sometimes these are recorded, self-paced courses like we have at Talk Python Training. Other times, they are more like live events in webcast format.

↧

Real Python: Data Engineer Interview Questions With Python

December 11, 2019, 6:00 am

≫ Next: Continuum Analytics Blog: 8 AI Predictions for 2020: Business Leaders & Researchers Weigh In

≪ Previous: Talk Python to Me: #242 Your education will be live-streamed

Going to interviews can be a time-consuming and tiring process, and technical interviews can be even more stressful! This tutorial is aimed to prepare you for some common questions you’ll encounter during your data engineer interview. You’ll learn how to answer questions about databases, Python, and SQL.

By the end of this tutorial, you’ll be able to:

Understand common data engineer interview questions
Distinguish between relational and non-relational databases
Set up databases using Python
Use Python for querying data

Free Bonus:Click here to get access to a chapter from Python Tricks: The Book that shows you Python's best practices with simple examples you can apply instantly to write more beautiful + Pythonic code.

Becoming a Data Engineer

The data engineering role can be a vast and varied one. You’ll need to have a working knowledge of multiple technologies and concepts. Data engineers are flexible in their thinking. As a result, they can be proficient in multiple topics, such as databases, software development, DevOps, and big data.

What Does a Data Engineer Do?

Given its varied skill set, a data engineering role can span many different job descriptions. A data engineer can be responsible for database design, schema design, and creating multiple database solutions. This work might also involve a Database Administrator.

As a data engineer, you might act as a bridge between the database and the data science teams. In that case, you’ll be responsible for data cleaning and preparation, as well. If big data is involved, then it’s your job to come up with an efficient solution for that data. This work can overlap with the DevOps role.

You’ll also need to make efficient data queries for reporting and analysis. You might need to interact with multiple databases or write Stored Procedures. For many solutions like high-traffic websites or services, there may be more than on database present. In these cases, the data engineer is responsible for setting up the databases, maintaining them, and transferring data between them.

How Can Python Help Data Engineers?

Python is known for being the swiss army knife of programming languages. It’s especially useful in data science, backend systems, and server-side scripting. That’s because Python has strong typing, simple syntax, and an abundance of third-party libraries to use. Pandas, SciPy, Tensorflow, SQLAlchemy, and NumPy are some of the most widely used libraries in production across different industries.

Most importantly, Python decreases development time, which means fewer expenses for companies. For a data engineer, most code execution is database-bound, not CPU-bound. Because of this, it makes sense to capitalize on Python’s simplicity, even at the cost of slower performance when compared to compiled languages such as C# and Java.

Answering Data Engineer Interview Questions

Now that you know what your role might consist of, it’s time to learn how to answer some data engineer interview questions! While there’s a lot of ground to cover, you’ll see practical Python examples throughout the tutorial to guide you along the way.

Questions on Relational Databases

Databases are one of the most crucial components in a system. Without them, there can be no state and no history. While you may not have considered database design to be a priority, know that it can have a significant impact on how quickly your page loads. In the past few years, several large corporations have introduced several new tools and techniques:

NoSQL
Cache databases
Graph databases
NoSQL support in SQL databases

These and other techniques were invented to try and increase the speed at which databases process requests. You’ll likely need to talk about these concepts in your data engineer interview, so let’s go over some questions!

Q1: Relational vs Non-Relational Databases

A relational database is one where data is stored in the form of a table. Each table has a schema, which is the columns and types a record is required to have. Each schema must have at least one primary key that uniquely identifies that record. In other words, there are no duplicate rows in your database. Moreover, each table can be related to other tables using foreign keys.

One important aspect of relational databases is that a change in a schema must be applied to all records. This can sometimes cause breakages and big headaches during migrations. Non-relational databases tackle things in a different way. They are inherently schema-less, which means that records can be saved with different schemas and with a different, nested structure. Records can still have primary keys, but a change in the schema is done on an entry-by-entry basis.

You would need to perform a speed comparison test based on the type of function being performed. You can choose INSERT, UPDATE, DELETE, or another function. Schema design, indices, the number of aggregations, and the number of records will also affect this analysis, so you’ll need to test thoroughly. You’ll learn more about how to do this later on.

Databases also differ in scalability. A non-relational database may be less of a headache to distribute. That’s because a collection of related records can be easily stored on a particular node. On the other hand, relational databases require more thought and usually make use of a master-slave system.

A SQLite Example

Now that you’ve answered what relational databases are, it’s time to dig into some Python! SQLite is a convenient database that you can use on your local machine. The database is a single file, which makes it ideal for prototyping purposes. First, import the required Python library and create a new database:

importsqlite3db=sqlite3.connect(':memory:')# Using an in-memory databasecur=db.cursor()

You’re now connected to an in-memory database and have your cursor object ready to go.

Next, you’ll create the following three tables:

Customer: This table will contain a primary key as well as the customer’s first and last names.
Items: This table will contain a primary key, the item name, and the item price.
Items Bought: This table will contain an order number, date, and price. It will also connect to the primary keys in the Items and Customer tables.

Now that you have an idea of what your tables will look like, you can go ahead and create them:

cur.execute('''CREATE TABLE IF NOT EXISTS Customer (                id integer PRIMARY KEY,                firstname varchar(255),                lastname varchar(255) )''')cur.execute('''CREATE TABLE IF NOT EXISTS Item (                id integer PRIMARY KEY,                title varchar(255),                price decimal )''')cur.execute('''CREATE TABLE IF NOT EXISTS BoughtItem (                ordernumber integer PRIMARY KEY,                customerid integer,                itemid integer,                price decimal,                CONSTRAINT customerid                    FOREIGN KEY (customerid) REFERENCES Customer(id),                CONSTRAINT itemid                    FOREIGN KEY (itemid) REFERENCES Item(id) )''')

You’ve passed a query to cur.execute() to create your three tables.

The last step is to populate your tables with data:

cur.execute('''INSERT INTO Customer(firstname, lastname)               VALUES ('Bob', 'Adams'),                      ('Amy', 'Smith'),                      ('Rob', 'Bennet');''')cur.execute('''INSERT INTO Item(title, price)               VALUES ('USB', 10.2),                      ('Mouse', 12.23),                      ('Monitor', 199.99);''')cur.execute('''INSERT INTO BoughtItem(customerid, itemid, price)               VALUES (1, 1, 10.2),                      (1, 2, 12.23),                      (1, 3, 199.99),                      (2, 3, 180.00),                      (3, 2, 11.23);''')# Discounted price

Now that there are a few records in each table, you can use this data to answer a few more data engineer interview questions.

Q2: SQL Aggregation Functions

Aggregation functions are those that perform a mathematical operation on a result set. Some examples include AVG, COUNT, MIN, MAX, and SUM. Often, you’ll need GROUP BY and HAVING clauses to complement these aggregations. One useful aggregation function is AVG, which you can use to compute the mean of a given result set:

>>>

cur.execute('''SELECT itemid, AVG(price) FROM BoughtItem GROUP BY itemid''')print(cur.fetchall())>>> [(1,10.2),(2,11.73),(3,189.995)]

Here, you’ve retrieved the average price for each of the items bought in your database. You can see that the item with an itemid of 1 has an average price of $10.20.

To make the above output easier to understand, you can display the item name instead of the itemid:

>>>

cur.execute('''SELECT item.title, AVG(boughtitem.price) FROM BoughtItem as boughtitem               INNER JOIN Item as item on (item.id = boughtitem.itemid)               GROUP BY boughtitem.itemid''')print(cur.fetchall())>>> [('USB',10.2),('Mouse',11.73),('Monitor',189.995)]

Now, you see more easily that the item with an average price of $10.20 is the USB.

Another useful aggregation is SUM. You can use this function to display the total amount of money that each customer spent:

>>>

cur.execute('''SELECT customer.firstname, SUM(boughtitem.price) FROM BoughtItem as boughtitem               INNER JOIN Customer as customer on (customer.id = boughtitem.customerid)               GROUP BY customer.firstname''')print(cur.fetchall())>>> [('Amy',180),('Bob',222.42000000000002),('Rob',11.23)]

On average, the customer named Amy spent about $180, while Rob only spent $11.23!

If your interviewer likes databases, then you might want to brush up on nested queries, join types, and the steps a relational database takes to perform your query.

Q3: Speeding Up SQL Queries

Speed depends on various factors, but is mostly affected by how many of each of the following are present:

Joins
Aggregations
Traversals
Records

The greater the number of joins, the higher the complexity and the larger the number of traversals in tables. Multiple joins are quite expensive to perform on several thousands of records involving several tables because the database also needs to cache the intermediate result! At this point, you might start to think about how to increase your memory size.

Speed is also affected by whether or not there are indices present in the database. Indices are extremely important and allow you to quickly search through a table and find a match for some column specified in the query.

Indices sort the records at the cost of higher insert time, as well as some storage. Multiple columns can be combined to create a single index. For example, the columns date and price might be combined because your query depends on both conditions.

Q4: Debugging SQL Queries

Most databases include an EXPLAIN QUERY PLAN that describes the steps the database takes to execute the query. For SQLite, you can enable this functionality by adding EXPLAIN QUERY PLAN in front of a SELECT statement:

>>>

cur.execute('''EXPLAIN QUERY PLAN SELECT customer.firstname, item.title,                item.price, boughtitem.price FROM BoughtItem as boughtitem               INNER JOIN Customer as customer on (customer.id = boughtitem.customerid)               INNER JOIN Item as item on (item.id = boughtitem.itemid)            ''')print(cur.fetchall())>>>[(4, 0, 0, 'SCAN TABLE BoughtItem AS boughtitem'), (6, 0, 0, 'SEARCH TABLE Customer AS customer USING INTEGER PRIMARY KEY (rowid=?)'), (9, 0, 0, 'SEARCH TABLE Item AS item USING INTEGER PRIMARY KEY (rowid=?)')]

This query tries to list the first name, item title, original price, and bought price for all the bought items.

Here’s what the query plan itself looks like:

SCANTABLEBoughtItemASboughtitemSEARCHTABLECustomerAScustomerUSINGINTEGERPRIMARYKEY(rowid=?)SEARCHTABLEItemASitemUSINGINTEGERPRIMARYKEY(rowid=?)

Note that fetch statement in your Python code only returns the explanation, but not the results. That’s because EXPLAIN QUERY PLAN is not intended to be used in production.

Questions on Non-Relational Databases

In the previous section, you laid out the differences between relational and non-relational databases and used SQLite with Python. Now you’re going to focus on NoSQL. Your goal is to highlight its strengths, differences, and use cases.

A MongoDB Example

You’ll use the same data as before, but this time your database will be MongoDB. This NoSQL database is document-based and scales very well. First things first, you’ll need to install the required Python library:

$ pip install pymongo

You also might want to install the MongoDB Compass Community. It includes a local IDE that’s perfect for visualizing the database. With it, you can see the created records, create triggers, and act as visual admin for the database.

Note: To run the code in this section, you’ll need a running database server. To learn more about how to set it up, check out Introduction to MongoDB and Python.

Here’s how you create the database and insert some data:

importpymongoclient=pymongo.MongoClient("mongodb://localhost:27017/")# Note: This database is not created until it is populated by some datadb=client["example_database"]customers=db["customers"]items=db["items"]customers_data=[{"firstname":"Bob","lastname":"Adams"},{"firstname":"Amy","lastname":"Smith"},{"firstname":"Rob","lastname":"Bennet"},]items_data=[{"title":"USB","price":10.2},{"title":"Mouse","price":12.23},{"title":"Monitor","price":199.99},]customers.insert_many(customers_data)items.insert_many(items_data)

As you might have noticed, MongoDB stores data records in collections, which are the equivalent to a list of dictionaries in Python. In practice, MongoDB stores BSON documents.

Q5: Querying Data With MongoDB

Let’s try to replicate the BoughtItem table first, as you did in SQL. To do this, you must append a new field to a customer. MongoDB’s documentation specifies that the keyword operator set can be used to update a record without having to write all the existing fields:

# Just add "boughtitems" to the customer where the firstname is Bobbob=customers.update_many({"firstname":"Bob"},{"$set":{"boughtitems":[{"title":"USB","price":10.2,"currency":"EUR","notes":"Customer wants it delivered via FedEx","original_item_id":1}]},})

Notice how you added additional fields to the customer without explicitly defining the schema beforehand. Nifty!

In fact, you can update another customer with a slightly altered schema:

amy=customers.update_many({"firstname":"Amy"},{"$set":{"boughtitems":[{"title":"Monitor","price":199.99,"original_item_id":3,"discounted":False}]},})print(type(amy))# pymongo.results.UpdateResult

Similar to SQL, document-based databases also allow queries and aggregations to be executed. However, the functionality can differ both syntactically and in the underlying execution. In fact, you might have noticed that MongoDB reserves the $ character to specify some command or aggregation on the records, such as $group. You can learn more about this behavior in the official docs.

You can perform queries just like you did in SQL. To start, you can create an index:

>>>

customers.create_index([("name", pymongo.DESCENDING)])

This is optional, but it speeds up queries that require name lookups.

Then, you can retrieve the customer names sorted in ascending order:

>>>

items = customers.find().sort("name", pymongo.ASCENDING)

You can also iterate through and print the bought items:

>>>

for item in items:    print(item.get('boughtitems'))>>> None    [{'title': 'Monitor', 'price': 199.99, 'original_item_id': 3, 'discounted': False}]    [{'title': 'USB', 'price': 10.2, 'currency': 'EUR', 'notes': 'Customer wants it delivered via FedEx', 'original_item_id': 1}]

You can even retrieve a list of unique names in the database:

>>>

customers.distinct("firstname")>>> ['Bob','Amy','Rob']

Now that you know the names of the customers in your database, you can create a query to retrieve information about them:

>>>

for i in customers.find({"$or": [{'firstname':'Bob'},                                  {'firstname':'Amy'}]},                         {'firstname':1, 'boughtitems':1, '_id':0}):    print(i)>>> {'firstname':'Bob','boughtitems':[{'title':'USB','price':10.2,'currency':'EUR','notes':'Customer wants it delivered via FedEx','original_item_id':1}]}    {'firstname': 'Amy', 'boughtitems': [{'title': 'Monitor', 'price': 199.99, 'original_item_id': 3, 'discounted': False}]}

Here’s the equivalent SQL query:

SELECTfirstname,boughtitemsFROMcustomersWHEREfirstnameLIKE('Bob','Amy')

Note that even though the syntax may differ only slightly, there’s a drastic difference in the way queries are executed underneath the hood. This is to be expected because of the different query structures and use cases between SQL and NoSQL databases.

Q6: NoSQL vs SQL

If you have a constantly changing schema, such as financial regulatory information, then NoSQL can modify the records and nest related information. Imagine the number of joins you’d have to do in SQL if you had eight orders of nesting! However, this situation is more common than you would think.

Now, what if you want to run reports, extract information on that financial data, and infer conclusions? In this case, you need to run complex queries, and SQL tends to be faster in this respect.

Note: SQL databases, particularly PostgreSQL, have also released a feature that allows queryable JSON data to be inserted as part of a record. While this can combine the best of both worlds, speed may be of concern.

It’s faster to query unstructured data from a NoSQL database than it is to query JSON fields from a JSON-type column in PostgreSQL. You can always do a speed comparison test for a definitive answer.

Nonetheless, this feature might reduce the need for an additional database. Sometimes, pickled or serialized objects are stored in records in the form of binary types, and then de-serialized on read.

Speed isn’t the only metric, though. You’ll also want to take into account things like transactions, atomicity, durability, and scalability. Transactions are important in financial applications, and such features take precedence.

Since there’s a wide range of databases, each with its own features, it’s the data engineer’s job to make an informed decision on which database to use in each application. For more information, you can read up on ACID properties relating to database transactions.

You may also be asked what other databases you know of in your data engineer interview. There are several other relevant databases that are used by many companies:

Elastic Search is highly efficient in text search. It leverages its document-based database to create a powerful search tool.
Newt DB combines ZODB and the PostgreSQL JSONB feature to create a Python-friendly NoSQL database.
InfluxDB is used in time-series applications to store events.

The list goes on, but this illustrates how a wide variety of available databases all cater to their niche industry.

Questions on Cache Databases

Cache databases hold frequently accessed data. They live alongside the main SQL and NoSQL databases. Their aim is to alleviate load and serve requests faster.

A Redis Example

You’ve covered SQL and NoSQL databases for long-term storage solutions, but what about faster, more immediate storage? How can a data engineer change how fast data is retrieved from a database?

Typical web-applications retrieve commonly-used data, like a user’s profile or name, very often. If all of the data is contained in one database, then the number of hits the database server gets is going to be over the top and unnecessary. As such, a faster, more immediate storage solution is needed.

While this reduces server load, it also creates two headaches for the data engineer, backend team, and DevOps team. First, you’ll now need some database that has a faster read time than your main SQL or NoSQL database. However, the contents of both databases must eventually match. (Welcome to the problem of state consistency between databases! Enjoy.)

The second headache is that DevOps now needs to worry about scalability, redundancy, and so on for the new cache database. In the next section, you’ll dive into issues like these with the help of Redis.

Q7: How to Use Cache Databases

You may have gotten enough information from the introduction to answer this question! A cache database is a fast storage solution used to store short-lived, structured, or unstructured data. It can be partitioned and scaled according to your needs, but it’s typically much smaller in size than your main database. Because of this, your cache database can reside in memory, allowing you to bypass the need to read from a disk.

Note: If you’ve ever used dictionaries in Python, then Redis follows the same structure. It’s a key-value store, where you can SET and GET data just like a Python dict.

When a request comes in, you first check the cache database, then the main database. This way, you can prevent any unnecessary and repetitive requests from reaching the main database’s server. Since a cache database has a lower read time, you also benefit from a performance increase!

You can use pip to install the required library:

$ pip install redis

Now, consider a request to get the user’s name from their ID:

importredisfromdatetimeimporttimedelta# In a real web application, configuration is obtained from settings or utilsr=redis.Redis()# Assume this is a getter handling a requestdefget_name(request,*args,**kwargs):id=request.get('id')ifidinr:returnr.get(id)# Assume that we have an {id: name} storeelse:# Get data from the main DB here, assume we already did itname='Bob'# Set the value in the cache database, with an expiration timer.setex(id,timedelta(minutes=60),value=name)returnname

This code checks if the name is in Redis using the id key. If not, then the name is set with an expiration time, which you use because the cache is short-lived.

Now, what if your interviewer asks you what’s wrong with this code? Your response should be that there’s no exception handling! Databases can have many problems, like dropped connections, so it’s always a good idea to try and catch those exceptions.

Questions on Design Patterns and ETL Concepts

In large applications, you’ll often use more than one type of database. In fact, it’s possible to use PostgreSQL, MongoDB, and Redis all within just one application! One challenging problem is dealing with state changes between databases, which exposes the developer to issues of consistency. Consider the following scenario:

A value in Database #1 is updated.
That same value in Database #2 is kept the same (not updated).
A query is run on Database #2.

Now, you’ve got yourself an inconsistent and outdated result! The results returned from the second database won’t reflect the updated value in the first one. This can happen with any two databases, but it’s especially common when the main database is a NoSQL database, and information is transformed into SQL for query purposes.

Databases may have background workers to tackle such problems. These workers extract data from one database, transform it in some way, and load it into the target database. When you’re converting from a NoSQL database to a SQL one, the Extract, transform, load (ETL) process takes the following steps:

Extract: There is a MongoDB trigger whenever a record is created, updated, and so on. A callback function is called asynchronously on a separate thread.
Transform: Parts of the record are extracted, normalized, and put into the correct data structure (or row) to be inserted into SQL.
Load: The SQL database is updated in batches, or as a single record for high volume writes.

This workflow is quite common in financial, gaming, and reporting applications. In these cases, the constantly-changing schema requires a NoSQL database, but reporting, analysis, and aggregations require a SQL database.

Q8: ETL Challenges

There are several challenging concepts in ETL, including the following:

Big data
Stateful problems
Asynchronous workers
Type-matching

The list goes on! However, since the steps in the ETL process are well-defined and logical, the data and backend engineers will typically worry more about performance and availability rather than implementation.

If your application is writing thousands of records per second to MongoDB, then your ETL worker needs to keep up with transforming, loading, and delivering the data to the user in the requested form. Speed and latency can become an issue, so these workers are typically written in fast languages. You can use compiled code for the transform step to speed things up, as this part is usually CPU-bound.

Note: Multi-processing and separation of workers are other solutions that you might want to consider.

If you’re dealing with a lot of CPU-intensive functions, then you might want to check out Numba. This library compiles functions to make them faster on execution. Best of all, this is easily implemented in Python, though there are some limitations on what functions can be used in these compiled functions.

Q9: Design Patterns in Big Data

Imagine Amazon needs to create a recommender system to suggest suitable products to users. The data science team needs data and lots of it! They go to you, the data engineer, and ask you to create a separate staging database warehouse. That’s where they’ll clean up and transform the data.

You might be shocked to receive such a request. When you have terabytes of data, you’ll need multiple machines to handle all of that information. A database aggregation function can be a very complex operation. How can you query, aggregate, and make use of relatively big data in an efficient way?

Apache had initially introduced MapReduce, which follows the map, shuffle, reduce workflow. The idea is to map different data on separate machines, also called clusters. Then, you can perform work on the data, grouped by a key, and finally, aggregate the data in the final stage.

This workflow is still used today, but it’s been fading recently in favor of Spark. The design pattern, however, forms the basis of most big data workflows and is a highly intriguing concept. You can read more on MapReduce at IBM Analytics.

Q10: Common Aspects of the ETL Process and Big Data Workflows

You might think this a rather odd question, but it’s simply a check of your computer science knowledge, as well as your overall design knowledge and experience.

Both workflows follow the Producer-Consumer pattern. A worker (the Producer) produces data of some kind and outputs it to a pipeline. This pipeline can take many forms, including network messages and triggers. After the Producer outputs the data, the Consumer consumes and makes use of it. These workers typically work in an asynchronous manner and are executed in separate processes.

You can liken the Producer to the extract and transform steps of the ETL process. Similarly, in big data, the mapper can be seen as the Producer, while the reducer is effectively the Consumer. This separation of concerns is extremely important and effective in the development and architecture design of applications.

Conclusion

Congratulations! You’ve covered a lot of ground and answered several data engineer interview questions. You now understand a bit more about the many different hats a data engineer can wear, as well as what your responsibilities are with respect to databases, design, and workflow.

Armed with this knowledge, you can now:

Use Python with SQL, NoSQL, and cache databases
Use Python in ETL and query applications
Plan projects ahead of time, keeping design and workflow in mind

While interview questions can be varied, you’ve been exposed to multiple topics and learned to think outside the box in many different areas of computer science. Now you’re ready to have an awesome interview!

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

Continuum Analytics Blog: 8 AI Predictions for 2020: Business Leaders & Researchers Weigh In

December 11, 2019, 7:27 am

≫ Next: Codementor: The easiest way to deploy Django application

≪ Previous: Real Python: Data Engineer Interview Questions With Python

The first industrial revolution was powered by coal, the second by oil and gas, and the third by nuclear power. The fourth — AI — is fueled by an abundance of data and breakthroughs in…

The post 8 AI Predictions for 2020: Business Leaders & Researchers Weigh In appeared first on Anaconda.

↧

Codementor: The easiest way to deploy Django application

December 11, 2019, 8:14 am

≫ Next: Mike Driscoll: PyDev of the Week: Sebastian Steins

≪ Previous: Continuum Analytics Blog: 8 AI Predictions for 2020: Business Leaders & Researchers Weigh In

I show you how to deploy your Django application to Heroku.

↧

Mike Driscoll: PyDev of the Week: Sebastian Steins

December 8, 2019, 10:05 pm

≫ Next: Catalin George Festila: Python 3.7.5 : The Pygal python package.

≪ Previous: Codementor: The easiest way to deploy Django application

Sebastian Steins

Can you tell us a little about yourself (hobbies, education, etc):

When I’m not in front of a computer, I like to ride my road bike, learn new stuff from audiobooks and would never say no to a night out in a good restaurant.

Why did you start using Python?

What other programming languages do you know and which is your favorite?

I worked in different projects with many different programming languages like Java, C#, C and JavaScript.

What projects are you working on now?

Which Python libraries are your favorite (core or 3rd party)?

Bigger ones, such as numpy and pandas really proves that Python is so versatile that there is hardly a problem which cannot be solved with it.

What is the origin of your Pythonic News site?

What have you learned creating the project?

This made me very happy!

Do you have any words of wisdom for other content creators?

Thanks for doing the interview, Sebastian!

The post PyDev of the Week: Sebastian Steins appeared first on The Mouse Vs. The Python.

↧

Catalin George Festila: Python 3.7.5 : The Pygal python package.

December 11, 2019, 6:15 am

≫ Next: Python Insider: Python 3.7.6rc1 and 3.6.10rc1 are now available for testing

≪ Previous: Mike Driscoll: PyDev of the Week: Sebastian Steins

Today's tutorial aims to get data from a URL and display it with the Pygal python package. I believe that global warming is a very important topic for human evolution. You can read more about this topic on this website. About this python package you can learn more at the official website. [mythcat@desk ~]$ pip3 install Pygal --user Collecting Pygal ... Installing collected packages: Pygal

↧

Python Insider: Python 3.7.6rc1 and 3.6.10rc1 are now available for testing

December 11, 2019, 10:36 am

≫ Next: Mike Driscoll: PyDev of the Week: Sebastian Steins

≪ Previous: Catalin George Festila: Python 3.7.5 : The Pygal python package.

Python 3.7.6rc1 and 3.6.10rc1 are now available. 3.7.6rc1 is the release preview of the next maintenance release of Python 3.7; 3.6.10rc1 is the release preview of the next security-fix release of Python 3.6. Assuming no critical problems are found prior to 2019-12-18, no code changes are planned between these release candidates and the final releases. These release candidates are intended to give you the opportunity to test the new security and bug fixes in 3.7.6 and security fixes in 3.6.10. While we strive to not introduce any incompatibilities in new maintenance and security releases, we encourage you to test your projects and report issues found to bugs.python.org as soon as possible. Please keep in mind that these are preview releases and, thus, their use is not recommended for production environments.

You can find the release files, a link to their changelogs, and more information here:

https://www.python.org/downloads/release/python-376rc1/

https://www.python.org/downloads/release/python-3610rc1/

↧

Mike Driscoll: PyDev of the Week: Sebastian Steins

December 8, 2019, 10:05 pm

≫ Next: Stack Abuse: Merge Sort in Python

≪ Previous: Python Insider: Python 3.7.6rc1 and 3.6.10rc1 are now available for testing

Sebastian Steins

Can you tell us a little about yourself (hobbies, education, etc):

When I’m not in front of a computer, I like to ride my road bike, learn new stuff from audiobooks and would never say no to a night out in a good restaurant.

Why did you start using Python?

What other programming languages do you know and which is your favorite?

I worked in different projects with many different programming languages like Java, C#, C and JavaScript.

What projects are you working on now?

Which Python libraries are your favorite (core or 3rd party)?

Bigger ones, such as numpy and pandas really proves that Python is so versatile that there is hardly a problem which cannot be solved with it.

What is the origin of your Pythonic News site?

What have you learned creating the project?

This made me very happy!

Do you have any words of wisdom for other content creators?

Thanks for doing the interview, Sebastian!

The post PyDev of the Week: Sebastian Steins appeared first on The Mouse Vs. The Python.

↧

Stack Abuse: Merge Sort in Python

December 12, 2019, 5:46 am

≫ Next: testmon: How to set-up and use py.test in Pycharm

≪ Previous: Mike Driscoll: PyDev of the Week: Sebastian Steins

Introduction

Merge Sort is one of the most famous sorting algorithms. If you're studying Computer Science, Merge Sort, alongside Quick Sort is likely the first efficient, general-purpose sorting algorithm you have heard of. It is also a classic example of a divide-and-conquer category of algorithms.

Merge Sort

The way Merge Sort works is:

An initial array is divided into two roughly equal parts. If the array has an odd number of elements, one of those "halves" is by one element larger than the other.
The subarrays are divided over and over again into halves until you end up with arrays that have only one element each.
Then you combine the pairs of one-element arrays into two-element arrays, soring them in the process. Then these sorted pairs are merged into four-element arrays, and so on until you end up with the initial array sorted.

Here's a visualization of Merge Sort:

alt

As you can see, the fact that the array couldn't be divided into equal halves isn't a problem, the 3 just "waits" until the sorting begins.

There are two main ways we can implement the Merge Sort algorithm, one is using a top-down approach like in the example above, which is how Merge Sort is most often introduced.

The other approach, i.e. bottom-up, works in the opposite direction, without recursion (works iteratively) - if our array has N elements we divide it into N subarrays of one element and sort pairs of adjacent one-element arrays, then sort the adjacent pairs of two-element arrays and so on.

Note: The bottom-up approach provides an interesting optimization which we'll discuss later. We'll be implementing the top-down approach as it's simpler and more intuitive couples with the fact that there's no real difference between the time complexity between them without specific optimizations.

The main part of both these approaches is how we combine (merge) the two smaller arrays into a larger array. This is done fairly intuitively, let's say we examine the last step in our previous example. We have the arrays:

A: 2 4 7 8
B: 1 3 11
sorted: empty

The first thing we do is look at the first element of both arrays. We find the one that's smaller, in our case that's 1, so that's the first element of our sorted array, and we move forward in the B array:

A: 2 4 7 8
B: 1 3 11
sorted: 1

Then we look at the next pair of elements 2 and 3; 2 is smaller so we put it in our sorted array and move forward in array A. Of course, we don't move forward in array B and we keep our pointer at 3 for future comparisons:

A: 2 4 7 8
B: 1 3 11
sorted: 1 2

Using the same logic we move through the rest and end up with with an array of {1, 2, 3, 4, 7, 8, 11}.

The two special cases that can occur are:

Both subarrays have the same element. We can move forward in either one and add the element to the sorted array. We can technically move forward in both arrays and add both elements to the sorted array but this would require special behavior when we encountered the same elements in both arrays.
We "run" out of elements in one subarray. For example, we have an array with {1, 2, 3} and an array with {9, 10, 11}. Clearly we'll go through all the elements in the first array without moving forward even once in the second. Whenever we run out of elements in a subarray we simply add the elements of the second one after the other.

Keep in mind that we can sort however we want - this example sorts integers in ascending order but we can just as easily sort in descending order, or sort custom objects.

Implementation

We'll be implementing Merge Sort on two types of collections - on arrays of integers (typically used to introduce sorting) and on custom objects (a more practical and realistic scenario).

We'll implement the Merge Sort algorithm using the top-down approach. The algorithm doesn't look very "pretty" and can be confusing, so we'll go through each step in detail.

Sorting Arrays

Let's start with the easy part. The base idea of the algorithm is to divide (sub)arrays into halves and sort them recursively. We want to keep doing this as much as possible, i.e. until we end up with subarrays that have only one element:

def merge_sort(array, left_index, right_index):
    if left_index > right_index:
        return

    middle = (left_index + right_index)//2
    merge_sort(array, left_index, middle)
    merge_sort(array, middle + 1, right_index)
    merge(array, left_index, right_index, middle)

By calling the merge method last, we make sure that all the divisions will happen before we start the sorting. We use the // operator to be explicit about the fact that we want integer values for our indices.

The next step is the actual merging part through a few steps and scenarios:

Create copies of our arrays. The first array being the subarray from [left_index,..,middle] and the second from [middle+1,...,right_index]
We go through both copies (keeping track of pointers in both arrays), picking the smaller of the two elements we're currently looking at, and add them to our sorted array. We move forward in whichever array we've chosen the element from, and forward in the sorted array regardless.
If we run out of elements in one of our copies - simply add the remaining elements in the other copy to the sorted array.

With our requirements laid out, let's go ahead and define a merge() function:

def merge(array, left_index, right_index, middle):
    # Make copies of both arrays we're trying to merge

    # The second parameter is non-inclusive, so we have to increase by 1
    left_copy = array[left_index:middle + 1]
    right_copy = array[middle+1:right_index+1]

    # Initial values for variables that we use to keep
    # track of where we are in each array
    left_copy_index = 0
    right_copy_index = 0
    sorted_index = left_index

    # Go through both copies until we run out of elements in one
    while left_copy_index < len(left_copy) and right_copy_index < len(right_copy):

        # If our left_copy has the smaller element, put it in the sorted
        # part and then move forward in left_copy (by increasing the pointer)
        if left_copy[left_copy_index] <= right_copy[right_copy_index]:
            array[sorted_index] = left_copy[left_copy_index]
            left_copy_index = left_copy_index + 1
        # Opposite from above
        else:
            array[sorted_index] = right_copy[right_copy_index]
            right_copy_index = right_copy_index + 1

        # Regardless of where we got our element from
        # move forward in the sorted part
        sorted_index = sorted_index + 1

    # We ran out of elements either in left_copy or right_copy
    # so we will go through the remaining elements and add them
    while left_copy_index < len(left_copy):
        array[sorted_index] = left_copy[left_copy_index]
        left_copy_index = left_copy_index + 1
        sorted_index = sorted_index + 1

    while right_copy_index < len(right_copy):
        array[sorted_index] = right_copy[right_copy_index]
        right_copy_index = right_copy_index + 1
        sorted_index = sorted_index + 1

Now let's test our program out:

array = [33, 42, 9, 37, 8, 47, 5, 29, 49, 31, 4, 48, 16, 22, 26]
merge_sort(array, 0, len(array) -1)
print(array)

And the output is:

[4, 5, 8, 9, 16, 22, 26, 29, 31, 33, 37, 42, 47, 48, 49]

Sorting Custom Objects

Now that we have the basic algorithm down we can take a look at how to sort custom classes. We can override the __eq__, __le__, __ge__ and other operators as needed for this.

This lets us use the same algorithm as above but limits us to only one way of sorting our custom objects, which in most cases isn't what we want. A better idea is to make the algorithm itself more versatile, and pass a comparison function to it instead.

First we'll implement a custom class, Car and add a few fields to it:

class Car:
    def __init__(self, make, model, year):
        self.make = make
        self.model = model
        self.year = year

    def __str__(self):
        return str.format("Make: {}, Model: {}, Year: {}", self.make, self.model, self.year)

Then we'll make a few changes to our Merge Sort methods. The easiest way to achieve what we want is by using lambda functions. You can see that we only added an extra parameter and changed the method calls accordingly, and only one other line of code to make this algorithm a lot more versatile:

def merge(array, left_index, right_index, middle, comparison_function):
    left_copy = array[left_index:middle + 1]
    right_copy = array[middle+1:right_index+1]

    left_copy_index = 0
    right_copy_index = 0
    sorted_index = left_index

    while left_copy_index < len(left_copy) and right_copy_index < len(right_copy):

        # We use the comparison_function instead of a simple comparison operator
        if comparison_function(left_copy[left_copy_index], right_copy[right_copy_index]):
            array[sorted_index] = left_copy[left_copy_index]
            left_copy_index = left_copy_index + 1
        else:
            array[sorted_index] = right_copy[right_copy_index]
            right_copy_index = right_copy_index + 1

        sorted_index = sorted_index + 1

    while left_copy_index < len(left_copy):
        array[sorted_index] = left_copy[left_copy_index]
        left_copy_index = left_copy_index + 1
        sorted_index = sorted_index + 1

    while right_copy_index < len(right_copy):
        array[sorted_index] = right_copy[right_copy_index]
        right_copy_index = right_copy_index + 1
        sorted_index = sorted_index + 1


def merge_sort(array, left_index, right_index, comparison_function):
    if left_index >= right_index:
        return

    middle = (left_index + right_index)//2
    merge_sort(array, left_index, middle, comparison_function)
    merge_sort(array, middle + 1, right_index, comparison_function)
    merge(array, left_index, right_index, middle, comparison_function)

Let's test out or modified algorithm on a few Car instances:

car1 = Car("Alfa Romeo", "33 SportWagon", 1988)
car2 = Car("Chevrolet", "Cruze Hatchback", 2011)
car3 = Car("Corvette", "C6 Couple", 2004)
car4 = Car("Cadillac", "Seville Sedan", 1995)

array = [car1, car2, car3, car4]

merge_sort(array, 0, len(array) -1, lambda carA, carB: carA.year < carB.year)

print("Cars sorted by year:")
for car in array:
    print(car)

print()
merge_sort(array, 0, len(array) -1, lambda carA, carB: carA.make < carB.make)
print("Cars sorted by make:")
for car in array:
    print(car)

We get the output:

Cars sorted by year:
Make: Alfa Romeo, Model: 33 SportWagon, Year: 1988
Make: Cadillac, Model: Seville Sedan, Year: 1995
Make: Corvette, Model: C6 Couple, Year: 2004
Make: Chevrolet, Model: Cruze Hatchback, Year: 2011

Cars sorted by make:
Make: Alfa Romeo, Model: 33 SportWagon, Year: 1988
Make: Cadillac, Model: Seville Sedan, Year: 1995
Make: Chevrolet, Model: Cruze Hatchback, Year: 2011
Make: Corvette, Model: C6 Couple, Year: 2004

Optimization

Let's elaborate the difference between top-down and bottom-up Merge Sort now. Bottom-up works like the second half of the top-down approach where instead of recursively calling the sort on halved subarrays, we iteratively sort adjacent subarrays.

One thing we can do to improve this algorithm is to consider sorted chunks instead of single elements before breaking the array down.

What this means is that, given an array such as {4, 8, 7, 2, 11, 1, 3}, instead of breaking it down into {4}, {8}, {7}, {2}, {11}, {1} ,{3} - it's divided into subarrays which may already be sorted: {4,8}, {7}, {2,11}, {1,3}, and then sorting them.

With real life data we often have a lot of these already sorted subarrays that can noticeably shorten the execution time of Merge Sort.

Another thing to consider with Merge Sort, particularly the top-down version is multi-threading. Merge Sort is convenient for this since each half can be sorted independently of its pair. The only thing that we need to make sure of is that we're done sorting each half before we merge them.

Merge Sort is however relatively inefficient (both time and space) when it comes to smaller arrays, and is often optimized by stopping when we reach an array of ~7 elements, instead of going down to arrays with one element, and calling Insertion Sort to sort them instead, before merging into a larger array.

This is because Insertion Sort works really well with small and/or nearly sorted arrays.

Conclusion

Merge Sort is an efficient, general-purpose sorting algorithm. It's main advantage is the reliable runtime of the algorithm and it's efficiency when sorting large arrays. Unlike Quick Sort, it doesn't depend on any unfortunate decisions that lead to bad runtimes.

One of the main drawbacks is the additional memory that Merge Sort uses to store the temporary copies of arrays before merging them. However, Merge Sort is an excellent, intuitive example to introduce future Software Engineers to the divide-and-conquer approach to creating algorithms.

We've implemented Merge Sort both on simple integer arrays and on custom objects via a lambda function used for comparison. In the end, possible optimizations for both approaches were briefly discussed.

↧

testmon: How to set-up and use py.test in Pycharm

December 10, 2019, 4:30 am

≫ Next: testmon: Determining affected tests

≪ Previous: Stack Abuse: Merge Sort in Python

We asked our friend Miro to try using Pytest with PyCharm. Miro has background in test automation, Python, DevOps and metal music. Since he was new to the set-up he would be the ideal person to write a beginner's guide. This is his take.

I've been using Vim and terminal as a weapon of choice for years. I've had a good time with it, however, more and more people ask me why I'm using this setup. And honestly, I don't know the answer.

I'm aware that things can be done more efficiently and an IDE can help with a lot of things. I guess that my weak spot is the unit tests and testing my code in general. I'm not running my tests when on the coding spree, I'm breaking lots of stuff, and only when I think I'm finished, I'll do the fixing and make everything running green again.

Well, I would like to change that. And I'm also curious about trying out new ways of doing things. The obvious choice for programming in Python is the PyCharm. It's a nice IDE, supports many features that I like and most importantly, it can help with the testing. PyCharm can easily integrate with popular test frameworks and run the tests for me.

In this article, we'll take a look at using py.test in PyCharm.

Example project

For the sake of this article, I've prepared a small project called MasterMind. It's a simple, console-based game just for learning purposes.

Most of you are probably familiar with the rules. One person thinks of a number and the second person must guess this number. In this case, you'll be playing against the computer. It'll generate a random number that you have to guess.

You can download the project from here.

Development environment

To start with a clean slate, I'll be using a virtual machine, fresh install of macOS - Mojave, Python3 (installed with a package manager homebrew) and PyCharm CE - 2019.1

I would like to briefly describe the MasterMind project structure.

/mastermind <--------------  top-level dir
  /src <-------------------  source codes / modules
    /mastermind <----------  main python module
      __init__.py
      __main__.py <--------  entry point (startup of game)
      exceptions.py <------  contains game exceptions
      game.py <------------  game logic
      heading.txt <--------  heading at the beginning of the game
      utils.py <-----------  various helpers
  /tests <-----------------  pytests
    test_mastermind.py <---  test suite
  setup.py <---------------  package setup script

Please notice that I'm using an src folder to separate top-level Python modules from the main directory. This is used to protect from accidental imports and other unwanted effects. If you're interested in more details, check out this blog.

I'm bringing this up because we'll have to do some additional steps in PyCharm configuration.

Pycharm Configuration

So here we are, I'm firing up PyCharm and opening the mastermind folder as a new project. If you're following this tutorial, you can extract mastermind.zip and open the mastermind directory in PyCharm. This will become the root folder of our project in PyCharm.

We'll have to configure PyCharm, to use the Python3 interpreter that we've Installed with brew install python3. You can find this setting under preferences -> project interpreter. If there are no interpreters at all, we'll have to add one. This will also create virtualenv.

Since we are using a clean virtual environment, we'll have to install py.test package in order to use it for running the tests. Just stay on the interpreter preference page and install packages from there. Right under selected Python interpreter is a list of installed packages (so far just setuptools and pip). To install new packages click add+ sign and search for pytest package. Press install package to install it. This will be added only to our project virtualenv.

However, if you prefer console, you can install Python packages using pip command. Just open the console by pressing terminal at the left - bottom.

open terminal

Install pytest with command pip install pytest. This has the same effect as installing with PyCharm interface. Using terminal you should see something like this:

(venv)$ pip install pytest
Collecting pytest
  Downloading https://files.pythonhosted.org/packages/5d/c3/54f607bc9817fd284073ac68e99123f86616f431f9d29a855474b7cf00eb/pytest-4.4.1-py2.py3-none-any.whl (223kB)
     |████████████████████████████████| 225kB 1.7MB/s 

...[MORE PIP OUTPUT]...

Requirement already satisfied: setuptools in ./venv/lib/python3.7/site-packages (from pytest) (41.0.1)
Installing collected packages: pluggy, py, attrs, atomicwrites, more-itertools, six, pytest
Successfully installed atomicwrites-1.3.0 attrs-19.1.0 more-itertools-7.0.0 pluggy-0.9.0 py-1.8.0 pytest-4.4.1 six-1.12.0

Now we can configure pytest as default test runner in PyCharm. You can find this setting under preferences -> Tools -> Python Integrated Tools. Set Default test runner to pytest.

default test runner

Running App / Tests

So far, we've been playing with PyCharm and configuration. Now we should try out the app, if it's working and if we can proceed further. MasterMind can be executed as a module. In the terminal, we would need to execute this command to start the game. Please note, that the current working directory must be in src directory.

(venv)$ cd src
(venv)$ python -m mastermind

When you execute just the module like that, Python will know to go to the __main__.py file, which contains main function and execute it.

So how to do this in PyCharm? We'll need to create Run Configuration.

In this configuration, we'll basically do the same thing, as in terminal. Just execute module called mastermind. Also, in the terminal, we've cd (change directory) do src, which contains our mastermind module. We have to configure this in PyCharm as well. Otherwise, it would not know where to find our module. You can do so, by right clicking on src directory and marking it as sources root

Now, it's time to run our tests for MasterMind. To create a test configuration, we'll use IDE to set it up for us. Just right-click on tests directory in Project tool window and select run pytest in tests.

This will do 2 things. It'll run the tests and create run configuration. Test results are displayed in Test Runner Tab.

test tab

Also, note the test configuration in the top right corner. We can save this configuration as our default test run.

Useful features

There are a couple of features that are useful for everyday workflow.

Execute Single Test

When you're writing new tests for your app, PyCharm makes it really easy to execute a single test. E.g. the one that you've just finished writing. You can do so by pressing the green play button, next to the test definition.

run single test

Of course, that's not the end of it. PyCharm lets you execute a whole class (containing multiple tests) or a single file (test suite). You can select these options from the context menu of a mouse right click.

From the official documentation:

In many cases, you can initiate a testing session from a context menu. For this purpose, the Run and Debug commands are provided in certain context menus. For example, these commands are available for a test class, directory, or a package in the Project Tool Window. They are also available for a test class or method you are currently working on in the editor.

Auto Test Rerun

I like to focus only on programming and don't worry about the tests (until it's too late :P ). So let's make PyCharm do the hard work for us. In the Test Runner Tab, we can toggle the automatic rerun of the tests on a code change. Pretty neat, right?

automatic rerun on tests

So for example, if I introduce an error, I'll know this immediately. You've probably heard that to fix defects later is much more expensive that early. This will give you rapid feedback!

There are a couple of disadvantages IMHO. It's great to have rapid feedback, however, PyCharm will trigger the tests on each file change. Even if you add a new line. Moreover, PyCharm autosaves your changes, which leads to even more repeatedly executed tests.

You could set the automatic rerun of a subgroup of the tests (different run configuration). For example, if you are working on particular functionality, you can set a smaller test suite (covering only what's needed) and set auto tests just for that configuration. However, you have to do this manually, and not to forget to change it back, when you're finished. Good luck with that!

Final thoughts

In this article, we've tried integration of py.test and PyCharm, it was fairly easy and straightforward. After setting up an automatic rerun of tests we could focus on programming and get rapid feedback from our test cases.

Another nice feature is to execute tests that are being currently developed. We can write the tests and see if everything works as expected. Just with the press of a button, we can get the feedback.

By the way, when I was setting up PyCharm, there was an option to enable ideavim plugin. I'll have to give it a try and see how well it's working. Nevertheless, it's a pleasant surprise for vim users such as myself.

I'll keep on using PyCharm and py.test for a couple of weeks before I make any conclusions, but I have to say I'm quite comfortable using it even after a couple of days. If you hesitate to try a new IDE or tool, just go for it. Maybe you'll find something that suits you better.

↧

testmon: Determining affected tests

December 11, 2019, 4:30 am

≫ Next: testmon: New in testmon 1.0.0

≪ Previous: testmon: How to set-up and use py.test in Pycharm

Automatically determining affected tests sounds too good to be true. Python developers rightfully have a suspecting attitude towards any tool which tries to be too clever about their source code. Code completion and symbol searching doesn't need to be 100% reliable but messing with the test suite execution? This page explains what testmon tries and what it does not try to achieve.

There is no heuristics involved. testmon works with these pretty solid assumptions:

coverage.py library can reliably determine executed and unexecuted lines of code of a tested program
even in dynamic language, unexecuted line of code doesn't influence the outcome of execution of the surrounding code
methodbody which is not reached/executed by a specific test cannot influence outcome of that test
the lines which are executed can have so many side effects that we don't try to determine their real dependencies, we re-execute dependent tests even on smallest change

E.g. having test_s.py:

1defadd(a,b):2returna+b34defsubtract(a,b):5returna-b67deftest_add(a,b):8assertadd(1,2)==3

If you run coverage run -m pytest test_s.py::test_add you'll get:

1>defadd(a,b):2>returna+b34>defsubtract(a,b):5!returna-b67>deftest_add():8>assertadd(1,2)==3

Now you can change the unexecuted line ! return a - b to nuclear_bomb.explode() and it still won't affect running test_s.py::test_add.

Implementation details

How does testmon approach processing the source code and determining the dependencies? It splits the code into blocks. Blocks can have holes which are denoted by a placeholder. ( "transformed_into_block" token ). Each Block also has a start, end (line numbers, 1-based, closed interval)

The above code is transformed into 4 blocks:

Block1: 1-8 (start-end)

defadd(a,b):transformed_into_blockdefsubtract(a,b):transformed_into_blockdeftest_add(a,b):transformed_into_block

Block2: 2-2

returna+b

Block3: 5-5

returna-b

Block4: 8-8

assertadd(1,2)==3

After running the test with coverage analysis and parsing the source code, testmon determines which blocks does test_s.py::test_add depend on. In our example it's Block 1,2 and 4. (and not Block 3). testmon doesn't store the whole code of the block but just a checksum of it. Block 3 can be changed to anything. As long as the Block 1,2 and 4 stay the same, the execution path for test_s.py::test_add and it's outcome will stay the same.

The limits and reliability of this method are pretty much the same as limits of coverage.py (things that cause trouble)

↧

testmon: New in testmon 1.0.0

December 12, 2019, 6:16 am

≫ Next: Anwesha Das: Circuit Python at PyConf Hyderabad

≪ Previous: testmon: Determining affected tests

Testmon in editor

Significant portions of testmon have been rewritten for v 1.0.1. Although the UI is mostly the same, there are some significant differences.

End of python 2.7 support

Testmon requires python 3.6 or higher and pytest 5 or higher.

No subprocess measurement

--testmon-singleprocess was removed (because it's the only option now)

It's not easy to automate the setup of subprocess measurement and therefore to automate testing it. We're not sure if it worked outside of our environment and if anybody used it. If you miss subprocess measurement, please let us know.

Renames

--testmon-off has been renamed to --no-testmon.
Variable run_variant_expression, which can be specified in pytest configuration file to distinguish different environments was renamed to environment_expression, as it better describes it's meaning.

Playing nice with -m, -k and all other selectors

Old versions of testmon got confused if you deselected some tests through other means than testmon itself. If there are bugs in this they will be squashed with priority.

Quickest tests first

Testmon reorders the test files according to tests per second average so that the quickest tests go first, but the order within the test file or test class is not changed.

New algorithm

Last but not least, we developed a new algorithm and database schema used for selecting tests affected by changes. It will allow us to add new functionality and continue to improve testmon. If there are not many changes determining affected tests should take hundreds of milliseconds at most. If you have to wait for testmon to find out nothing has changed, it's a bug. Please report it. Now also whitespace and comments are taken into account when detecting changes. We hope to improve this in the future and make them insignificant.

↧

Anwesha Das: Circuit Python at PyConf Hyderabad

December 12, 2019, 10:46 am

≫ Next: Python Bytes: #160 Your JSON shall be streamed

≪ Previous: testmon: New in testmon 1.0.0

Introduction

Coding in/with hardware has become my biggest stress buster for me ever since I have been introduced to it in PyCon Pune 2017 by John. Coding with hardware provides a real-life interaction with the code you write. It flourishes creativity. I can do all of this while I learn something new. Now I look for auctions to offer me a chance to code in/with Hardware. It gives the chance to escape the muggle world.

Diwali and Circuit Python

Diwali is the festival of flowers, food, and lights. So why not I take Diwali as an opportunity to create some magic. Since 2017 I try to lit my house with hardware, which I operate with coding. This year in PyCon US, all the participants got a piece of Adafruit Circuit Playground Express. For 2019 Diwali, I chose Circuit Playground Express as my wand and spell is abracadabra Circuit Python.

Circuit Python in PyConf Hyderabad

A week before, I got a call from the organizers of PyConf Hyderabad 2019 that if I want to deliver a talk there. Initially, I thought of a talk titled " Limiting the legal risk of your open source project”, this was a talk selected for PyCon India 2019. Unfortunately, I could not deliver it due to my grandfather's demise. I went for this as the talk was ready. But an afterthought invoked an idea that why shouldn’t I go for a talk on Circuit Python? And share what I did in Diwali with it. The organizers also liked it. (Of course, talk about/on hardware is more interesting than a legal talk in a Python conference, right?). In hindsight, it meant much work within 6 days, but I dove for it.

The day of the talk

My morning of the day of the conference started with a lovely surprise. My work has been featured on a blog by Adafruit. It was a fantastic feeling when your work gets recognition from the organization you admire the most. Thank you, Adafruit.

Mine was the 4th talk of the day and the talk before lunch. It was my first talk on/about hardware/hardware projects. This is a talk I am giving after 1 and a half years after my battle with depression. So in a word, I was nervous.

The talk

I started with why programming with/on hardware is essential for me. Then moving to what is Circuit Python and Circuit Playground Express. I use mu as my editor for all my hardware projects. This is no exception. I opened up the editor and started to code. Then came the time to showcase my work/projects with Circuit Python.

In the first example, I turned on the first NeoPixel of the CPX into Red.

from adafruit_circuitplayground.express import cpx

cpx.pixels.brightness = 0.5

# COLOUR

RED = (255, 0, 0)
YELLOW = (255, 150, 0)
GREEN = (0,one_led_red.py 255, 0)
CYAN = (0, 255, 255)
BLUE = (0, 0, 255)
PURPLE = (180, 0, 255)
WHITE = (255, 255, 255)
MAROON = (128,0,0)
PURPLE =(128,0,128)
TEAL = (0,128,128)
OFF = (0, 0, 0)


while True:
    cpx.pixels[0] = RED

In CPX, we have 10 individually addressable LEDs. Now, we are importing the adafruit_circuitplayground.express module as cpx so that it becomes easier to type. If you remember, there are 10 neopixels on the board, and we can access them as cpx.pixels. We can set the brightness of pixels on the board. Instead of full, I am setting it as half as the lights are really bright. Then I have defined a few colours.

I want my light to lit forever so,' while True' and I am setting the first neopixel's color as red.

all_led_rgb.py

One led is a bit dull; let us lighten up the room. Here we will light up all the neopixels of cpx. With cpx.pixels.fill I am filling up the color red, green, and blue. I am giving a 1-second gap to see the change of colors.

from adafruit_circuitplayground.express import cpx
import time

cpx.pixels.brightness = 0.5

#COLOUR

RED = (255, 0, 0)
YELLOW = (255, 150, 0)
GREEN = (0, 255, 0)
CYAN = (0, 255, 255)
BLUE = (0, 0, 255)
PURPLE = (180, 0, 255)
WHITE = (255, 255, 255)
MAROON = (128,0,0)
PURPLE =(128,0,128)
TEAL = (0,128,128)
OFF = (0, 0, 0)


while True:
    cpx.pixels.fill(RED)
    time.sleep(1)

For the third example, I showed the code where I was

switching on all neopixels on cpx one by one
then I was switching it off in the same order.

import time
from adafruit_circuitplayground.express import cpx
from random import randint

cpx.pixels.brightness = 0.5

#  COLOUR

RED = (255, 0, 0)
YELLOW = (255, 150, 0)
GREEN = (0, 255, 0)
CYAN = (0, 255, 255)
BLUE = (0, 0, 255)
PURPLE = (180, 0, 255)
WHITE = (255, 255, 255)
MAROON = (128,0,0)
VIOLET =(128,0,128)
TEAL = (0,128,128)
OFF = (0, 0, 0)

COLOURS = [RED, YELLOW, GREEN, CYAN, BLUE, PURPLE, WHITE, MAROON, VIOLET, TEAL]

i = 0
complete = False

while True:
    if complete:
        c = (0,0,0)
    else:
        r = randint(0,9)
        c = COLOURS[r]
    cpx.pixels[i] = c
    time.sleep(1)
    i += 1
    if i == 10:

        i = 0
        if complete:
            complete = False
        else:
            complete = True

The next was the code which I used to lit my Diwali Kandel. I needed cpx to be continuously lighted. So I wrote this code.

import time
from adafruit_circuitplayground.express import cpx
from random import randint

cpx.pixels.brightness = 0.5

# COLOUR

RED = (255, 0, 0)
YELLOW = (255, 150, 0)
GREEN = (0, 255, 0)
CYAN = (0, 255, 255)
BLUE = (0, 0, 255)
PURPLE = (180, 0, 255)
WHITE = (255, 255, 255)
MAROON = (128,0,0)
VIOLET =(128,0,128)
TEAL = (0,128,128)
OFF = (0, 0, 0)

COLOURS= [RED, YELLOW, GREEN, CYAN, BLUE, PURPLE, WHITE, MAROON, VIOLET, TEAL]

i = 0


while True:
    r = randint(0,9)
    c = COLOURS[r]
    cpx.pixels[i] = c
    time.sleep(1)
    i += 1
    if i == 10:
        i = 0

It is customary that we play a game on Diwali. I have replaced card games with a game on cpx. It is a guessing game on CPX.

import time
from adafruit_circuitplayground.express import cpx
from random import randint

cpx.pixels.brightness = 0.5

# COLOUR

RED = (255, 0, 0)
YELLOW = (255, 150, 0)
GREEN = (0, 255, 0)
CYAN = (0, 255, 255)
BLUE = (0, 0, 255)
PURPLE = (180, 0, 255)
WHITE = (255, 255, 255)
MAROON = (128,0,0)
VIOLET =(128,0,128)
TEAL = (0,128,128)
OFF = (0, 0, 0)

COLOURS = [RED, YELLOW, GREEN, CYAN, BLUE, PURPLE, WHITE, MAROON, VIOLET, TEAL]

i = 0


while True:
    if cpx.button_a:
        r = randint(0,9)
        c = COLOURS[r]
        cpx.pixels[i] = c
        time.sleep(0.5)
        i += 1
    elif cpx.button_b:
        cpx.pixels.fill( (0,0,0))
        i = 0

    if i == 10:
        i = 0

What is Diwali without music? Circuit Python and Adafruit have options for that also. I demonstrated Adafruit NeoTrellis M4 for creating some beats.
It is a backlight keypad driver system. We can use the Adafruit NeoTrellis with

Python or
CircuitPython, and the
CircuitPython Trellis module, provided by Adafruit.

The Adafruit module enables one to write Python code controlling the neopixel and read button presses on a single Trellis board or with a matrix of up to eight Trellis boards.

This board can be used with any

CircuitPython microcontroller board or
with a computer that has GPIO and Python.

Coloring the NeoPixel strip

At PyCon US 2019, while discussing about projects on my previous Diwali with Nina, she got excited and gifted me with a neopixel strip. And as the lady commanded, I lit my Diwali rangoli with neopixel strip connecting it with the alligator clip at cpx :

GND - black
A1 (data) - white
VOUT - red

and ran the code :

import time
from adafruit_circuitplayground.express import cpx
import board
import neopixel
pixels = neopixel.NeoPixel(board.D6, 30)
red = (255,0,0)
green = (0,255,0)
blue = (0,0,255)


while True:
    for i in range(0,30):
        pixels[i] = red
        time.sleep(0.1)
    for i in range(29,-1, -1):
        pixels[i] = green
        time.sleep(0.1)

Finally, it was time to light the Diwali diya. I have placed 3 diyas, (color by my toddler) and then fixed the neo pixel strip taping it on the diyas and then ran this code :

import time
from adafruit_circuitplayground.express import cpx
import board
import neopixel
pixels = neopixel.NeoPixel(board.D6, 30)
red = (255,0,0)
green = (0,255,0)
blue = (0,0,255)

cpx.pixels[1] = red
cpx.pixels[2] = red
cpx.pixels[9] = green
cpx.pixels[10]= green
cpx.pixels[18] = red
cpx.pixels[19]= red
cpx.pixels[27] = green
cpx.pixels[26]= green
time.sleep(30)

I never showed this code during the talk itself for the time constraint.

The slide :

My slides are public at https://slides.com/dascommunity/my-diwali-with-circuit-python#/

My Gratitude:

I would like to show my deepest gratitude to Nina Zakharenko, Kattni and Carol Willing for the hardware, Scott Shawcroft for giving me guidance into Circuit Python, Nicholas Tollervey for giving us mu and John Hawley for dragging me into hardware. Moreover thank you everyone for helping me, supporting me, standing by me and inspiring me when I broke down.

Conclusion

I really enjoyed giving the talk on/with/in Circuit Python. I will be here, coding simple, fun, useless and creative stuff.

↧

Introduction to pathlib

Path, text, and bytes

Arbitrary bytes

Surrogate escape

os.fsencode()

To Path and back again

Presentation

Why you should not use str(path)

Lesser of the evils

Summary

Cheat sheet

Python Jobs

Articles & Tutorials

Projects & Code

Events

Introduction

What is Base64 Encoding?

How Does Base64 Encoding Work?

Why use Base64 Encoding?

Encoding Strings with Python

Decoding Strings with Python

Encoding Binary Data with Python

Decoding Binary Data with Python

Conclusion

Becoming a Data Engineer

What Does a Data Engineer Do?

How Can Python Help Data Engineers?

Answering Data Engineer Interview Questions

Questions on Relational Databases

Q1: Relational vs Non-Relational Databases

A SQLite Example

Q2: SQL Aggregation Functions

Q3: Speeding Up SQL Queries

Q4: Debugging SQL Queries

Questions on Non-Relational Databases

A MongoDB Example

Q5: Querying Data With MongoDB

Q6: NoSQL vs SQL

Questions on Cache Databases

A Redis Example

Q7: How to Use Cache Databases

Questions on Design Patterns and ETL Concepts

Q8: ETL Challenges

Q9: Design Patterns in Big Data

Q10: Common Aspects of the ETL Process and Big Data Workflows

Conclusion

Introduction

Merge Sort

Implementation

Sorting Arrays

Sorting Custom Objects

Optimization

Conclusion

Example project

Development environment

Pycharm Configuration

Running App / Tests

Useful features

Execute Single Test

Auto Test Rerun

Final thoughts

Implementation details

End of python 2.7 support

No subprocess measurement

Renames

Playing nice with -m, -k and all other selectors

Quickest tests first

New algorithm

Introduction

Diwali and Circuit Python

Circuit Python in PyConf Hyderabad

The day of the talk

The talk

all_led_rgb.py

Coloring the NeoPixel strip

The slide :

My Gratitude:

Conclusion

Introduction to `pathlib`

`Path`, text, and bytes

`os.fsencode()`

To `Path` and back again

Why you should not use `str(path)`