Gocept Weblog: Saltlabs Sprint: Zope and Plone sprint in a new location

August 7, 2018, 10:39 pm

≫ Next: Bhishan Bhandari: Python filter() built-in

≪ Previous: Mike Driscoll: Python 101: Episode #19 – The subprocess module

After Earl Zope II is now nearly relocated to the Python 3 wonderland, gocept will move to a new head quarter in the next months. This is the right time to celebrate with a new sprint, as we have now even more space for sprinters. The new location is called the “Saltlabs”, a place for IT companies in Halle (Saale), Germany.

Sprint information

Date: Monday, 1st until Friday, 5th of October 2018
Location: Leipziger Str. 70, Halle (Saale), Germany

Sprint topics

This sprint has three main topics:

Create a final Zope 4 release

Before releasing a final version of Zope 4 we want to resolve about at least 40 issues: Some bugs have to be fixed, some functions have to be polished and documentation has to be written resp. reviewed. On the other hand there is the re-brush of the ZMI using Bootstrap which should be completed beforehand, as it modernizes the ZMI and allows for easier customisation, but might also be backwards incompatible with certain test suites. There is an Etherpad to write down ideas, tasks, wishes and work proposals, which are not currently covered by the issue tracker.

Port Plone to Python 3

The following tasks are currently open and can be fixed at the sprint:

successfully run all Plone tests and even the robotframework tests on Python 3
Zope 4 lost the WebDAV support: find resp. create a replacement
document the WSGI setup and test it in a production ready environment
port as many as possible add-ons to Python 3 (e.g. Mosaic and Easyform)
work on the Migration of ZODB contents (Data.fs) to Python 3
improve the test setup with tox.
start to support Python 3.7

Polish Plone 5.2

The upcoming Plone 5.2 release will appreciate some love and care at the following items:

new navigation with dropdown and better performance
Barceloneta theme: ease the customisation and improve responsiveness
parallelise the tests so they run faster
remove Archetypes and other obsolete packages

See also the list of topics on plone.org for this sprint.

Organisational Remarks

In order to coordinate the participation for this sprint, we ask you to join us on Meetup. We can then coordinate the catering and requirements for space.

As this sprint will be running longer than usual (five days), it is also possible to join only for a part of the week. As October 3rd is the national holiday, we are trying to organise some social event for those who are interested in having a small break.

For a better overview, please indicate your participation also on this doodle poll.

↧

Bhishan Bhandari: Python filter() built-in

August 8, 2018, 5:22 am

≫ Next: PyCharm: PyCharm 2018.2.1 Out Now

≪ Previous: Gocept Weblog: Saltlabs Sprint: Zope and Plone sprint in a new location

Filter makes an iterator that takes a function and uses the arguments from the following iterable passed to the filter built-in. It returns a filtered iterator which contains only those values for which the function(passed as the first argument to the filter) evaluated truth value. What makes this possible is the equal status of every […]

The post Python filter() built-in appeared first on The Tara Nights.

↧

PyCharm: PyCharm 2018.2.1 Out Now

August 8, 2018, 5:29 am

≫ Next: Real Python: Python Community Interview With Mike Driscoll

≪ Previous: Bhishan Bhandari: Python filter() built-in

The latest version of PyCharm is now available: get PyCharm 2018.2.1 from our website now.

New in This Version

An issue on Linux where part of the window wouldn’t be redrawn correctly has been resolved
A performance problem that affected Django users with a MacBook Pro with a touch bar has been fixed
And more: read the release notes

Any Comments?

If you have questions, remarks, or complaints, please reach out to us! You can ask our support team for assistance, create tickets in our issue tracker, follow us on Twitter, or just leave a comment on this blog post.

↧

Real Python: Python Community Interview With Mike Driscoll

August 8, 2018, 7:00 am

≫ Next: PyCharm: PyCharm and pytest-bdd

≪ Previous: PyCharm: PyCharm 2018.2.1 Out Now

Welcome to the first in a series of interviews with members of the Python community.

If you don’t already know me, my name is Ricky, and I’m the Community Manager here at Real Python. I’m a relatively new developer, and I’ve been part of the Python community since January, 2017, when I first learned Python.

Prior to that, I mainly dabbled in other languages (C++, PHP, and C#) for fun. It was only after I fell in love with Python that I decided to become a “serious” developer. When I’m not working on Real Python projects, I make websites for local businesses.

This week, I’m talking to Mike Driscoll of Mouse Vs Python fame. As a long-time Python advocate and teacher, Mike shares his story of how he came to be a Python developer and an author. He also shares his plans for the future, as well as insight into how he would use a time machine…

Let’s get started.

Ricky:I’d like to start by learning how you got into programming, and how you came to love Python?

Mike Driscoll

Mike: I decided to be some kind of computer programmer when I went to college. I started out in computer science and then somehow ended up with an MIS degree due to some confusing advice I received long ago from a professor. Anyway, this was back right before the internet bubble burst, so there were no jobs in tech when I graduated. After working as the sole member of an I.T. team at an auction house, I was hired by the local government to be a software developer.

The boss at that place loved Python, and I was required to learn it because that was what all new development would be done in. Trial by fire! It was a stressful couple of months of turning Kixtart code into Python code for our login scripts. I also was challenged to find a way to create desktop user interfaces in Python so we could migrate away from these truly awful VBA applications that were created on top of MS Office.

Between my boss loving Python and me having so much fun learning it and using it on the job, I ended up loving it too. We made GUIs with wxPython, reports with ReportLab, web applications with TurboGears, and much more with just vanilla Python.

Ricky:You’ve been writing on your blog, Mouse Vs Python, for over 10 years now. How have you kept so consistent and motivated to write each week?

Mike: I’m not always consistent. There have been some gaps where I didn’t write much at all. There was a year where I had stopped writing for the most part for several months. But I noticed that my readership had actually grown while I was taking a break. I actually found that really motivating because there were so many people reading old posts, and I wanted my blog to continue to stay fresh.

Also, my readers have always been pretty supportive of my blog. Because of their support, I have been committed to writing on the blog whenever I can or at least jot down some ideas for later.

Ricky:You’ve also authored five books to date, with Python Interviews: Discussions with Python Experts being released earlier this year. Having spoken with so many highly prominent developers in the Python community, what tips or wisdom have you personally taken away from the book that have helped you develop (either professionally or personally)?

Mike: I really enjoyed speaking with the developers while working on the Python Interviews book. They were quite helpful in fleshing out the history of Python and PyCon USA as well as the Python Software Foundation.

I learned about where some of the core developers think Python might go in the future and also why it was designed the way it was in the past. For example, I hadn’t realized that the reason Python didn’t have Unicode support built-in at the beginning was that Python actually pre-dates Unicode by several months.

I think one of the lessons learned is how big data science and education are for Python right now. A lot of people I interviewed talked about those topics, and it was fun to see Python’s reach continue to grow.

Ricky:I’ve noticed you’ve started creating YouTube videos again for your Python 101 series. What made you decide to start creating video content again?

Mike: The Python 101 screencast was something I put together as an offshoot of the Python 101 book. While a lot of publishers say that video content is growing in popularity, my experience has been the opposite. My screencast series never had a lot of takers, so I decided to just share it with my readers on YouTube. I will be posting most or all of the series there and probably discontinue it as a product that I sell.

I think I need more experience creating video training, so I also plan to do more videos on other topics in Python and see how they are received. It’s always fun to try out other methods of engagement with my audience.

Ricky:Not only do you do so much for the online community, but you also founded and run your local Python user group. What advice would you give to someone (like me) who might be looking to go to their first local user group meeting?

Mike: Pyowa, the local Python group that I founded, now has several organizers, which is really nice. But back to your question. If you want to go to a group, the first thing to do is to find out where and if one exists near you. Most groups are listed on the Python wiki.

Next, you need to look up their website or Meetup and see what their next meeting is about. Most of the meetings I have been to in Iowa have some form of social time at the beginning, or end, or both. Then they have a talk of some sort or some other activity like mob programming or lightning talks. The main thing is to come prepared to talk and learn about Python. Most of the time, you will find that the local user groups are just as welcoming as the people who attend PyCon are.

Ricky:If you could go back in time, what would you change about Python? Is there something you wish the language could do? Or maybe there’s something you’d like to remove from the language, instead?

Mike: I wish Guido had been able to convince Google’s Android engineering department to include Python as one of the languages used natively in Android. As it is, we currently don’t have much in the way of writing applications for mobile besides Toga and Kivy. I think both of these libraries are pretty neat, but Toga is still pretty beta, especially on Android, and Kivy doesn’t look native on anything that it runs on.

Ricky:I love celebrating the wins in life, big and small. What has been your proudest Python moment so far?

Mike: Personally, I am proud of writing about Python in book and blog form and having so many readers who have found my ramblings helpful. I am also proud to know so many great people in the community who will help each other in many meaningful ways. It’s like having a network of friends that you haven’t even necessarily met. I find this unique to the Python community.

Ricky:I’m curious to know what other hobbies and interests you have, aside from Python? Any you’d like to share and/or plug?

Mike: Most of my spare time is spent playing with my three-year-old daughter. However, I also enjoy photography. It can be challenging to get the shot you want, but digital photography also makes it a lot easier since you can get instant feedback and adjust if you messed it up, assuming your subject is willing.

If you’d like to follow Mike’s blog or check out any of his books, head over to his website. You can also message Mike to say “Hi” on Twitter and YouTube.

Is there someone you’d like us to interview in the community? Leave their name below, and they just might be next.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

PyCharm: PyCharm and pytest-bdd

August 8, 2018, 9:21 am

≫ Next: Bill Ward / AdminTome: Data Pipeline: Send logs from Kafka to Cassandra

≪ Previous: Real Python: Python Community Interview With Mike Driscoll

Last week we published a blog post on the new pytest fixture support in PyCharm 2018.2. This feature was a big part of the 2018.2 work and proved to be a hit. But it wasn’t the only notable pytest work: 2018.2 also delivered support for behavior testing with pytest-bdd, a project that provides behavior testing with pytest as the test framework. Our What’s New in 2018.2 video gave a quick glimpse of this support.

Let’s rewrite the tutorial from last week, showing the pytest-bdd support instead of simply working with fixtures. Behavior-driven-development (BDD) support is a part of PyCharm Professional Edition, previously covered in a blog post about using Behave and an older blog post about BDD. This tutorial puts pytest-bdd to use.

Want the finished code? It’s in available in a GitHub repo.

What and Why for BDD

If you’re new to BDD, it can appear a bit strange. You write a programming-free specification, known as a “feature”, that describes how the software is supposed to behave. You then implement that specification with test code.

This two-step process shines in several workflows:

Large teams with “test engineers”, business analysts or/and domain experts. That is, those who understand business and can read English but not Python code.
Complex software in which you want to think about what’s happening in a simpler fashion than test code

BDD can seem like overhead. But if you find yourself lost in your test code, the feature spec can help you step back from the details and follow the high-level intent. Even better, when you come back to your code months later, it is much easier to get up-to-date. Finally, if you feel committed to the “Given/When/Then” test style, pytest-bdd moves that commitment from “dating” to “marriage”.

Setup

We’ll use the same example application as the previous blog post.

To follow along at home, make sure you have Python 3.7 (our example uses dataclasses) and Pipenv installed. Clone the repo at https://github.com/pauleveritt/laxleague and make sure Pipenv has created an interpreter for you. (You can do both of those steps from within PyCharm.) Make sure Pipenv has installed pytest and pytest-bdd into your interpreter.

Then, open the directory in PyCharm and make sure you have set the Python Integrated Tools -> Default Test Runner to pytest, as done in the previous blog post.

This time, though, we also need to set Languages & Frameworks -> BDD -> Preferred BD framework to pytest-bdd. This will enable many of the features discussed in this tutorial. More information is available in the PyCharm Help on pytest-bdd.

Now you’re ready to follow the material below. Right-click on the tests directory and choose Run ‘pytest in tests’. If all the tests pass correctly, you’re setup.

Let’s Make a Feature

What parts of our project deliver business value? We’re going to take these requirements, as “features”, and write them in feature files. These features are specified using a subset of the Gherkin language.

Right click on the tests folder and select New, then choose Gherkin feature file.
Name the file games. Note: Some like to group BDD tests under tests/features.

Here’s the default file contents generated by PyCharm:

# Created by pauleveritt at 8/6/18
Feature: #Enter feature name here
 # Enter feature description here

 Scenario: # Enter scenario name here
   # Enter steps here

Let’s change the feature file’s specification to say the following:

Feature: Games
 laxleague games are between two teams and have a score resulting
 in a winner or a tie.

 Scenario: Determine The Winner
   Given a home team of Blue
   And a visiting team of Red
   When the score is 5 for Blue to 4 for Red
   Then Blue is the winner

This Scenario shows the basics of BDD:

Each Feature can have multiple Scenarios
Each Scenario has multiple steps
The steps revolve around the “given/when/then” approach to test specification
Step types can be continued with the And keyword
Given specifies inputs
When specifies the logic being tested
Then specifies the result

We might have a number of other scenarios relating to games. Specifically, the “winner” when the score is tied. For now, this provides enough to implement.

As we type this feature in, we see some of the features PyCharm Professional provides in its Gherkin support:

Syntax highlighting of Gherkin keywords
Reformat Code fixes indentation
Autocomplete on keywords
Warnings on unimplemented steps

Implement The Steps

If you run the tests, you’ll see….no difference. The games.feature file isn’t a test: it’s a specification. We need to implement, in test code, each of the scenarios and scenario steps.

PyCharm can help on this. As mentioned above, PyCharm warns you about an “Undefined step reference”:

If you Alt-Enter on the warnings, PyCharm will offer to either Create step definition or Create all steps definition. Since 2018.2 does the latter in an unexpected way (note: it’s being worked on), let’s choose the former, and provide a File name: of test_games_feature:

Here’s what the generated test file test_games_feature.py looks like:

from pytest_bdd import scenario, given, when, then


@given("a home team of Blue")
def step_impl():
   raise NotImplementedError(u'STEP: Given a home team of Blue')

These are just stubs, of course, which we’ll have to come back and name/implement. We could implement the other steps by hand. Let’s instead let PyCharm continue generating the stubs, into the same file.

2018.2 doesn’t generate the scenario, which is what actually triggers the running of the test. Let’s provide the scenario, as well as implement each step, resulting in the following for the test_games_feature.py implementation:

from pytest_bdd import scenario, given, when, then

from laxleague.games import Game
from laxleague.teams import Team


@scenario('games.feature', 'Determine The Winner')
def test_games_feature():
   pass


@given('a home team of Blue')
def blue():
   return Team('Blue')


@given('a visiting team of Red')
def red():
   return Team('Red')


@given('a game between them')
def game_red_blue(blue, red):
   return Game(home_team=blue, visitor_team=red)


@when('the score is 10 for Blue to 5 for Red')
def record_score(game_red_blue):
   game_red_blue.record_score(10, 5)


@then('Blue is the winner')
def winner(game_red_blue):
   assert 'Blue' == game_red_blue.winner.name

To see this in action, let’s run the test then take a look at what pytest-bdd is doing.

Run the Tests

You’ve already run your regular pytest tests, with fixtures and the like. What extra does it take to also run your pytest-bdd tests? Nothing! Just run your tests:

Your pytest-bdd tests show up just like any other test.

Let’s take a look at some things pytest-bdd is doing:

The @scenario decorated function test_games_feature is the only function in the file with the test_ prefix. That means, this file only has one test. And guess what? The function itself doesn’t do anything. It’s just a marker.
We need two teams, so we implement the two Given steps by making a Blue and Red team.
We also need a game between these two teams. This is the third step in our games.feature scenario. Not that this function takes two arguments. As it turns out, pytest-bdd steps are pytest fixtures which can be injected into the functions for each step. (In fact, any pytest fixture can be injected.)
Now that we have a game setup, we use @when to run the logic being tested by recording a score.
Did our logic work? We use @then to do our assertion. This is where our test passes or fails.

It’s an interesting approach. It’s verbose, but it clearly delineates the “Given/When/Then” triad of good test cases. Plus, you can read the almost-human-language Gherkin file to quickly understand what’s the behavior being tested.

Test Parameters

You might have noticed that the feature file specified a score for the game but it was ignored in the implemented tests. pytest-bdd has a feature where you can extract parameters from strings. It has several parsing schemes. We’ll use the simplest, starting by adding parsers to the import from pytest_bdd:

from pytest_bdd import scenario, given, when, then, parsers

Note: You could also do this the productive way by using the symbol and letting PyCharm generate the import for you.

The feature file says this:

When the score is 10 for Blue to 5 for Red

Let’s change our @when decorator to parse out the score:

@when(parsers.parse('the score is {home:d} for Blue to {visitor:d} for Red'))

When we do so, PyCharm warns us that Not all arguments.... were used. You can type them in, but PyCharm knows how to do it. Hit Alt-Enter and accept the first item:

After changing record_score to use these values, our function looks like this:

@when(parsers.parse('the score is {home:d} for Blue to {visitor:d} for Red'))
def record_score(game_red_blue, home, visitor):
   game_red_blue.record_score(home, visitor)

As we experiment with lots of combinations, moving this to the feature file is very helpful, particularly for test engineers.

Want more? Gherkin and pytest-bdd support “Scenario Outlines” where you can batch parameterize your inputs and outputs, just like we saw previously with pytest.mark.parametrize.

Use pytest Fixtures and Features

As noted above, pytest-bdd treats scenario steps as pytest fixtures. But any pytest fixture can be used, which makes pytest-bdd a powerful approach for BDD.

For example, imagine we have an indirection where there are different types of Game implementations. We’d like to put the choice of which class is used behind a fixture, in conftest.py:

@pytest.fixture
def game_type():
   return Game

We can then use this fixture in the @given step when we instantiate a Game:

@given('a game between them')
def game_red_blue(game_type, blue, red):
   return game_type(home_team=blue, visitor_team=red)

Productive BDD

Prior to 2018.2, PyCharm provided BDD for Behave, but nothing for pytest-bdd. Admittedly, there are still loose ends being worked on. Still, what’s available now is quite productive, with IDE features brought to bear when getting in the pytest-bdd flow.

Warnings

Forgot to implement a step? Typo in your @scenario text identifier? PyCharm flags these with the (configurable) warning and error infrastructure.

Autocomplete

Obviously we autocomplete keywords from Gherkin and pytest-bdd. And as expected, we autocomplete function names as fixture parameters. But we also autocomplete strings from the feature file…for example, to fix the kind of error just mentioned for the scenario identifier. Also, parameters parsed out of strings can autocomplete.

Quick Info

Wonder what is a particular symbol and where it came from? You can use Quick Info or on-hover type information with Cmd-hover to see more information:

Conclusion

Doing behavior-driven development is a very different approach to testing. With pytest-bdd, you can keep much of what you know and use from the very-active pytest ecosystem when doing BDD. PyCharm Professional 2018.2 provides a “visual testing” frontend to keep you in a productive BDD flow.

↧

Bill Ward / AdminTome: Data Pipeline: Send logs from Kafka to Cassandra

August 8, 2018, 11:50 am

≫ Next: Codementor: Better with Python: Collections

≪ Previous: PyCharm: PyCharm and pytest-bdd

In this post, I will outline how I created a big data pipeline for my web server logs using Apache Kafka, Python, and Apache Cassandra.

In past articles I described how to install and configure Apache Kafka and Apache Cassandra. I assume that you already have a Kafka broker running with a topic of www_logs and a production ready Cassandra cluster running. If you don’t then please follow the articles mentioned in order to follow along with this tutorial.

In this post, we will tie them together to create a big data pipeline that will take web server logs and push them to an Apache Cassandra based data sink.

This will give us the opportunity to go through our logs using SQL statements and possible other benefits like applying machine learning to predict if there is an issue with our site.

Here is the basic diagram of what we are going to configure:

data pipeline example using kafka, python, and cassandra

Lets see how we start the pipeline by pushing log data to our Kafka topic.

Pushing logs to our data pipeline

Apache Web Server logs to /var/logs/apache. For this tutorial, we will work with the Apache access logs which show requests to the web server. Here is an example:

108.162.245.143 - - [08/Aug/2018:17:44:40 +0000] "GET /blog/terraform-taint-tip/ HTTP/1.0" 200 31281 "-""Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Log files are simply text files where each line is a entry in the log file.

In order to easily read our logs from a Python application that we will write later, we will want to convert these log lines into JSON data and add a few more fields.

Here is what our JSON will look like:

{
  "log": {
    "source": "",
    "type": "",
    "datetime": "",
    "log": ""
  }
}

The source field is going to be the hostname of our web server. The type field is going to let us know what type of logs we are sending. In this case it will be ‘www_access’ since we are going to send Apache access logs. The datetime field will hold the timestamp value of when the log was created. Finally, the log field will contain the entire line of text representing the log entry.

I created a sample python application that takes these logs and forwards them to kafka. You can find it on GitHub at admintome/logs2kafka. Let’s look at the forwarder.py file in more detail:

import time
import datetime
import socket
import json
from mykafka import MyKafka


def parse_log_line(line):
    strptime = datetime.datetime.strptime
    hostname = socket.gethostname()
    time = line.split(' ')[3][1::]
    entry = {}
    entry['datetime'] = strptime(
        time, "%d/%b/%Y:%H:%M:%S").strftime("%Y-%m-%d %H:%M")
    entry['source'] = "{}".format(hostname)
    entry['type'] = "www_access"
    entry['log'] = "'{}'".format(line.rstrip())
    return entry


def show_entry(entry):
    temp = ",".join([
        entry['datetime'],
        entry['source'],
        entry['type'],
        entry['log']
    ])
    log_entry = {'log': entry}
    temp = json.dumps(log_entry)
    print("{}".format(temp))
    return temp


def follow(syslog_file):
    syslog_file.seek(0, 2)
    pubsub = MyKafka(["mslave2.admintome.lab:31000"])
    while True:
        line = syslog_file.readline()
        if not line:
            time.sleep(0.1)
            continue
        else:
            entry = parse_log_line(line)
            if not entry:
                continue
            json_entry = show_entry(entry)
            pubsub.send_page_data(json_entry, 'www_logs')


f = open("/var/log/apache2/access.log", "rt")
follow(f)

The first thing we do is open the log file /var/log/apache2/access.log for reading. We then pass that file to our follow() function where our application will follow the log file much like tail -f /var/log/apache2/access.log would.

If the follow function detects that a new line exists in the log it converts it to JSON using the parse_log_line() function. It then uses the send_page_data() function of MyKafka to push the JSON message to the www_logs topic.

Here is the MyKafka.py python file:

from kafka import KafkaProducer
import json


class MyKafka(object):

    def __init__(self, kafka_brokers):
        self.producer = KafkaProducer(
            value_serializer=lambda v: json.dumps(v).encode('utf-8'),
            bootstrap_servers=kafka_brokers
        )

    def send_page_data(self, json_data, topic):
        result = self.producer.send(topic, key=b'log', value=json_data)
        print("kafka send result: {}".format(result.get()))

This simply calls KafkaProducer to send our JSON as a key/value pair where the key is the string ‘log’ and the value is our JSON.

Now that we have our log data being pushed to Kafka we need to write a consumer in python to pull messages off the topic and save them as a row in a Cassandra table.

But first we should prepare Cassandra by creating a Keyspace and a table to hold our log data.

Preparing Cassandra

In order to save our data to Cassandra we need to first create a Keyspace in our Cassandra cluster. Remember that a keyspace is how we tell Cassandra a replication strategy for any tables attached to our keyspace.

Let’s start up CQLSH.

$ bin/cqlsh cass1.admintome.lab
Connected to AdminTome Cluster at cass1.admintome.lab:9042.
[cqlsh 5.0.1 | Cassandra 3.11.3 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh>

Now run the following query to create our keyspace.

CREATE KEYSPACE admintome WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'}  AND durable_writes = true;

Now run this query to create our logs table.

CREATE TABLE admintome.logs (
    log_source text,
    log_type text,
    log_id timeuuid,
    log text,
    log_datetime text,
    PRIMARY KEY ((log_source, log_type), log_id)
) WITH CLUSTERING ORDER BY (log_id DESC)

Essentially, we are storing time series data which represents our log file information.

You can see that we have a column for source, type, datetime, and log that match our JSON from the previous section.

We also have another row called log_id that is of the type timeuuid. This creates a unique UUID from the current timestamp when we insert a record into this table.

Cassandra stores one row per partition. A partition in Cassandra is identified by the PRIMARY KEY. In this example, our PK is a COMPOSITE PRIMARY KEY where we use both the log_source and the log_type values as a primary key.

So for our example, we are going to create a single partition in Cassandra consisting of the primary key (‘www2’,’www_access). The hostname of my web server is www2 so that is what log_source is set to.

We also set the Clustering Key to log_id. These are guaranteed unique keys so we will be able to have multiple rows in our partition.

If I lost you there don’t worry, it took me a couple of days and many headaches to understand it fully. I will be writing another article soon detailing why the data is modeled in this fashion for Cassandra.

Now that we have our Cassandra keyspace and table ready to go, we need to write our Python consumer to pull the JSON data from our Kafka topic and insert that data into our table as a new row.

Python Consumer Application

I have posted the source code to the kafka2cassandra python application on GitHub at admintome/kafka2cassandra.

We use the same Kafka Python module that we used in our producer code above, but instead we will use KafkaConsumer pull messages off of our topic. We then will use the Python Cassandra Drive module from Datastax to insert a row into our table.

Here is the code for the poller.py file:

import sys
from kafka import KafkaConsumer
import json
from cassandra.cluster import Cluster

consumer = KafkaConsumer(
    'www_logs', bootstrap_servers="mslave2.admintome.lab:31000")

cluster = Cluster(['192.168.1.47'])
session = cluster.connect('admintome')

# start the loop
try:
    for message in consumer:
        entry = json.loads(json.loads(message.value))['log']
        print("Entry: {} Source: {} Type: {}".format(
            entry['datetime'],
            entry['source'],
            entry['type']))
        print("Log: {}".format(entry['log']))
        print("--------------------------------------------------")
        session.execute(
            """
INSERT INTO logs (log_source, log_type, log_datetime, log_id, log)
VALUES (%s, %s, %s, now(), %s)
""",
            (entry['source'],
             entry['type'],
             entry['datetime'],
             entry['log']))
except KeyboardInterrupt:
    sys.exit()

This is a simple loop where we use KafkaConsumer to pull a message off the Kafka topic. I have no idea why but I only got a proper python dictionary when I called json.loads() twice from log JSON data returned from KafkaConsumer.

If you find out why please post in the comments, I would love to know why.

We also create a connection to our Cassandra cluster and connect to our admintome keyspace with these two lines:

cluster = Cluster(['192.168.1.47'])
session = cluster.connect('admintome')

We then insert our JSON data (which is now stored in the entry dict) to our logs table in Cassandra.

"""
INSERT INTO logs (log_source, log_type, log_datetime, log_id, log)
VALUES (%s, %s, %s, now(), %s)
""",
            (entry['source'],
             entry['type'],
             entry['datetime'],
             entry['log']))

With Cassandra you always have to specify all the rows when you do an insert. In the values section notice that we are using the now() CQL function to create a timeuuid value from the current timestamp.

Deploying our consumer to Kubernetes.

We want this consumer to always be running so we are going to use Kubernetes to deploy a docker container that runs this script for us.

You don’t have to complete this section to continue. We already have a fully running pipeline.

We can use this Dockerfile to build our docker container.

FROM python:3

WORKDIR /usr/src/app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

CMD [ "python", "-u", "./poller.py" ]

Notice in the last line we tell it to run python -u? That will tell python to display the text unbuffered. This will allow us to see the output of our python application running in the docker container correctly.

Build the docker container and push it to your docker registry that your Kubernetes cluster is using.

Now create a kafka2cassandra.yaml file on your Kubernetes management system (the system you have KUBECTL installed on to manage your Kubernetes cluster) and add these contents:

apiVersion: v1
kind: Pod
metadata:
  name: kafka2cassandra
spec:
  containers:
    - name: kafka2cassandra
      image: admintome/kafka2cassandra
      stdin: true
      tty: true

Make sure to update the image parameter with the actual image location that you pushed your docker container to.

Also notice that we set stdin and tty to true in our Pod definition. This is so we can see the text logging from our python script from Kubernetes correctly.

Now deploy the pod with

$ kubectl create -f kafka2cassandra.yaml

You see the pod successfully start and if you check the logs you will see that it is successfully pulling messages off of our Kafka topic and pushing the data to our Cassandra table.

Now it’s time to query Cassandra for our log data.

Cassandra Queries

Now that we have our data being sent to Cassandra we can run some queries on the data.

Start up CQLSH.

$ bin/cqlsh cass1.admintome.lab
Connected to AdminTome Cluster at cass1.admintome.lab:9042.
[cqlsh 5.0.1 | Cassandra 3.11.3 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh>

Run the following query to count the number of rows we have so far. Keep in mind this is assuming you have some web requests that have been processed already.

select count(*) from admintome.logs where log_source = 'www2' and log_type = 'www_access';

We should get a response back:

@ Row 1
-------+----
 count | 23

You can view the logs with this query:

select dateOf(log_id), log from admintome.logs where log_source = 'www2' and log_type = 'www_access' limit 5;

@ Row 1
-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 system.dateof(log_id) | 2018-08-08 04:35:20.493000+0000
 log                   | '172.69.70.168 - - [08/Aug/2018:04:35:20 +0000] "GET /blog/installing-puppet-enterprise-2017-3-agents/ HTTP/1.0" 200 35250 "http://www.admintome.com/blog/tag/puppet/""Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"'

@ Row 2

Keep in mind that that query is fine when you don’t have many rows but due to the way Cassandra stores data it can cause serious performance issues if you try to run it on a large data set.

A better way is to limit your query to a set time period like this query:

cqlsh> select dateOf(log_id), log from admintome.logs where log_source = 'www2' and log_type = 'www_access' and log_id >= maxTimeuuid('2018-08-08 04:30+0000') and log_id < minTimeuuid('2018-08-08 04:40+0000') limit 5;

@ Row 1
-----------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 system.dateof(log_id) | 2018-08-08 04:35:20.493000+0000
 log                   | '172.69.70.168 - - [08/Aug/2018:04:35:20 +0000] "GET /blog/installing-puppet-enterprise-2017-3-agents/ HTTP/1.0" 200 35250 "http://www.admintome.com/blog/tag/puppet/""Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"'

Conlusion

We now have a complete data pipeline that takes Apache Access logs, pushes them in JSON form to a Kafka topic, a python application that consumes the messages and inserts the data into Cassandra for long term storage and data analysis.

There is much more that can be done with this pipeline to make it more robust. For example, once we start to add in more logs like the Apache Error log or other logs, we will need to create a Kafka consumer group and run more than one consumer container and split our Kafka topic into partitions.

I hope you have enjoyed this post.

If you did then please share it on social media and comment below, I would love to hear from you.

The post Data Pipeline: Send logs from Kafka to Cassandra appeared first on AdminTome Blog.

↧

Codementor: Better with Python: Collections

August 8, 2018, 8:56 pm

≫ Next: py.CheckIO: Data science & data analysis most effective libraries

≪ Previous: Bill Ward / AdminTome: Data Pipeline: Send logs from Kafka to Cassandra

A quick introduction to a feature of the Python standard library.

↧

py.CheckIO: Data science & data analysis most effective libraries

August 7, 2018, 1:28 am

≫ Next: Djangostars: How to Build a Unique Technology for Your Fintech Product with Python

≪ Previous: Codementor: Better with Python: Collections

Statistics say that a modern person in one day receives the same amount of information that a typical city resident in the 19th century received in the course of a year. It’s obvious that for human brain it’s becoming harder and harder to cope with such data flows. And that’s what makes the Data Science field so popular - the ability to use the computer’s resources to speed up the processing of information. In this article you’ll be able to read about the most popular and efficient Python libraries that are used in this area.

↧

Djangostars: How to Build a Unique Technology for Your Fintech Product with Python

August 9, 2018, 5:59 am

≫ Next: NumFOCUS: Announcing Julia 1.0

≪ Previous: py.CheckIO: Data science & data analysis most effective libraries

How to Build a Unique Technology for Your Fintech Product with Python

Fintech is a maze. It’s a thrilling and extremely complex industry for software development. There are state level regulations, integrations with different services and institutions, bank API connections, etc. to deal with. Another challenge is the high level of trust from the end users required to run finance, mortgages, investments and such. These, in turn, require the highest level of security, functionality, and correspondence with requirements.

What I’m trying to say is that the more unique the software is, the higher it’s valued. Without a properly working and trustworthy software, any financial venture will die down and lose worth. People need financial technology that will last, and I’m going to tell you how we achieved this with Python/Django technological stack while developing fintech products. It’s especially pleasant to say after Python has become the world’s most popular coding language.

Fintech: The Importance of Being Unique

In the world of finance, there are two streams that still coexist. On one hand, there are the millennials who stride gloriously into the future while mastering contactless payments, using on-line banking and all kinds of digital financing services. In an effort to avoid old school bureaucracy, they build their lives in a way that no generation before them did.

On the other hand, there’s the good old traditional financing services. This is a hell of a machine, hundreds of years old, that you can’t stop that easily. Even if it acknowledges the effect that new technology has on finance, it still doesn’t see it as neither as a threat, nor as a worthy competitor.

An attitude like this is especially typical of the most developed countries, such as the G7, which have the most of the money. The most of the old money, I might add. As well as the most people who are ready to operate it and the most highly technological startups. However, the thing is, their financial system is so old and hard-shelled, that it’s not always ready to change.

Deloitte has proven this in their statistics for 2017 that shows exactly how G7 sees and uses financial technology as opposed to the rest of the world. Deloitte researchers note:

“Surprisingly, with regard to mobile payments, 40 percent of executives from the United States expect little to no impact to their industry. With the caveat that the sample size is relatively small, 7 out of the 17 US banks (41 percent) saw little to
no impact from mobile wallets and other payment technologies, vs. 14 out of 36
(37 percent) of the nonbanks.”

Meanwhile, developing countries have a number of black holes in the financial sector that allow space for growth. These black holes are slowly but surely being taken over by fintech. By doing so, it gives people in these countries more opportunities, like working with
developed countries and getting paid easily and securely. Fintech removes financial borders, and that’s one of my favorite things about it.

Usage of emerging technologies: G7 vs rest of the world (ROW)

How to Build a Unique Technology for Your Fintech Product with Python

No matter how skeptical the G7 is towards fintech, technology continues changing finance. One of the reasons is that technology is more flexible and is able to adapt to new users’ needs, such as the needs and demands of the millennials. With their new habits, their high digital sensitivity, and digital presence, this generation feels the need to be productive every waking moment, can’t afford to waste time, travels a lot, and values financial freedom no matter where they are.

Confirming my thoughts, the Wall Street Journal, for example, says that the ease of payments attracts people that are comfortable with technology and have a busy lifestyle. Users of mobile payments mostly have higher education, work full time, are predominantly male, and are also very active financially. For instance, compared to nonusers of mobile financial apps, they are more likely to have bank accounts, retirement accounts, and/or own homes. Equally, they are no stranger to auto loans and mortgages.

Statistics that the WSJ uses in the article basically show that mobile payment users are more financially active, use a variety of financial products, and earn more than non-users. At the same time, they’re more careless with their expenses, get into debt, and even take money from their retirement accounts. This is why experts expect a whole new niche in fintech – simple tools that will help millennials manage their money better. Millennials who use mobile payments are reported to have a greater risk of financial distress and mismanagement, despite higher incomes and education levels.

How to Build a Unique Technology for Your Fintech Product with Python

In the era of digital disruption, finance has to be especially sensitive towards new customer demands. Will they use your service when it becomes more common and necessary? Can you create a product now that can grow and develop to serve millennials when they grow up and start earning big money? The same as the generation at which the current financial system is aimed. This is especially important when it comes to branches like mortgages, investments, and wealth management.

As I said above, I can’t stress enough how important it is to offer a unique technology that is custom tailored to fit customer needs. So far, it’s impossible to avoid integration with traditional financial and state institutions. You have to make sure your cooperation runs smoothly and that they deem you reliable enough to choose you as a partner; to choose your technology and not someone else’s, or worse, create a technology of their own.

We realized the importance of top-notch technology when Moneypark, formerly a start-up client of urs and now Switzerland’s largest technology-based mortgage intermediary, acquired Defferrard & Lanz and became part of Helvetia. All this occurred because their technological solution and business approach was the most convincing.

How Is Python Used In Finance & Fintech

So where do you get this technology robust enough to withstand stress of worldwide financial perturbations, but flexible enough to follow all the new changes and customers’ needs? We chose to use the Django framework within Python, and continue discovering its power. We’re not trying to say that Python is the savior and the silver bullet, but we do know for sure what advantages Python has for finance.

1. Python/Django stack takes you to market a lot quicker. It’s simple: this combo lets you build an MVP quickly, which increases your chances to find your product/market fit.

One of the advantages fintech has over traditional banking services is its ability to change quickly, adapt to customer demands, and offer additional services and improvements in accordance to the customers’ wishes. To do so, you have to be able to get to market quick, toughen up against real life problems, constantly improve, and grow. This is the only way fintech will be able to compete and/or collaborate with traditional banking and finance.

The technology must be flexible and offer solid ground for numerous additional services. Obviously an MVP is of importance, but the complexity of projects doesn’t always allow to develop it fast. However, the Python/Django framework combo takes into account the needs of an MVP and allows to save some time. They basically work like a Lego – you don’t need to develop small things like autorisation or user management tools from scratch. You just take whatever you need from the Python libraries (Nimpy, Scipy, Scikit-learn, Statsmodels, Pandas, Matplotlib, Seaborn, etc.) and build an MVP.

Another big advantage that Django gives you at the MVP stage is a simple admin panel or CRM – it’s built-in; you just have to set it up for your product. Of course, at the MVP stage, the product isn’t complete, but you can test and easily finish it, as it’s very flexible.

After the MVP is done, this tech stack allows to adapt parts of code. This means that after you validate the MVP, you can either easily change some code lines or even write new ones, if this is required for the product to function flawlessly.

Read:What you need to consider before building a fintech product

Millennials are people who are used to living in a fast-paced world. They feel like they have to be productive every waking moment, and this is what they expect from everyone else and from the services they use. There’s no time for error. Maximum transparency and high-quality service are critical for them, and I don’t think they’ll be letting it go anytime soon.

Let’s say that no matter how much I love Uber, as soon as they make a mistake as little as searching for a driver for too long, I get very annoyed. I’m sure we all expect and deserve better than this. I can’t even begin to describe the panic that takes over people if heaven forbid, Slack crashes.

This is why customer development is so important – a whole generation depends on it. Consequently, the sooner you get your product to market, the quicker you collect feedback and the faster you’ll make improvements. Python programming in finance allows you to do this with your hands behind your back.

2. Python is the language of Mathematicians and Economists. Fintech obviously can’t exist without these two groups, and most of the time they use – wait for it – Python to calculate their algorithms and formulas. While R and Matlab are less common among economists, Python became the most useful programming language for finance, as well as the programming “lingua franca” of data science. Because economists use it to make their calculations, of course it makes them easier to integrate with a Python based product. However, the presence of and communication with the technical partner is nevertheless important because sometimes even pieces of code that are written in the same language are hard to integrate.

3. Python has simple syntax which is easier for collaboration. Becoming the “lingua franca”, in my opinion, was just a matter of time. Thanks to its simplicity and easy-to-understand syntax, Python is very legible and everyone can learn it. Python creator Guido van Rossum describes it as a “high-level programming language, and its core design philosophy is all about code readability and a syntax which allows programmers to express concepts in a few lines of code.”

Not only is it easy to understand for technical specialists, it is for clients as well. As you can imagine, people involved in the development process from both sides have different levels of technical understanding. With Python, engineers can explain the code much easier, and clients can better understand how the development is progressing.

As The Economist says about Python:

“The language’s two main advantages are its simplicity and flexibility. Its straightforward syntax and use of indented spaces make it easy to learn, read and share. Its avid practitioners, known as Pythonistas, have uploaded 145,000 custom-built software packages to an online repository. These cover everything from game development to astronomy, and can be installed and inserted into a Python program in a matter of seconds.”

Which brings us to the next point.

4. Python has open libraries, including those for API integration. Open libraries help develop the product and analyze large amounts of data in the shortest amounts of time, as you don’t have to build your tools from scratch. This can save a lot of time and money, which is especially valuable while building MVP.

As I mentioned, fintech products require a lot of integrations with third parties. Python libraries make integrating your product with other systems through different API a lot easier. In finance, API can help you to collect and analyze the required data about users, real estate, and organizations. For instance, in the UK, you can get people’s credit history by API, which is required to proceed further financial operations. In the online mortgage industry, you also check real estate data and you always need to verify someone’s identity, which is much easier to do with API. By using and combining different libraries/packages, you can get the data or filter it in one click without having to develop new tools for that.

Django Stars, for instance, use the Django Rest Framework to build APIs or to integrate with external ones, as well as Celery to queue or distribute tasks.

5.Python hype is real. Python will continue developing, giving access to more and more specialists,which is good because we’re guaranteed to have enough people to develop and maintain our products in the future. According to the HackerRank 2018 Developer Skills Report, Python is the second language coders are going to learn next and is among TOP-3 languages in financial services and other progressive industries.

How to Build a Unique Technology for Your Fintech Product with Python

“Python wins the heart of developers across all ages, according to our Love-Hate index. Python is also the most popular language that developers want to learn overall, and a significant share already knows it.” HackerRank

Python can be used for all kinds of purposes, from traditional ones like web development to cutting edge, like AI. It’s versatile – it has over 125,000 third-party Python libraries. It’s the go-to language for data analysis, which makes it attractive for non-technical fields like business, and the best programming language for financial analysis.

Again, I’m not trying to sell you Python because it’s the only language that can save the world. I’m only speaking from my own experience because I saw what wonders Python can do when applied within the Django Framework.

The world of fintech is demanding – your product has to be trustworthy, 150% secure, and functional. Adhering to state regulations, dealing with integration with services, institutions, and bank API connections should all be built to last to support the new generations of millennials who are taking over the future. To get to the top and be among the ones who are disrupting the financial market, you need to be unique, efficient, user-oriented, and open for the future. That’s what Python is about.

↧

NumFOCUS: Announcing Julia 1.0

August 9, 2018, 8:12 am

≫ Next: Continuum Analytics Blog: Deploying Machine Learning Models is Hard, But It Doesn’t Have to Be

≪ Previous: Djangostars: How to Build a Unique Technology for Your Fintech Product with Python

The post Announcing Julia 1.0 appeared first on NumFOCUS.

↧

Continuum Analytics Blog: Deploying Machine Learning Models is Hard, But It Doesn’t Have to Be

August 9, 2018, 9:32 am

≫ Next: Stack Abuse: Association Rule Mining via Apriori Algorithm in Python

≪ Previous: NumFOCUS: Announcing Julia 1.0

With free, open source tools like Anaconda Distribution, it has never been easier for individual data scientists to analyze data and build machine learning models on their laptops. So why does deriving actual business value from machine learning remain elusive for many organizations? Because while it’s easy for data scientists to build powerful models on …
Read more →

The post Deploying Machine Learning Models is Hard, But It Doesn’t Have to Be appeared first on Anaconda.

↧

Stack Abuse: Association Rule Mining via Apriori Algorithm in Python

August 9, 2018, 6:54 am

≫ Next: Hynek Schlawack: Hardening Your Web Server’s SSL Ciphers

≪ Previous: Continuum Analytics Blog: Deploying Machine Learning Models is Hard, But It Doesn’t Have to Be

Association rule mining is a technique to identify underlying relations between different items. Take an example of a Super Market where customers can buy variety of items. Usually, there is a pattern in what the customers buy. For instance, mothers with babies buy baby products such as milk and diapers. Damsels may buy makeup items whereas bachelors may buy beers and chips etc. In short, transactions involve a pattern. More profit can be generated if the relationship between the items purchased in different transactions can be identified.

For instance, if item A and B are bought together more frequently then several steps can be taken to increase the profit. For example:

A and B can be placed together so that when a customer buys one of the product he doesn't have to go far away to buy the other product.
People who buy one of the products can be targeted through an advertisement campaign to buy the other.
Collective discounts can be offered on these products if the customer buys both of them.
Both A and B can be packaged together.

The process of identifying an associations between products is called association rule mining.

Apriori Algorithm for Association Rule Mining

Different statistical algorithms have been developed to implement association rule mining, and Apriori is one such algorithm. In this article we will study the theory behind the Apriori algorithm and will later implement Apriori algorithm in Python.

Theory of Apriori Algorithm

There are three major components of Apriori algorithm:

Support
Confidence
Lift

We will explain these three concepts with the help of an example.

Suppose we have a record of 1 thousand customer transactions, and we want to find the Support, Confidence, and Lift for two items e.g. burgers and ketchup. Out of one thousand transactions, 100 contain ketchup while 150 contain a burger. Out of 150 transactions where a burger is purchased, 50 transactions contain ketchup as well. Using this data, we want to find the support, confidence, and lift.

Support

Support refers to the default popularity of an item and can be calculated by finding number of transactions containing a particular item divided by total number of transactions. Suppose we want to find support for item B. This can be calculated as:

Support(B) = (Transactions containing (B))/(Total Transactions)

For instance if out of 1000 transactions, 100 transactions contain Ketchup then the support for item Ketchup can be calculated as:

Support(Ketchup) = (Transactions containingKetchup)/(Total Transactions)

Support(Ketchup) = 100/1000  
                 = 10%

Confidence

Confidence refers to the likelihood that an item B is also bought if item A is bought. It can be calculated by finding the number of transactions where A and B are bought together, divided by total number of transactions where A is bought. Mathematically, it can be represented as:

Confidence(A→B) = (Transactions containing both (A and B))/(Transactions containing A)

Coming back to our problem, we had 50 transactions where Burger and Ketchup were bought together. While in 150 transactions, burgers are bought. Then we can find likelihood of buying ketchup when a burger is bought can be represented as confidence of Burger -> Ketchup and can be mathematically written as:

Confidence(Burger→Ketchup) = (Transactions containing both (Burger and Ketchup))/(Transactions containing A)

Confidence(Burger→Ketchup) = 50/150  
                           = 33.3%

You may notice that this is similar to what you'd see in the Naive Bayes Algorithm, however, the two algorithms are meant for different types of problems.

Lift

Lift(A -> B) refers to the increase in the ratio of sale of B when A is sold. Lift(A –> B) can be calculated by dividing Confidence(A -> B) divided by Support(B). Mathematically it can be represented as:

Lift(A→B) = (Confidence (A→B))/(Support (B))

Coming back to our Burger and Ketchup problem, the Lift(Burger -> Ketchup) can be calculated as:

Lift(Burger→Ketchup) = (Confidence (Burger→Ketchup))/(Support (Ketchup))

Lift(Burger→Ketchup) = 33.3/10  
                     = 3.33

Lift basically tells us that the likelihood of buying a Burger and Ketchup together is 3.33 times more than the likelihood of just buying the ketchup. A Lift of 1 means there is no association between products A and B. Lift of greater than 1 means products A and B are more likely to be bought together. Finally, Lift of less than 1 refers to the case where two products are unlikely to be bought together.

Steps Involved in Apriori Algorithm

For large sets of data, there can be hundreds of items in hundreds of thousands transactions. The Apriori algorithm tries to extract rules for each possible combination of items. For instance, Lift can be calculated for item 1 and item 2, item 1 and item 3, item 1 and item 4 and then item 2 and item 3, item 2 and item 4 and then combinations of items e.g. item 1, item 2 and item 3; similarly item 1, item2, and item 4, and so on.

As you can see from the above example, this process can be extremely slow due to the number of combinations. To speed up the process, we need to perform the following steps:

Set a minimum value for support and confidence. This means that we are only interested in finding rules for the items that have certain default existence (e.g. support) and have a minimum value for co-occurrence with other items (e.g. confidence).
Extract all the subsets having higher value of support than minimum threshold.
Select all the rules from the subsets with confidence value higher than minimum threshold.
Order the rules by descending order of Lift.

Implementing Apriori Algorithm with Python

Enough of theory, now is the time to see the Apriori algorithm in action. In this section we will use the Apriori algorithm to find rules that describe associations between different products given 7500 transactions over the course of a week at a French retail store. The dataset can be downloaded from the following link:

https://drive.google.com/file/d/1y5DYn0dGoSbC22xowBq2d4po6h1JxcTQ/view?usp=sharing

Another interesting point is that we do not need to write the script to calculate support, confidence, and lift for all the possible combination of items. We will use an off-the-shelf library where all of the code has already been implemented.

The library I'm referring to is apyori and the source can be found here. I suggest you to download and install the library in the default path for your Python libraries before proceeding.

Note: All the scripts in this article have been executed using Spyder IDE for Python.

Follow these steps to implement Apriori algorithm in Python:

Import the Libraries

The first step, as always, is to import the required libraries. Execute the following script to do so:

import numpy as np  
import matplotlib.pyplot as plt  
import pandas as pd  
from apyori import apriori

In the script above we import pandas, numpy, pyplot, and apriori libraries.

Importing the Dataset

Now let's import the dataset and see what we're working with. Download the dataset and place it in the "Datasets" folder of the "D" drive (or change the code below to match the path of the file on your computer) and execute the following script:

store_data = pd.read_csv('D:\\Datasets\\store_data.csv')

Let's call the head() function to see how the dataset looks:

store_data.head()

Store data preview with header

A snippet of the dataset is shown in the above screenshot. If you carefully look at the data, we can see that the header is actually the first transaction. Each row corresponds to a transaction and each column corresponds to an item purchased in that specific transaction. The NaN tells us that the item represented by the column was not purchased in that specific transaction.

In this dataset there is no header row. But by default, pd.read_csv function treats first row as header. To get rid of this problem, add header=None option to pd.read_csv function, as shown below:

store_data = pd.read_csv('D:\\Datasets\\store_data.csv', header=None)

Now execute the head() function:

store_data.head()

In this updated output you will see that the first line is now treated as a record instead of header as shown below:

Store data preview without header

Now we will use the Apriori algorithm to find out which items are commonly sold together, so that store owner can take action to place the related items together or advertise them together in order to have increased profit.

Data Proprocessing

The Apriori library we are going to use requires our dataset to be in the form of a list of lists, where the whole dataset is a big list and each transaction in the dataset is an inner list within the outer big list. Currently we have data in the form of a pandas dataframe. To convert our pandas dataframe into a list of lists, execute the following script:

records = []  
for i in range(0, 7501):  
    records.append([str(store_data.values[i,j]) for j in range(0, 20)])

Applying Apriori

The next step is to apply the Apriori algorithm on the dataset. To do so, we can use the apriori class that we imported from the apyori library.

The apriori class requires some parameter values to work. The first parameter is the list of list that you want to extract rules from. The second parameter is the min_support parameter. This parameter is used to select the items with support values greater than the value specified by the parameter. Next, the min_confidence parameter filters those rules that have confidence greater than the confidence threshold specified by the parameter. Similarly, the min_lift parameter specifies the minimum lift value for the short listed rules. Finally, the min_length parameter specifies the minimum number of items that you want in your rules.

Let's suppose that we want rules for only those items that are purchased at least 5 times a day, or 7 x 5 = 35 times in one week, since our dataset is for a one-week time period. The support for those items can be calculated as 35/7500 = 0.0045. The minimum confidence for the rules is 20% or 0.2. Similarly, we specify the value for lift as 3 and finally min_length is 2 since we want at least two products in our rules. These values are mostly just arbitrarily chosen, so you can play with these values and see what difference it makes in the rules you get back out.

Execute the following script:

association_rules = apriori(records, min_support=0.0045, min_confidence=0.2, min_lift=3, min_length=2)  
association_results = list(association_rules)

In the second line here we convert the rules found by the apriori class into a list since it is easier to view the results in this form.

Viewing the Results

Let's first find the total number of rules mined by the apriori class. Execute the following script:

print(len(association_rules))

The script above should return 48. Each item corresponds to one rule.

Let's print the first item in the association_rules list to see the first rule. Execute the following script:

print(association_rules[0])

The output should look like this:

RelationRecord(items=frozenset({'light cream', 'chicken'}), support=0.004532728969470737, ordered_statistics[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)])

The first item in the list is a list itself containing three items. The first item of the list shows the grocery items in the rule.

For instance from the first item, we can see that light cream and chicken are commonly bought together. This makes sense since people who purchase light cream are careful about what they eat hence they are more likely to buy chicken i.e. white meat instead of red meat i.e. beef. Or this could mean that light cream is commonly used in recipes for chicken.

The support value for the first rule is 0.0045. This number is calculated by dividing the number of transactions containing light cream divided by total number of transactions. The confidence level for the rule is 0.2905 which shows that out of all the transactions that contain light cream, 29.05% of the transactions also contain chicken. Finally, the lift of 4.84 tells us that chicken is 4.84 times more likely to be bought by the customers who buy light cream compared to the default likelihood of the sale of chicken.

The following script displays the rule, the support, the confidence, and lift for each rule in a more clear way:

for item in association_rules:

    # first index of the inner list
    # Contains base item and add item
    pair = item[0] 
    items = [x for x in pair]
    print("Rule: " + items[0] + " -> " + items[1])

    #second index of the inner list
    print("Support: " + str(item[1]))

    #third index of the list located at 0th
    #of the third index of the inner list

    print("Confidence: " + str(item[2][0][2]))
    print("Lift: " + str(item[2][0][3]))
    print("=====================================")

If you execute the above script, you will see all the rules returned by the apriori class. The first four rules returned by the apriori class look like this:

Rule: light cream -> chicken  
Support: 0.004532728969470737  
Confidence: 0.29059829059829057  
Lift: 4.84395061728395  
=====================================
Rule: mushroom cream sauce -> escalope  
Support: 0.005732568990801126  
Confidence: 0.3006993006993007  
Lift: 3.790832696715049  
=====================================
Rule: escalope -> pasta  
Support: 0.005865884548726837  
Confidence: 0.3728813559322034  
Lift: 4.700811850163794  
=====================================
Rule: ground beef -> herb & pepper  
Support: 0.015997866951073192  
Confidence: 0.3234501347708895  
Lift: 3.2919938411349285  
=====================================

We have already discussed the first rule. Let's now discuss the second rule. The second rule states that mushroom cream sauce and escalope are bought frequently. The support for mushroom cream sauce is 0.0057. The confidence for this rule is 0.3006 which means that out of all the transactions containing mushroom, 30.06% of the transactions are likely to contain escalope as well. Finally, lift of 3.79 shows that the escalope is 3.79 more likely to be bought by the customers that buy mushroom cream sauce, compared to its default sale.

Conclusion

Association rule mining algorithms such as Apriori are very useful for finding simple associations between our data items. They are easy to implement and have high explain-ability. However for more advanced insights, such those used by Google or Amazon etc., more complex algorithms, such as recommender systems, are used. However, you can probably see that this method is a very simple way to get basic associations if that's all your use-case needs.

↧

Hynek Schlawack: Hardening Your Web Server’s SSL Ciphers

August 9, 2018, 11:00 am

≫ Next: NumFOCUS: Endorsements for the 2018 NumFOCUS Election

≪ Previous: Stack Abuse: Association Rule Mining via Apriori Algorithm in Python

There are many wordy articles on configuring your web server’s TLS ciphers. This is not one of them. Instead I will share a configuration which is both compatible enough for today’s needs and scores a straight “A” on Qualys’s SSL Server Test.

↧

NumFOCUS: Endorsements for the 2018 NumFOCUS Election

August 9, 2018, 1:00 pm

≫ Next: Not Invented Here: CPython vs PyPy Memory Usage

≪ Previous: Hynek Schlawack: Hardening Your Web Server’s SSL Ciphers

The post Endorsements for the 2018 NumFOCUS Election appeared first on NumFOCUS.

↧

Not Invented Here: CPython vs PyPy Memory Usage

August 8, 2018, 6:22 am

≫ Next: Mike C. Fletcher: TTFQuery 2.0.0b1 Up on PyPI

≪ Previous: NumFOCUS: Endorsements for the 2018 NumFOCUS Election

If you have lots of "small" objects in a Python program (objects which have few instance attributes), you may find that the object overhead starts to become considerable. The common wisdom says that to reduce this in CPython you need to re-define the classes to use __slots__, eliminating the attribute dictionary. But this comes with the downsides of limiting flexibility and eliminating the use of class defaults. Would it surprise you to learn that PyPy can significantly, and without any effort by the programmer, reduce that overhead automatically?

Let's take a look.

Contents

Contrary to advice, instead of starting at the very beginning, we'll jump right to the end. The following graph shows the peak memory usage of the example program we'll be talking about in this post across seven different Python implementations: PyPy2 v6.0, PyPy3 v6.0, CPython 2.7.15, 3.4.9, 3.5.6, 3.6.6, and 3.7.0 [1].

For regular objects ("Point3D"), PyPy needs less than 700MB to create 10,000,000, where CPython 2.7 needs almost 3.5 GB, and CPython 3.x needs between 1.5 and 2.1 GB [6]. Moving to __slots__ ("Point3DSlot") brings the CPython overhead closer to—but still higher than—that of PyPy. In particular, note that the PyPy memory usage is essentially the same whether or not slots are used.

#chart-2b5bfb59-d50b-477e-886f-0693dd20f528{-webkit-user-select:none;-webkit-font-smoothing:antialiased;font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .title{font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:16px}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .legends .legend text{font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:14px}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis text{font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:10px}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis text.major{font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:10px}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .text-overlay text.value{font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:16px}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .text-overlay text.label{font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:10px}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .tooltip{font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:14px}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 text.no_data{font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;font-size:64px} #chart-2b5bfb59-d50b-477e-886f-0693dd20f528{background-color:rgba(249,249,249,1)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 path,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 line,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 rect,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 circle{-webkit-transition:150ms;-moz-transition:150ms;transition:150ms}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .graph > .background{fill:rgba(249,249,249,1)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .plot > .background{fill:rgba(255,255,255,1)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .graph{fill:rgba(0,0,0,.87)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 text.no_data{fill:rgba(0,0,0,1)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .title{fill:rgba(0,0,0,1)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .legends .legend text{fill:rgba(0,0,0,.87)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .legends .legend:hover text{fill:rgba(0,0,0,1)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis .line{stroke:rgba(0,0,0,1)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis .guide.line{stroke:rgba(0,0,0,.54)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis .major.line{stroke:rgba(0,0,0,.87)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis text.major{fill:rgba(0,0,0,1)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis.y .guides:hover .guide.line,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .line-graph .axis.x .guides:hover .guide.line,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .stackedline-graph .axis.x .guides:hover .guide.line,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .xy-graph .axis.x .guides:hover .guide.line{stroke:rgba(0,0,0,1)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis .guides:hover text{fill:rgba(0,0,0,1)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .reactive{fill-opacity:.7;stroke-opacity:.8}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .ci{stroke:rgba(0,0,0,.87)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .reactive.active,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .active .reactive{fill-opacity:.8;stroke-opacity:.9;stroke-width:4}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .ci .reactive.active{stroke-width:1.5}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .series text{fill:rgba(0,0,0,1)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .tooltip rect{fill:rgba(255,255,255,1);stroke:rgba(0,0,0,1);-webkit-transition:opacity 150ms;-moz-transition:opacity 150ms;transition:opacity 150ms}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .tooltip .label{fill:rgba(0,0,0,.87)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .tooltip .label{fill:rgba(0,0,0,.87)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .tooltip .legend{font-size:.8em;fill:rgba(0,0,0,.54)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .tooltip .x_label{font-size:.6em;fill:rgba(0,0,0,1)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .tooltip .xlink{font-size:.5em;text-decoration:underline}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .tooltip .value{font-size:1.5em}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .bound{font-size:.5em}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .max-value{font-size:.75em;fill:rgba(0,0,0,.54)}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .map-element{fill:rgba(255,255,255,1);stroke:rgba(0,0,0,.54) !important}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .map-element .reactive{fill-opacity:inherit;stroke-opacity:inherit}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .color-0,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .color-0 a:visited{stroke:#F44336;fill:#F44336}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .color-1,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .color-1 a:visited{stroke:#3F51B5;fill:#3F51B5}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .color-2,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .color-2 a:visited{stroke:#009688;fill:#009688}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .color-3,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .color-3 a:visited{stroke:#FFC107;fill:#FFC107}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .color-4,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .color-4 a:visited{stroke:#FF5722;fill:#FF5722}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .color-5,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .color-5 a:visited{stroke:#9C27B0;fill:#9C27B0}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .color-6,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .color-6 a:visited{stroke:#03A9F4;fill:#03A9F4}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .text-overlay .color-0 text{fill:black}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .text-overlay .color-1 text{fill:black}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .text-overlay .color-2 text{fill:black}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .text-overlay .color-3 text{fill:black}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .text-overlay .color-4 text{fill:black}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .text-overlay .color-5 text{fill:black}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .text-overlay .color-6 text{fill:black} #chart-2b5bfb59-d50b-477e-886f-0693dd20f528 text.no_data{text-anchor:middle}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .guide.line{fill:none}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .centered{text-anchor:middle}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .title{text-anchor:middle}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .legends .legend text{fill-opacity:1}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis.x text{text-anchor:middle}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis.x:not(.web) text[transform]{text-anchor:start}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis.x:not(.web) text[transform].backwards{text-anchor:end}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis.y text{text-anchor:end}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis.y text[transform].backwards{text-anchor:start}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis.y2 text{text-anchor:start}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis.y2 text[transform].backwards{text-anchor:end}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis .guide.line{stroke-dasharray:4,4}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis .major.guide.line{stroke-dasharray:6,6}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .horizontal .axis.y .guide.line,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .horizontal .axis.y2 .guide.line,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .vertical .axis.x .guide.line{opacity:0}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .horizontal .axis.always_show .guide.line,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .vertical .axis.always_show .guide.line{opacity:1 !important}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis.y .guides:hover .guide.line,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis.y2 .guides:hover .guide.line,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis.x .guides:hover .guide.line{opacity:1}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .axis .guides:hover text{opacity:1}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .nofill{fill:none}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .subtle-fill{fill-opacity:.2}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .dot{stroke-width:1px;fill-opacity:1}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .dot.active{stroke-width:5px}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .dot.negative{fill:transparent}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 text,#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 tspan{stroke:none !important}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .series text.active{opacity:1}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .tooltip rect{fill-opacity:.95;stroke-width:.5}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .tooltip text{fill-opacity:1}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .showable{visibility:hidden}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .showable.shown{visibility:visible}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .gauge-background{fill:rgba(229,229,229,1);stroke:none}#chart-2b5bfb59-d50b-477e-886f-0693dd20f528 .bg-lines{stroke:rgba(249,249,249,1);stroke-width:2px}Memory Usage (in MB)00400400800800120012001600160020002000240024002800280032003200Point3DPoint3DSlotPoint3DSlot Uncached Integers691.935.844688644688645454.6035742845751Point3D692231.87032967032968454.5964183645966Point3DSlot1152.2427.8959706959707421.6648746238269Point3DSlot Uncached Integers683.660.48791208791208455.1975156427854Point3D683.8256.5135531135531455.18320380282853Point3DSlot1143.57452.53919413919414422.2824305179661Point3DSlot Uncached Integers3453.385.13113553113551257.0Point3D787.92281.15677655677655447.7324599212793Point3DSlot1624.7477.18241758241754387.8531527257094Point3DSlot Uncached Integers1521.4109.77435897435898395.2452180634354Point3D709.2305.79999999999995453.36560012830535Point3DSlot1624.54501.825641025641387.8646021976749Point3DSlot Uncached Integers1546.1134.41758241758242393.4777058287614Point3D709330.4432234432234453.37991196826226Point3DSlot1624.7526.4688644688645387.8531527257094Point3DSlot Uncached Integers1522.3159.06080586080586395.1808147836295Point3D704.3355.0864468864469453.7162402072488Point3DSlot1624.8551.1120879120879387.845996805731Point3DSlot Uncached Integers2154.9183.70402930402932349.9124650000334Point3D1009.6379.7296703296703431.8692165130789Point3DSlot2869.6575.7553113553114298.76910491414014Point3DSlot Uncached IntegersMemory Usage (in MB)PyPy2PyPy3CPython 2.7CPython 3.4CPython 3.5CPython 3.6CPython 3.7

The third group of data is the same as the second group, except instead of using small integers that should be in the CPython internal integer object cache [7], I used larger numbers that shouldn't be cached. This is just an interesting data point showing the allocation of three times as many objects, and won't be discussed further.

What Is Being Measured

In the script I used to produce these numbers [2], I'm using the excellent psutil library's Process.memory_info to record the "unique set size" ("the memory which is unique to a process and which would be freed if the process was terminated right now") before and then after allocating a large number of objects.

defcheck(klass,x,y,z):before=get_memory().ussinst=klass(0,x,y,z)print("Size of",type(inst).__name__,sizeof_object(inst))delinstprint("Count      AbsoluteUsage     Delta")print("=======================================")forcountin100,1000,10000,100000,1000000,10000000:l=[None]*countforiinrange(count):l[i]=klass(i,x,y,z)after=get_memory().ussprint("%9d"%count,format_memory(after-global_starting_memory.uss),format_memory(after-before))l=Noneprint()

This gives us a fairly accurate idea of how much memory the processes needed to allocate from the operating system to be able to create all the objects we asked for. (get_memory is a helper function that runs the garbage collector to be sure we have the most stable numbers.)

What Is Not Being Measured

In this example output from a run of PyPy, the AbsoluteUsage is the total growth from when the program started, while the Delta is the growth just within this function.

Memory for Point3D(1, 2, 3)
Count      AbsoluteUsage     Delta
=======================================
      100           0.02           0.02
     1000           0.03           0.03
    10000           0.51           0.51
   100000           7.20           7.20
  1000000          69.11          69.11
 10000000         691.90         691.90

This was the first of the test runs within this particular process. The second test run within this process reports higher absolute deltas since the beginning of the program, although the overall deltas are smaller. This indicates how much memory the program has allocated from the operating system but not returned to it, even though it may technically free from the standpoint of the Python runtime; this accounts for things like internal caches, or in PyPy's case, jitted code.

Memory for Point3DSlot(1, 2, 3)
Size of Point3DSlot -1
Count      AbsoluteUsage     Delta
=======================================
      100          86.09           0.00
     1000          86.12           0.03
    10000          86.56           0.46
   100000          87.33           1.23
  1000000         138.70          52.60
 10000000         692.05         605.95

Although I captured the data, this post is not about the startup or initial memory allocation of the various interpreters, nor about how much can easily be shared between forked processes, nor about how much memory is returned to the operating system while the process is still running. We're only talking about the memory size needed to allocate a given number of objects, e.g., the Delta column.

Object Internals

To understand what's happening, let's look at the two types of objects we're comparing:

classPoint3D(object):def__init__(self,x,y,z):self.x=xself.y=yself.z=zclassPoint3DSlot(object):__slots__=('x','y','z')def__init__(self,x,y,z):self.x=xself.y=yself.z=z

These are both small classes with three instance attributes. One is a standard, default, object, and one specifies its instance attributes in __slots__.

Objects with Dictionaries

Standard objects, like Point3D, have a special attribute__dict__, that is a normal Python dictionary object that is used to hold all the instance attributes for the object. We previously looked at how __getattribute__ can be used to customize all attribute reads for an object; likewise, __setattr__ can customize all attribute writes. The default __getattribute__ and __setattr__ that a class inherits from object function something like they were written to access the __dict__:

classObject:def__getattribute__(self,name):ifnameinself.__dict__:returnself.__dict__[name]returngetattr(type(self),name)def__setattr__(self,name,value):self.__dict__[name]=value

One advantage of having a __dict__ underlying an object is the flexibility it provides: you don't have to pre-declare your attributes for every object, and any object can have any attribute, so it facilitates subclasses adding new attributes, or even other libraries adding new, specialized, attributes to implement caching of expensive computed properties.

One disadvantage is that a __dict__ is a generic Python dictionary, not specialized at all [3], and as such it has overhead.

On CPython, we can ask the interpreter how much memory any given object uses with sys.getsizeof. On my machine under a 64-bit CPython 2.7.15, a bare object takes 16 bytes, while a trivial subclass takes a full 64 bytes (due to the overhead of being tracked by the garbage collector):

>>> importsys>>> sys.getsizeof(object())16>>> classTrivialSubclass(object):pass>>> sys.getsizeof(TriviasSubclass())64

An empty dict occupies 280 bytes:

>>> sys.getsizeof({})280

And so when you combine the size of the trivial subclass, with the size of its __dict__ you arrive at a minimum object size of 344 bytes:

>>> sys.getsizeof(TrivialSubclass().__dict__)280>>> sys.getsizeof(TrivialSubclass())+sys.getsizeof(TrivialSubclass().__dict__)344

A fully occupied Point3D object is also 344 bytes:

>>> pd=Point3D(1,2,3)>>> sys.getsizeof(pd)+sys.getsizeof(pd.__dict__)344

Because of the way dictionaries are implemented [8], there's always a little spare room for extra attributes. We don't find a jump in size until we've added three more attributes:

>>> pd.a=1>>> sys.getsizeof(pd)+sys.getsizeof(pd.__dict__)344>>> pd.b=1>>> sys.getsizeof(pd)+sys.getsizeof(pd.__dict__)344>>> pd.c=1>>> sys.getsizeof(pd)+sys.getsizeof(pd.__dict__)1112

Note

These values can change quite a bit across Python versions, typically improving over time. In CPython 3.4 and 3.5, getsizeof({}) returns 288, while it returns 240 in both 3.6 and 3.7. In addition, getsizeof(pd.__dict__) returns 96 and 112 [4]. The answer to getsizeof(pd) is 56 in all four versions.

Objects With Slots

Objects with a __slots__ declaration, like Point3DSlot do not have a __dict__ by default. The documentation notes that this can be a space savings. Indeed, on CPython 2.7, a Point3DSlot has a size of only 72 bytes, only one full pointer larger than a trivial subclass (when we do not factor in the __dict__):

>>> pds=Point3DSlot(1,2,3)>>> sys.getsizeof(pds)72

If they don't have an instance dictionary, where do they store their attributes? And why, if Point3DSlot has three defined attributes, is it only one pointer larger than Point3D?

Slots, like @property, @classmethod and @staticmethod, are implemented using descriptors. For our purpose, descriptors are a way to extend the workings of __getattribute__ and friends. A descriptor is an object whose type implements a __get__ method, and when that object is found in a type's dictionary, it is called instead of checking the __dict__. Something like this [5]:

classObject:def__getattribute__(self,name):if(nameindir(type(self))andhasattr(getattr(type(self),name),'__get__')):returngetattr(type(self),name).__get__(self,type(self))ifnameinself.__dict__:returnself.__dict__[name]returngetattr(type(self),name)def__setattr__(self,name,value):if(nameindir(type(self))andhasattr(getattr(type(self),name),'__set__')):getattr(type(self),name).__set__(self,type(self))returnself.__dict__[name]=value

When the class statement (indeed, when the type metaclass) finds __slots__ in the class body (the class dictionary), it takes special steps. Most importantly, it creates a descriptor for each mentioned slot and places it in the class's __dict__. So our Point3DSlot class gets three such descriptors:

>>> dict(Point3DSlot.__dict__){'__doc__': None,'__init__': <function __main__.__init__>,'__module__': '__main__','__slots__': ('x', 'y', 'z'),'x': <member 'x' of 'Point3DSlot' objects>,'y': <member 'y' of 'Point3DSlot' objects>,'z': <member 'z' of 'Point3DSlot' objects>}>>> pds.x1>>> Point3DSlot.x.__get__>>> <method-wrapper'__get__'ofmember_descriptorobjectat0x10b6fc2d8>>>> Point3DSlot.x.__get__(pds,Point3DSlot)1

Variable Storage

We've established how we can access these magic, hidden slotted attributes (through the descriptor protocol). (We've also established why we can't have defaults for slotted attributes in the class.) But we still haven't found out where they are stored. If they're not in a dictionary, where are they?

The answer is that they're stored directly in the object itself. Every type has a member called tp_basicsize, exposed to Python as __basicsize__. When the interpreter allocates an object, it allocates __basicsize__ bytes for it (every object has a minimum basic size, the size of object). The type metaclass arranges for __basicsize__ to be big enough to hold (a pointer to) each of the slotted attributes, which are kept in memory immediately after the data for the basic object . The descriptor for each attribute, then, just does some pointer arithmetic off of self to read and write the value. In a way, it's very similar to how collections.namedtuple works, except using pointers instead of indices.

That may be hard to follow, so here's an example.

The basic size of object exactly matches the reported size of its instances:

>>> object.__basicsize__16>>> sys.getsizeof(object())16

We get the same when we create an object that cannot have any instance variables, and hence does not need to be tracked by the garbage collector:

>>> classNoSlots(object):... __slots__=()...>>> NoSlots.__basicsize__16>>> sys.getsizeof(NoSlots())16

When we add one slot to an object, its basic size increases by one pointer (8 bytes), and because that object could be tracked by the garbage collector, this object needs to be tracked by the collector, so getsizeof reports some extra overhead:

>>> classOneSlot(object):... __slots__=('a',)...>>> OneSlot.__basicsize__24>>> sys.getsizeof(OneSlot())56

The basic size for an object with 3 slots is 16 (the size of object) + 3 pointers, or 40. What's the basic size for an object that has a __dict__?

>>> Point3DSlot.__basicsize__40>>> Point.__basicsize__32

Hmm, it's 16 + 2 pointers. What could those two pointers be? Documentation to the rescue:

__slots__ allow us to explicitly declare data members (like properties) and deny the creation of __dict__ and __weakref__ (unless explicitly declared in __slots__...)

So those two pointers are for __dict__ and __weakref__, things that standard objects get automatically, but which we have to opt-in to if we want them with __slots__. Thus, an object with three slots is one pointer size bigger than a standard object.

How PyPy Does Better

By now we should understand why the memory usage dropped significantly when we added __slots__ to our objects on CPython (although that comes with a cost). That leaves the question: how does PyPy get such good memory performance with a __dict__ that __slots__ doesn't even matter?

Earlier I wrote that the __dict__ of an instance is just a standard dictionary, not specialized at all. That's basically true on CPython, but it's not at all true on PyPy. PyPy basically fakes __dict__ by using __slots__ for all objects.

A given set of attributes (such as our "x", "y", "z" attributes for Point3DSlot) is called a "map". Each instance refers to its map, which tells PyPy how to efficiently access a given attribute. When an attribute is added or deleted, a new map is created (or re-used from an existing object; objects of completely unrelated types, but having common attributes can share the same maps) and assigned to the object, re-arranging things as needed. It's as if __slots__ was assigned to each instance, with descriptors added and removed for the instance on the fly.

If the program ever directly accesses an instance's __dict__, PyPy creates a thin wrapper object that operates on the object's map.

So for a program that has many simalar looking objects, even if unrelated, PyPy's approach can save a lot of memory. On the other hand, if the program creates objects that have a very diverse set of attributes, and that program frequently directly accessess __dict__, it's theoretically possible that PyPy could use more memory than CPython.

You can read more about this approach in this PyPy blog post.

Footnotes

[1]	All 64-bit builds, all tested on macOS. The results on Linux were very similar.

[2]	Available at this gist.

[3]	In CPython. But I'm getting ahead of myself.

[4]	The CPython dict implementation was completely overhauled in CPython 3.6. And based on the sizes of `{}` versus `pd.__dict__` we can see some sort of specialization for instance dictionaries, at least in terms of their fill factor.

[5]	This is very rough, and actually inaccurate in some small but important details. Refer to the documentation for the full protocol.

[6]	No, I'm not totally sure why Python 3.7 is such an outlier and uses more memory than the other Python 3.x versions.

[7]	See PyLong_FromLong

[8]	With a particular desired load factor.

↧

Mike C. Fletcher: TTFQuery 2.0.0b1 Up on PyPI

August 9, 2018, 10:50 pm

≫ Next: Codementor: Understanding Python Dataclasses — Part 1

≪ Previous: Not Invented Here: CPython vs PyPy Memory Usage

TTFQuery has a new release up. This release has a bunch of small breaking changes in it, specifically the command line demonstration tools now work differently. It also is now Python 3 ready (i.e. one more package should now be out of the way to get OpenGLContext running under Python 3) and finally has its own (basic) test-suite instead of relying on OpenGLContext to exercise it. It's also got a bit more documentation. You should be able to pull it with:

pip3 install 'ttfquery==2.0.0b1'

or, if you want to run the test suite (expect it to fail under Win32 or Mac, as I don't have those):

pip install tox
bzr branch lp:ttfquery
cd ttfquery
tox

Enjoy.

↧

Codementor: Understanding Python Dataclasses — Part 1

August 10, 2018, 2:06 am

≫ Next: PyCharm: PyCharm 2018.2.2 RC

≪ Previous: Mike C. Fletcher: TTFQuery 2.0.0b1 Up on PyPI

Python 3.7 introduces new dataclasses. In this post i discuss how they work, and how you can suit them to your use case.

↧

PyCharm: PyCharm 2018.2.2 RC

August 10, 2018, 2:14 am

≫ Next: Artem Golubin: How Python saves memory when storing strings

≪ Previous: Codementor: Understanding Python Dataclasses — Part 1

PyCharm 2018.2.2 Release Candidate is now available, with some small improvements. Get it now from our Confluence page

New in This Version

Some improvements to our pipenv support: if the pipfile specifies packages which aren’t compatible with your computer, they will no longer be suggested. Also, if you choose to create a pipenv for a project you’ve already opened, the project’s dependencies will now automatically be installed. This matches the behavior of pipenv on the command-line.
A regression where virtualenvs weren’t automatically detected has been resolved.
Some issues in version control support were ironed out: when you right-click a commit to reword it (change its commit message) in some cases PyCharm wasn’t able to identify the new hash of the commit correctly, this has been cleared up.
And much more, see the release notes for details.

Interested?

Download the RC from our confluence page

If you’re on Ubuntu 16.04 or later, you can use snap to get PyCharm RC versions, and stay up to date. You can find the installation instructions on our website.

The release candidate (RC) is not an early access program (EAP) build, and does not bundle an EAP license. If you get PyCharm Professional Edition RC, you will either need a currently active PyCharm subscription, or you will receive a 30-day free trial.

↧

Artem Golubin: How Python saves memory when storing strings

August 10, 2018, 3:25 am

≫ Next: Peter Bengtsson: Quick dog-piling (aka stampeding herd) URL stresstest

≪ Previous: PyCharm: PyCharm 2018.2.2 RC

Since Python 3, the str type uses Unicode representation. Unicode strings can take up to 4 bytes per character depending on the encoding, which sometimes can be expensive from a memory perspective.

To reduce memory consumption and improve performance, Python uses three kinds of internal representations for Unicode strings:

1 byte per char (Latin-1 encoding)
2 bytes per char (UCS-2 encoding)
4 bytes per char (UCS-4 encoding)

When programming in Python all strings behave the same, and most of the time we don't notice any difference. However, the difference can be very remarkable and sometimes unexpected when working with large amounts of text.

To see the difference in internal representations, we can use the sys.getsizeof function, which returns the size of an object in bytes:

>>>importsys>>>string='hello'>>>sys.getsizeof(string)54>>># 1-byte encoding>>>

↧

Peter Bengtsson: Quick dog-piling (aka stampeding herd) URL stresstest

August 10, 2018, 5:12 am

≫ Next: Python Does What?!: python needs a frozenlist

≪ Previous: Artem Golubin: How Python saves memory when storing strings

Whenever you want to quickly bombard a URL with some concurrent traffic, you can use this:

importrandomimporttimeimportrequestsimportconcurrent.futuresdef_get_size(url):sleep=random.random()/10# print("sleep", sleep)time.sleep(sleep)r=requests.get(url)# print(r.status_code)assertlen(r.text)returnlen(r.text)defrun(url,times=10):sizes=[]futures=[]withconcurrent.futures.ThreadPoolExecutor()asexecutor:for_inrange(times):futures.append(executor.submit(_get_size,url))forfutureinconcurrent.futures.as_completed(futures):sizes.append(future.result())returnsizesif__name__=="__main__":importsysprint(run(sys.argv[1]))

It's really basic but it works wonderfully. It starts 10 concurrent threads that all hit the same URL at almost the same time.
I've been using this stress test a local Django server to test some atomicity writes with the file system.

↧