Python Software Foundation: Redesigning the Python Package Index

August 28, 2018, 3:00 am

≫ Next: Python Anywhere: New feature: self-installation of SSL certificates!

≪ Previous: Techiediaries - Django: Using Electron with Flask and python-shell

In April this year a new version of the Python Package Index (PyPI) was released, an effort made possible by a generous award from the Mozilla Open Source Support program.

A major change in PyPI is the new user interface - something that had not seen any sizeable updates for over ten years. Understandably there have been questions about what’s next for the new UI, which as the designer of Warehouse (the project powering PyPI), I would like to address.

The new design

PyPI’s new design is a reflection of the Python community - modern, welcoming and inclusive.

The design emphasises inclusion by providing extensive help documentation, user friendly interfaces, accessible font sizes and a high-contrast color scheme. The entire site works across all resolutions, meaning that you can now use PyPI on any device.

These features are no accident, as the goal of the redesign is to make PyPI a success for as many users as possible. This a big challenge as over 15 million people from 236 different countries visit PyPI.org each year (Aug 2017 - July 2018).

While the new design is informed by usability standards and user experience best practices, it’s now time to take PyPI to the next level: informing design improvements by engaging in community research and user testing.

Next steps

The first area we are looking to improve is the project's detail page (view example), which is - as the name suggests - the page where the details of a particular project can be viewed.

This is by far the most visited page type on PyPI. In July 2018 alone, 76.59% of page visits were to a project detail page, or 3,594,956 visits from a total page visit count of 4,693,561. The majority of users arrive on these pages direct from Google or other sources, bypassing the PyPI home and search pages.

Given these numbers, even a small improvement in the efficiency of this page would return great results for the community. The question is: how should we decide what to change? What one user might think of an improvement, others may consider a regression.

Conducting user research

In an effort to better understand what our users want from the aforementioned page, we ran a design research exercise (full writeup) asking Python community members to rank the importance of different pieces of information on the page. 1,926 people participated in the exercise. These were the results:

"Buy a Feature" - PyPI research
Infogram

From this research, we can ascertain that many users highly value the project description, information about the required version of Python, and links to the project online. In contrast, few users value the trove classifiers, list of maintainers, or instructions on how to see statistics about the project.

Such insights are essential in driving the redesign in a way that prioritises important information for the largest number of users.

Running user tests

In conjunction with this research we are also establishing a user testing program, where PyPI users will give their feedback to the team via a remote video call; answering specific usage questions or completing certain tasks on the site.

Our goal is to run these sessions with a diverse group of users, accounting for the different people, places and ways that Python is used, while establishing major “pain points” with the current design. From this, we can open issues in the issue tracker to address problems, as has happened before with previous rounds of user tests conducted on the project management interfaces.

User tests can also be used to test new design concepts, compare the performance between old design vs new proposals, and ascertain if the proposed design solutions are truely performant.

Help us help PyPI!

So - how can you help us take PyPI to the next level?

If you’d like to participate in user tests, please register your interest. Depending on your profile and availability, we will be in touch to organise a testing session.

If you have a background in design/research or are interested in facilitating user tests, please contact me on nicole@pypi.org. All help is welcome!

We appreciate new contributors to the Warehouse project, with reserved issues for new contributors and love donations to the Python Packaging Working Group as these help us build a more sustainable model for Python packaging.

Finally, if you are interested in participating in future rounds of user research, please follow PyPI on Twitter or keep an eye on PyPI.org for future announcements!

↧

Python Anywhere: New feature: self-installation of SSL certificates!

August 28, 2018, 5:40 am

≫ Next: Codementor: How I got in the top 1 % on Kaggle.

≪ Previous: Python Software Foundation: Redesigning the Python Package Index

.jab-post img { border: 2px solid #eeeeee; padding: 5px; }

Our system update last week added on an API to let you install HTTP certificates yourself instead of having to email us. We've been beta-testing it over the last seven days, and it's now ready to go live :-)

You can either use it by accessing the API directly from your code, or by using our helper scripts (which you can pip install).

This is the first step towards a much improved system for HTTPS -- watch this space for more information.

What this is all about

We supply an HTTPS certificate for all websites that are subdomains of pythonanywhere.com -- so your website at yourusername.pythonanywhere.com handles HTTPS by default. But if you create a website with a custom domain, you need to get a certificate. This is because only the owner of a domain can create a certificate for it, to stop people from (say) creating certs for www.google.com.

Luckily, it's easy enough to create a certificate for your site, either by using the excellent Let's Encrypt (for which we have detailed instructions here) or from one of the many commercial providers.

But once you've created your certificate, you need to get it installed. In the past, the way we asked you to do that was to store the certificate and the private key somewhere in your PythonAnywhere file storage, then email us. The system worked well, but it did have one problem -- you needed to wait for us to install the certificate, which meant that in the worst case you could wait for up to 24 hours for it to be installed -- maybe not a huge issue when you're just getting started, but a big deal if you accidentally let the certificate expire and need a renewed one installed ASAP.

How to use it: the simple way

The easiest way to install your certificate is to use the PythonAnywhere helper scripts. The first step to do that is to make sure you have an API token set up for your account; go to the "Account" page and then click the "API token" tab. If you see this:

...then you're all set. If, however, you see this:

...then you need to click the button to generate a key.

Once you've done that, start a new Bash console, and run this to install the PythonAnywhere helper scripts:

pip3.6 install --user --upgrade pythonanywhere

(If you're on our "classic" image and don't have Python 3.6 available, you can use pip3.5 instead.)

Now you can run the script to install the certificate. If you've used our instructions for Let's Encrypt then the certificate and the key are in a well-known place, so you can just do this:

pa_install_webapp_letsencrypt_ssl.py www.yourdomain.com

If you've got a certificate from a different provider, you can specify the combined certificate and the key location by using a different command:

pa_install_webapp_ssl.py www.yourdomain.com /home/yourusername/something/combined-cert.pem /home/yourusername/something/private-key.pem

...adjusting the paths to point to the appropriate files.

If all goes well, you'll see output like this:

< Setting up SSL for www.yourdomain.com via API >
   \
    ~<:>>>>>>>>>
< Reloading www.yourdomain.com via API >
   \
    ~<:>>>>>>>>>
  _________________________________________________________________
/                                                                   \
| That's all set up now :-) Your new certificate will expire         |
| on 12 November 2018, so shortly before then you should             |
| renew it (see https://help.pythonanywhere.com/pages/LetsEncrypt/)  |
| and install the new certificate.                                   |
\                                                                   /
  -----------------------------------------------------------------
   \
    ~<:>>>>>>>>>

If you're not using Let's Encrypt, it will look slightly different, of course. If you get an error and can't work out what to do, please do email us at support@pythonanywhere.com.

There's one important thing to notice there -- the renewal date. You'll still need to renew the certificate and install a new one when it expires.

How to use it: scripting your own stuff

If you're an API kind of person, you can use the new ssl endpoint under the webapps URL to directly set your SSL certificate and private key with a POST request; you can also get SSL information (when the cert will expire, who issued it, and so on) by using the GET method. More information in the API documentation.

What's next?

An API and some command line scripts for this stuff is all very well, but we're far from done with improving the way SSL certs work on PythonAnywhere. Our long-term goal is to make this even easier -- watch this space for more information.

Any questions?

Hopefully all of that is pretty clear! But if you have any questions, please do let us know.

↧

Codementor: How I got in the top 1 % on Kaggle.

August 28, 2018, 9:25 am

≫ Next: Codementor: How to Create a Facebook Messenger Bot with Python Flask

≪ Previous: Python Anywhere: New feature: self-installation of SSL certificates!

I participated in Santander Customer Satisfaction challenge, ran on Kaggle for 2 months and got into top 1%. Here, I would be discussing my approach to this problem. PROBLEM STATEMENT Customer...

↧

Codementor: How to Create a Facebook Messenger Bot with Python Flask

August 28, 2018, 10:51 pm

≫ Next: Djangostars: Merging Django ORM with SQLAlchemy for Easier Data Analysis

≪ Previous: Codementor: How I got in the top 1 % on Kaggle.

The bot-revolution has taken over, and everyone is building a bot. In this post we take you through creating your first bot using Python/Flask. The codebase can be found here on Github...

↧

Djangostars: Merging Django ORM with SQLAlchemy for Easier Data Analysis

August 29, 2018, 5:27 am

≫ Next: Real Python: Python Pandas: Tricks & Features You May Not Know

≪ Previous: Codementor: How to Create a Facebook Messenger Bot with Python Flask

Merging Django ORM with SQLAlchemy for Easier Data Analysis

Development of products with Django framework is usually easy and straightforward; great documentation, many tools out of the box, plenty of open source libraries and big community. Django ORM takes full control about SQL layer protecting you from mistakes, and underlying details of queries so you can spend more time on designing and building your application structure in Python code. However, sometimes such behavior may hurt - for example, when you’re building a project related to data analysis. Building advanced queries with Django is not very easy; it’s hard to read (in Python) and hard to understand what’s going on in SQL-level without logging or printing generated SQL queries somewhere. Moreover, such queries could not be efficient enough, so this will hit you back when you load more data into DB to play with. In one moment, you can find yourself doing too much raw SQL through Django cursor, and this is the moment when you should do a break and take a look on another interesting tool, which is placed right between ORM layer and the layer of raw SQL queries.

As you can see from the title of the article, we successfully mixed Django ORM and SQLAlchemy Core together, and we’re very satisfied with results. We built an application which helps to analyze data produced by EMR systems by aggregating data into charts and tables, scoring by throughput/efficiency/staff cost, and highlighting outliers which allows to optimize business processes for clinics and save money.

What is the point of mixing Django ORM with SQLAlchemy?

There are a few reasons why we stepped out from Django ORM for this task:

For ORM world, one object is one record in the database, but here we deal only with aggregated data.
Some aggregations are very tricky, and Django ORM functionality is not enough to fulfill the needs. To be honest, sometimes in some simple cases it’s hard (or even impossible) to make ORM produce the SQL query exactly the way you want, and when you’re dealing with a big data, it will affect performance a lot.
If you’re building advanced queries via Django ORM, it’s hard to read and understand such queries in Python, and hard to predict which SQL query will be generated and treated to the database.

It’s worth saying that we also set up a second database, which is handled by Django ORM to cover other web application related tasks and business-logic needs, which it perfectly does. Django ORM is evolving from version to version, giving more and more features. For example, in recent releases, a bunch of neat features were added like support of Subquery expressions or Window functions and many others, which you should definitely try before doing raw SQL or looking at the tools like SQLAlchemy if your problem is more complex than fixing a few queries.

So that’s why we decided to take a look at SQLAlchemy. It consists of two parts - ORM and Core. SQLAlchemy ORM is similar to Django ORM, but at the same time, they differ. SQLAlchemy ORM uses a different concept, Data Mapper, compared to Django's Active Record approach. As far as you’re building projects on Django, you definitely should not switch ORM (if you don’t have very special reasons to do so), as you want to use Django REST framework, Django-admin, and other neat stuff which is tied to Django models.

The second part of SQLAlchemy is called Core. It’s placed right between high-level ORM and low-level SQL. The Core is very powerful and flexible; it gives you the ability to build any SQL-queries you wish, and when you see such queries in Python, it’s easy to understand what’s going on. For example, take a look into a sample query from the documentation:

q = session.query(User).filter(User.name.like('e%')).\  
    limit(5).from_self().\
    join(User.addresses).filter(Address.email.like('q%')).\
    order_by(User.name)

Which will result into

SELECT anon_1.user_id AS anon_1_user_id,  
       anon_1.user_name AS anon_1_user_name
FROM (SELECT "user".id AS user_id, "user".name AS user_name  
FROM "user"  
WHERE "user".name LIKE :name_1  
 LIMIT :param_1) AS anon_1
JOIN address ON anon_1.user_id = address.user_id  
WHERE address.email LIKE :email_1 ORDER BY anon_1.user_name

Note: with such tricks, we don’t fall into N+1 problem: from_select makes an additional SELECT wrapper around the query, so we reduce the amount of rows at first (via LIKE and LIMIT) and only then we join the address information.

How to mix Django application and SQLALchemy

So if you’re interested and want to try to mix SQLAlchemy with Django application, here are some hints which could help you.

First of all, you need to create a global variable with Engine, but the actual connection with DB will be established on first connect or execute call.

sa_engine = create_engine(settings.DB_CONNECTION_URL, pool_recycle=settings.POOL_RECYCLE)

Createengine accepts additional configuration for connection. MySQL/MariaDB/AWS Aurora(MySQL compatible) have an interactivetimeout setting which is 8h by default, so without pool_recycle extra parameter you will get annoying SQLError: (OperationalError) (2006, ‘MySQL server has gone away’). So the POOL_RECYCLE should be smaller than interactive_timeout. For example a half of it: POOL_RECYCLE = 4 * 60 * 60

Next step is to build your queries. Depending on your application architecture, you can declare tables and fields usingTable and Column classes (which also can be used with ORM), or if your application already stores tables and columns names in another way, you can do it in place, via table (table_name) and column (col_name) functions (as shown here).

In our app, we picked the second option, as we stored information about aggregations, formulas, and formatting in our own declarative syntax. Then we built a layer which read such structures and executed queries based on the provided instructions.

When your query is ready, simply call sa_engine.execute(query). The cursor would be opened until you read all the data or if you close it explicitly.

There is one very annoying thing worth mentioning. As a documentation says, SQLAlchemy has limited ability to do query stringification, so it’s not so easy to get a final query which will be executed. You can print query by itself:

print(query)

SELECT  
role_group_id, role_group_name, nr_patients  
FROM "StaffSummary"  
WHERE day >= :day_1 AND day <= :day_2 AND location_id = :location_id_1 AND service_id = :service_id_1

(This one looks not so scary, but with more complex queries it could be about 20+ placeholders, which are very annoying and time-expensive to fill manually to play later in SQL console.)

If you have only strings and numbers to be inserted into query, this will work for you

print(s.compile(compile_kwargs={"literal_binds": True}))

For dates, such a trick will not work. There is a discussion on StackOverflow on how to achieve the desired results, but solutions look unattractive.

Another option is to enable queries logging into the file via database configuration, but in this case, you could face another issue; it becomes hard to find a query you want to debug if Django ORM connected to this database too.

Testing

Note: Pytest multidb note says “Currently pytest-django does not specifically support Django’s multi-database support. You can however use normal Django TestCase instances to use it’s multi_db support.”

So what does it mean - not support? By default, Django will create and remove (at the end of all tests) a test-database for each db listed in DATABASES definition. This feature works perfectly with pytests also.

Django TestCase and TransactionTestCase with multi_db=True enables erasing of data in multiple databases between tests. Also it enables data loading into second database via django-fixtures, but it’s much better to use modelmommy or factoryboy instead, which are not affected by this attribute.

There are a few hacks suggested in pytest-django discussion how to work around the issue and enable multi_db to continue pytesting.

There is one important advice - for tables that have Django-models, you should save data to DB via Django-ORM. Otherwise, you will face issues during writing tests. TestCase will not be able to rollback other transactions which happened outside from Django DB connection. If you have such a situation, you may use TransactionalTestCase with multi_db=True for tests which trigger functionality, which produces DB writes through SQLAlchemy connection, but remember that such tests are slower than regular TestCase.

Also, another scenario is possible - you have Django-models only in one database and you’re working with the second database via SQLAlchemy. In this case, multi_db doesn’t affect you at all. In such cases, you need to write a pytest-fixture (or do it as a mixin and trigger logic in setUp if you’re using unittests) which will create DB structure from SQL file. Such a file should contain DROP TABLE IF EXISTS statements before CREATE TABLE. This fixture should be applied to each test case which manipulates with this database. Other fixture could load data into created tables.

Note: Such tests will be slower as tables will be recreated for each test. Ideally, tables should be created once (declared as @pytest.fixture(scope='session', autouse=True)), and each transaction should rollback data for each test. It’s not easy to achieve because of different connections: Django & SQLAlchemy or different connections of SQLAlchemy connection-pool, e.g in your tests you start the transaction, fill DB with test data, then run test and rollback transaction (it wasn’t committed). But during the test, your application code may do queries to DB like connection.execute(query) which performed outside of transaction which created test data. So with default transaction isolation level, the application will not see any data, only empty tables. It’s possible to change transaction isolation level to READ UNCOMMITTED for SQLAlchemy connection, and everything will work as expected, but it’s definitely not a solution at all.

Conclusion

To sum up everything above, SQLAlchemy Core is a great tool which brings you closer to SQL and gives you understanding and full control over the queries. If you’re building the application (or a part of it) which requires advanced aggregations, it is worth it to check out SQLAlchemy Core capabilities as an alternative to Django ORM tools.

Read on to learn how to make building advanced queries for data analysis projects easier. Find out how we managed to mix Django ORM and SQLAlchemy Core and what we got from it.

↧

Real Python: Python Pandas: Tricks & Features You May Not Know

August 29, 2018, 7:00 am

≫ Next: NumFOCUS: Tech-Driven Trading Firm IMC Joins NumFOCUS Corporate Sponsors

≪ Previous: Djangostars: Merging Django ORM with SQLAlchemy for Easier Data Analysis

Pandas is a foundational library for analytics, data processing, and data science. It’s a huge project with tons of optionality and depth.

This tutorial will cover some lesser-used but idiomatic Pandas capabilities that lend your code better readability, versatility, and speed, à la the Buzzfeed listicle.

If you feel comfortable with the core concepts of Python’s Pandas library, hopefully you’ll find a trick or two in this article that you haven’t stumbled across previously. (If you’re just starting out with the library, 10 Minutes to Pandas is a good place to start.)

Note: The examples in this article are tested with Pandas version 0.23.2 and Python 3.6.6. However, they should also be valid in older versions.

1. Configure Options & Settings at Interpreter Startup

You may have run across Pandas’ rich options and settings system before.

It’s a huge productivity saver to set customized Pandas options at interpreter startup, especially if you work in a scripting environment. You can use pd.set_option() to configure to your heart’s content with a Python or IPython startup file.

The options use a dot notation such as pd.set_option('display.max_colwidth', 25), which lends itself well to a nested dictionary of options:

importpandasaspddefstart():options={'display':{'max_columns':None,'max_colwidth':25,'expand_frame_repr':False,# Don't wrap to multiple pages'max_rows':14,'max_seq_items':50,# Max length of printed sequence'precision':4,'show_dimensions':False},'mode':{'chained_assignment':None# Controls SettingWithCopyWarning}}forcategory,optioninoptions.items():forop,valueinoption.items():pd.set_option(f'{category}.{op}',value)# Python 3.6+if__name__=='__main__':start()delstart# Clean up namespace in the interpreter

If you launch an interpreter session, you’ll see that everything in the startup script has been executed, and Pandas is imported for you automatically with your suite of options:

>>> pd.__name__'pandas'>>> pd.get_option('display.max_rows')14

Let’s use some data on abalone hosted by the UCI Machine Learning Repository to demonstrate the formatting that was set in the startup file. The data will truncate at 14 rows with 4 digits of precision for floats:

>>> url=('https://archive.ics.uci.edu/ml/'... 'machine-learning-databases/abalone/abalone.data')>>> cols=['sex','length','diam','height','weight','rings']>>> abalone=pd.read_csv(url,usecols=[0,1,2,3,4,8],names=cols)>>> abalone     sex  length   diam  height  weight  rings0      M   0.455  0.365   0.095  0.5140     151      M   0.350  0.265   0.090  0.2255      72      F   0.530  0.420   0.135  0.6770      93      M   0.440  0.365   0.125  0.5160     104      I   0.330  0.255   0.080  0.2050      75      I   0.425  0.300   0.095  0.3515      86      F   0.530  0.415   0.150  0.7775     20... .................4170   M   0.550  0.430   0.130  0.8395     104171   M   0.560  0.430   0.155  0.8675      84172   F   0.565  0.450   0.165  0.8870     114173   M   0.590  0.440   0.135  0.9660     104174   M   0.600  0.475   0.205  1.1760      94175   F   0.625  0.485   0.150  1.0945     104176   M   0.710  0.555   0.195  1.9485     12

You’ll see this dataset pop up in other examples later as well.

2. Make Toy Data Structures With Pandas’ Testing Module

Hidden way down in Pandas’ testing module are a number of convenient functions for quickly building quasi-realistic Series and DataFrames:

>>> importpandas.util.testingastm>>> tm.N,tm.K=15,3# Module-level default rows/columns>>> importnumpyasnp>>> np.random.seed(444)>>> tm.makeTimeDataFrame(freq='M').head()                 A       B       C2000-01-31  0.3574 -0.8804  0.26692000-02-29  0.3775  0.1526 -0.48032000-03-31  1.3823  0.2503  0.30082000-04-30  1.1755  0.0785 -0.17912000-05-31 -0.9393 -0.9039  1.1837>>> tm.makeDataFrame().head()                 A       B       CnTLGGTiRHF -0.6228  0.6459  0.1251WPBRn9jtsR -0.3187 -0.8091  1.15017B3wWfvuDA -1.9872 -1.0795  0.2987yJ0BTjehH1  0.8802  0.7403 -1.21540luaYUYvy1 -0.9320  1.2912 -0.2907

There are around 30 of these, and you can see the full list by calling dir() on the module object. Here are a few:

>>> [iforiindir(tm)ifi.startswith('make')]['makeBoolIndex', 'makeCategoricalIndex', 'makeCustomDataframe', 'makeCustomIndex', # ..., 'makeTimeSeries', 'makeTimedeltaIndex', 'makeUIntIndex', 'makeUnicodeIndex']

These can be useful for benchmarking, testing assertions, and experimenting with Pandas methods that you are less familiar with.

3. Take Advantage of Accessor Methods

Perhaps you’ve heard of the term accessor, which is somewhat like a getter (although getters and setters are used infrequently in Python). For our purposes here, you can think of a Pandas accessor as a property that serves as an interface to additional methods.

Pandas Series have three of them:

>>> pd.Series._accessors{'cat', 'str', 'dt'}

Yes, that definition above is a mouthful, so let’s take a look at a few examples before discussing the internals.

.cat is for categorical data, .str is for string (object) data, and .dt is for datetime-like data. Let’s start off with .str: imagine that you have some raw city/state/ZIP data as a single field within a Pandas Series.

Pandas string methods are vectorized, meaning that they operate on the entire array without an explicit for-loop:

>>> addr=pd.Series([... 'Washington, D.C. 20003',... 'Brooklyn, NY 11211-1755',... 'Omaha, NE 68154',... 'Pittsburgh, PA 15211'... ])>>> addr.str.upper()0     WASHINGTON, D.C. 200031    BROOKLYN, NY 11211-17552            OMAHA, NE 681543       PITTSBURGH, PA 15211dtype: object>>> addr.str.count(r'\d')# 5 or 9-digit zip?0    51    92    53    5dtype: int64

For a more involved example, let’s say that you want to separate out the three city/state/ZIP components neatly into DataFrame fields.

You can pass a regular expression to .str.extract() to “extract” parts of each cell in the Series. In .str.extract(), .str is the accessor, and .str.extract() is an accessor method:

>>> regex=(r'(?P<city>[A-Za-z ]+), '# One or more letters... r'(?P<state>[A-Z]{2}) '# 2 capital letters... r'(?P<zip>\d{5}(?:-\d{4})?)')# Optional 4-digit extension...>>> addr.str.replace('.','').str.extract(regex)         city state         zip0  Washington    DC       200031    Brooklyn    NY  11211-17552       Omaha    NE       681543  Pittsburgh    PA       15211

This also illustrates what is known as method-chaining, where .str.extract(regex) is called on the result of addr.str.replace('.', ''), which cleans up use of periods to get a nice 2-character state abbreviation.

It’s helpful to know a tiny bit about how these accessor methods work as a motivating reason for why you should use them in the first place, rather than something like addr.apply(re.findall, ...).

Each accessor is itself a bona fide Python class:

.str maps to StringMethods.
.dt maps to CombinedDatetimelikeProperties.
.cat routes to CategoricalAccessor.

These standalone classes are then “attached” to the Series class using a CachedAccessor. It is when the classes are wrapped in CachedAccessor that a bit of magic happens.

CachedAccessor is inspired by a “cached property” design: a property is only computed once per instance and then replaced by an ordinary attribute. It does this by overloading the .__get__() method, which is part of Python’s descriptor protocol.

Note: If you’d like to read more about the internals of how this works, see the Python Descriptor HOWTO and this post on the cached property design. Python 3 also introduced functools.lru_cache(), which offers similar functionality.

The second accessor, .dt, is for datetime-like data. It technically belongs to Pandas’ DatetimeIndex, and if called on a Series, it is converted to a DatetimeIndex first:

>>> daterng=pd.Series(pd.date_range('2017',periods=9,freq='Q'))>>> daterng0   2017-03-311   2017-06-302   2017-09-303   2017-12-314   2018-03-315   2018-06-306   2018-09-307   2018-12-318   2019-03-31dtype: datetime64[ns]>>> daterng.dt.day_name()0      Friday1      Friday2    Saturday3      Sunday4    Saturday5    Saturday6      Sunday7      Monday8      Sundaydtype: object>>> # Second-half of year only>>> daterng[daterng.dt.quarter>2]2   2017-09-303   2017-12-316   2018-09-307   2018-12-31dtype: datetime64[ns]>>> daterng[daterng.dt.is_year_end]3   2017-12-317   2018-12-31dtype: datetime64[ns]

The third accessor, .cat, is for Categorical data only, which you’ll see shortly in its own section.

4. Create a DatetimeIndex From Component Columns

Speaking of datetime-like data, as in daterng above, it’s possible to create a Pandas DatetimeIndex from multiple component columns that together form a date or datetime:

>>> fromitertoolsimportproduct>>> datecols=['year','month','day']>>> df=pd.DataFrame(list(product([2017,2016],[1,2],[1,2,3])),... columns=datecols)>>> df['data']=np.random.randn(len(df))>>> df    year  month  day    data0   2017      1    1 -0.07671   2017      1    2 -1.27982   2017      1    3  0.40323   2017      2    1  1.23774   2017      2    2 -0.20605   2017      2    3  0.61876   2016      1    1  2.37867   2016      1    2 -0.47308   2016      1    3 -2.15059   2016      2    1 -0.634010  2016      2    2  0.796411  2016      2    3  0.0005>>> df.index=pd.to_datetime(df[datecols])>>> df.head()            year  month  day    data2017-01-01  2017      1    1 -0.07672017-01-02  2017      1    2 -1.27982017-01-03  2017      1    3  0.40322017-02-01  2017      2    1  1.23772017-02-02  2017      2    2 -0.2060

Finally, you can drop the old individual columns and convert to a Series:

>>> df=df.drop(datecols,axis=1).squeeze()>>> df.head()2017-01-01   -0.07672017-01-02   -1.27982017-01-03    0.40322017-02-01    1.23772017-02-02   -0.2060Name: data, dtype: float64>>> df.index.dtype_str'datetime64[ns]

The intuition behind passing a DataFrame is that a DataFrame resembles a Python dictionary where the column names are keys, and the individual columns (Series) are the dictionary values. That’s why pd.to_datetime(df[datecols].to_dict(orient='list')) would also work in this case. This mirrors the construction of Python’s datetime.datetime, where you pass keyword arguments such as datetime.datetime(year=2000, month=1, day=15, hour=10).

5. Use Categorical Data to Save on Time and Space

One powerful Pandas feature is its Categorical dtype.

Even if you’re not always working with gigabytes of data in RAM, you’ve probably run into cases where straightforward operations on a large DataFrame seem to hang up for more than a few seconds.

Pandas object dtype is often a great candidate for conversion to category data. (object is a container for Python str, heterogeneous data types, or “other” types.) Strings occupy a significant amount of space in memory:

>>> colors=pd.Series([... 'periwinkle',... 'mint green',... 'burnt orange',... 'periwinkle',... 'burnt orange',... 'rose',... 'rose',... 'mint green',... 'rose',... 'navy'... ])...>>> importsys>>> colors.apply(sys.getsizeof)0    591    592    613    594    615    536    537    598    539    53dtype: int64

Note: I used sys.getsizeof() to show the memory occupied by each individual value in the Series. Keep in mind these are Python objects that have some overhead in the first place. (sys.getsizeof('') will return 49 bytes.)

There is also colors.memory_usage(), which sums up the memory usage and relies on the .nbytes attribute of the underlying NumPy array. Don’t get too bogged down in these details: what is important is relative memory usage that results from type conversion, as you’ll see next.

Now, what if we could take the unique colors above and map each to a less space-hogging integer? Here is a naive implementation of that:

>>> mapper={v:kfork,vinenumerate(colors.unique())}>>> mapper{'periwinkle': 0, 'mint green': 1, 'burnt orange': 2, 'rose': 3, 'navy': 4}>>> as_int=colors.map(mapper)>>> as_int0    01    12    23    04    25    36    37    18    39    4dtype: int64>>> as_int.apply(sys.getsizeof)0    241    282    283    244    285    286    287    288    289    28dtype: int64

Note: Another way to do this same thing is with Pandas’ pd.factorize(colors):

>>> pd.factorize(colors)[0]array([0, 1, 2, 0, 2, 3, 3, 1, 3, 4])

Either way, you are encoding the object as an enumerated type (categorical variable).

You’ll notice immediately that memory usage is just about cut in half compared to when the full strings are used with object dtype.

Earlier in the section on accessors, I mentioned the .cat (categorical) accessor. The above with mapper is a rough illustration of what is happening internally with Pandas’ Categorical dtype:

“The memory usage of a Categorical is proportional to the number of categories plus the length of the data. In contrast, an object dtype is a constant times the length of the data.” (Source)

In colors above, you have a ratio of 2 values for every unique value (category):

>>> len(colors)/colors.nunique()2.0

As a result, the memory savings from converting to Categorical is good, but not great:

>>> # Not a huge space-saver to encode as Categorical>>> colors.memory_usage(index=False,deep=True)650>>> colors.astype('category').memory_usage(index=False,deep=True)495

However, if you blow out the proportion above, with a lot of data and few unique values (think about data on demographics or alphabetic test scores), the reduction in memory required is over 10 times:

>>> manycolors=colors.repeat(10)>>> len(manycolors)/manycolors.nunique()# Much greater than 2.0x20.0>>> manycolors.memory_usage(index=False,deep=True)6500>>> manycolors.astype('category').memory_usage(index=False,deep=True)585

A bonus is that computational efficiency gets a boost too: for categorical Series, the string operations are performed on the .cat.categories attribute rather than on each original element of the Series.

In other words, the operation is done once per unique category, and the results are mapped back to the values. Categorical data has a .cat accessor that is a window into attributes and methods for manipulating the categories:

>>> ccolors=colors.astype('category')>>> ccolors.cat.categoriesIndex(['burnt orange', 'mint green', 'navy', 'periwinkle', 'rose'], dtype='object')

In fact, you can reproduce something similar to the example above that you did manually:

>>> ccolors.cat.codes0    31    12    03    34    05    46    47    18    49    2dtype: int8

All that you need to do to exactly mimic the earlier manual output is to reorder the codes:

>>> ccolors.cat.reorder_categories(mapper).cat.codes0    01    12    23    04    25    36    37    18    39    4dtype: int8

Notice that the dtype is NumPy’s int8, an 8-bit signed integer that can take on values from -127 to 128. (Only a single byte is needed to represent a value in memory. 64-bit signed ints would be overkill in terms of memory usage.) Our rough-hewn example resulted in int64 data by default, whereas Pandas is smart enough to downcast categorical data to the smallest numerical dtype possible.

Most of the attributes for .cat are related to viewing and manipulating the underlying categories themselves:

>>> [iforiindir(ccolors.cat)ifnoti.startswith('_')]['add_categories', 'as_ordered', 'as_unordered', 'categories', 'codes', 'ordered', 'remove_categories', 'remove_unused_categories', 'rename_categories', 'reorder_categories', 'set_categories']

There are a few caveats, though. Categorical data is generally less flexible. For instance, if inserting previously unseen values, you need to add this value to a .categories container first:

>>> ccolors.iloc[5]='a new color'# ...ValueError: Cannot setitem on a Categorical with a new category,set the categories first>>> ccolors=ccolors.cat.add_categories(['a new color'])>>> ccolors.iloc[5]='a new color'# No more ValueError

If you plan to be setting values or reshaping data rather than deriving new computations, Categorical types may be less nimble.

6. Introspect Groupby Objects via Iteration

When you call df.groupby('x'), the resulting Pandas groupby objects can be a bit opaque. This object is lazily instantiated and doesn’t have any meaningful representation on its own.

You can demonstrate with the abalone dataset from example 1:

>>> abalone['ring_quartile']=pd.qcut(abalone.rings,q=4,labels=range(1,5))>>> grouped=abalone.groupby('ring_quartile')>>> grouped<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x11c1169b0>

Alright, now you have a groupby object, but what is this thing, and how do I see it?

Before you call something like grouped.apply(func), you can take advantage of the fact that groupby objects are iterable:

>>> help(grouped.__iter__)        Groupby iterator        Returns        -------        Generator yielding sequence of (name, subsetted object)        for each group

Each “thing” yielded by grouped.__iter__() is a tuple of (name, subsetted object), where name is the value of the column on which you’re grouping, and subsetted object is a DataFrame that is a subset of the original DataFrame based on whatever grouping condition you specify. That is, the data gets chunked by group:

>>> foridx,frameingrouped:... print(f'Ring quartile: {idx}')... print('-'*16)... print(frame.nlargest(3,'weight'),end='\n\n')...Ring quartile: 1----------------     sex  length   diam  height  weight  rings ring_quartile2619   M   0.690  0.540   0.185  1.7100      8             11044   M   0.690  0.525   0.175  1.7005      8             11026   M   0.645  0.520   0.175  1.5610      8             1Ring quartile: 2----------------     sex  length  diam  height  weight  rings ring_quartile2811   M   0.725  0.57   0.190  2.3305      9             21426   F   0.745  0.57   0.215  2.2500      9             21821   F   0.720  0.55   0.195  2.0730      9             2Ring quartile: 3----------------     sex  length  diam  height  weight  rings ring_quartile1209   F   0.780  0.63   0.215   2.657     11             31051   F   0.735  0.60   0.220   2.555     11             33715   M   0.780  0.60   0.210   2.548     11             3Ring quartile: 4----------------     sex  length   diam  height  weight  rings ring_quartile891    M   0.730  0.595    0.23  2.8255     17             41763   M   0.775  0.630    0.25  2.7795     12             4165    M   0.725  0.570    0.19  2.5500     14             4

Relatedly, a groupby object also has .groups and a group-getter, .get_group():

>>> grouped.groups.keys()dict_keys([1, 2, 3, 4])>>> grouped.get_group(2).head()   sex  length   diam  height  weight  rings ring_quartile2    F   0.530  0.420   0.135  0.6770      9             28    M   0.475  0.370   0.125  0.5095      9             219   M   0.450  0.320   0.100  0.3810      9             223   F   0.550  0.415   0.135  0.7635      9             239   M   0.355  0.290   0.090  0.3275      9             2

This can help you be a little more confident that the operation you’re performing is the one you want:

>>> grouped['height','weight'].agg(['mean','median'])               height         weight                 mean median    mean  medianring_quartile1              0.1066  0.105  0.4324  0.36852              0.1427  0.145  0.8520  0.84403              0.1572  0.155  1.0669  1.06454              0.1648  0.165  1.1149  1.0655

No matter what calculation you perform on grouped, be it a single Pandas method or custom-built function, each of these “sub-frames” is passed one-by-one as an argument to that callable. This is where the term “split-apply-combine” comes from: break the data up by groups, perform a per-group calculation, and recombine in some aggregated fashion.

If you’re having trouble visualizing exactly what the groups will actually look like, simply iterating over them and printing a few can be tremendously useful.

7. Use This Mapping Trick for Membership Binning

Let’s say that you have a Series and a corresponding “mapping table” where each value belongs to a multi-member group, or to no groups at all:

>>> countries=pd.Series([... 'United States',... 'Canada',... 'Mexico',... 'Belgium',... 'United Kingdom',... 'Thailand'... ])...>>> groups={... 'North America':('United States','Canada','Mexico','Greenland'),... 'Europe':('France','Germany','United Kingdom','Belgium')... }

In other words, you need to map countries to the following result:

0    North America1    North America2    North America3           Europe4           Europe5            otherdtype: object

What you need here is a function similar to Pandas’ pd.cut(), but for binning based on categorical membership. You can use pd.Series.map(), which you already saw in example #5, to mimic this:

fromtypingimportAnydefmembership_map(s:pd.Series,groups:dict,fillvalue:Any=-1)->pd.Series:# Reverse & expand the dictionary key-value pairsgroups={x:kfork,vingroups.items()forxinv}returns.map(groups).fillna(fillvalue)

This should be significantly faster than a nested Python loop through groups for each country in countries.

Here’s a test drive:

>>> membership_map(countries,groups,fillvalue='other')0    North America1    North America2    North America3           Europe4           Europe5            otherdtype: object

Let’s break down what’s going on here. (Sidenote: this is a great place to step into a function’s scope with Python’s debugger, pdb, to inspect what variables are local to the function.)

The objective is to map each group in groups to an integer. However, Series.map() will not recognize 'ab'—it needs the broken-out version with each character from each group mapped to an integer. This is what the dictionary comprehension is doing:

>>> groups=dict(enumerate(('ab','cd','xyz')))>>> {x:kfork,vingroups.items()forxinv}{'a': 0, 'b': 0, 'c': 1, 'd': 1, 'x': 2, 'y': 2, 'z': 2}

This dictionary can be passed to s.map() to map or “translate” its values to their corresponding group indices.

8. Understand How Pandas Uses Boolean Operators

You may be familiar with Python’s operator precedence, where and, not, and or have lower precedence than arithmetic operators such as <, <=, >, >=, !=, and ==. Consider the two statements below, where < and > have higher precedence than the and operator:

>>> # Evaluates to "False and True">>> 4<3and5>4False>>> # Evaluates to 4 < 5 > 4>>> 4<(3and5)>4True

Note: It’s not specifically Pandas-related, but 3 and 5 evaluates to 5 because of short-circuit evaluation:

“The return value of a short-circuit operator is the last evaluated argument.” (Source)

Pandas (and NumPy, on which Pandas is built) does not use and, or, or not. Instead, it uses &, |, and ~, respectively, which are normal, bona fide Python bitwise operators.

These operators are not “invented” by Pandas. Rather, &, |, and ~ are valid Python built-in operators that have higher (rather than lower) precedence than arithmetic operators. (Pandas overrides dunder methods like .__ror__() that map to the | operator.) To sacrifice some detail, you can think of “bitwise” as “elementwise” as it relates to Pandas and NumPy:

>>> pd.Series([True,True,False])&pd.Series([True,False,False])0     True1    False2    Falsedtype: bool

It pays to understand this concept in full. Let’s say that you have a range-like Series:

>>> s=pd.Series(range(10))

I would guess that you may have seen this exception raised at some point:

>>> s%2==0&s>3ValueError: The truth value of a Series is ambiguous.Use a.empty, a.bool(), a.item(), a.any() or a.all().

What’s happening here? It’s helpful to incrementally bind the expression with parentheses, spelling out how Python expands this expression step by step:

s%2==0&s>3# Same as above, original expression(s%2)==0&s>3# Modulo is most tightly binding here(s%2)==(0&s)>3# Bitwise-and is second-most-binding(s%2)==(0&s)and(0&s)>3# Expand the statement((s%2)==(0&s))and((0&s)>3)# The `and` operator is least-binding

The expression s % 2 == 0 & s > 3 is equivalent to (or gets treated as) ((s % 2) == (0 & s)) and ((0 & s) > 3). This is called expansion: x < y <= z is equivalent to x < y and y <= z.

Okay, now stop there, and let’s bring this back to Pandas-speak. You have two Pandas Series that we’ll call left and right:

>>> left=(s%2)==(0&s)>>> right=(0&s)>3>>> leftandright# This will raise the same ValueError

You know that a statement of the form left and right is truth-value testing both left and right, as in the following:

>>> bool(left)andbool(right)

The problem is that Pandas developers intentionally don’t establish a truth-value (truthiness) for an entire Series. Is a Series True or False? Who knows? The result is ambiguous:

>>> bool(s)ValueError: The truth value of a Series is ambiguous.Use a.empty, a.bool(), a.item(), a.any() or a.all().

The only comparison that makes sense is an elementwise comparison. That’s why, if an arithmetic operator is involved, you’ll need parentheses:

>>> (s%2==0)&(s>3)0    False1    False2    False3    False4     True5    False6     True7    False8     True9    Falsedtype: bool

In short, if you see the ValueError above pop up with boolean indexing, the first thing you should probably look to do is sprinkle in some needed parentheses.

9. Load Data From the Clipboard

It’s a common situation to need to transfer data from a place like Excel or Sublime Text to a Pandas data structure. Ideally, you want to do this without going through the intermediate step of saving the data to a file and afterwards reading in the file to Pandas.

You can load in DataFrames from your computer’s clipboard data buffer with pd.read_clipboard(). Its keyword arguments are passed on to pd.read_table().

This allows you to copy structured text directly to a DataFrame or Series. In Excel, the data would look something like this:

Its plain-text representation (for example, in a text editor) would look like this:

a   b           c       d
0   1           inf     1/1/00
2   7.389056099 N/A     5-Jan-13
4   54.59815003 nan     7/24/18
6   403.4287935 None    NaT

Simply highlight and copy the plain text above, and call pd.read_clipboard():

>>> df=pd.read_clipboard(na_values=[None],parse_dates=['d'])>>> df   a         b    c          d0  0    1.0000  inf 2000-01-011  2    7.3891  NaN 2013-01-052  4   54.5982  NaN 2018-07-243  6  403.4288  NaN        NaT>>> df.dtypesa             int64b           float64c           float64d    datetime64[ns]dtype: object

10. Write Pandas Objects Directly to Compressed Format

This one’s short and sweet to round out the list. As of Pandas version 0.21.0, you can write Pandas objects directly to gzip, bz2, zip, or xz compression, rather than stashing the uncompressed file in memory and converting it. Here’s an example using the abalone data from trick #1:

abalone.to_json('df.json.gz',orient='records',lines=True,compression='gzip')

In this case, the size difference is 11.6x:

>>> importos.path>>> abalone.to_json('df.json',orient='records',lines=True)>>> os.path.getsize('df.json')/os.path.getsize('df.json.gz')11.603035760226396

Want to Add to This List? Let Us Know

Hopefully, you were able to pick up a couple of useful tricks from this list to lend your Pandas code better readability, versatility, and performance.

If you have something up your sleeve that’s not covered here, please leave a suggestion in the comments or as a GitHub Gist. We will gladly add to this list and give credit where it’s due.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

NumFOCUS: Tech-Driven Trading Firm IMC Joins NumFOCUS Corporate Sponsors

August 29, 2018, 9:22 am

≫ Next: Matthew Rocklin: High level performance of Pandas, Dask, Spark, and Arrow

≪ Previous: Real Python: Python Pandas: Tricks & Features You May Not Know

The post Tech-Driven Trading Firm IMC Joins NumFOCUS Corporate Sponsors appeared first on NumFOCUS.

↧

Matthew Rocklin: High level performance of Pandas, Dask, Spark, and Arrow

August 27, 2018, 5:00 pm

≫ Next: Will Kahn-Greene: Siggen (Socorro signature generator) v0.2.0 released!

≪ Previous: NumFOCUS: Tech-Driven Trading Firm IMC Joins NumFOCUS Corporate Sponsors

This work is supported by Anaconda Inc

Question

How does Dask dataframe performance compare to Pandas? Also, what about Spark dataframes and what about Arrow? How do they compare?

I get this question every few weeks. This post is to avoid repetition.

Caveats

This answer is likely to change over time. I’m writing this in August 2018
This question and answer are very high level. More technical answers are possible, but not contained here.

Answers

Pandas

If you’re coming from Python and have smallish datasets then Pandas is the right choice. It’s usable, widely understood, efficient, and well maintained.

Benefits of Parallelism

The performance benefit (or drawback) of using a parallel dataframe like Dask dataframes or Spark dataframes over Pandas will differ based on the kinds of computations you do:

If you’re doing small computations then Pandas is always the right choice. The administrative costs of parallelizing will outweigh any benefit. You should not parallelize if your computations are taking less than, say, 100ms.
For simple operations like filtering, cleaning, and aggregating large data you should expect linear speedup by using a parallel dataframes.
If you’re on a 20-core computer you might expect a 20x speedup. If you’re on a 1000-core cluster you might expect a 1000x speedup, assuming that you have a problem big enough to spread across 1000 cores. As you scale up administrative overhead will increase, so you should expect the speedup to decrease a bit.
For complex operations like distributed joins it’s more complicated. You might get linear speedups like above, or you might even get slowdowns. Someone experienced in database-like computations and parallel computing can probably predict pretty well which computations will do well.

However, configuration may be required. Often people find that parallel solutions don’t meet expectations when they first try them out. Unfortunately most distributed systems require some configuration to perform optimally.

There are other options to speed up Pandas

Many people looking to speed up Pandas don’t need parallelism. There are often several other tricks like encoding text data, using efficient file formats, avoiding groupby.apply, and so on that are more effective at speeding up Pandas than switching to parallelism.

Comparing Apache Spark and Dask

Assuming that yes, I do want parallelism, should I choose Apache Spark, or Dask dataframes?

This is often decided more by cultural preferences (JVM vs Python, all-in-one-tool vs integration with other tools) than performance differences, but I’ll try to outline a few things here:

Spark dataframes will be much better when you have large SQL-style queries (think 100+ line queries) where their query optimizer can kick in.
Dask dataframes will be much better when queries go beyond typical database queries. This happens most often in time series, random access, and other complex computations.
Spark will integrate better with JVM and data engineering technology. Spark will also come with everything pre-packaged. Spark is its own ecosystem.
Dask will integrate better with Python code. Dask is designed to integrate with other libraries and pre-existing systems. If you’re coming from an existing Pandas-based workflow then it’s usually much easier to evolve to Dask.

Generally speaking for most operations you’ll be fine using either one. People often choose between Pandas/Dask and Spark based on cultural preference. Either they have people that really like the Python ecosystem, or they have people that really like the Spark ecosystem.

Dataframes are also only a small part of each project. Spark and Dask both do many other things that aren’t dataframes. For example Spark has a graph analysis library, Dask doesn’t. Dask supports multi-dimensional arrays, Spark doesn’t. Spark is generally higher level and all-in-one while Dask is lower-level and focuses on integrating into other tools.

For more information, see Dask’s “Comparison to Spark documentation”.

Apache Arrow

What about Arrow? Is Arrow faster than Pandas?

This question doesn’t quite make sense… yet.

Arrow is not a replacement for Pandas. Arrow is a way to move data around between different systems and different file formats. Arrow does not do computation today. If you use Pandas or Spark or Dask you might be using Arrow without even knowing it. Today Arrow is more useful for other libraries than it is to end-users.

However, this is likely to change in the future. Arrow developers plan to write computational code around Arrow that we would expect to be faster than the code in either Pandas or Spark. This is probably a year or two away though. There will probably be some effort to make this semi-compatible with Pandas, but it’s much too early to tell.

↧

Will Kahn-Greene: Siggen (Socorro signature generator) v0.2.0 released!

August 29, 2018, 9:00 am

≫ Next: Continuum Analytics Blog: How PNC Financial Services Leveraged Anaconda to Enable Data Science and Machine Learning Capabilities Across the Company

≪ Previous: Matthew Rocklin: High level performance of Pandas, Dask, Spark, and Arrow

Siggen

Siggen (sig-gen) is a Socorro-style signature generator extracted from Socorro and packaged with pretty bows and wrapping paper in a Python library. Siggen generates Socorro-style signatures from your crash data making it easier for you to bucket your crash data using the same buckets that Socorro uses.

The story

Back in June of 2017, the signature generation code was deeply embedded in Socorro's processor. I spent a couple of weeks extracting it and adding tooling so as to:

make it easier for others to make and test signature generation changes
make it easier for me to review signature generation changes
make it easier to experiment with algorithm changes and understand how it affects existing signatures
make it easier for other groups to use with their crash data

I wrote a blog post about extracting signature generation. That project went really well and as a result, we've made many big changes to signature generation with full confidence about how they would affect things. I claim this was a big success.

The fourth item in that list was a "hope", but wasn't meaningfully true. While it was theoretically possible, because while the code was in its own Python module, it was still all tied up with the rest of Socorro and effectively impossible for other people to use.

A year passed....

Early this summer, Will Lachance took on Ben Wu as an intern to look at Telemetry crash ping data. One of the things Ben wanted to do was generate Socorro-style signatures from the data. Then we could do analysis on crash ping data using Telemetry tools and then do deep dives on specific crashes in Socorro.

I forked the Socorro signature generation code and created Siggen and released it on PyPI. Ben and I fixed some realllly rough edges and did a few releases. We documented parts of signature generation that had never been documented before.

Ben wrote some symbolication code to convert the frames to symbols, then ran that through Siggen to generate a Socorro style signature. That's in fix-crash-sig. He did some great things with his internship project!

So then I had this problem where I had two very different versions of Socorro's signature generation code. I did several passes at unifying the two versions and fixing both sides so the code worked inside of Socorro as well as outside of Socorro. It was effectively a rewrite of the code.

The result of that work is Siggen v0.2.0.

Usage

Siggen can be installed using pip:

$ pip install siggen

Siggen comes with two command line tools.

You can generate a signature from crash data on Socorro given crashids:

$ signature <CRASHID> [<CRASHID> ...]

This is the same as doing socorro-cmd signature in the Socorro local development environment.

You can also generate a signature from crash data in JSON format:

$ signify <JSONFILE>

You can use it as a library in your Python code:

fromsiggen.generatorimportSignatureGeneratorgenerator=SignatureGenerator()crash_data={...}ret=generator.generate(crash_data)print(ret['signature'])

The schema is "documented" in the README which can be viewed online at https://github.com/willkg/socorro-siggen#crash-data-schema.

There's more Siggen documentation in the README though that's one area where this project is sort of lacking. There's also more documentation about the signature generation algorithm in the Socorro docs on signature generation.

What's the future of this library

This is alpha-quality software. It's possible the command line tools and API bits will change as people use it and issues pop up. Having said that, it's in use in a couple of places now, so it probably won't change much.

Some people want different kinds of signature generation. That's cool--this neither helps nor hinders that.

This doesn't solve everyone's Socorro signature generation problems, but I think it gives us a starting point for some of them and it was a doable first step.

Some people want to produce Socorro-style signatures from their crash data. This will help with that. Unless you need the code in some other language in which case this is probably not helpful.

I wrote some tools to update Siggen from changes in Socorro. That was how I built v0.2.0. I think that worked well and it's pretty easy to do, so I plan to keep this going for a while.

If you use this library, pleeeeeeease tell me where you're using it. That's how I'll know it's being used and that the time and effort to maintain it are worth while. Even better, add a star in GitHub so I have a list of you all and can contact you later. Plus it's a (terrible) indicator of library popularity.

If no one uses this library or if no one tells me (I can't tell the difference), then I'll probably stop maintaining it.

If there's interest in this algorithm, but implemented with a different language, pleeeeeeeease let me know. I'm interested in helping to build a version in Rust. Possibly other languages.

If there's interest in throwing a webapp with an API around this, chime in with specifics in [bug 828452].

Hopefully this helps. If so, let me know! If not, let me know!

↧

Continuum Analytics Blog: How PNC Financial Services Leveraged Anaconda to Enable Data Science and Machine Learning Capabilities Across the Company

August 29, 2018, 11:20 am

≫ Next: Bill Ward / AdminTome: Python Custom Logging Handler Example

≪ Previous: Will Kahn-Greene: Siggen (Socorro signature generator) v0.2.0 released!

As an AI software company passionate about the real-world practice of data science, machine learning, and predictive analytics, we take great pleasure in hearing about the inspiring and innovative ways our customers use our products to drive their businesses forward and change the worlds around them. Earlier this year, we hosted our second annual AnacondaCON …
Read more →

The post How PNC Financial Services Leveraged Anaconda to Enable Data Science and Machine Learning Capabilities Across the Company appeared first on Anaconda.

↧

Bill Ward / AdminTome: Python Custom Logging Handler Example

August 29, 2018, 12:29 pm

≫ Next: Vladimir Iakolev: Video from subtitles or Bob's Burgers to The Simpsons with TensorFlow

≪ Previous: Continuum Analytics Blog: How PNC Financial Services Leveraged Anaconda to Enable Data Science and Machine Learning Capabilities Across the Company

In this post, I am going to write a Python custom logging handler that will send log entries to a Kafka broker where you can aggregate them to a database.

Pythons logging module gives you very granular control of how to handle logging from your application. You can log to a file, the console, and more.

It also gives you the ability to write your own logging handlers that let you deal with log entries in any way you want.

This tutorial will cove how to write a Python custom logging handler that will send logs to a Kafka topic.

If you would like to follow along then you must have a Kafka broker configured.

Here are several articles that can help you get one setup:

Ultimate Guide to Installing Kafka Docker on Kubernetes

Kafka Tutorial for Fast Data Architecture

Or if you like, you can adjust the code to handle the logs as you see fit.

All the code from this tutorial is available on GitHub: https://github.com/admintome/logger2kafka

Creating a Custom Logging Handler Class

To create your custom logging handler class we create a new class that inherits from an existing handler.

For example, in my code I inherited from StreamHandler which sends logs to a stream.

Here is the code for my KafkaHandler class:

from logging import StreamHandler
from mykafka import MyKafka

class KafkaHandler(StreamHandler):

    def __init__(self, broker, topic):
        StreamHandler.__init__(self)
        self.broker = broker
        self.topic = topic

        # Kafka Broker Configuration
        self.kafka_broker = MyKafka(broker)

    def emit(self, record):
        msg = self.format(record)
        self.kafka_broker.send(msg, self.topic)

First we import the handler.

from logging import StreamHandler

Next we declare our class, inheriting from StreamHandler.

class KafkaHandler(StreamHandler):

We define two methods, __init__ and emit.

The __init__ constructor calls the parent’s __init__ and sets some class variables.

We also instantiate a kafka_broker which we will learn about in the next section.

All custom logging handlers need to have an emit() method.

def emit(self, record):
    msg = self.format(record)
    self.kafka_broker.send(msg, self.topic)

The first line formats the message if we have a formatter defined.

The next line sends the formatted message to our Kafka broker topic.

Now lets take a look at our Kafka code.

Python Kafka Producer

The following code defines a Kafka Producer that we use to send messages to a Kafka topic.

from kafka import KafkaProducer
import json


class MyKafka(object):

    def __init__(self, kafka_brokers, json=False):
        self.json = json
        if not json:
            self.producer = KafkaProducer(
                bootstrap_servers=kafka_brokers
            )
        else:
            self.producer = KafkaProducer(
                value_serializer=lambda v: json.dumps(v).encode('utf-8'),
                bootstrap_servers=kafka_brokers
            )

    def send(self, data, topic):
        if self.json:
            result = self.producer.send(topic, key=b'log', value=data)
        else:
            result = self.producer.send(topic, bytes(data, 'utf-8'))
        print("kafka send result: {}".format(result.get()))

This class makes use of the KafkaProducer class from the Kafka Python Module.

Our class is actually an expanded version from earlier articles. This version lets you send JSON data in addition to string data.

In the __init__ constructor we set json to False (default) if we want to send string data.

Conversely, if we want to send JSON data then we set the json parameter to True.

The constructor configures a producer object that handles Kafka Producer requests.

We also define a send method that we will use to send data to a kafka topic.

We are now ready to tie it all together with a sample application.

Python Custom Logging Handler

Our sample application will be a simple for loop that will accept input and push the text to our logging.

Logging will be configured to send to three different places: the console, a file, and Kafka.

Here is our sample application.

import logging
from kafkahandler import KafkaHandler


class Main:

    def __init__(self):
        logging.basicConfig(
            format='%(asctime)s %(levelname)s %(message)s', 
            level=logging.INFO, 
            datefmt='%m/%d/%Y %I:%M:%S %p'
            )
        self.logger = logging.getLogger('simple_example')
        ch = logging.StreamHandler()
        ch.setLevel(logging.INFO)
        formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
        ch.setFormatter(formatter)
        self.logger.addHandler(ch)
        fl = logging.FileHandler("myapp.log")
        self.logger.addHandler(fl)
        kh = KafkaHandler("192.168.1.240:9092", "pylog")
        kh.setLevel(logging.INFO)
        self.logger.addHandler(kh)

        

    def run(self):
        while True:
            log = input("> ")
            self.logger.info(log)

if __name__ == "__main__":
    main = Main()
    main.run()

First we configure python logging with basicConfig.

logging.basicConfig(
            format='%(asctime)s %(levelname)s %(message)s', 
            level=logging.INFO, 
            datefmt='%m/%d/%Y %I:%M:%S %p'
            )

Here we set the logging options like what the format, logging level and date format for console logging.

Next we configure our StreamHandler

        self.logger = logging.getLogger('simple_example')
        ch = logging.StreamHandler()
        ch.setLevel(logging.INFO)
        formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
        ch.setFormatter(formatter)
        self.logger.addHandler(ch)

We log to the file with a couple of lines:

       fl = logging.FileHandler("myapp.log")
       self.logger.addHandler(fl)

The next several lines create our custom KafkaHandler:

kh = KafkaHandler("192.168.1.240:9092", "pylog")
kh.setLevel(logging.INFO)
self.logger.addHandler(kh)

Running the program we enter in some text and hit enter.

> this is an awsome log
2018-08-29 14:24:52,425 - simple_example - INFO - this is an awsome log
kafka send result: RecordMetadata(topic='pylog', partition=0, topic_partition=TopicPartition(topic='pylog', partition=0), offset=2, timestamp=1535570692427, checksum=None, serialized_key_size=-1, serialized_value_size=21)
08/29/2018 02:24:52 PM INFO this is an awsome log
>

We can see that it send the log to the console and our Kafka Broker.

I setup a Kafka Consumer using the kafka-console-consumer script that comes with Kafka.

$ bin/kafka-console-consumer.sh --bootstrap-server 192.168.1.240:9092 --topic pylog --from-beginning
this is an awsome log

We can see that my consumer received the log message.

python custom logging handler

Conclusion

I hope that you have enjoyed this post.

If you did then please share it and comment below.

Also be sure to subscribe to the AdminTome Blog Newsletter

The post Python Custom Logging Handler Example appeared first on AdminTome Blog.

↧

Vladimir Iakolev: Video from subtitles or Bob's Burgers to The Simpsons with TensorFlow

August 29, 2018, 5:40 pm

≫ Next: Mike Driscoll: Python 101: Episode #22 – The datetime / time modules

≪ Previous: Bill Ward / AdminTome: Python Custom Logging Handler Example

Bob's Burgers to The Simpsons

Back in June I’ve played a bit with subtitles and tried to generate a filmstrip, it wasn’t that much successful, but it was fun. So I decided to try to go deeper and generate a video from subtitles. The main idea is to get phrases from a part of some video, get the most similar phrases from another video and generate something.

As the “enemy” I’ve decided to use a part from Bob’s Burgers Tina-rannosaurus Wrecks episode:

As the source, I’ve decided to use The Simpsons, as they have a lot of episodes and Simpsons Already Did It whatever. I somehow have 671 episode and managed to get perfectly matching subtitles for 452 of them.

TLDR: It was fun, but the result is meh at best:

Initially, I was planning to use Friends and Seinfeld but the result was even worse.

As the first step I’ve parsed subtitles (boring, available in the gist) and created a mapping from phrases and “captions” (subtitles parts with timing and additional data) and a list of phrases from all available subtitles:

data_text2captions=defaultdict(lambda:[])forseasoninroot.glob('*'):ifseason.is_dir():forsubtitlesinseason.glob('*.srt'):forcaptioninread_subtitles(subtitles.as_posix()):data_text2captions[caption.text].append(caption)data_texts=[*data_text2captions]

>>>data_text2captions["That's just a dog in a spacesuit!"][Caption(path='The Simpsons S13E06 She of Little Faith.srt',start=127795000,length=2544000,text="That's just a dog in a spacesuit!")]>>>data_texts[0]'Give up, Mr. Simpson! We know you have the Olympic torch!'

After that I’ve found subtitles for the Bob’s Burgers episode and manually selected parts from the part of the episode that I’ve used as the “enemy” and processed them in a similar way:

play=[*read_subtitles('Bobs.Burgers.S03E07.HDTV.XviD-AFG.srt')][1:54]play_text2captions=defaultdict(lambda:[])forcaptioninplay:play_text2captions[caption.text].append(caption)play_texts=[*play_text2captions]

>>>play_text2captions['Who raised you?'][Caption(path='Bobs.Burgers.S03E07.HDTV.XviD-AFG.srt',start=118605000,length=1202000,text='Who raised you?')]>>>play_texts[0]"Wow, still can't believe this sale."

Then I’ve generated vectors for all phrases with TensorFlow’s The Universal Sentence Encoder and used cosine similarity to get most similar phrases:

module_url="https://tfhub.dev/google/universal-sentence-encoder/2"embed=hub.Module(module_url)vec_a=tf.placeholder(tf.float32,shape=None)vec_b=tf.placeholder(tf.float32,shape=None)normalized_a=tf.nn.l2_normalize(vec_a,axis=1)normalized_b=tf.nn.l2_normalize(vec_b,axis=1)sim_scores=-tf.acos(tf.reduce_sum(tf.multiply(normalized_a,normalized_b),axis=1))defget_similarity_score(text_vec_a,text_vec_b):emba,embb,scores=session.run([normalized_a,normalized_b,sim_scores],feed_dict={vec_a:text_vec_a,vec_b:text_vec_b})returnscoresdefget_most_similar_text(vec_a,data_vectors):scores=get_similarity_score([vec_a]*len(data_texts),data_vectors)returndata_texts[sorted(enumerate(scores),key=lambdascore:-score[1])[3][0]]withtf.Session()assession:session.run([tf.global_variables_initializer(),tf.tables_initializer()])data_vecs,play_vecs=session.run([embed(data_texts),embed(play_texts)])data_vecs=np.array(data_vecs).tolist()play_vecs=np.array(play_vecs).tolist()similar_texts={play_text:get_most_similar_text(play_vecs[n],data_vecs)forn,play_textinenumerate(play_texts)}

>>similar_texts['Is that legal?']"- [Gasping] - Uh, isn't that illegal?">>>similar_texts['(chuckling): Okay, okay.']'[ Laughing Continues ] All right. Okay.

Looks kind of relevant, right? Unfortunately only phrase by phrase.

After that, I’ve cut parts of The Simpsons episodes for matching phrases. This part was a bit complicated, because without a force re-encoding (with the same encoding) and setting a framerate (with kind of the same framerate with most of the videos) it was producing unplayable videos:

defgenerate_parts():forn,captioninenumerate(play):similar=similar_texts[caption.text]similar_caption=sorted(data_text2captions[similar],key=lambdamaybe_similar:abs(caption.length-maybe_similar.length),reverse=True)[0]yieldPart(video=similar_caption.path.replace('.srt','.mp4'),start=str(timedelta(microseconds=similar_caption.start))[:-3],end=str(timedelta(microseconds=similar_caption.length))[:-3],output=Path(output_dir).joinpath(f'part_{n}.mp4').as_posix())parts=[*generate_parts()]forpartinparts:call(['ffmpeg','-y','-i',part.video,'-ss',part.start,'-t',part.end,'-c:v','libx264','-c:a','aac','-strict','experimental','-vf','fps=30','-b:a','128k',part.output])

>>>parts[0]Part(video='The Simpsons S09E22 Trash of the Titans.mp4',start='0:00:31.531',end='0:00:03.003',output='part_0.mp4')

And at the end I’ve generated a special file for the FFmpeg concat and concatenated the generated parts (also with re-encoding):

concat='\n'.join(f"file '{part.output}'"forpartinparts)+'\n'withopen('concat.txt','w')asf:f.write(concat)

➜ cat concat.txt | head -n 5
file 'parts/part_0.mp4'
file 'parts/part_1.mp4'
file 'parts/part_2.mp4'
file 'parts/part_3.mp4'
file 'parts/part_4.mp4'

call(['ffmpeg','-y','-safe','0','-f','concat','-i','concat.txt','-c:v','libx264','-c:a','aac','-strict','experimental','-vf','fps=30','output.mp4'])

As the result is kind of meh, but it was fun, I’m going to try to do that again with a bigger dataset, even working with FFmpeg wasn’t fun at all.

Gist with full sources.

↧

Mike Driscoll: Python 101: Episode #22 – The datetime / time modules

August 29, 2018, 10:05 pm

≫ Next: py.CheckIO: 7 Strategies For Optimizing Your Code

≪ Previous: Vladimir Iakolev: Video from subtitles or Bob's Burgers to The Simpsons with TensorFlow

In this screencast you will learn the basics of Python’s datetime and time modules. If reading is more your thing, then check out the chapter this is based on over at http://python101.pythonlibrary.org/ or get the book at https://leanpub.com/python_101

↧

py.CheckIO: 7 Strategies For Optimizing Your Code

August 28, 2018, 12:48 am

≫ Next: PyCon.DE & PyData Karlsruhe: PyCon.DE & PyData Karlsruhe 2018 Keynote: Emmanuelle Gouillart

≪ Previous: Mike Driscoll: Python 101: Episode #22 – The datetime / time modules

You’ve probably heard that Python is a slow language, especially compared to C, C++, and others. But what if I told you that there is a way to improve the performance of your Python programs? Moreover, in this article, you’ll be able to read about 7 ways to optimize your code and help Python remove the stigma of a slow language.

↧

PyCon.DE & PyData Karlsruhe: PyCon.DE & PyData Karlsruhe 2018 Keynote: Emmanuelle Gouillart

August 29, 2018, 5:00 pm

≫ Next: Codementor: Unit Testing Celery Tasks

≪ Previous: py.CheckIO: 7 Strategies For Optimizing Your Code

We are very happy to confirm Emmanuelle Gouillart as a keynote speaker for PyCon.DE 2018 & PyData Karlsruhe.

Emmanuelle Gouillart PyCon.DE

Emmanuelle is a researcher and geek, enthusiastic about physics and glass science, Scientific Python and image processing, travels and graphic arts as well as a #skimage developer.

↧

Codementor: Unit Testing Celery Tasks

August 30, 2018, 7:38 am

≫ Next: PyCharm: PyCharm 2018.1.5

≪ Previous: PyCon.DE & PyData Karlsruhe: PyCon.DE & PyData Karlsruhe 2018 Keynote: Emmanuelle Gouillart

This post was originally published on Celery. The Missing Blog (https://www.python-celery.com/) on May 1st, 2018. All source code examples used in this blog post can be found on GitHub: ...

↧

PyCharm: PyCharm 2018.1.5

August 30, 2018, 8:56 am

≫ Next: Yasoob Khalid: Practical Python Projects Book

≪ Previous: Codementor: Unit Testing Celery Tasks

For those of you who are still on PyCharm 2018.1, we have a small update with a couple of bug fixes.

In this release, we’ve fixed an issue where unshelving files would create a large number of threads. If you’re interested in learning about other issues that were resolved, check out the release notes here.

To update, either choose Help | Check for Updates in PyCharm, or head over to the previous releases page on the website.

↧

Yasoob Khalid: Practical Python Projects Book

August 30, 2018, 10:00 am

≫ Next: Python Sweetness: Transmit optimisations in Mitogen 0.3

≪ Previous: PyCharm: PyCharm 2018.1.5

Hi everyone! I am super proud to announce that my second book

“Practical Python Projects“

is alhamdulillah almost halfway done content-wise. This book is a direct result of a market gap which I found when I started programming a couple of years back. Most programming beginners have access to multiple resources to teach themselves the basics of Python (or any other language) but they don’t have access to those books/tutorials where they are taught to create end-to-end projects in a specific language. A fair amount of online tutorials and beginner textbooks teach the reader only the basics of a language without real-life concrete project development.

Please sign up for my newsletter to get updates on the book (I send less than 2 emails per month): http://newsletter.pythontips.com/

Through my new book, you will be able to take your Python programming skills/knowledge to the next level by developing 15+ projects from scratch. These are bite-sized projects, meaning that you can implement each one of them during a weekend. These projects will not be just throwaway projects, you will actually be able to list them in your portfolio when applying for jobs.

Current projects include:

FIFA World Cup bot. You will learn how to send latest, upcoming and past match updates via SMS.

Creating a Facebook & Soundcloud video/music downloader. You will learn how to log into Facebook and download private videos as well

Making a Reddit + Facebook messenger bot. You will learn how the Facebook messenger bot system works. We will implement a fun chatbot which will send you jokes and motivational posts

Controlling your system with Alexa. You will learn how to create an Alexa skill and use Facebook’s image recognition API (we will reverse engineer it

) to recognize people in photos

Scraping data from Steam. You will learn how to create automated data extraction software. We will focus on extracting data from Steam in this particular chapter.

Article summary & Instagram stories upload. You will learn how to use state-of-the-art algorithms to generate 10 line (or more) summaries from online/offline articles. We will overlay these summaries over images extracted from online articles and then we will post these images to Instagram as stories.

Cinema pre-show generator. You will learn video manipulation. Normally when you go to cinemas you see trailers for upcoming movies before the original movie starts. This project has a similar motive. Just provide it with a movie name and it will download trailers for 2 (or more) upcoming movies which have the same genre. It will then merge these trailers with a “put your cell phones on silent” msg and an iconic video countdown.

Automatic Invoice generator. You will learn how to generate PDF invoices automatically. I will also teach you about task queues and how to make effective use of them. Lastly, you will learn how to implement email sending functionality in Python (Flask).

I have completed

10 chapters

(each chapter == 1 project) amounting to almost 140 pages of content. It is already longer than my last book (yayy!!). I hope that inshAllah this book will be completed within the next couple of months.

Currently, I am looking for those beginner Python programmers who have recently learned this language to help me as beta testers. If you are one of them and would like to help by critiquing different chapters of the book please let me know.

Edit: I have enough people to help me with the beta testing right now. I will reach out to you guys when I am nearing completion. Do stay tuned for the publish announcement of the book!

If you have a project idea please share that with me as well. I am planning on making this the perfect book where you can learn about various different tools, packages, and techniques which might end up being useful at some point in your career.

Please sign up for my newsletter to get updates on the book (I send less than 2 emails per month): http://newsletter.pythontips.com/

P.S: The final book cover might be different from this one. If you have any suggestions please let me know

Have a great day!

↧

Python Sweetness: Transmit optimisations in Mitogen 0.3

August 30, 2018, 11:12 am

≫ Next: John Cook: Drawing Spirograph curves in Python

≪ Previous: Yasoob Khalid: Practical Python Projects Book

An early goal for Mitogen was to make it simple to retrofit, avoiding any "opinionated" choices likely to cause needless or impossible changes in downstream code. Despite being internally asynchronous, a blocking and mostly thread-safe API is exposed, with management of the asynchrony punted to a thread, making integrating with a deployment script hopefully as easy as with a GUI.

Although rough edges remain due to this, such as struggles with subprocess reaping, based on experience working on Mitogen for Ansible and ignoring complexities unique to that environment, the design appears to mostly function as intended.

Mostly being operative as, due to the API choice, and despite gains already witnessed in the extension, some internals remain overly simplistic. Naturally as has been the lesson throughout, this of course means inefficient: horrifyingly, crying-in-the-shower inefficient.

While recently attacking some of Ansible's grosser naivities, now the excesses of continuous forking are gone, dirty laundry is again visible on Mitogen's side. This post describes one offender: message transmission and routing, how it looks today, why it is a tragedy, and how things will improve.

Just tuning in?

2017-09-15: Mitogen, an infrastructure code baseline that sucks less
2018-03-06: Quadrupling Ansible performance with Mitogen
2018-07-10: Mitogen released!
2018-08-27: A fork in the road for Mitogen

Overview

To recap, communication with a bootstrapped child is message-oriented to escape the limitations of stream-oriented IO. When an application makes a call, the sending thread enqueues a message with the broker thread, which is responsible for all IO, then sleeps waiting for the broker to deliver a reply.

This has many benefits: mutually ignorant threads can share a child without coordination, since a central broker exists behind the scenes. Errors can only occur on the broker thread, so handling is not spread throughout user code.

Message Transmission

Examining just the Mitogen aspects of transmission for an SSH-connected Ansible target, below are the rough steps repeated for every message in the stable branch.

Despite removing most system calls to fit things in one diagram, there is still plenty to absorb, and clearly many parts to what is conceptually a simple task. A component called Waker abstracts waking the broker thread. This implements a variant of the UNIX self-pipe trick, waking it by writing to a pipe it is sleeping on.

When the broker wakes, it calls waker's on_receive handler, causing any deferred functions to execute on its thread. Here the asynchronous half of the router runs, picking a stream to forward the message.

The stream responds by asking the broker to tell it when the SSH stream becomes writeable, which is implemented differently depending on OS, but in most cases it entails yet more system calls.

Since usually the SSH input buffer is empty, the broker immediately wakes again to call the stream's on_transmit handler, finally passing the message to SSH before marking the stream unwriteable again. At this point execution moves to SSH, for little than to read from a socket, do some crypto and write to another socket.

Better Message Transmission

In total transmission requires at least 2 task switches, 2 loop iterations, at least 5 reads/writes, and 2 poller reconfigurations.

While superficially logical, one problem is already obvious: transmitting always entails waking a thread, a nontrivial operation on UNIX. Another is the biggest performance bottleneck, the IO loop, is forced to iterate twice for every transmission, in part to cope with the possibility the SSH input buffer is full.

What if we were more optimistic: an error won't occur, and the SSH input buffer probably has space. Since we aren't expecting to cleanup a failure, there is no reason to involve the broker either. The new sequence:

Coordination is replaced with a lock, and the sending thread writes directly to SSH. We no longer check for writeability: simply try the write and if it fails, or buffered data exists, defer to the broker like before.

Now we have 1 task switch, 0 loop iterations, 2 lock operations, 3 reads/writes, and 0 poller reconfigurations, but still there is that unsightly task switch.

Even Better Message Transmission

The Ansible extension and new strategy work both offer something Ansible previously relied on SSH multiplexing to provide: a process where connection state persists during a run. As persistence is under our control, one final step becomes possible. Simply move SSH in-process:

Now we have 0 task switches, 0 loop iterations, 2 lock operations, 1 write, and 0 poller reconfigurations, or simply put, the minimum possible to support a threaded program communicating via SSH.

Some exciting possibilities emerge: passwords can be typed without allocating a PTY. Since usually Linux only supports 4,096 PTYs, this raises the scalability upper bound while reducing resource usage. Much better buffering is possible, eliminating Mitogen's own buffer, and optimally sizing SSH sockets to support file transfers.

Of course downsides exist: unlike libssh or libssh2, OpenSSH is part of a typical workflow, supports every authentication style, and it is common to stash configuration in ~/.ssh/config. Although libssh supports SSH configuration parsing, it's unclear how well it works in practice, and at least the author of ParallelSSH (and wrappers for both libraries) appears to have chosen libssh2 over it for reasons I'd like to discover.

Routing

For completeness, and since the diagrams exist already, here is routing between two SSH children from the context of their parent on the stable branch:

While internal switching is avoided, those nasty loop iterations are visible, as are the surrounding task switches. Optimistic sending benefits routing too:

Now the loop iterates once. Finally, with an in-process SSH client:

A single thread is woken, receives the message to be forwarded, delivers it, and sleeps all on one stack.

Summary

Complexity is fractal, but shying from it just leads to mediocre software. Both improvements exist as branches, and both will be supported by the Ansible extension in addition to the new work.

Until next time!

↧

John Cook: Drawing Spirograph curves in Python

August 30, 2018, 5:48 pm

≫ Next: Django Weblog: Django bugfix release: 2.1.1

≪ Previous: Python Sweetness: Transmit optimisations in Mitogen 0.3

I was looking back over an old blog post and noticed some code in the comments that I had overlooked. Tom Pollard gives the following code for drawing Spirograph-like curves.

import matplotlib.pyplot as plt
from numpy import pi, exp, real, imag, linspace

def spiro(t, r1, r2, r3):
    """
    Create Spirograph curves made by one circle of radius r2 rolling 
    around the inside (or outside) of another of radius r1.  The pen
    is a distance r3 from the center of the first circle.
    """
    return r3*exp(1j*t*(r1+r2)/r2) + (r1+r2)*exp(1j*t)

def circle(t, r):
    return r * exp(1j*t)

r1 = 1.0
r2 = 52.0/96.0
r3 = 42.0/96.0

ncycle = 13 # LCM(r1,r2)/r2

t = linspace(0, ncycle*2*pi, 1000)
plt.plot(real(spiro(t, r1, r2, r3)), imag(spiro(t, r1, r2, r3)))
plt.plot(real(circle(t, r1)), imag(circle(t, r1)))
 
fig = plt.gcf()
fig.gca().set_aspect('equal')
plt.show()

Tom points out that with the r parameters above, this produces the default curve on Nathan Friend’s Inspirograph app.

↧