Quantcast
Channel: Planet Python
Viewing all 22641 articles
Browse latest View live

Mike Driscoll: PyDev of the Week: Aisha Bello

$
0
0

This week we welcome Aisha Bello (@AishaXBello) as our PyDev of the Week! Aisha is the founder of PyLadies Nigeria and is passionate about STEM in developing countries. She is also an organizer for DjangoGirls in Africa. Aisha has gone around the world speaking talking about Python at EuroPython, DjangoCon, Python Brasil and the PyData conferences. Let’s take a few moments to get to know her better!

Can you tell us a little about yourself (hobbies, education, etc):

Currently I work as a Virtual Systems Engineer in the Data Center and Virtualisation practice for Cisco, Nigeria. I completed a Masters in Information Technology from Cardiff Metropolitan, where I worked on a Data Science project for the hospitality industry. I am very passionate about women empowerment and tech education in developing countries.

When I am not working you would catch me watching a movie, going for a gym class or exploring new places.

Why did you start using Python?

I started Python when I attended a DjangoGirls event at Europython back in 2015. That was my very first encounter with Python. and I remember asking myself in the midst of the workshop “Where have you been all my life”. The ‘you’ being ‘Python’. This is coming from a girl who gave up on programming as a whole, because she couldn’t quite grasp the concepts of the languages she had learned in school. Back then I only wanted to learn enough Java or C to finish a school project or just pass the exam. I was totally convinced that programming was for the first class and genius students , definitely not me. Now when I look back it wasn’t the concepts that was too complex to understand, but the way the other languages interpreted those concepts was complicated at least for me.

How did you get started with DjangoGirls and PyLadies?

When I attended my first DjangoGirls event in Bilbao, and stayed for the entire conference. I remember going back home and was so fired up to bring such an amazing opportunity back home to women in Nigeria, to show them that they too could code. That being a makeup artist or fashion designer wasn’t all that we could be good for like some of us are conditioned to believe. From there on the Python Nigeria community was born. After the DjangoGirls event I was looking for a way to help the ladies that attended continue their learnings, and PyLadies Nigeria was born.

What other programming languages do you know and which is your favorite?

While in school I played around “textbook style” with C, C++, Java, and I have done a bit of Javascript too in the past. When I had just finished my undergrad I started a makeup and hair business and built a website for my business using the ‘good ol’ HTML & CSS. I would say Python still is my favorite. It was so easy and much less complicated to start out building things with Python, but most importantly Echoing the words of a wise man, Brett Cannon ; I came for the Language but stayed for the community. I can’t even begin to explain how powerful that is.

What projects are you working on now?

One of the projects I am most proud of is the Python Community in Nigeria, how and where we started and how far we have come. We went from 1 DjangoGirls workshop in 2016, to a community of 1200 people with over 50 events happening all over the country in form of other events; DjangoGirls, Pyladies, PyData and now Python Nigeria meetups. I am most excited about the Python Conference coming up in September where over 250 pythonistas will be under the same roof sharing and talking about Python in different domains.

Funny thing is it’s been almost a year since I bought a personal domain name where I was going to build my site from scratch, and also start writing again, breaking down networking concepts for developers, inspired by my PyData talk last year. I guess we would see 🙂

Which Python framework or libraries are your favorite (core or 3rd party)?

Not sure I would use the word favorite. I started Python by learning Django and went on to do my first Data Science ML Project with Scikit-Learn. And now that I deal with a lot of API’s, I can’t do without the Requests library .

Is there anything else you’d like to say?

If I could say one thing to my younger self it would be “Never underestimate yourself. You are smart enough, good enough, You are enough. Keep pushing, keep climbing and never give up”

Thanks for doing the interview, Aisha!


Tryton News: Newsletter September 2018

$
0
0

@ced wrote:

This is the last month before the long term release 5.0. So many ongoing development has finally landed in Tryton.

Changes for the user

Style default button

The default buttons on desktop client have the style of suggested action.
Tryton login window

Use of message dialog

All the message dialog of the desktop client uses now the GTK MessageDialog. This unify the behavior with other GTK application.

Better dialog size

Instead of using the same size of the parent window which confused some users. Instead we let GTK compute the natural size of the dialog depending of the content. But this requires to deactivate the scrolled window. So wizard that displays full form should be run in tab instead of dialog.

New icons

We replaced the tango icons for which we missed many for business modules by a subset of the Material Icons. We also took the opportunity to curate and rationalize the list of default icons.


New party name on address

Sometimes it is needed to store on a party an address that is not the real address of the party. So his name can not be used as the name on the mailbox. So we added an optional “Party Name” on the address which is used if filled to format the address instead of the name of the linked party and the last one will be used as the attention name.

Fall-back email for dunning

When an email must be sent for dunning, the party may not have an email configured. In this case, we can now configure a fall-back address to which the email will be sent. It may be the address of the secretary for example which will be forward the dunning to the party with the proper mechanism.
The type of email can be also configured at the dunning level. In such way that they are sent to the invoice email.

Asset subscribed

The sale_subscription_asset module extends the Subscription to store which assets are rent.
The user can configure a subscription service to require an asset by defining which the lot are available. Tryton display also the lot currently available for each service.
On subscription line for such service, it is possible to reserve an asset by setting its lot number. The lot becomes required to run the subscription.
Tryton ensure that a lot can be rent only once at the same time.

Apply factor in search bar

Some numerical field can have a factor for the display. The common usage percentage field which are stored as number between 0 and 1 but are displayed with a factor of 100. For now, searching on such field is performed on value with the factor applied. This provides a better experience for the user.

Right to left support on web client

The web client supports now, like the desktop, the right to left language.

Final state for dunning

When the dunning reach the last level of the procedure, its state is changed to final instead of disappearing in the list of dunning done. This allow to still keep track of those dunning that requires manual procedure.

Support chart of account evolution

We replaced the simple active/inactive checkbox by date period on accounts and tax codes.
It is not allowed to create a move using an account at a date outside the period.
The accounting reports do not show out of period accounts or tax codes if the report is run for a date out of the period.
More over, when an account has an end date defined, we can configure a replacement account to use. in such case, operational document like sale or purchase will use the replacement account automatically. This allows to avoid to update all referential data that were still referring to the old account.

Interface with Chorus Pro

A sets of new modules has been added which allow to send invoice to Chorus Pro. Chorus Pro is the mandatory platform to send electronic invoices for French administration.
Currently the format supported is the Cross-Industry-Invoice (aka 16B-CII) from UN/CEFACT.

Removal of default accounts on journal

We have found that this design was not very optimal because it required for some case to create more journal than needed. So we replace them by write-off methods and payment methods. And for the statement, the journal statement has now an account field.

Spanish tax report

Tryton generates automatically the files for tax reporting that can be imported directly on the tax authority website.

Icons in input

We simplified the web design for the input that has buttons. The input has now a primary and/or secondary icons inside its border. This unclutter the interface and integrate better with bootstrap theme following material design.

Attachment drop down

We reworked the attachment action to be a drop down instead of opening a pop-up. This is faster to open the attachments and to add new one. It is still possible to open the pop-up window with the management entry to change or delete them.
Attachment drop down

Changes for the developer

Remove buttons

Now the buttons which depends on fields for which the user has no access, are removed automatically from the view.

Support of timestamp field on client

Until now, such field was using the default field and had no widget associated. But the default field has an empty string as default value which is not valid for a Timestamp. So now we manage Timestamp field as a DateTime field but with the representation of microseconds.

Use sqlite3.backup

Since Python 3.7, the module sqlite3 has a backup method. This allows to remove the dependency on sqlitebck to use the database cache feature when running tests.

Clean migration < 3.0

We removed the migration code from version older than 3.0.
If there are still users of such old version. They must upgrade as soon as possible to a version prior to 5.0 to be able to migrate later. The future rules will be to clean migration for 2 major releases in the past. This force to migrate at least once every 10 years.

Add Timedelta to PYSON

The PYSON missed an object to represent a timedelta value. As such value could be used in a domain on a field of the same time, we should support it.

Remove unique constraint on attachment

As it was no more useful to have a unique constraint on attachment since Removal of *DAV, we removed the constraint.

Use uWSGI in docker images

The images published by Tryton from series 4.6 are using now uWSGI as default server. This replaces the default server from Werzeug which is not considered production ready.

Use UUID on timesheet lines

In order to prevent Chronos (the web extension for timesheet) to create duplicate lines on bad network response time, we added a unique UUID field on the lines.

Use passlib to check password

We replaced custom password check by passlib, a generic library for password hashing. A configuration file for passlib can be set otherwise the default schemes are bcrypt (if installed) or pbkdf2_sha512. When the configuration change, the passwords are updated to the new scheme on the next log-in of the user.
The migration from older version is done automatically when the user log-in the first time.

New session management

The Double session timeout has been implemented. So the session expired now after 30 days but some operations like posting an invoice or approve a payment requires a fresh session. A fresh session is a session which has no request interruption longer than 5 minutes since its creation.
When a user change his password, all his active sessions are invalidated. This prevent any attacker who had stolen the password to keep a session active after the password change.

Fully configure Tryton with environment variables

In order to simplify the configuration of the docker image, Tryton will parse environment variables that follows the syntax TRYTOND_<SECTION>__<NAME>. Such values are set before the load of configuration files.

Load custom CSS and Javascript

The web client will try to load custom.css and custom.js by default. This allow to customize it by just serving those files.

Real-time notification

We added a BUS to Tryton. It allows the server to send message to the client. It is using long polling as push mechanism.
The first usage is the possibility to send notification which are short message with a priority. The web client display them by using the Web Notification and the desktop client using the GNotification which unfortunately is not yet implemented on Windows nor MacOS.

Improved ModelStorage.copy

The copy method has been extended to have more flexibility on the copy result.
The default dictionary accept a callable as value. It will be called for each copied record with a dictionary of the copied values. It must return the new value for the new record.
Also the default dictionary support the dotted notation for the Many2One keys. In such case, the value will be used as default dictionary to copy the pointed record.

Posts: 1

Participants: 1

Read full topic

Codementor: We don't need a ternary operator

$
0
0
Why the ternary operator is messy and how we could get rid of it.

Julien Danjou: High-Performance in Python with Zero-Copy and the Buffer Protocol

$
0
0
High-Performance in Python with Zero-Copy and the Buffer Protocol

Whatever your programs are doing, they often have to deal with vast amounts of data. This data is usually represented and manipulated in the form of strings. However, handling such a large quantity of input in strings can be very ineffective once you start manipulating them by copying, slicing, and modifying. Why?

Let's consider a small program which reads a large file of binary data, and
copies it partially into another file. To examine out the memory usage of this program, we will use https://pypi.python.org/pypi/memory_profiler[memory_profiler], an excellent Python package that allows us to see the memory usage of a program line by line.

@profile
def read_random():
    with open("/dev/urandom", "rb") as source:
        content = source.read(1024 * 10000)
        content_to_write = content[1024:]
    print("Content length: %d, content to write length %d" %
          (len(content), len(content_to_write)))
    with open("/dev/null", "wb") as target:
        target.write(content_to_write)

if __name__ == '__main__':
    read_random()

Running the above program using memory_profiler produces the following:

$ python -m memory_profiler memoryview/copy.py
Content length: 10240000, content to write length 10238976
Filename: memoryview/copy.py

Mem usage    Increment   Line Contents
======================================
                         @profile
 9.883 MB     0.000 MB   def read_random():
 9.887 MB     0.004 MB       with open("/dev/urandom", "rb") as source:
19.656 MB     9.770 MB           content = source.read(1024 * 10000)
29.422 MB     9.766 MB           content_to_write = content[1024:]
29.422 MB     0.000 MB       print("Content length: %d, content to write length %d" %
29.434 MB     0.012 MB             (len(content), len(content_to_write)))
29.434 MB     0.000 MB       with open("/dev/null", "wb") as target:
29.434 MB     0.000 MB           target.write(content_to_write)

The call to source.read reads 10 MB from /dev/urandom. Python needs to allocate around 10 MB of memory to store this data as a string. The instruction on the line just after, content[1024:], copies the entire block of data minus the first KB — allocating 10 more megabytes.

So what's interesting here, is to notice that the memory usage of the program increased by about 10 MB when building the variable content_to_write. The slice operator is copying the entirety of content, minus the first KB, into a new string object.

When dealing with extensive data, performing this kind of operation on large byte arrays is going to be a disaster. If you already have written C code, you know that using memcpy() has a significant cost, both in term of memory usage and regarding general performance: copying memory is slow.

However, as a C programmer, you also know that strings are arrays of characters and that nothing stops you from looking at only part of this array without copying it, through the use of basic pointer arithmetic – assuming that the entire string is in a contiguous memory area.

This is possible in Python using objects which implement the buffer protocol. The buffer protocol is defined in http://www.python.org/dev/peps/pep-3118/[PEP 3118], which explains the C API used to provide this protocol to various types, such as strings.

When an object implements this protocol, you can use the memoryview class constructor on it to build a new memoryview object that references the original object memory.

>>> s = b"abcdefgh"
>>> view = memoryview(s)
>>> view[1]
98
>>> limited = view[1:3]
>>> limited
<memory at 0x7fca18b8d460>
>>> bytes(view[1:3])
b'bc'

Note: 98 is the ASCII code for the letter b.

In the example above, we use the fact that the memoryview object's slice operator itself returns a memoryview object. That means it does not copy any data but merely references a particular slice of it.

The graph below illustrates what happens:

High-Performance in Python with Zero-Copy and the Buffer Protocol

Therefore, it is possible to rewrite the program above in a more efficient manner. We need to reference the data that we want to write using a memoryview object, rather than allocating a new string.

@profile
def read_random():
    with open("/dev/urandom", "rb") as source:
        content = source.read(1024 * 10000)
        content_to_write = memoryview(content)[1024:]
    print("Content length: %d, content to write length %d" %
          (len(content), len(content_to_write)))
    with open("/dev/null", "wb") as target:
        target.write(content_to_write)

if __name__ == '__main__':
    read_random()

Let's run the program above with the memory profiler:

$ python -m memory_profiler memoryview/copy-memoryview.py
Content length: 10240000, content to write length 10238976
Filename: memoryview/copy-memoryview.py

Mem usage    Increment   Line Contents
======================================
                         @profile
 9.887 MB     0.000 MB   def read_random():
 9.891 MB     0.004 MB       with open("/dev/urandom", "rb") as source:
19.660 MB     9.770 MB           content = source.read(1024 * 10000) <1>
19.660 MB     0.000 MB           content_to_write = memoryview(content)[1024:] <2>
19.660 MB     0.000 MB       print("Content length: %d, content to write length %d" %
19.672 MB     0.012 MB             (len(content), len(content_to_write)))
19.672 MB     0.000 MB       with open("/dev/null", "wb") as target:
19.672 MB     0.000 MB           target.write(content_to_write)

In that case, the source.read call still allocates 10 MB of memory to read the content of the file. However, when using memoryview to refer to the offset content, no more memory is allocated.

This version of the program ends up allocating 50% less memory than the original version!

This kind of trick is especially useful when dealing with sockets. When sending data over a socket, all the data might not be sent in a single call.

import socket
s = socket.socket(…)
s.connect(…)
# Build a bytes object with more than 100 millions times the letter `a`
data = b"a" * (1024 * 100000)
while data:
    sent = s.send(data)
    # Remove the first `sent` bytes sent
    data = data[sent:] <2>

Using a mechanism as implemented above, the program copies the data over and over until the socket has sent everything. By using memoryview, it is possible to achieve the same functionality with zero-copy, and therefore higher performance:

import socket
s = socket.socket(…)
s.connect(…)
# Build a bytes object with more than 100 millions times the letter `a`
data = b"a" * (1024 * 100000)
mv = memoryview(data)
while mv:
    sent = s.send(mv)
    # Build a new memoryview object pointing to the data which remains to be sent
    mv = mv[sent:]

As this won't copy anything, it won't use any more memory than the 100 MB
initially needed for the data variable.

So far we've used memoryview objects to write data efficiently, but the same method can also be used to read data. Most I/O operations in Python know how to deal with objects implementing the buffer protocol. They can read from it, but also write to it. In this case, we don't need memoryview objects – we can ask an I/O function to write into our pre-allocated object:

>>> ba = bytearray(8)
>>> ba
bytearray(b'\x00\x00\x00\x00\x00\x00\x00\x00')
>>> with open("/dev/urandom", "rb") as source:
...     source.readinto(ba)
... 
8
>>> ba
bytearray(b'`m.z\x8d\x0fp\xa1')

With such techniques, it's easy to pre-allocate a buffer (as you would do in C to mitigate the number of calls to malloc()) and fill it at your convenience.

Using memoryview, you can even place data at any point in the memory area:

>>> ba = bytearray(8)
>>> # Reference the _bytearray_ from offset 4 to its end
>>> ba_at_4 = memoryview(ba)[4:]
>>> with open("/dev/urandom", "rb") as source:
... # Write the content of /dev/urandom from offset 4 to the end of the
... # bytearray, effectively reading 4 bytes only
...     source.readinto(ba_at_4)
... 
4
>>> ba
bytearray(b'\x00\x00\x00\x00\x0b\x19\xae\xb2')

The buffer protocol is fundamental to achieve low memory overhead and great performances. As Python hides all the memory allocations, developers tend to forget what happens under the hood, at a high cost for the speed of their programs!

It's also good to know that both the objects in the array module and the functions in the struct module can handle the buffer protocol correctly, and can, therefore, efficiently perform when targeting zero copy.

Zato Blog: Connecting Zato clusters with WebSockets and publish/subscribe queues

$
0
0

Since version 3.0, it is possible to directly connect Zato clusters and exchange messages as though remote services where running in a local instance. This makes it an ideal choice for environments split into multiple parts.

Introduction

The reasons to have more than one cluster, each with one or more servers, may vary:

  • For HA and performance, environments may be broken out geographically into a setup with one cluster per continent or a region of the world
  • CPU-extensive operations may be carried out in one cluster with another making use of the results the former produces to offer a set of APIs
  • For legal reasons, it may not be allowed to run all integration services in one cluster, using the same hardware and software infrastructure

The new feature in Zato 3.0 which allows for efficient communication between clusters are WebSocket connections - one of clusters will create a channel through with other clusters may invoke its services via their outgoing connections.

WebSockets (WSX for short) have essentially no overhead in practice but they can be used for bi-directional communication hence they are a great choice for such scenarios.

From a Zato programmer's perspective, all the communication details are hidden and a couple of lines of code suffices to invoke services or receive messages from remote clusters, for instance:

# Obtain a handle to a remote connectionwithself.out.wsx.get('My Connection').conn.client()asclient:# Invoke a remote service - expects a Python dict on input# and returns a Python dict on response. All the serialization# and network connectivity is handled automatically.response=client.invoke(msg)

Architecture and configuration

Screenshots
  • Each cluster which is to become a recipient of messages from other clusters needs to have a new WebSocket channel created with service helpers.web-sockets-gateway mounted on it. A security definition should also be attached as required.

  • Each cluster that should invoke another one needs to have an outgoing WebSocket connection created - make sure Is remote end Zato checkbox is on and that credentials are provided, if required by the other side.

  • If the cluster with an outgoing connection is interested in receiving publish/subscribe messages, all topics it wants to subscribe to should be listed, one in each line. Make sure the cluster with a channel has a correct pub/sub endpoint configured for that channel.

  • The cluster which establishes the connection (here, cluster1) may also want to subscribe to events of interest via hooks services - more about it below.

  • Once an outgoing connection is created, internal tasks will start on cluster1 to establish a remote connection to server2. If successful, authentication will take place automatically. Finally, if configured, a hook service will fire to let cluster1 know that a new connection was established. Afterwards, cluster1 may start to invoke remote services.

  • There are no other steps involved, at this point everything is configured and ready to be used.

Screenshots

Screenshots

From a programmer's perspective

  • To invoke remote Zato services, programmers use WebSockets outgoing connections methods - providing a dictionary of input data to the invocation and receiving a dictionary of data on input. Note that the invocation is synchronous, your service is blocked until the remote cluster responds.
# -*- coding: utf-8 -*-from__future__importabsolute_import,division,print_function,unicode_literalsfromzato.server.serviceimportServiceclassMyService(Service):defhandle(self):# Message to send - needs to be a dictionary with name# of the service to invoke as well as its input data, if any is required.# In this case, we are invoking an echo service# which writes back to output anything it receives on input.msg={'service':'zato.helpers.echo','request':{'elem1':'value1','elem2':'value2',}}# Name of the connection to send messages throughconn_name='My WSX Outconn'# Obtain a client from the connection poolwithself.out.wsx.get(conn_name).conn.client()asclient:# Send the message and read its responseresponse=client.send(msg)# Or, client.invoke can be used with Zato WebSocket connections,# this method is an alias to client.sendresponse=client.invoke(msg)# Log the response receivedself.logger.info('Response is `%s`',response.data)
INFO - Response is `{u'elem2': u'value2', u'elem1': u'value1'}`
  • To receive messages, hook services are used. There are three events for which hooks can be triggered - they can be handled by different services or the same one, it is up to users:
  1. Upon connecting to a remote cluster, including reconnects (on_connect)
  2. Upon receiving messages from remote clusters (on_message)
  3. Once a connection to the remote cluster is shut down (on_close)
  • The on_message hook can be combined with publish/subscribe topics and queues - each time the remote cluster (the one with a WSX channel) publishes a message that the local cluster (the one with a WSX outgoing connection) is interested in, the on_message hook will be called to handle it, in this manner making it possible for remote clusters to deliver messages to clusters subscribing to topics.

  • Each hook is just a Zato service with a specific SimpleIO signature, as in the on_message example below:

# -*- coding: utf-8 -*-from__future__importabsolute_import,division,print_function,unicode_literalsfromzato.server.serviceimportOpaque,ServiceclassOnMessageHook(Service):classSimpleIO:input_optional=(Opaque('ctx'),)defhandle(self):# Object describing incoming datactx=self.request.input.ctx# Message typemsg_type=ctx.type# Data receiveddata=ctx.data# Log message typeself.logger.info('Msg type: `%s`',msg_type)# Log actual dataself.logger.info('Data received: `%s`',data.data)# Log metadata - ID and timestampself.logger.info('Meta: `%s` `%s`',data.id,data.timestamp)

Now, we can use web-admin to publish a test message and confirm that the on_message service receives it:

Screenshots

In the on_message service's server logs:

INFO - Msg type: `message`
INFO - Data received: `[
  {u'delivery_count': 0,
   u'msg_id': u'zpsme26726911ffbe8cba2cca278',
   u'expiration_time_iso': u'2086-09-21T14:03:05.285470',
   u'topic_name': u'/customer/new',
   u'pub_time_iso': u'2018-09-03T10:48:58.285470',
   u'priority': 5,
   u'expiration': 2147483647000,
   u'has_gd': True,
   u'data': u'This is a sample message',
   u'sub_key': u'zpsk.websockets.6ef529f7cab64a71d8bd2878',
   u'mime_type': u'text/plain',
   u'size': 24}
  ]`
INFO - `6fd296ecf78493a3a0ce7570` `2018-09-03T10:49:00.540024`

Summary

Stack Abuse: Beginner's Tutorial on the Pandas Python Library

$
0
0

Pandas is an open source Python package that provides numerous tools for data analysis. The package comes with several data structures that can be used for many different data manipulation tasks. It also has a variety of methods that can be invoked for data analysis, which comes in handy when working on data science and machine learning problems in Python.

Advantages of Using Pandas

The following are some of the advantages of the Pandas library:

  1. It can present data in a way that is suitable for data analysis via its Series and DataFrame data structures.
  2. The package contains multiple methods for convenient data filtering.
  3. Pandas has a variety of utilities to perform Input/Output operations in a seamless manner. It can read data from a variety of formats such as CSV, TSV, MS Excel, etc.

Installing Pandas

The standard Python distribution does not come with the Pandas module. To use this 3rd party module, you must install it.

The nice thing about Python is that it comes bundled with a tool called pip that can be used for the installation of Pandas. The do the installation, you need to run the following command:

$ pip install pandas

If you have installed Anaconda on your system, just run the following command to install Pandas:

$ conda install pandas

It is highly recommended that you install the latest version of the Pandas package. However, if you want to install an older version you can specify it by running the conda install command as follows:

$ conda install pandas=0.23.4

Pandas Data Structures

Pandas has two main data structures for data storage:

  1. Series
  2. DataFrame

Series

A series is similar to a one-dimensional array. It can store data of any type. The values of a Pandas Series are mutable but the size of a Series is immutable and cannot be changed.

The first element in the series is assigned the index 0, while the last element is at index N-1, where N is the total number of elements in the series.

To create a Pandas Series, we must first import the Pandas package via the Python's import command:

import pandas as pd  

To create the Series, we invoke the pd.Series() method and pass an array, as shown below:

series1 = pd.Series([1,2,3,4])  

Next, run the print statement to display the contents of the Series:

print(series1)  

Output:

0    1  
1    2  
2    3  
3    4  
dtype: int64  

You can see that we have two columns, the first one with numbers starting from index 0 and the second one with the elements that were added to the series.

The first column denotes the indexes for the elements.

However, you may get an error when you try to display the Series. The major cause of this error is that Pandas looks for the amount of information to display, therefore you should provide sys output information.

You can solve the error by executing the code as follows:

import pandas as pd  
import sys

sys.__stdout__ = sys.stdout

series1 = pd.Series([1,2,3,4])  
print(series1)  

A Series may also be created from a numpy array. Let us create a numpy array then convert it into a Pandas Series:

import pandas as pd  
import numpy as np  
import sys

sys.__stdout__ = sys.stdout

fruits = np.array(['apple','orange','mango','pear'])  
series2 = pd.Series(fruits)  
print(series2)  

Output:

0     apple  
1    orange  
2     mango  
3      pear  
dtype: object  

We start by importing the necessary libraries, including numpy. Next, we called the numpy's array() function to create an array of fruits. We then use Pandas Series() function and pass it the array that we want to convert into a series. Finally, we call the print() function to display the Series.

DataFrame

The Pandas DataFrame can be seen as a table. It organizes data into rows and columns, making it a two-dimensional data structure. Potentially, the columns are of a different type and the size of the DataFrame is mutable, and hence can be modified.

To create a DataFrame, you can choose to start from scratch or convert other data structures like Numpy arrays into a DataFrame. Here is how you can create a DataFrame from scratch:

import pandas as pd  
df = pd.DataFrame({  
    "Column1": [1, 4, 8, 7, 9],
    "Column2": ['a', 'column', 'with', 'a', 'string'],
    "Column3": [1.23, 23.5, 45.6, 32.1234, 89.453],
    "Column4": [True, False, True, False, True]
})
print(df)  

Output:

   Column1 Column2  Column3  Column4
0        1       a   1.2300     True  
1        4  column  23.5000    False  
2        8    with  45.6000     True  
3        7       a  32.1234    False  
4        9  string  89.4530     True  

In this example we have created a DataFrame named df. The first column of the DataFrame has integer values. The second column has a string, the third column has floating point values, while the fourth column has boolean values.

The statement print(df) will display the contents of the DataFrame to us via the console, allowing us to inspect and verify its contents.

However, when displaying the DataFrame, you may have noticed that there is an additional column at the start of the table, with its elements beginning at 0. This column is created automatically and it marks the indexes of the rows.

To create a DataFrame, we must invoke the pd.DataFrame() method as shown in the above example.

It is possible for us to create a DataFrame from a list or even a set of lists. We only have to call the pd.DataFrame() method and then pass it the list variable as its only argument.

Consider the following example:

import pandas as pd  
mylist = [4, 8, 12, 16, 20]  
df = pd.DataFrame(mylist)  
print(df)  

Output:

  0
0   4  
1   8  
2  12  
3  16  
4  20  

In this example we created a list named mylist with a sequence of 5 integers. We then called the DataFrame() method and passed the name of the list to it as the argument. This is where the conversion of the list to a DataFrame happened.

We have then printed out the contents of the DataFrame. The DataFrame has a default column showing indexes, with the first element being at index 0 and the last one at index N-1, where N is the total number of elements in the DataFrame.

Here is another example:

import pandas as pd  
items = [['Phone', 2000], ['TV', 1500], ['Radio', 800]]  
df = pd.DataFrame(items, columns=['Item', 'Price'], dtype=float)  
print(df)  

Output:

  Item   Price
0  Phone  2000.0  
1     TV  1500.0  
2  Radio   800.0  

Here we have created a list named items with a set of 3 items. For each item, we have a name and price. The list is then passed to the DataFrame() method in order to convert it into a DataFrame object.

In this example the names of the columns for the DataFrame have been specified as well. The numeric values have also been converted into floating point values since we specified the dtype argument as "float".

To get a summary of this item's data, we can call the describe() function on the DataFrame variable, that is, df:

df.describe()  

Output:

      Price
count     3.000000  
mean   1433.333333  
std     602.771377  
min     800.000000  
25%    1150.000000  
50%    1500.000000  
75%    1750.000000  
max    2000.000000  

The describe() function returns some common statistical details of the data, including the mean, standard deviation, minimum element, maximum element, and some other details. This is a great way to get a snapshot of the data you're working with if the dataset is relatively unknown to you. It could also be a good way to quickly compare two separate datasets of similar data.

Importing Data

Often times you'll need to use Pandas to analyze data that is stored in an Excel file or in a CSV file. This requires you to open and import the data from such sources into Pandas.

Luckily, Pandas provides us with numerous methods that we can use to load the data from such sources into a Pandas DataFrame.

Importing CSV Data

A CSV file, which stands for comma separated value, is simply a text file with values separated by a comma (,). Since this is a very well-known and often-used standard, we can use Pandas to read CSV files either in whole or in part.

For this example we will create a CSV file named cars.csv. The file should have the following data:

Number,Type,Capacity  
SSD,Premio,1800  
KCN,Fielder,1500  
USG,Benz,2200  
TCH,BMW,2000  
KBQ,Range,3500  
TBD,Premio,1800  
KCP,Benz,2200  
USD,Fielder,1500  
UGB,BMW,2000  
TBG,Range,3200  

You can copy the data and paste in a text editor like Notepad, and then save it with the name cars.csv in the same directory as your Python scripts.

Pandas provides us with a method named read_csv that can be used for reading CSV values into a Pandas DataFrame. The method takes the path to the CSV file as the argument.

The following code is what we'll use to help us read the cars.csv file:

import pandas as pd  
data = pd.read_csv('cars.csv')  
print(data)  

Output:

 Number     Type  Capacity
0    SSD   Premio      1800  
1    KCN  Fielder      1500  
2    USG     Benz      2200  
3    TCH      BMW      2000  
4    KBQ    Range      3500  
5    TBD   Premio      1800  
6    KCP     Benz      2200  
7    USD  Fielder      1500  
8    UGB      BMW      2000  
9    TBG    Range      3200  

In my case, I saved the CSV file in the same directory as the Python script, hence I simply passed the name of the file to the read_csv method and it knew to check the current working directory.

If you have saved your file in a different path, ensure you pass the correct path as the argument to the method. This can either be a relative path, like "../cars.csv", or an absolute path like "/Users/nicholas/data/cars.csv".

In some cases, you may have thousands of rows in your dataset. In such a case, it would be more helpful to you to print only the first few rows on the console rather than printing all the rows.

This can be done by calling the head() method on the DataFrame as shown below:

data.head()  

For our data above, the above command returns only the first 5 rows of the dataset, allowing you to inspect a small sample of the data. This is shown below:

Output:

  Number     Type  Capacity
0    SSD   Premio      1800  
1    KCN  Fielder      1500  
2    USG     Benz      2200  
3    TCH      BMW      2000  
4    KBQ    Range      3500  

The loc() method is a nice utility that helps us read only certain rows of a specific column in the dataset, as demonstrated in the following example:

import pandas as pd  
data = pd.read_csv('cars.csv')

print (data.loc[[0, 4, 7], ['Type']])  

Output:

 Type
0   Premio  
4    Range  
7  Fielder  

Here we used the loc() method to only read the elements at indexes 0, 4, and 7 of the Type column.

At times Wwe may need to only read certain columns and not others. This can be done using the loc() method as well, shown below in this example:

import pandas as pd  
data = pd.read_csv('cars.csv')

print (data.loc[:, ['Type', 'Capacity']])  

Output:

Type  Capacity  
0   Premio      1800  
1  Fielder      1500  
2     Benz      2200  
3      BMW      2000  
4    Range      3500  
5   Premio      1800  
6     Benz      2200  
7  Fielder      1500  
8      BMW      2000  
9    Range      3200  

Here we used the loc() method to read all rows (the : part) of only two of our columns from the dataset, that is, the Type and Capacity columns, as specified in the argument.

Importing Excel Data

In addition to the read_csv method, Pandas also has the read_excel function that can be used for reading Excel data into a Pandas DataFrame. In this example, we will use an Excel file named workers.xlsx with details of workers in a company.

The following code can be used to load the contents of the Excel file into a Pandas DataFrame:

import pandas as pd  
data = pd.read_excel('workers.xlsx')  
print (data)  

Output:

  ID    Name      Dept  Salary
0   1    John       ICT    3000  
1   2    Kate   Finance    2500  
2   3  Joseph        HR    3500  
3   4  George       ICT    2500  
4   5    Lucy     Legal    3200  
5   6   David   Library    2000  
6   7   James        HR    2000  
7   8   Alice  Security    1500  
8   9   Bosco   Kitchen    1000  
9  10    Mike       ICT    3300  

After calling the read_excel function we then passed the name of the file as the argument, which read_excel used to open/load the file and then parse the data. The print() function then helps us display the contents of the DataFrame, as we've done in past examples.

And just like with our CSV example, this function can be combined with the loc() method to help us read specific rows and columns from the Excel file.

For example:

import pandas as pd  
data = pd.read_excel('workers.xlsx')

print (data.loc[[1,4,7],['Name','Salary']])  

Output:

Name  Salary  
1   Kate    2500  
4   Lucy    3200  
7  Alice    1500  

We have used the loc() method to retrieve the Name and Salary values of the elements at indexes 1, 4, and 7.

Pandas also allows us to read from two Excel sheets simultaneously. Suppose our previous data is in Sheet1, and we have some other data in Sheet2 of the same Excel file. The following code shows how we can read from the two sheets simultaneously:

import pandas as pd  
with pd.ExcelFile('workers.xlsx') as x:  
    s1 = pd.read_excel(x, 'Sheet1')
    s2 = pd.read_excel(x, 'Sheet2')

print("Sheet 1:")  
print (s1)  
print("")  
print("Sheet 2:")  
print (s2)  

Output:

Sheet 1:  
   ID    Name      Dept  Salary
0   1    John       ICT    3000  
1   2    Kate   Finance    2500  
2   3  Joseph        HR    3500  
3   4  George       ICT    2500  
4   5    Lucy     Legal    3200  
5   6   David   Library    2000  
6   7   James        HR    2000  
7   8   Alice  Security    1500  
8   9   Bosco   Kitchen    1000  
9  10    Mike       ICT    3300

Sheet 2:  
   ID    Name  Age  Retire
0   1    John   55    2023  
1   2    Kate   45    2033  
2   3  Joseph   55    2023  
3   4  George   35    2043  
4   5    Lucy   42    2036  
5   6   David   50    2028  
6   7   James   30    2048  
7   8   Alice   24    2054  
8   9   Bosco   33    2045  
9  10    Mike   35    2043  

What happened is that we combined the read_excel() function with the ExcelFile wrapper class. The variable x was created when calling the wrapper class and with Python keyword, which we use to temporarily open the file.

From the ExcelFile variable x, we have created two more variables, s1 and s2 to represent the contents that were read from the different sheets.

We then used print statements to view the contents of the two sheets in the console. The blank print statement, print(""), is only used to print a blank line between our sheet data.

Data Wrangling

Data wrangling is the process of processing data to prepare it for use in the next step. Examples of data wrangling processes include merging, grouping, and concatenation. This kind of manipulation is often needed in data science to get your data in to a form that works well with whatever analysis or algorithms that you're going to put it through.

Merging

The Pandas library allows us to join DataFrame objects via the merge() function. Let us create two DataFrames and demonstrate how to merge them.

Here is the first DataFrame, df1:

import pandas as pd

d = {  
    'subject_id': ['1', '2', '3', '4', '5'],
    'student_name': ['John', 'Emily', 'Kate', 'Joseph', 'Dennis']
}
df1 = pd.DataFrame(d, columns=['subject_id', 'student_name'])  
print(df1)  

Output:

subject_id student_name  
0          1         John  
1          2        Emily  
2          3         Kate  
3          4       Joseph  
4          5       Dennis  

Here is the code to create the second DataFrame, df2:

import pandas as pd

data = {  
    'subject_id': ['4', '5', '6', '7', '8'],
    'student_name': ['Brian', 'William', 'Lilian', 'Grace', 'Caleb']
}
df2 = pd.DataFrame(data, columns=['subject_id', 'student_name'])  
print(df2)  

Output:

subject_id student_name  
0          4        Brian  
1          5      William  
2          6       Lilian  
3          7        Grace  
4          8        Caleb  

We now need to merge the two DataFrames, that is, df1 and df2 along the values of subject_id. We simply call the merge() function as shown below:

pd.merge(df1, df2, on='subject_id')  

Output:

subject_id student_name_x student_name_y  
0          4         Joseph          Brian  
1          5         Dennis        William  

What merging does is that it returns the rows from both DataFrames with the same value for the column you are using for the merge.

There are many other ways to use the pd.merge function that we won't be covering in this article, such as what data should be merged, how it should be merged, if it should be sorted, etc. For more information, check out the official documentation on the merge function.

Grouping

Grouping is the process of putting data into various categories. Here is a simple example:

# import pandas library
import pandas as pd

raw = {  
    'Name': ['John', 'John', 'Grace', 'Grace', 'Benjamin', 'Benjamin', 'Benjamin',
        'Benjamin', 'John', 'Alex', 'Alex', 'Alex'],
    'Position': [2, 1, 1, 4, 2, 4, 3, 1, 3, 2, 4, 3],
    'Year': [2009, 2010, 2009, 2010, 2010, 2010, 2011, 2012, 2011, 2013, 2013, 2012],
    'Marks':[408, 398, 422, 376, 401, 380, 396, 388, 356, 402, 368, 378]
}
df = pd.DataFrame(raw)

group = df.groupby('Year')  
print(group.get_group(2010))  

Output:

   Marks      Name  Position  Year
1    398      John         1  2010  
3    376     Grace         4  2010  
5    380  Benjamin         4  2010  

In this simple example, we have grouped the data by year, which in this case was 2010. We could have also grouped by any of the other columns, like "Name", "Position", etc.

Concatenation

Concatenation of data, which basically means to add one set of data to another, can be done by calling the concat() function.

Let us demonstrate how to concatenate DataFrames using our two previous Dataframes, that is, df1 and df2, each with two columns, "subject_id" and "student_name":

print(pd.concat([df1, df2]))  

Output:

subject_id student_name  
0          1         John  
1          2        Emily  
2          3         Kate  
3          4       Joseph  
4          5       Dennis  
0          4        Brian  
1          5      William  
2          6       Lilian  
3          7        Grace  
4          8        Caleb  

Descriptive Statistics

As I briefly showed earlier, when we use the describe() function we get the descriptive statistics for numerical columns, but the character columns are excluded.

Let's first create a DataFrame showing student names and their scores in Math and English:

import pandas as pd

data = {  
    'Name': ['John', 'Alice', 'Joseph', 'Alex'],
    'English': [64, 78, 68, 58],
    'Maths': [76, 54, 72, 64]
}

df = pd.DataFrame(data)  
print(df)  

Output:

 English  Maths    Name
0       64     76    John  
1       78     54   Alice  
2       68     72  Joseph  
3       58     64    Alex  

We only have to call the describe() function on the DataFrame and get the various measures like the mean, standard deviation, median, maximum element, minimum element, etc:

df.describe()  

Output:

   English      Maths
count   4.000000   4.000000  
mean   67.000000  66.500000  
std     8.406347   9.712535  
min    58.000000  54.000000  
25%    62.500000  61.500000  
50%    66.000000  68.000000  
75%    70.500000  73.000000  
max    78.000000  76.000000  

As you can see, the describe() method completely ignored the "Name" column since it is not numberical, which is what we want. This simplifies things for the caller since you don't need to worry about removing non-numerical columns before calculating the numerical stats you want.

Conclusion

Pandas is an extremely useful Python library, particularly for data science. Various Pandas functionalities make data preprocessing extremely simple. This article provides a brief introduction to the main functionalities of the library. In this article, we saw working examples of all the major utilities of Pandas library. To get the most out of Pandas, I would suggest you practice the examples in this article and also test the library with your own datasets. Happy Coding!

Ned Batchelder: Coverage.py 5.0a2: SQLite storage

$
0
0

The next alpha of Coverage.py 5.0 is ready: 5.0a2. The big change is that instead of using a JSON-like file for storing the collected data, we now use a SQLite database. This is in preparation for new features down the road.

In theory, everything works as it did before. I need you to tell me whether that’s true or not. Please test this alpha release. Let me know what you find.

If you try it, and it works, let me know! Email is good.

If you see a problem, do this:

  • First create a reproducible scenario, something that I can recreate.
  • Try running that scenario with the environment variable COVERAGE_STORAGE=json defined, which will use the old JSON storage format. It will be very helpful to know if the results change.
  • Write up the issue on GitHub. Please provide as many details as you can.

The biggest change in behavior is that the data file is now created earlier than before. If you are running tests simultaneously, this might mean that you need parallel=true where you didn’t before. Keep an eye out for that.

Some other notes about these changes:

  • For now, the old JSON storage code is still in place. It can be enabled with a COVERAGE_STORAGE=json environment variable.
  • But I would rather not keep that code around forever. One of the things I’m trying to find out with this alpha is whether there’s any reason I will need to keep it around.
  • The database schema is still in flux, and will need to change to support future features. I’m not sure whether to make the schema part of the public interface to coverage.py or not. I want people to be able to experiment with the collected data, but I also want to be able to change it in the future if I need to.

Please test the code, and let me know what you find.

Real Python: Structuring Python Programs

$
0
0

You have now covered Python variables, operators, and data types in depth, and you’ve seen quite a bit of example code. Up to now, the code has consisted of short individual statements, simply assigning objects to variables or displaying values.

But you want to do more than just define data and display it! Let’s start arranging code into more complex groupings.

Here’s what you’ll learn in this tutorial: You’ll dig deeper into Python lexical structure. You’ll learn about the syntactic elements that comprise statements, the basic units that make up a Python program. This will prepare you for the next few tutorials covering control structures, constructs that direct program flow among different groups of code.

Python Statements

Statements are the basic units of instruction that the Python interpreter parses and processes. In general, the interpreter executes statements sequentially, one after the next as it encounters them. (You will see in the next tutorial on conditional statements that it is possible to alter this behavior.)

In a REPL session, statements are executed as they are typed in, until the interpreter is terminated. When you execute a script file, the interpreter reads statements from the file and executes them until end-of-file is encountered.

Python programs are typically organized with one statement per line. In other words, each statement occupies a single line, with the end of the statement delimited by the newline character that marks the end of the line. The majority of the examples so far in this tutorial series have followed this pattern:

>>> print('Hello, World!')Hello, World!>>> x=[1,2,3]>>> print(x[1:2])[2]

Note: In many of the REPL examples you have seen, a statement has often simply consisted of an expression typed directly at the >>> prompt, for which the interpreter dutifully displays the value:

>>> 'foobar'[2:5]'oba'

Remember that this only works interactively, not from a script file. In a script file, a literal or expression that appears as a solitary statement like the above will not cause output to the console. In fact, it won’t do anything useful at all. Python will simply waste CPU time calculating the value of the expression, and then throw it away.

Line Continuation

Suppose a single statement in your Python code is especially long. For example, you may have an assignment statement with many terms:

>>> person1_age=42>>> person2_age=16>>> person3_age=71>>> someone_is_of_working_age=(person1_age>=18andperson1_age<=65)or(person2_age>=18andperson2_age<=65)or(person3_age>=18andperson3_age<=65)>>> someone_is_of_working_ageTrue

Or perhaps you are defining a lengthy nested list:

>>> a=[[1,2,3,4,5],[6,7,8,9,10],[11,12,13,14,15],[16,17,18,19,20],[21,22,23,24,25]]>>> a[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15], [16, 17, 18, 19, 20], [21, 22, 23, 24, 25]]

You’ll notice that these statements are too long to fit in your browser window, and the browser is forced to render the code blocks with horizontal scroll bars. You may find that irritating. (You have our apologies—these examples are presented that way to make the point. It won’t happen again.)

It is equally frustrating when lengthy statements like these are contained in a script file. Most editors can be configured to wrap text, so that the ends of long lines are at least visible and don’t disappear out the right edge of the editor window. But the wrapping doesn’t necessarily occur in logical locations that enhance readability:

line-wrap

Excessively long lines of code are generally considered poor practice. In fact, there is an official Style Guide for Python Code put forth by the Python Software Foundation, and one of its stipulations is that the maximum line length in Python code should be 79 characters.

Note: The Style Guide for Python Code is also referred to as PEP 8. PEP stands for Python Enhancement Proposal. PEPs are documents that contain details about features, standards, design issues, general guidelines, and information relating to Python. For more information, see the Python Software Foundation Index of PEPs.

As code becomes more complex, statements will on occasion unavoidably grow long. To maintain readability, you should break them up into parts across several lines. But you can’t just split a statement whenever and wherever you like. Unless told otherwise, the interpreter assumes that a newline character terminates a statement. If the statement isn’t syntactically correct at that point, an exception is raised:

>>> someone_is_of_working_age=person1_age>=18andperson1_age<=65orSyntaxError: invalid syntax

In Python code, a statement can be continued from one line to the next in two different ways: implicit and explicit line continuation.

Implicit Line Continuation

This is the more straightforward technique for line continuation, and the one that is preferred according to PEP 8.

Any statement containing opening parentheses ('('), brackets ('['), or curly braces ('{') is presumed to be incomplete until all matching parentheses, brackets, and braces have been encountered. Until then, the statement can be implicitly continued across lines without raising an error.

For example, the nested list definition from above can be made much more readable using implicit line continuation because of the open brackets:

>>> a=[... [1,2,3,4,5],... [6,7,8,9,10],... [11,12,13,14,15],... [16,17,18,19,20],... [21,22,23,24,25]... ]>>> a[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15],[16, 17, 18, 19, 20], [21, 22, 23, 24, 25]]

A long expression can also be continued across multiple lines by wrapping it in grouping parentheses. PEP 8 explicitly advocates using parentheses in this manner when appropriate:

>>> someone_is_of_working_age=(... (person1_age>=18andperson1_age<=65)... or(person2_age>=18andperson2_age<=65)... or(person3_age>=18andperson3_age<=65)... )>>> someone_is_of_working_ageTrue

If you need to continue a statement across multiple lines, it is usually possible to use implicit line continuation to do so. This is because parentheses, brackets, and curly braces appear so frequently in Python syntax:

Parentheses

  • Expression grouping

    >>> x=(... 1+2... +3+4... +5+6... )>>> x21
  • Function call

    >>> print(... 'foo',... 'bar',... 'baz'... )foo bar baz
  • Method call

    >>> 'abc'.center(... 9,... '-'... )'---abc---'
  • Tuple definition

    >>> t=(... 'a','b',... 'c','d'... )

Curly Braces

  • Dictionary definition

    >>> d={... 'a':1,... 'b':2... }
  • Set definition

    >>> x1={... 'foo',... 'bar',... 'baz'... }

Square Brackets

  • List definition

    >>> a=[... 'foo','bar',... 'baz','qux'... ]
  • Indexing

    >>> a[... 1... ]'bar'
  • Slicing

    >>> a[... 1:2... ]['bar']
  • Dictionary key reference

    >>> d[... 'b'... ]2

Note: Just because something is syntactically allowed, it doesn’t mean you should do it. Some of the examples above would not typically be recommended. Splitting indexing, slicing, or dictionary key reference across lines, in particular, would be unusual. But you can consider it if you can make a good argument that it enhances readability.

Remember that if there are multiple parentheses, brackets, or curly braces, then implicit line continuation is in effect until they are all closed:

>>> a=[... [... ['foo','bar'],... [1,2,3]... ],... {1,3,5},... {... 'a':1,... 'b':2... }... ]>>> a[[['foo', 'bar'], [1, 2, 3]], {1, 3, 5}, {'a': 1, 'b': 2}]

Note how line continuation and judicious use of indentation can be used to clarify the nested structure of the list.

Explicit Line Continuation

In cases where implicit line continuation is not readily available or practicable, there is another option. This is referred to as explicit line continuation or explicit line joining.

Ordinarily, a newline character (which you get when you press Enter on your keyboard) indicates the end of a line. If the statement is not complete by that point, Python will raise a SyntaxError exception:

>>> s=
  File "<stdin>", line 1s=^SyntaxError: invalid syntax>>> x=1+2+
  File "<stdin>", line 1x=1+2+^SyntaxError: invalid syntax

To indicate explicit line continuation, you can specify a backslash (\) character as the final character on the line. In that case, Python ignores the following newline, and the statement is effectively continued on next line:

>>> s= \
... 'Hello, World!'>>> s'Hello, World!'>>> x=1+2 \
... +3+4 \
... +5+6>>> x21

Note that the backslash character must be the last character on the line. Not even whitespace is allowed after it:

>>> # You can't see it, but there is a space character following the \ here:>>> s= \
  File "<stdin>", line 1s= \
         ^SyntaxError: unexpected character after line continuation character

Again, PEP 8 recommends using explicit line continuation only when implicit line continuation is not feasible.

Multiple Statements Per Line

Multiple statements may occur on one line, if they are separated by a semicolon (;) character:

>>> x=1;y=2;z=3>>> print(x);print(y);print(z)123

Stylistically, this is generally frowned upon, and PEP 8 expressly discourages it. There might be situations where it improves readability, but it usually doesn’t. In fact, it often isn’t necessary. The following statements are functionally equivalent to the example above, but would be considered more typical Python code:

>>> x,y,z=1,2,3>>> print(x,y,z,sep='\n')123

The term Pythonic refers to code that adheres to generally accepted common guidelines for readability and “best” use of idiomatic Python. When someone says code is not Pythonic, they are implying that it does not express the programmer’s intent as well as might otherwise be done in Python. Thus, the code is probably not as readable as it could be to someone who is fluent in Python.

If you find your code has multiple statements on a line, there is probably a more Pythonic way to write it. But again, if you think it’s appropriate or enhances readability, you should feel free to do it.

Comments

In Python, the hash character (#) signifies a comment. The interpreter will ignore everything from the hash character through the end of that line:

>>> a=['foo','bar','baz']# I am a comment.>>> a['foo', 'bar', 'baz']

If the first non-whitespace character on the line is a hash, the entire line is effectively ignored:

>>> # I am a comment.>>> # I am too.

Naturally, a hash character inside a string literal is protected, and does not indicate a comment:

>>> a='foobar # I am *not* a comment.'>>> a'foobar # I am *not* a comment.'

A comment is just ignored, so what purpose does it serve? Comments give you a way to attach explanatory detail to your code:

>>> # Calculate and display the area of a circle.>>> pi=3.1415926536>>> r=12.35>>> area=pi*(r**2)>>> print('The area of a circle with radius',r,'is',area)The area of a circle with radius 12.35 is 479.163565508706

Up to now, your Python coding has consisted mostly of short, isolated REPL sessions. In that setting, the need for comments is pretty minimal. Eventually, you will develop larger applications contained across multiple script files, and comments will become increasingly important.

Good commenting makes the intent of your code clear at a glance when someone else reads it, or even when you yourself read it. Ideally, you should strive to write code that is as clear, concise, and self-explanatory as possible. But there will be times that you will make design or implementation decisions that are not readily obvious from the code itself. That is where commenting comes in. Good code explains how; good comments explain why.

Comments can be included within implicit line continuation:

>>> x=(1+2# I am a comment.... +3+4# Me too.... +5+6)>>> x21>>> a=[... 'foo','bar',# Me three.... 'baz','qux'... ]>>> a['foo', 'bar', 'baz', 'qux']

But recall that explicit line continuation requires the backslash character to be the last character on the line. Thus, a comment can’t follow afterward:

>>> x=1+2+ \   # I wish to be comment, but I'm not.SyntaxError: unexpected character after line continuation character

What if you want to add a comment that is several lines long? Many programming languages provide a syntax for multiline comments (also called block comments). For example, in C and Java, comments are delimited by the tokens /* and */. The text contained within those delimiters can span multiple lines:

/*
[This is not Python!]

Initialize the value for radius of circle.

Then calculate the area of the circle
and display the result to the console.
*/

Python doesn’t explicitly provide anything analogous to this for creating multiline block comments. To create a block comment, you would usually just begin each line with a hash character:

>>> # Initialize value for radius of circle.>>> #>>> # Then calculate the area of the circle>>> # and display the result to the console.>>> pi=3.1415926536>>> r=12.35>>> area=pi*(r**2)>>> print('The area of a circle with radius',r,'is',area)The area of a circle with radius 12.35 is 479.163565508706

However, for code in a script file, there is technically an alternative.

You saw above that when the interpreter parses code in a script file, it ignores a string literal (or any literal, for that matter) if it appears as statement by itself. More precisely, a literal isn’t ignored entirely: the interpreter sees it and parses it, but doesn’t do anything with it. Thus, a string literal on a line by itself can serve as a comment. Since a triple-quoted string can span multiple lines, it can effectively function as a multiline comment.

Consider this script file foo.py:

"""Initialize value for radius of circle.Then calculate the area of the circleand display the result to the console."""pi=3.1415926536r=12.35area=pi*(r**2)print('The area of a circle with radius',r,'is',area)

When this script is run, the output appears as follows:

C:\Users\john\Documents\Python\doc>python foo.py
The area of a circle with radius 12.35 is 479.163565508706

The triple-quoted string is not displayed and doesn’t change the way the script executes in any way. It effectively constitutes a multiline block comment.

Although this works (and was once put forth as a Python programming tip by Guido himself), PEP 8 actually recommends against it. The reason for this appears to be because of a special Python construct called the docstring. A docstring is a special comment at the beginning of a user-defined function that documents the function’s behavior. Docstrings are typically specified as triple-quoted string comments, so PEP 8 recommends that other block comments in Python code be designated the usual way, with a hash character at the start of each line.

However, as you are developing code, if you want a quick and dirty way to comment out as section of code temporarily for experimentation, you may find it convenient to wrap the code in triple quotes.

Further Reading: You will learn more about docstrings in the upcoming tutorial on functions in Python.

For more information on commenting and documenting Python code, including docstrings, see Documenting Python Code: A Complete Guide.

Whitespace

When parsing code, the Python interpreter breaks the input up into tokens. Informally, tokens are just the language elements that you have seen so far: identifiers, keywords, literals, and operators.

Typically, what separates tokens from one another is whitespace: blank characters that provide empty space to improve readability. The most common whitespace characters are the following:

CharacterASCII CodeLiteral Expression
space32 (0x20)' '
tab9 (0x9)'\t'
newline10 (0xa)'\n'

There are other somewhat outdated ASCII whitespace characters such as line feed and form feed, as well as some very esoteric Unicode characters that provide whitespace. But for present purposes, whitespace usually means a space, tab, or newline.

Whitespace is mostly ignored, and mostly not required, by the Python interpreter. When it is clear where one token ends and the next one starts, whitespace can be omitted. This is usually the case when special non-alphanumeric characters are involved:

>>> x=3;y=12>>> x+y15>>> (x==3)and(x<y)True>>> a=['foo','bar','baz']>>> a['foo', 'bar', 'baz']>>> d={'foo':3,'bar':4}>>> d{'foo': 3, 'bar': 4}>>> x,y,z='foo',14,21.1>>> (x,y,z)('foo', 14, 21.1)>>> z='foo'"bar"'baz'#Comment>>> z'foobarbaz'

Every one of the statements above has no whitespace at all, and the interpreter handles them all fine. That’s not to say that you should write them that way though. Judicious use of whitespace almost always enhances readability, and your code should typically include some. Compare the following code fragments:

>>> value1=100>>> value2=200>>> v=(value1>=0)and(value1<value2)
>>> value1=100>>> value2=200>>> v=(value1>=0)and(value1<value2)

Most people would likely find that the added whitespace in the second example makes it easier to read. On the other hand, you could probably find a few who would prefer the first example. To some extent, it is a matter of personal preference. But there are standards for whitespace in expressions and statements put forth in PEP 8, and you should strongly consider adhering to them as much as possible.

Note: You can juxtapose string literals, with or without whitespace:

>>> s="foo"'bar''''baz'''>>> s'foobarbaz'>>> s='foo'"bar"'''baz'''>>> s'foobarbaz'

The effect is concatenation, exactly as though you had used the + operator.

In Python, whitespace is generally only required when it is necessary to distinguish one token from the next. This is most common when one or both tokens are an identifier or keyword.

For example, in the following case, whitespace is needed to separate the identifier s from the keyword in:

>>> s='bar'>>> sin['foo','bar','baz']True>>> sin['foo','bar','baz']Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>sin['foo','bar','baz']NameError: name 'sin' is not defined

Here is an example where whitespace is required to distinguish between the identifier y and the numeric constant 20:

>>> yis20False>>> yis20SyntaxError: invalid syntax

In this example, whitespace is needed between two keywords:

>>> 'qux'notin['foo','bar','baz']True>>> 'qux'notin['foo','bar','baz']SyntaxError: invalid syntax

Running identifiers or keywords together fools the interpreter into thinking you are referring to a different token than you intended: sin, is20, and notin, in the examples above.

All this tends to be rather academic because it isn’t something you’ll likely need to think about much. Instances where whitespace is necessary tend to be intuitive, and you’ll probably just do it by second nature.

You should use whitespace where it isn’t strictly necessary as well to enhance readability. Ideally, you should follow the guidelines in PEP 8.

Deep Dive: Fortran and Whitespace

The earliest versions of Fortran, one of the first programming languages created, were designed so that all whitespace was completely ignored. Whitespace characters could be optionally included or omitted virtually anywhere—between identifiers and reserved words, and even in the middle of identifiers and reserved words.

For example, if your Fortran code contained a variable named total, any of the following would be a valid statement to assign it the value 50:

total = 50
to tal = 50
t o t a l=5 0

This was meant as a convenience, but in retrospect it is widely regarded as overkill. It often resulted in code that was difficult to read. Worse yet, it potentially led to code that did not execute correctly.

Consider this tale from NASA in the 1960s. A Mission Control Center orbit computation program written in Fortran was supposed to contain the following line of code:

DO 10 I = 1,100

In the Fortran dialect used by NASA at that time, the code shown introduces a loop, a construct that executes a body of code repeatedly. (You will learn about loops in Python in two future tutorials on definite and indefinite iteration).

Unfortunately, this line of code ended up in the program instead:

DO 10 I = 1.100

If you have a difficult time seeing the difference, don’t feel too bad. It took the NASA programmer a couple weeks to notice that there is a period between 1 and 100 instead of a comma. Because the Fortran compiler ignored whitespace, DO 10 I was taken to be a variable name, and the statement DO 10 I = 1.100 resulted in assigning 1.100 to a variable called DO10I instead of introducing a loop.

Some versions of the story claim that a Mercury rocket was lost because of this error, but that is evidently a myth. It did apparently cause inaccurate data for some time, though, before the programmer spotted the error.

Virtually all modern programming languages have chosen not to go this far with ignoring whitespace.

Whitespace as Indentation

There is one more important situation in which whitespace is significant in Python code. Indentation—whitespace that appears to the left of the first token on a line—has very special meaning.

In most interpreted languages, leading whitespace before statements is ignored. For example, consider this Windows Command Prompt session:

C:\Users\john>echo foo
fooC:\Users\john>echo foo
foo

Note: In a Command Prompt window, the echo command displays its arguments to the console, like the print() function in Python. Similar behavior can be observed from a terminal window in macOS or Linux.

In the second statement, four space characters are inserted to the left of the echo command. But the result is the same. The interpreter ignores the leading whitespace and executes the same command, echo foo, just as it does when the leading whitespace is absent.

Now try more or less the same thing with the Python interpreter:

>>> print('foo')foo>>> print('foo')SyntaxError: unexpected indent

Say what? Unexpected indent? The leading whitespace before the second print() statement causes a SyntaxError exception!

In Python, indentation is not ignored. Leading whitespace is used to compute a line’s indentation level, which in turn is used to determine grouping of statements. As yet, you have not needed to group statements, but that will change in the next tutorial with the introduction of control structures.

Until then, be aware that leading whitespace matters.

Conclusion

This tutorial introduced you to Python program lexical structure. You learned what constitutes a valid Python statement and how to use implicit and explicit line continuation to write a statement that spans multiple lines. You also learned about commenting Python code, and about use of whitespace to enhance readability.

Next, you will learn how to group statements into more complex decision-making constructs using conditional statements.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]


Made With Mu: Tinkology with Les

$
0
0

Friend of Mu and community supporter extraordinaire Les Pounder writes a blog containing lots of quick, fun and beginner-friendly projects. He calls this sort of thing “tinkology” – a term I really like. Playful tinkering is such a wonderful way to learn and relax. Can you guess what editor he uses? Nope, it’s not EMACS or VI. ;-)

What I love about Les’s blogging is its consistency: there’s always something new and it’s always really imaginative and fun. Les is a great source of ideas for lessons, projects or learning activities for beginner programmers. For instance, on most Tuesdays he writes a “Tooling Tuesday” entry to his blog and many of these have a Python slant to them. Other related series of posts include “Micro:bit Monday” and “Friday Fun”.

Mu related highlights include:

Any of these could form the basis of a cool project, set of lessons or after school code-club activity. So, here’s to Les, his blog and fantastically imaginative projects.

Thank you Les!

Fabio Zadrozny: PyDev 6.5.0 (#region code folding)

$
0
0
PyDev 6.5.0 is now available for download.

There are some nice features and fixes available in this release:
  • #region / #endregion comments can now be used by the code-folding engine.
  • An action to easily switch the default interpreter is now available (default binding: Ctrl+Shift+Alt+I -- note that it must be executed with an opened editor).
  • It's possible to create local imports from global imports (use Ctrl+1 on the name of a given global import and select "Move import to local scope(s)" -- although note that the global import needs to be manually deleted later on).
  • The interactive interpreter now has scroll-lock.
  • The debugger is much more responsive!
See: http://www.pydev.org for more details.

Podcast.__init__: Infection Monkey Vulnerability Scanner with Daniel Goldberg

$
0
0
How secure are your servers? The best way to be sure that your systems aren't being compromised is to do it yourself. In this episode Daniel Goldberg explains how you can use his project Infection Monkey to run a scan of your infrastructure to find and fix the vulnerabilities that can be taken advantage of. He also discusses his reasons for building it in Python, how it compares to other security scanners, and how you can get involved to keep making it better.

Summary

How secure are your servers? The best way to be sure that your systems aren’t being compromised is to do it yourself. In this episode Daniel Goldberg explains how you can use his project Infection Monkey to run a scan of your infrastructure to find and fix the vulnerabilities that can be taken advantage of. He also discusses his reasons for building it in Python, how it compares to other security scanners, and how you can get involved to keep making it better.

Preface

  • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
  • When you’re ready to launch your next app you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to scale up. Go to podcastinit.com/linode to get a $20 credit and launch a new server in under a minute.
  • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
  • To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media.
  • Join the community in the new Zulip chat workspace at podcastinit.com/chat
  • Your host as usual is Tobias Macey and today I’m interviewing Daniel Goldberg about Infection Monkey, an open source system breach simulation tool for evaluating the security of your network

Interview

  • Introductions
  • How did you get introduced to Python?
  • What is infection monkey and what was the reason for building it?
    • What was the reasoning for building it in Python?
    • If you were to start over today what would you do differently?
  • Penetration testing is typically an endeavor that requires a significant amount of knowledge and experience of security practices. What have been some of the most difficult aspects of building an automated vulnerability testing system?
    • How does a deployed instance keep up to date with recent exploits and attack vectors?
  • How does Infection Monkey compare to other tools such as Nessus and Nexpose?
  • What are some examples of the types of vulnerabilities that can be discovered by Infection Monkey?
  • What kinds of information can Infection Monkey discover during a scan?
    • How does that information get reported to the user?
    • How much security experience is necessary to understand and address the findings in a given report generated from a scan?
  • What techniques do you use to ensure that the simulated compromises can be safely reverted?
  • What are some aspects of network security and system vulnerabilities that Infection Monkey is unable to detect and/or analyze?
  • For someone who is interested in using Infection Monkey what are the steps involved in getting it set up?
    • What is the workflow for running a scan?
    • Is Infection Monkey intended to be run continuously, or only with the interaction of an operator?
  • What are your plans for the future of Infection Monkey?

Keep In Touch

Picks

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

Codementor: The Magic Behind Python Generator Functions

$
0
0
https://cdn-images-1.medium.com/max/1600/1*6PpBwJOQIYho7GMjijh0Gw.jpeg Photo by Kaique&nbsp;Rocha (https://www.pexels.com/u/kaiquestr/) Generator Functions are one of the coolest features of the...

Codementor: A Closer Look At How Python f-strings Work

$
0
0
https://cdn-images-1.medium.com/max/1600/1*Ww1GIXkTV2XdF_hYKyiWfA.jpeg Photo ByFancycrave (https://www.pexels.com/photo/accessory-bobbin-close-up-clothing-355148/) PEP 498...

Michael Foord: Interview on Podcast.__init__

$
0
0

Podcast.__init__

I recently had the great pleasure of having a conversation with Tobias Macey, who produces the Podcast.init podcast. We spent almost an hour talking about testing, the Python community and the mock library amongst other things.

The podcast is now available to listen to:

  • Michael Foord On Testing, Mock, TDD, And The Python Community - Episode 171

    Michael Foord has been working on building and testing software in Python for over a decade. One of his most notable and widely used contributions to the community is the Mock library, which has been incorporated into the standard library. In this episode he explains how he got involved in the community, why testing has been such a strong focus throughout his career, the uses and hazards of mocked objects, and how he is transitioning to freelancing full time.

A selection of the questions Tobias asked include:

  • How did you get introduced to Python?
  • One of the main threads in your career appears to be software testing. What aspects of testing do you find so interesting and how did you first get exposed to that aspect of building software?
    • How has the language and ecosystem support for testing evolved over the course of your career?
    • What are some of the areas that you find it to still be lacking?
  • Mock is one of your projects that has been widely adopted and ultimately incorporated into the standard library. What was your reason for starting it in the first place?
    • Mocking can be a controversial topic. What are your current thoughts on how and when to use mocks, stubs, and fixtures?
  • How do you view the state of the art for testing in Python as it compares to other languages that you have worked in?
  • You were fairly early in the move to supporting Python 2 and 3 in a single project with Mock. How has that overall experience changed in the intervening years since Python 2.4 and 3.2?
  • What are some of the notable evolutions in Python and the software industry that you have experienced over your career?
  • You recently transitioned to acting as a software trainer and consultant full time. Where are you focusing your energy currently and what are your grand plans for the future?

Python Celery - Weekly Celery Tutorials and How-tos: Celery task exceptions and automatic retries

$
0
0

Handling Celery task failures in a consistent and predictable way is a prerquisite to building a resilient asynchronous system. In this blog post you will learn how to: handle Celery task errors and automatically retry failed tasks

To handle exceptions or not?

Assume we have a Celery task that fetches some data from an external API via a http GET request. Wee want our code to respond predictably to any potential failure such as connection issues, request throttling or unexpected server responses. But what precisely does that mean?

@app.task(bind=True):
def fetch_data(self):
    url = 'https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json'
    response = requests.get(url)
    if not response.ok:
        raise Exception(f'GET {url} returned unexpected response code: {response.status_code}')
    return response.json()

Here, we actually do not handle errors at all. Very much the opposite even. Either the GET request throws an exception somewhere along the way, which we let happily bubble up; or we throw an exception ourselves, in case we do not receive a 2xx response status code or an invalid JSON response body.

Auto-retry failed tasks

The idea is to not catch any exceptions and let Celery deal with it. Our responsibilities are:

  • ensure that exceptions bubble up so that our task fails
  • instruct Celery to do something (or nothing) with a failed task

When you register a Celery task via the decorator, you can tell Celery what to do with the task in a case of a failure. autoretry_for allows you to specify a list of exception types you want to retry for. retry_kwargs lets you specify additional arguments such as max_retries (number of max retries) and countdown (delay between retries). Check out the docs for a full list of arguments. In the following example, Celery retries up to five times with a two second delay inbetween retries:

@app.task(bind=True, autoretry_for=(Exception,), retry_kwargs={'max_retries': 5, 'countdown': 2})
def fetch_data(self):
    url = 'https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json'
    response = requests.get(url)
    if not response.ok:
        raise Exception(f'GET {url} returned unexpected response code: {response.status_code}')
    return response.json()

Alternatively, you can retry following the rules of exponential backoff (retry_jitter is used to introduce randomness into exponential backoff delays to prevent all tasks in the from being executed simultaneously; it’s set to False her but you probably want it to be set to True in a production environment). In this example, the first retry happens after 2s, the following after 4s, the third one after 8s etc:

@app.task(bind=True, autoretry_for=(Exception,), exponential_backoff=2, retry_kwargs={'max_retries': 5}, retry_jitter=False)
def fetch_data(self):
    url = 'https://www.quandl.com/api/v3/datasets/WIKI/FB/data.json'
    response = requests.get(url)
    if not response.ok:
        raise Exception(f'GET {url} returned unexpected response code: {response.status_code}')
    return response.json()

Conclusion

There is not much secret to exception handling in Celery other than allowing exceptions happen and using Celery configuration to deal with it. This coupled with an atomic task design (in the example above the json would be passed onto via a Celery chain to a second task that writes the json to the database) makes for a really powerful, reusable and predictable design.


Itamar Turner Trauring: Stabbing yourself with a fork() in a multiprocessing.Pool full of sharks

$
0
0

It’s time for another deep-dive into Python brokenness and the pain that is POSIX system programming, this time with exciting and not very convincing shark-themed metaphors! Most of what you’ll learn isn’t really Python-specific, so stick around regardless and enjoy the sharks.

Let’s set the metaphorical scene: you’re swimming in a pool full of sharks. (The sharks are a metaphor for processes.)

Next, you take a fork. (The fork is a metaphor for fork().)

You stab yourself with the fork. Stab stab stab. Blood starts seeping out, the sharks start circling, and pretty soon you find yourself—dead(locked) in the water!

In this journey through space and time you will encounter:

  • A mysterious failure wherein Python’s multiprocessing.Pool deadlocks, mysteriously.
  • The root of the mystery: fork().
  • A conundrum wherein fork() copying everything is a problem, and fork() not copying everything is also a problem.
  • Some bandaids that won’t stop the bleeding.
  • The solution that will keep your code from being eaten by sharks.

Let’s begin!

Read more...

Stack Abuse: Text Summarization with NLTK in Python

$
0
0

Introduction

Text Summarization with NLTK in Python

As I write this article, 1,907,223,370 websites are active on the internet and 2,722,460 emails are being sent per second. This is an unbelievably huge amount of data. It is impossible for a user to get insights from such huge volumes of data. Furthermore, a large portion of this data is either redundant or doesn't contain much useful information. The most efficient way to get access to the most important parts of the data, without having to sift through redundant and insignificant data, is to summarize the data in a way that it contains non-redundant and useful information only. The data can be in any form such as audio, video, images, and text. In this article, we will see how we can use automatic text summarization techniques to summarize text data.

Text summarization is a subdomain of Natural Language Processing (NLP) that deals with extracting summaries from huge chunks of texts. There are two main types of techniques used for text summarization: NLP-based techniques and deep learning-based techniques. In this article, we will see a simple NLP-based technique for text summarization. We will not use any machine learning library in this article. Rather we will simply use Python's NLTK library for summarizing Wikipedia articles.

Text Summarization Steps

I will explain the steps involved in text summarization using NLP techniques with the help of an example.

The following is a paragraph from one of the famous speeches by Denzel Washington at the 48th NAACP Image Awards:

So, keep working. Keep striving. Never give up. Fall down seven times, get up eight. Ease is a greater threat to progress than hardship. Ease is a greater threat to progress than hardship. So, keep moving, keep growing, keep learning. See you at work.

We can see from the paragraph above that he is basically motivating others to work hard and never give up. To summarize the above paragraph using NLP-based techniques we need to follow a set of steps, which will be described in the following sections.

Convert Paragraphs to Sentences

We first need to convert the whole paragraph into sentences. The most common way of converting paragraphs to sentences is to split the paragraph whenever a period is encountered. So if we split the paragraph under discussion into sentences, we get the following sentences:

  1. So, keep working
  2. Keep striving
  3. Never give up
  4. Fall down seven times, get up eight
  5. Ease is a greater threat to progress than hardship
  6. Ease is a greater threat to progress than hardship
  7. So, keep moving, keep growing, keep learning
  8. See you at work

Text Preprocessing

After converting paragraph to sentences, we need to remove all the special characters, stop words and numbers from all the sentences. After preprocessing, we get the following sentences:

  1. keep working
  2. keep striving
  3. never give
  4. fall seven time get eight
  5. ease greater threat progress hardship
  6. ease greater threat progress hardship
  7. keep moving keep growing keep learning
  8. see work

Tokenizing the Sentences

We need to tokenize all the sentences to get all the words that exist in the sentences. After tokenizing the sentences, we get list of following words:

['keep',
 'working',
 'keep',
 'striving',
 'never',
 'give',
 'fall',
 'seven',
 'time',
 'get',
 'eight',
 'ease',
 'greater',
 'threat',
 'progress',
 'hardship',
 'ease',
 'greater',
 'threat',
 'progress',
 'hardship',
 'keep',
 'moving',
 'keep',
 'growing',
 'keep',
 'learning',
 'see',
 'work']

Find Weighted Frequency of Occurrence

Next we need to find the weighted frequency of occurrences of all the words. We can find the weighted frequency of each word by dividing its frequency by the frequency of the most occurring word. The following table contains the weighted frequencies for each word:

WordFrequencyWeighted Frequency
ease20.40
eight10.20
fall10.20
get10.20
give10.20
greater20.40
growing10.20
hardship20.40
keep51.00
learning10.20
moving10.20
never10.20
progress20.40
see10.20
seven10.20
striving10.20
threat20.40
time10.20
work10.20
working10.20

Since the word "keep" has the highest frequency of 5, therefore the weighted frequency of all the words have been calculated by dividing their number of occurances by 5.

Replace Words by Weighted Frequency in Original Sentences

The final step is to plug the weighted frequency in place of the corresponding words in original sentences and finding their sum. It is important to mention that weighted frequency for the words removed during preprocessing (stop words, punctuation, digits etc.) will be zero and therefore is not required to be added, as mentioned below:

Sentence Sum of Weighted Frequencies
So, keep working 1 + 0.20 = 1.20
Keep striving 1 + 0.20 = 1.20
Never give up 0.20 + 0.20 = 0.40
Fall down seven times, get up eight 0.20 + 0.20 + 0.20 + 0.20 + 0.20 = 1.0
Ease is a greater threat to progress than hardship0.40 + 0.40 + 0.40 + 0.40 + 0.40 = 2.0
Ease is a greater threat to progress than hardship0.40 + 0.40 + 0.40 + 0.40 + 0.40 = 2.0
So, keep moving, keep growing, keep learning 1 + 0.20 + 1 + 0.20 + 1 + 0.20 = 3.60
See you at work 0.20 + 0.20 = 0.40

Sort Sentences in Descending Order of Sum

The final step is to sort the sentences in inverse order of their sum. The sentences with highest frequencies summarize the text. For instance, look at the sentence with the highest sum of weighted frequencies:

So, keep moving, keep growing, keep learning

You can easily judge that what the paragraph is all about. Similarly, you can add the sentence with the second highest sum of weighted frequencies to have a more informative summary. Take a look at the following sentences:

So, keep moving, keep growing, keep learning. Ease is a greater threat to progress than hardship.

These two sentences give a pretty good summarization of what was said in the paragraph.

Summarizing Wikipedia Articles

Now we know how the process of text summarization works using a very simple NLP technique. In this section, we will use Python's NLTK library to summarize a Wikipedia article.

Fetching Articles from Wikipedia

Before we could summarize Wikipedia articles, we need to fetch them from the web. To do so we will use a couple of libraries. The first library that we need to download is the beautiful soup which is very useful Python utility for web scraping. Execute the following command at the command prompt to download the Beautiful Soup utility.

$ pip install beautifulsoup4

Another important library that we need to parse XML and HTML is the lxml library. Execute the following command at command prompt to download lxml:

$ pip install lxml

Now lets some Python code to scrape data from the web. The article we are going to scrape is the Wikipedia article on Artificial Intelligence. Execute the following script:

import bs4 as bs  
import urllib.request  
import re

scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')  
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:  
    article_text += p.text

In the script above we first import the important libraries required for scraping the data from the web. We then use the urlopen function from the urllib.request utility to scrape the data. Next, we need to call read function on the object returned by urlopen function in order to read the data. To parse the data, we use BeautifulSoup object and pass it the scraped data object i.e. article and the lxml parser.

In Wikipedia articles, all the text for the article is enclosed inside the <p> tags. To retrieve the text we need to call find_all function on the object returned by the BeautifulSoup. The tag name is passed as a parameter to the function. The find_all function returns all the paragraphs in the article in the form of a list. All the paragraphs have been combined to recreate the article.

Once the article is scraped, we need to to do some preprocessing.

Preprocessing

The first preprocessing step is to remove references from the article. Wikipedia, references are enclosed in square brackets. The following script removes the square brackets and replaces the resulting multiple spaces by a single space. Take a look at the script below:

# Removing Square Brackets and Extra Spaces
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)  
article_text = re.sub(r'\s+', ' ', article_text)  

The article_text object contains text without brackets. However, we do not want to remove anything else from the article since this is the original article. We will not remove other numbers, punctuation marks and special characters from this text since we will use this text to create summaries and weighted word frequencies will be replaced in this article.

To clean the text and calculate weighted frequences, we will create another object. Take a look at the following script:

# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )  
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)  

Now we have two objects article_text, which contains the original article and formatted_article_text which contains the formatted article. We will use formatted_article_text to create weighted frequency histograms for the words and will replace these weighted frequencies with the words in the article_text object.

Converting Text To Sentences

At this point we have preprocessed the data. Next, we need to tokenize the article into sentences. We will use thearticle_text object for tokenizing the article to sentence since it contains full stops. The formatted_article_text does not contain any punctuation and therefore cannot be converted into sentences using the full stop as a parameter.

The following script performs sentence tokenization:

sentence_list = nltk.sent_tokenize(article_text)  

Find Weighted Frequency of Occurrence

To find the frequency of occurrence of each word, we use the formatted_article_text variable. We used this variable to find the frequency of occurrence since it doesn't contain punctuation, digits, or other special characters. Take a look at the following script:

stopwords = nltk.corpus.stopwords.words('english')

word_frequencies = {}  
for word in nltk.word_tokenize(formatted_article_text):  
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

In the script above, we first store all the English stop words from the nltk library into a stopwords variable. Next, we loop through all the sentences and then corresponding words to first check if they are stop words. If not, we proceed to check whether the words exist in word_frequency dictionary i.e. word_frequencies, or not. If the word is encountered for the first time, it is added to the dictionary as a key and its value is set to 1. Otherwise, if the word previously exists in the dictionary, its value is simply updated by 1.

Finally, to find the weighted frequency, we can simply divide the number of occurances of all the words by the frequency of the most occurring word, as shown below:

maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():  
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

Calculating Sentence Scores

We have now calculated the weighted frequencies for all the words. Now is the time to calculate the scores for each sentence by adding weighted frequencies of the words that occur in that particular sentence. The following script calculates sentence scores:

sentence_scores = {}  
for sent in sentence_list:  
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

In the script above, we first create an empty sentence_scores dictionary. The keys of this dictionary will be the sentences themselves and the values will be the corresponding scores of the sentences. Next, we loop through each sentence in the sentence_list and tokenize the sentence into words.

We then check if the word exists in the word_frequencies dictionary. This check is performed since we created the sentence_list list from the article_text object; on the other hand, the word frequencies were calculated using the formatted_article_text object, which doesn't contain any stop words, numbers, etc.

We do not want very long sentences in the summary, therefore, we calculate the score for only sentences with less than 30 words (although you can tweak this parameter for your own use-case). Next, we check whether the sentence exists in the sentence_scores dictionary or not. If the sentence doesn't exist, we add it to the sentence_scores dictionary as a key and assign it the weighted frequency of the first word in the sentence, as its value. On the contrary, if the sentence exists in the dictionary, we simply add the weighted frequency of the word to the existing value.

Getting the Summary

Now we have the sentence_scores dictionary that contains sentences with their corresponding score. To summarize the article, we can take top N sentences with the highest scores. The following script retrieves top 7 sentences and prints them on the screen.

import heapq  
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)  
print(summary)  

In the script above, we use the heapq library and call its nlargest function to retrieve the top 7 sentences with the highest scores.

The output summary looks like this:

Artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans and other animals. Many tools are used in AI, including versions of search and mathematical optimization, artificial neural networks, and methods based on statistics, probability and economics. The traditional problems (or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects. When access to digital computers became possible in the middle 1950s, AI research began to explore the possibility that human intelligence could be reduced to symbol manipulation. One proposal to deal with this is to ensure that the first generally intelligent AI is 'Friendly AI', and will then be able to control subsequently developed AIs. Nowadays, the vast majority of current AI researchers work instead on tractable "narrow AI" applications (such as medical diagnosis or automobile navigation). Machine learning, a fundamental concept of AI research since the field's inception, is the study of computer algorithms that improve automatically through experience.

Remember, since Wikipedia articles are updated frequently, you might get different results depending upon the time of execution of the script.

Conclusion

This article explains the process of text summarization with the help of the Python NLTK library. The process of scraping articles using the BeautifulSoap library has also been briefly covered in the article. I will recommend you to scrape any other article from Wikipedia and see whether you can get a good summary of the article or not.

Stack Abuse: File Handling in Python

$
0
0

Introduction

It is an unwritten consensus that Python is one of the best starting programming languages to learn as a novice. It is extremely versatile, easy to read/analyze, and quite pleasant to the eye. The Python programming language is highly scalable and is widely considered as one of the best toolboxes to build tools and utilities that you may want to use for diverse reasons.

This article will briefly covers how Python handles one of the most important components of any operating system: its files and directories. Fortunately, Python has built-in functions to create and manipulate files, either flat files or text files. The io module is the default module for accessing files, therefore we will not need to import any external library for general IO operations.

The key functions used for file handling in Python are: open(), close(), read(), write() and append().

Opening Files with open()

This function returns a file object called "handle", which is used to read from and write to a file. The arguments that the function can receive are as follows:

open(filename, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)  

Normally, only the filename and mode parameters are needed, while the others are used implicitly set with their default values.

The following code snippet shows how this function can be used:

file_example = open ("TestingText.txt")  

This will open a text file called "TestingText" in read-only mode. Take note that only the filename parameter was specified, this is due to the "read" mode being the default mode for the open function.

The access modes available for the open() function are as follows:

  • r: Opens the file in read-only mode. Starts reading from the beginning of the file and is the default mode for the open() function.
  • rb: Opens the file as read-only in binary format and starts reading from the beginning of the file. While binary format can be used for different purposes, it is usually used when dealing with things like images, videos, etc.
  • r+: Opens a file for reading and writing, placing the pointer at the beginning of the file.
  • w: Opens in write-only mode. The pointer is placed at the beginning of the file and this will overwrite any existing file with the same name. It will create a new file if one with the same name doesn't exist.
  • wb: Opens a write-only file in binary mode.
  • w+: Opens a file for writing and reading.
  • wb+: Opens a file for writing and reading in binary mode.
  • a: Opens a file for appending new information to it. The pointer is placed at the end of the file. A new file is created if one with the same name doesn't exist.
  • ab: Opens a file for appending in binary mode.
  • a+: Opens a file for both appending and reading.
  • ab+: Opens a file for both appending and reading in binary mode.

If both the Python file being executed and the target file to read doesn't exist in the same directory, we need to pass the full path of the file to read, to the open() function as shown in the following code snippet:

file_example = open ("F:\\Directory\\AnotherDirectory\\tests\\TestingText.txt")  

Note: Something to keep in mind is to always make sure that both the file name and the path given are correct. If either is incorrect or doesn't exist, the error FileNotFoundError will be thrown, which needs to then be caught and handled by your program to prevent it from crashing.

To avoid this issue, as a best practice, errors can be caught with a try-finally block to handle the exception as shown below.

try:  
    file_example = open ("F:\\Directory\\AnotherDirectory\\tests\\TestingText.txt")
except IOError:  
    print("An error was found. Either path is incorrect or file doesn't exist!")

finally:  
    print("Terminating process!")

Reading from Files with read()

Python contains 3 functions to read files: read(), readline(), and readlines(). The last two functions are merely helper functions that make reading certain types of files easier.

For the examples that will be used, "TestingText.txt" contains the following text:

Hello, world! Python is the way to coding awesomeness.

If you don't believe me, try it on your own.

Come, you will enjoy the Dark Side. We have cookies!  

The read method is used as follows:

file_example = open ("F:\\Directory\\AnotherDirectory\\tests\\TestingText.txt", "r")

print(file_example.read())  

The output will be as follows:

Hello, world! Python is the way to coding awesomeness.

If you don't believe me, try it on your own.

Come, you will enjoy the Dark Side. We have cookies!  

Note: Special characters may not be read correctly using the read function. To read special characters correctly, you can pass the encoding parameter to read() function and set its value to utf8 as shown below:

file_example = open ("F:\\Directory\\AnotherDirectory\\tests\\TestingText.txt", "r", encoding="utf8" )  

Also, the function read(), as well as the helper-function readline(), can receive a number as a parameter that will determine the number of bytes to read from the file. In the case of a text file, this will be the number of characters returned.

file_example = open ("F:\\Directory\\AnotherDirectory\\tests\\TestingText.txt", "r")

print(file_example.read(8))  

The output will be as follows:

Hello, w  

The helper-function readline() behaves in a similar manner, but instead of returning the whole text, it will return a single line.

file_example = open ("F:\\Directory\\AnotherDirectory\\tests\\TestingText.txt", "r")

print(file_example.readline())  
print(file_example.readline(5))  

In the script above, the first print() statement returns the first line and inserts a blank line in the output console. The next print() statement is separated from the previous line with a blank line and starts on a new line as shown in the output:

Hello world! Python is the way to coding awesomeness.

If yo  

Finally, the helper-function readlines() reads all of the text and splits them in to lines for easy reading. Take a look at the following example:

print(file_example.readlines())  

The output from this code will be:

Hello world! Python is the way to coding awesomeness. If you don't believe me, try it on your own. Come, you will enjoy the Dark Side. We have cookies!  

Keep in mind that the readlines() function is considered to be much slower and more inefficient than the read() function, without not many benefits. One good alternative to this is using a loop that will function much smoother and faster:

file_example = open ("F:\\Directory\\AnotherDirectory\\tests\\TestingText.txt", "r")

for lines in file_example:  
    print(lines) 

Note that if the line is not printed, it will be replaced in the buffer by the next statement

Writing to Files with write()

When using this function, any information inside the file with the same name will be overwritten. Its behavior is similar to the read() function, but inserting information rather than reading it.

file_example2 = open ("F:\\Directory\\AnotherDirectory\\tests\\TestingTextTwo.txt", "w")

file_example2.write("This is a test. Enjoy it!\n") #'\n' is for change line.

file_example2.write("Another thing to know is doing it slowly.\n")

file_example2.write("One by one. Yay!")  

If several lines need to be written, the sub-function writelines() could be used instead:

listOfThingsToSay = ["I like things like: \n", "Ice Cream\n", "Fruits\n", "Movies\n", "Anime\n", "Naps\n", "Jerky\n"]

file_example2 = open ("F:\\Directory\\AnotherDirectory\\tests\\TestingTextTwo.txt", "w")

file_example2.writelines(listOfThingsToSay)  

Note: to be able to use print() function, the mode needs to be set as w+, which allows reading as well as writing.

Adding to Files with append()

This function acts similar to the write() function, however, instead of overwriting the file, the append() function appends content to the existing file.

If a text file named "TestingTextThree" contains the following information:

Some essential things are missing in life and should not be avoided.  

In order to append new text, the following code could be used:

listOfThingsDo = ["You need at least to: \n", "Eat fried Ice Cream\n", "Go to Disney\n", "Travel to the moon\n", "Cook Pineapple Pizza\n", "Dance Salsa\n"]

file_example3 = open ("F:\\Directory\\AnotherDirectory\\tests\\TestingTextThree.txt", "a+")

file_example3.writelines(listOfThingsToDo)

for newline in file_example3  
    print(newlines)

The output will be as follows:

Some essential things are missing in life and should not be avoided.

You need at least to:

Eat fried Ice Cream

Go to Disney

Travel to the moon

Cook Pineapple Pizza

Dance Salsa  

Closing Opened Files with close()

The close() function clears the memory buffer and closes the file. This means that we'll no longer be able to read from the file, and we'll have to re-open it if we want to read from it again. Also, some operating systems, such as Windows, treat opened files as locked, so it is important to clean up after yourself within your code.

Using the previously used sample code, this function is used as follows:

file_example = open ("F:\\Directory\\AnotherDirectory\\tests\\TestingText.txt", "r")

print(file_example.read())

file_example.close()  

Conclusion

Python is one of the most robust programming languages and is also one of the most used. It is easy to implement as well as analyze, making it a perfect tool for beginners in programming. Furthermore, its versatility makes it a perfect starting point for programming novices.

In regards to file handling, Python has easy-to-use functions with rapid response time and relatively resilient error handling methods, so the development and debug processes are much more pain-free than in other languages when it comes to working with files.

Continuum Analytics Blog: Anaconda Welcomes Maggie Key as SVP of Customer Success

$
0
0

Former VP of Accruent joins executive team to build out and embed customer success program within Anaconda AUSTIN, Texas – September 4, 2018 – Anaconda, Inc., the most popular Python data science platform provider with 2.5 million downloads per month, today announced the addition of Maggie Key to its executive team as SVP of Customer Success. …
Read more →

The post Anaconda Welcomes Maggie Key as SVP of Customer Success appeared first on Anaconda.

James Bennett: django-registration 3.0

$
0
0

Today I’m pleased to announce the release of django-registration 3.0. This is a pretty big update, and one that’s been coming for a while, so I want to take a moment to go briefly through the changes (if you want the full version, you can check out the upgrade guide in the documentation).

This also marks the retirement of the 2.x release series of django-registration; 2.5.2 is on PyPI, and I intend ...

Read full entry

Viewing all 22641 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>