ListenData: Importing CSV File in Python

April 18, 2019, 6:40 am

≫ Next: ListenData: Importing Data into Python

≪ Previous: Weekly Python StackOverflow Report: (clxxiv) stackoverflow python report

This tutorial explains how to read a CSV file in python with pandas. It outlines many examples of loading a CSV file into Python. Pandas is an awesome package for data manipulation. It includes various functions to load and import data from various formats. In this post, we will see how to load comma separated files with several use cases.

Load Package

You have to load required package i.e. pandas. Run the following command to load it.

import pandas as pd

Create Sample Data for Import

The program below creates a sample data frame which can be used further for demonstration.

dt = {'ID': [11, 12, 13, 14, 15],
'first_name': ['David', 'Jamie', 'Steve', 'Stevart', 'John'],
'company': ['Aon', 'TCS', 'Google', 'RBS', '.'],
'salary': [74, 76, 96, 71, 78]}
mydt = pd.DataFrame(dt, columns = ['ID', 'first_name', 'company', 'salary'])

The sample data looks like below -

Sample Data

Save data as CSV in the working directory

The following command tells python to write data in CSV format.

mydt.to_csv('workingfile.csv', index=False)

Example 1 : Read CSV file with header row

It's the basic syntax of read_csv() function. You just need to mention the filename.

mydata = pd.read_csv("workingfile.csv")

Example 2 : Read CSV file without header row

mydata0 = pd.read_csv("workingfile.csv", header = None)

If you specify "header = None", python would assign a series of numbers starting from 0 to (number of columns - 1). See the output shown below -

Output

Example 3 : Specify missing values

The na_values= options is used to set some values as blank / missing values.

mydata00 = pd.read_csv("workingfile.csv", na_values=['.'])

Set Missing Values

Example 4 : Set Index Column

mydata01 = pd.read_csv("workingfile.csv", index_col ='ID')

Python : Setting Index Column

As you can see in the above image, the column ID has been set as index column.

Example 5 : Read CSV File from URL

You can directly read data from the CSV file that is stored on a web link.

mydata02 = pd.read_csv("http://winterolympicsmedals.com/medals.csv")

Example 6 : Skip First 5 Rows While Importing CSV

mydata03 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", skiprows=5)

It reads data from 6th row (6th row would be a header row)

Example 7 : Skip Last 5 Rows While Importing CSV

mydata04 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", skip_footer=5)

In the above code, we are excluding bottom 5 rows using skip_footer= parameter.

Example 8 : Read only first 5 rows

mydata05 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", nrows=5)

Using nrows= option, you can load top K number of rows.

Example 9 : Interpreting "," as thousands separator

mydata06 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", thousands=",")

Example 10 : Read only specific columns

mydata07 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", usecols=(1,5,7))

The above code reads only columns placed at first, fifth and seventh position.

Example 11 : Read some rows and columns

mydata08 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", usecols=(1,5,7),nrows=5)

In the above command, we have combined usecols= and nrows= options. It will select only first 5 rows and selected columns.

Example 12 : Read file with semi colon delimiter

mydata09 = pd.read_csv("file_path", sep = ';')

Using sep= parameter in read_csv( ) function, you can import file with semi-colon delimiter.

↧

ListenData: Importing Data into Python

April 18, 2019, 7:13 am

≫ Next: ListenData: Install Python Package

≪ Previous: ListenData: Importing CSV File in Python

This tutorial explains various methods to read data into Python. Data can be in any of the popular formats - CSV, TXT, XLS/XLSX (Excel), sas7bdat (SAS), Stata, Rdata (R) etc. Loading data in python environment is the most initial step of analyzing data.

Import Data into Python

While importing external files, we need to check the following points -

Check whether header row exists or not
Treatment of special values as missing values
Consistent data type in a variable (column)
Date Type variable in consistent date format.
No truncation of rows while reading external data

Install and Load pandas Package

pandas is a powerful data analysis package. It makes data exploration and manipulation easy. It has several functions to read data from various sources.

If you are using Anaconda, pandas must be already installed. You need to load the package by using the following command -

import pandas as pd

If pandas package is not installed, you can install it by running the following code in Ipython Console. If you are using Spyder, you can submit the following code in Ipython console within Spyder.

!pip install pandas

If you are using Anaconda, you can try the following line of code to install pandas -

!conda install pandas

1. Import CSV files

It is important to note that a single backslash does not work when specifying the file path. You need to either change it to forward slash or add one more backslash like below

import pandas as pd
mydata= pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv")

If no header (title) in raw data file

mydata1 = pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv", header = None)

You need to include header = None option to tell Python there is no column name (header) in data.

Add Column Names

We can include column names by using names= option.

mydata2 = pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv", header = None, names = ['ID', 'first_name', 'salary'])

The variable names can also be added separately by using the following command.

mydata1.columns = ['ID', 'first_name', 'salary']

Detailed Explanation : Import CSV File in Python

2. Import File from URL

You don't need to perform additional steps to fetch data from URL. Simply put URL in read_csv() function (applicable only for CSV files stored in URL).

mydata = pd.read_csv("http://winterolympicsmedals.com/medals.csv")

3. Read Text File

We can use read_table() function to pull data from text file. We can also use read_csv() with sep= "\t" to read data from tab-separated file.

mydata = pd.read_table("C:\\Users\\Deepanshu\\Desktop\\example2.txt")
mydata = pd.read_csv("C:\\Users\\Deepanshu\\Desktop\\example2.txt", sep ="\t")

4. Read Excel File

The read_excel() function can be used to import excel data into Python.

mydata = pd.read_excel("https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls",sheetname="Data 1", skiprows=2)

If you do not specify name of sheet in sheetname= option, it would take by default first sheet.

5. Read delimited file

Suppose you need to import a file that is separated with white spaces.

mydata2 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", header = None)

To include variable names, use the names= option like below -

mydata3 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", names=['a', 'b', 'c', 'd'])

6. Read SAS File

We can import SAS data file by using read_sas() function.

mydata4 = pd.read_sas('cars.sas7bdat')

7. Read Stata File

We can load Stata data file via read_stata() function.

mydata41 = pd.read_stata('cars.dta')

8. Import R Data File

Using pyreadr package, you can load .RData and .Rds format files which in general contains R data frame. You can install this package using the command below -

pip install pyreadr

With the use of read_r( ) function, we can import R data format files.

import pyreadr
result = pyreadr.read_r('C:/Users/sampledata.RData')
print(result.keys()) # let's check what objects we got
df1 = result["df1"] # extract the pandas data frame for object df1

Similarly, you can read .Rds formatted file.

9. Read SQL Table

We can extract table from SQL database (Teradata / SQL Server). See the program below -

import sqlite3
from pandas.io import sql
conn = sqlite3.connect('C:/Users/Deepanshu/Downloads/flight.db')
query = "SELECT * FROM flight;"
results = pd.read_sql(query, con=conn)
print results.head()

10. Read sample of rows and columns

By specifying nrows= and usecols=, you can fetch specified number of rows and columns.

mydata7 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", nrows=5, usecols=(1,5,7))

nrows = 5 implies you want to import only first 5 rows and usecols= refers to specified columns you want to import.

11. Skip rows while importing

Suppose you want to skip first 5 rows and wants to read data from 6th row (6th row would be a header row)

mydata8 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", skiprows=5)

12. Specify values as missing values

By including na_values= option, you can specify values as missing values. In this case, we are telling python to consider dot (.) as missing cases.

mydata9 = pd.read_csv("workingfile.csv", na_values=['.'])

↧

ListenData: Install Python Package

April 19, 2019, 12:16 am

≫ Next: ListenData: Python for Data Science : Learn in 3 Days

≪ Previous: ListenData: Importing Data into Python

Python is one of the most popular programming language for data science and analytics. It is widely used for a variety of tasks in startups and many multi-national organizations. The beauty of this programming language is that it is open-source which means it is available for free and has very active community of developers across the world. Python developers share their solutions in the form of package or module with other python users. This tutorial explains various ways how to install python package.

Ways to Install Python Package

Method 1 : If Anaconda is already installed on your System

Anaconda is the data science platform which comes with pre-installed popular python packages and powerful IDE (Spyder) which has user-friendly interface to ease writing of python programming scripts.

If Anaconda is installed on your system (laptop), click on Anaconda Prompt as shown in the image below.

Anaconda Prompt

To install a python package or module, enter the code below in Anaconda Prompt -

pip install package-name

Install Python Package using PIP Windows

Method 2 : NO Need of Anaconda

1. Open RUN box using shortcut Windows Key + R

2. Enter cmd in the RUN box

Command Prompt

Once you press OK, it will show command prompt screen.

3. Search for folder named Scripts where pip applications are stored.

Scripts Folder

4. In command prompt, type cd <file location of Scripts folder>

cd refers to change directory.

For example, folder location is C:\Users\DELL\Python37\Scripts so you need to enter the following line in command prompt :

cd C:\Users\DELL\Python37\Scripts

Change Directory

5. Type pip install package-name

Install Package via PIP command prompt

Method 3 : Install Python Package from IPython console

Make sure to use ! before pip when you enter the command below in IPython console window. Otherwise it would return syntax error.

!pip install package_name

The ! prefix tells Python to run a shell command.

Syntax Error : Installing Package using PIP

Some users face error "SyntaxError: invalid syntax"in installing packages. To workaround this issue, run the command line below in command prompt -

python -m pip install package-name

python -m pip tells python to import a module for you, then run it as a script.

Install Specific Versions of Python Package

python -m pip install Packagename==1.3 # specific version
python -m pip install "Packagename>=1.3" # version greater than or equal to 1.3

How to load or import package or module

Once package is installed, next step is to make the package in use. In other words, it is required to import package once installed. There are several ways to load package or module in Python :

1. import math loads the module math. Then you can use any function defined in math module using math.function. Refer the example below -

import math
math.sqrt(4)

2. from math import * loads the module math. Now we don't need to specify the module to use functions of this module.

from math import *
sqrt(4)

3. from math import sqrt, cos imports the selected functions of the module math.

4.import math as m imports the math module under the alias m.

m.sqrt(4)

Other Useful Commands

Description	Command
To uninstall a package	pip uninstall package
To upgrade a package	pip install --upgrade package
To search a package	pip search "package-name"
To check all the installed packages	pip list

↧

ListenData: Python for Data Science : Learn in 3 Days

April 19, 2019, 12:18 am

≫ Next: ListenData: NumPy Tutorial with Exercises

≪ Previous: ListenData: Install Python Package

This tutorial helps you to learn Data Science with Python with examples. Python is an open source language and it is widely used as a high-level programming language for general-purpose programming. It has gained high popularity in data science world. As data science domain is rising these days, IBM recently predicted demand for data science professionals would rise by more than 25% by 2020. In the PyPL Popularity of Programming language index, Python scored second rank with a 14 percent share. In advanced analytics and predictive analytics market, it is ranked among top 3 programming languages for advanced analytics.

Data Science with Python Tutorial

Table of Contents

Getting Started with Python
Data Structures and Conditional Statements
- Python Data Structures
- Python Conditional Statements
Python Libraries
Data Manipulation using Pandas
Data Science with Python

Python 2.7 vs 3.6

Google yields thousands of articles on this topic. Some bloggers opposed and some in favor of 2.7. If you filter your search criteria and look for only recent articles (late 2016 onwards), you would see majority of bloggers are in favor of Python 3.6. See the following reasons to support Python 3.6.

1. The official end date for the Python 2.7 is year 2020. Afterward there would be no support from community. It does not make any sense to learn 2.7 if you learn it today.

2. Python 3.6 supports 95% of top 360 python packages and almost 100% of top packages for data science.

What's new in Python 3.6

It is cleaner and faster. It is a language for the future. It fixed major issues with versions of Python 2 series. Python 3 was first released in year 2008. It has been 9 years releasing robust versions of Python 3 series.

Key Takeaway

You should go for Python 3.6. In terms of learning Python, there are no major differences in Python 2.7 and 3.6. It is not too difficult to move from Python 3 to Python 2 with a few adjustments. Your focus should go on learning Python as a language.

Python for Data Science : Introduction

Python is widely used and very popular for a variety of software engineering tasks such as website development, cloud-architecture, back-end etc. It is equally popular in data science world. In advanced analytics world, there has been several debates on R vs. Python. There are some areas such as number of libraries for statistical analysis, where R wins over Python but Python is catching up very fast. With popularity of big data and data science, Python has become first programming language of data scientists.

There are several reasons to learn Python. Some of them are as follows -

Python runs well in automating various steps of a predictive model.
Python has awesome robust libraries for machine learning, natural language processing, deep learning, big data and artificial Intelligence.
Python wins over R when it comes to deploying machine learning models in production.
It can be easily integrated with big data frameworks such as Spark and Hadoop.
Python has a great online community support.

Do you know these sites are developed in Python?

YouTube
Instagram
Reddit
Dropbox
Disqus

How to Install Python

There are two ways to download and install Python

Download Anaconda. It comes with Python software along with preinstalled popular libraries.
Download Python from its official website. You have to manually install libraries.

Recommended :Go for first option and download anaconda. It saves a lot of time in learning and coding Python

Coding Environments

Anaconda comes with two popular IDE :

Jupyter (Ipython) Notebook
Spyder

Spyder. It is like RStudio for Python. It gives an environment wherein writing python code is user-friendly. If you are a SAS User, you can think of it as SAS Enterprise Guide / SAS Studio. It comes with a syntax editor where you can write programs. It has a console to check each and every line of code. Under the 'Variable explorer', you can access your created data files and function. I highly recommend Spyder!

Spyder - Python Coding Environment

Jupyter (Ipython) Notebook

Jupyter is equivalent to markdown in R. It is useful when you need to present your work to others or when you need to create step by step project report as it can combine code, output, words, and graphics.

Spyder Shortcut Keys

The following is a list of some useful spyder shortcut keys which makes you more productive.

Press F5 to run the entire script
Press F9 to run selection or line
Press Ctrl + 1 to comment / uncomment
Go to front of function and then press Ctrl + I to see documentation of the function
Run %reset -f to clean workspace
Ctrl + Left click on object to see source code
Ctrl+Enter executes the current cell.
Shift+Enter executes the current cell and advances the cursor to the next cell

List of arithmetic operators with examples

Arithmetic Operators	Operation	Example
+	Addition	10 + 2 = 12
–	Subtraction	10 – 2 = 8
*	Multiplication	10 * 2 = 20
/	Division	10 / 2 = 5.0
%	Modulus (Remainder)	10 % 3 = 1
**	Power	10 ** 2 = 100
//	Floor	17 // 3 = 5
(x + (d-1)) // d	Ceiling	(17 +(3-1)) // 3 = 6

Basic Programs

Example 1

#Basics
x = 10
y = 3
print("10 divided by 3 is", x/y)
print("remainder after 10 divided by 3 is", x%y)

Result :
10 divided by 3 is 3.33
remainder after 10 divided by 3 is 1

Example 2

x = 100
x > 80 and x <=95
x > 35 or x < 60

x > 80 and x <=95
Out[45]: False

x > 35 or x < 60
Out[46]: True

Comparison & Logical Operators	Description	Example
>	Greater than	5 > 3 returns True
<	Less than	5 < 3 returns False
>=	Greater than or equal to	5 >= 3 returns True
<=	Less than or equal to	5 <= 3 return False
==	Equal to	5 == 3 returns False
!=	Not equal to	5 != 3 returns True
and	Check both the conditions	x > 18 and x <=35
or	If atleast one condition hold True	x > 35 or x < 60
not	Opposite of Condition	not(x>7)

Assignment Operators

It is used to assign a value to the declared variable. For e.g. x += 25 means x = x +25.

x = 100
y = 10
x += y
print(x)

print(x)
110

In this case, x+=y implies x=x+y which is x = 100 + 10.

Similarly, you can use x-=y, x*=y and x /=y

Python Data Structure

In every programming language, it is important to understand the data structures. Following are some data structures used in Python.

1. List

It is a sequence of multiple values. It allows us to store different types of data such as integer, float, string etc. See the examples of list below. First one is an integer list containing only integer. Second one is string list containing only string values. Third one is mixed list containing integer, string and float values.

x = [1, 2, 3, 4, 5]
y = [‘A’, ‘O’, ‘G’, ‘M’]
z = [‘A’, 4, 5.1, ‘M’]

Get List Item

We can extract list item using Indexes. Index starts from 0 and end with (number of elements-1).

x = [1, 2, 3, 4, 5]
x[0]
x[1]
x[4]
x[-1]
x[-2]

x[0]
Out[68]: 1

x[1]
Out[69]: 2

x[4]
Out[70]: 5

x[-1]
Out[71]: 5

x[-2]
Out[72]: 4

x[0] picks first element from list. Negative sign tells Python to search list item from right to left. x[-1] selects the last element from list.

You can select multiple elements from a list using the following method

x[:3] returns [1, 2, 3]

2. Tuple

A tuple is similar to a list in the sense that it is a sequence of elements. The difference between list and tuple are as follows -

A tuple cannot be changed once constructed whereas list can be modified.
A tuple is created by placing comma-separated values inside parentheses ( ). Whereas, list is created inside square brackets [ ]

Examples

K = (1,2,3)
State = ('Delhi','Maharashtra','Karnataka')

Perform for loop on Tuple

for i in State:
print(i)

Delhi
Maharashtra
Karnataka

Detailed Tutorial : Python Data Structures

Functions

Like print(), you can create your own custom function. It is also called user-defined functions. It helps you in automating the repetitive task and calling reusable code in easier way.

Rules to define a function

Function starts with def keyword followed by function name and ( )
Function body starts with a colon (:) and is indented
The keyword return ends a function and give value of previous expression.

def sum_fun(a, b):
result = a + b
return result

z = sum_fun(10, 15)

Result : z = 25

Suppose you want python to assume 0 as default value if no value is specified for parameter b.

def sum_fun(a, b=0):
result = a + b
return result
z = sum_fun(10)

In the above function, b is set to be 0 if no value is provided for parameter b. It does not mean no other value than 0 can be set here. It can also be used as z = sum_fun(10, 15)

Conditional Statements (if else)

Conditional statements are commonly used in coding. It is IF ELSE statements. It can be read like : " if a condition holds true, then execute something. Else execute something else"

Note : The if and else statements ends with a colon :

Example

k = 27
if k%5 == 0:
print('Multiple of 5')
else:
print('Not a Multiple of 5')

Result : Not a Multiple of 5

Popular python packages for Data Analysis & Visualization

Some of the leading packages in Python along with equivalent libraries in R are as follows-

pandas. For data manipulation and data wrangling. A collections of functions to understand and explore data. It is counterpart of dplyr and reshape2 packages in R.
NumPy. For numerical computing. It's a package for efficient array computations. It allows us to do some operations on an entire column or table in one line. It is roughly approximate to Rcpp package in R which eliminates the limitation of slow speed in R. Numpy Tutorial
Scipy. For mathematical and scientific functions such as integration, interpolation, signal processing, linear algebra, statistics, etc. It is built on Numpy.
Scikit-learn. A collection of machine learning algorithms. It is built on Numpy and Scipy. It can perform all the techniques that can be done in R using glm, knn, randomForest, rpart, e1071 packages.
Matplotlib. For data visualization. It's a leading package for graphics in Python. It is equivalent to ggplot2 package in R.
Statsmodels. For statistical and predictive modeling. It includes various functions to explore data and generate descriptive and predictive analytics. It allows users to run descriptive statistics, methods to impute missing values, statistical tests and take table output to HTML format.
pandasql. It allows SQL users to write SQL queries in Python. It is very helpful for people who loves writing SQL queries to manipulate data. It is equivalent to sqldf package in R.

Maximum of the above packages are already preinstalled in Spyder.

Comparison of Python and R Packages by Data Mining Task

Task	Python Package	R Package
IDE	Rodeo / Spyder	Rstudio
Data Manipulation	pandas	dplyr and reshape2
Machine Learning	Scikit-learn	glm, knn, randomForest, rpart, e1071
Data Visualization	ggplot + seaborn + bokeh	ggplot2
Character Functions	Built-In Functions	stringr
Reproducibility	Jupyter	Knitr
SQL Queries	pandasql	sqldf
Working with Dates	datetime	lubridate
Web Scraping	beautifulsoup	rvest

Popular Python Commands

The commands below would help you to install and update new and existing packages. Let's say, you want to install / uninstall pandas package.

Run these commands from IPython console window. Don't forget to add ! before pip otherwise it would return syntax error.

Install Package
!pip install pandas

Uninstall Package
!pip uninstall pandas

Show Information about Installed Package
!pip show pandas

List of Installed Packages
!pip list

Upgrade a package
!pip install --upgrade pandas

How to import a package

There are multiple ways to import a package in Python. It is important to understand the difference between these styles.

1. import pandas as pd

It imports the package pandas under the alias pd. A function DataFrame in package pandas is then submitted with pd.DataFrame.

2. import pandas

It imports the package without using alias but here the function DataFrame is submitted with full package name pandas.DataFrame

3. from pandas import *
It imports the whole package and the function DataFrame is executed simply by typing DataFrame. It sometimes creates confusion when same function name exists in more than one package.

Pandas Data Structures : Series and DataFrame

In pandas package, there are two data structures - series and dataframe. These structures are explained below in detail -

Series is a one-dimensional array. You can access individual elements of a series using position. It's similar to vector in R.

In the example below, we are generating 5 random values.

import pandas as pd
import numpy as np
s1 = pd.Series(np.random.randn(5))
s1

0   -2.412015
1   -0.451752
2    1.174207
3    0.766348
4   -0.361815
dtype: float64

Extract first and second value

You can get a particular element of a series using index value. See the examples below -

s1[0]

-2.412015

s1[1]

-0.451752

s1[:3]

0   -2.412015
1   -0.451752
2    1.174207

2. DataFrame

It is equivalent to data.frame in R. It is a 2-dimensional data structure that can store data of different data types such as characters, integers, floating point values, factors. Those who are well-conversant with MS Excel, they can think of data frame as Excel Spreadsheet.

Comparison of Data Type in Python and Pandas

The following table shows how Python and pandas package stores data.

Data Type	Pandas	Standard Python
For character variable	object	string
For categorical variable	category	-
For Numeric variable without decimals	int64	int
Numeric characters with decimals	float64	float
For date time variables	datetime64	-

Important Pandas Functions

The table below shows comparison of pandas functions with R functions for various data wrangling and manipulation tasks. It would help you to memorize pandas functions. It's a very handy information for programmers who are new to Python. It includes solutions for most of the frequently used data exploration tasks.

Functions	R	Python (pandas package)
Installing a package	install.packages('name')	!pip install name
Loading a package	library(name)	import name as other_name
Checking working directory	getwd()	import os os.getcwd()
Setting working directory	setwd()	os.chdir()
List files in a directory	dir()	os.listdir()
Remove an object	rm('name')	del object
Select Variables	select(df, x1, x2)	df[['x1', 'x2']]
Drop Variables	select(df, -(x1:x2))	df.drop(['x1', 'x2'], axis = 1)
Filter Data	filter(df, x1 >= 100)	df.query('x1 >= 100')
Structure of a DataFrame	str(df)	df.info()
Summarize dataframe	summary(df)	df.describe()
Get row names of dataframe "df"	rownames(df)	df.index
Get column names	colnames(df)	df.columns
View Top N rows	head(df,N)	df.head(N)
View Bottom N rows	tail(df,N)	df.tail(N)
Get dimension of data frame	dim(df)	df.shape
Get number of rows	nrow(df)	df.shape[0]
Get number of columns	ncol(df)	df.shape[1]
Length of data frame	length(df)	len(df)
Get random 3 rows from dataframe	sample_n(df, 3)	df.sample(n=3)
Get random 10% rows	sample_frac(df, 0.1)	df.sample(frac=0.1)
Check Missing Values	is.na(df$x)	pd.isnull(df.x)
Sorting	arrange(df, x1, x2)	df.sort_values(['x1', 'x2'])
Rename Variables	rename(df, newvar = x1)	df.rename(columns={'x1': 'newvar'})

Data Manipulation with pandas - Examples

1. Import Required Packages

You can import required packages using import statement. In the syntax below, we are asking Python to import numpy and pandas package. The 'as' is used to alias package name.

import numpy as np
import pandas as pd

2. Build DataFrame

We can build dataframe using DataFrame() function of pandas package.

mydata = {'productcode': ['AA', 'AA', 'AA', 'BB', 'BB', 'BB'],
'sales': [1010, 1025.2, 1404.2, 1251.7, 1160, 1604.8],
'cost' : [1020, 1625.2, 1204, 1003.7, 1020, 1124]}
df = pd.DataFrame(mydata)

In this dataframe, we have three variables - productcode, sales, cost.

Sample DataFrame

To import data from CSV file

You can use read_csv() function from pandas package to get data into python from CSV file.

mydata= pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv")

Make sure you use double backslash when specifying path of CSV file. Alternatively, you can use forward slash to mention file path inside read_csv() function.

Detailed Tutorial : Import Data in Python

3. To see number of rows and columns

You can run the command below to find out number of rows and columns.

df.shape

Result : (6, 3). It means 6 rows and 3 columns.

4. To view first 3 rows

The df.head(N) function can be used to check out first some N rows.

df.head(3)

     cost productcode   sales
0  1020.0          AA  1010.0
1  1625.2          AA  1025.2
2  1204.0          AA  1404.2

5. Select or Drop Variables

To keep a single variable, you can write in any of the following three methods -

df.productcode
df["productcode"]
df.loc[: , "productcode"]

To select variable by column position, you can use df.iloc function. In the example below, we are selecting second column. Column Index starts from 0. Hence, 1 refers to second column.

df.iloc[: , 1]

We can keep multiple variables by specifying desired variables inside [ ]. Also, we can make use of df.loc() function.

df[["productcode", "cost"]]
df.loc[ : , ["productcode", "cost"]]

Drop Variable

We can remove variables by using df.drop() function. See the example below -

df2 = df.drop(['sales'], axis = 1)

6. To summarize data frame

To summarize or explore data, you can submit the command below.

df.describe()

              cost       sales
count     6.000000     6.00000
mean   1166.150000  1242.65000
std     237.926793   230.46669
min    1003.700000  1010.00000
25%    1020.000000  1058.90000
50%    1072.000000  1205.85000
75%    1184.000000  1366.07500
max    1625.200000  1604.80000

To summarise all the character variables, you can use the following script.

df.describe(include=['O'])

Similarly, you can use df.describe(include=['float64']) to view summary of all the numeric variables with decimals.

To select only a particular variable, you can write the following code -

df.productcode.describe()
OR
df["productcode"].describe()

count      6
unique     2
top       BB
freq       3
Name: productcode, dtype: object

7. To calculate summary statistics

We can manually find out summary statistics such as count, mean, median by using commands below

df.sales.mean()
df.sales.median()
df.sales.count()
df.sales.min()
df.sales.max()

8. Filter Data

Suppose you are asked to apply condition - productcode is equal to "AA" and sales greater than or equal to 1250.

df1 = df[(df.productcode == "AA") & (df.sales >= 1250)]

It can also be written like :

df1 = df.query('(productcode == "AA") & (sales >= 1250)')

In the second query, we do not need to specify DataFrame along with variable name.

9. Sort Data

In the code below, we are arrange data in ascending order by sales.

df.sort_values(['sales'])

10. Group By : Summary by Grouping Variable

Like SQL GROUP BY, you want to summarize continuous variable by classification variable. In this case, we are calculating average sale and cost by product code.

df.groupby(df.productcode).mean()

                    cost        sales
productcode                          
AA           1283.066667  1146.466667
BB           1049.233333  1338.833333

Instead of summarising for multiple variable, you can run it for a single variable i.e. sales. Submit the following script.

df["sales"].groupby(df.productcode).mean()

11. Define Categorical Variable

Let's create a classification variable - id which contains only 3 unique values - 1/2/3.

df0 = pd.DataFrame({'id': [1, 1, 2, 3, 1, 2, 2]})

Let's define as a categorical variable.
We can use astype() function to make id as a categorical variable.

df0.id = df0["id"].astype('category')

Summarize this classification variable to check descriptive statistics.

df0.describe()

       id
count    7
unique   3
top      2
freq     3

Frequency Distribution

You can calculate frequency distribution of a categorical variable. It is one of the method to explore a categorical variable.

df['productcode'].value_counts()

BB    3
AA    3

12. Generate Histogram

Histogram is one of the method to check distribution of a continuous variable. In the figure shown below, there are two values for variable 'sales' in range 1000-1100. In the remaining intervals, there is only a single value. In this case, there are only 5 values. If you have a large dataset, you can plot histogram to identify outliers in a continuous variable.

df['sales'].hist()

Histogram

13. BoxPlot

Boxplot is a method to visualize continuous or numeric variable. It shows minimum, Q1, Q2, Q3, IQR, maximum value in a single graph.

df.boxplot(column='sales')

BoxPlot

Detailed Tutorial :Data Analysis with Pandas Tutorial

Data Science using Python - Examples

In this section, we cover how to perform data mining and machine learning algorithms with Python. sklearn is the most frequently used library for running data mining and machine learning algorithms. We will also cover statsmodels library for regression techniques. statsmodels library generates formattable output which can be used further in project report and presentation.

1. Install the required libraries

Import the following libraries before reading or exploring data

#Import required libraries
import pandas as pd
import statsmodels.api as sm
import numpy as np

2. Download and import data into Python

With the use of python library, we can easily get data from web into python.

# Read data from web
df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")

Variables Type Description
gre Continuous Graduate Record Exam score
gpa Continuous Grade Point Average
rank Categorical Prestige of the undergraduate institution
admit Binary Admission in graduate school

The binary variable admit is a target variable.

3. Explore Data

Let's explore data. We'll answer the following questions -

How many rows and columns in the data file?
What are the distribution of variables?
Check if any outlier(s)
If outlier(s), treat them
Check if any missing value(s)
Impute Missing values (if any)

# See no. of rows and columns
df.shape

Result : 400 rows and 4 columns

In the code below, we rename the variable rank to 'position' as rank is already a function in python.

# rename rank column
df = df.rename(columns={'rank': 'position'})

Summarize and plot all the columns.

# Summarize
df.describe()
# plot all of the columns
df.hist()

Categorical variable Analysis

It is important to check the frequency distribution of categorical variable. It helps to answer the question whether data is skewed.

# Summarize
df.position.value_counts(ascending=True)

Generating Crosstab

By looking at cross tabulation report, we can check whether we have enough number of events against each unique values of categorical variable.

pd.crosstab(df['admit'], df['position'])

position   1   2   3   4
admit                   
0         28  97  93  55
1         33  54  28  12

Number of Missing Values

We can write a simple loop to figure out the number of blank values in all variables in a dataset.

for i in list(df.columns) :
k = sum(pd.isnull(df[i]))
print(i, k)

In this case, there are no missing values in the dataset.

4. Logistic Regression Model

Logistic Regression is a special type of regression where target variable is categorical in nature and independent variables be discrete or continuous. In this post, we will demonstrate only binary logistic regression which takes only binary values in target variable. Unlike linear regression, logistic regression model returns probability of target variable.It assumes binomial distribution of dependent variable. In other words, it belongs to binomial family.

In python, we can write R-style model formula y ~ x1 + x2 + x3 using patsy and statsmodels libraries. In the formula, we need to define variable 'position' as a categorical variable by mentioning it inside capital C(). You can also define reference category using reference= option.

#Reference Category
from patsy import dmatrices, Treatment
y, X = dmatrices('admit ~ gre + gpa + C(position, Treatment(reference=4))', df, return_type = 'dataframe')

It returns two datasets - X and y. The dataset 'y' contains variable admit which is a target variable. The other dataset 'X' contains Intercept (constant value), dummy variables for Treatment, gre and gpa. Since 4 is set as a reference category, it will be 0 against all the three dummy variables. See sample below -

P  P_1 P_2 P_3
3  0 0 1
3  0 0 1
1  1 0 0
4  0 0 0
4  0 0 0
2  0 1 0

Split Data into two parts

80% of data goes to training dataset which is used for building model and 20% goes to test dataset which would be used for validating the model.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Build Logistic Regression Model

By default, the regression without formula style does not include intercept. To include it, we already have added intercept in X_train which would be used as a predictor.

#Fit Logit model
logit = sm.Logit(y_train, X_train)
result = logit.fit()

#Summary of Logistic regression model
result.summary()
result.params

                          Logit Regression Results                           
==============================================================================
Dep. Variable:                  admit   No. Observations:                  320
Model:                          Logit   Df Residuals:                      315
Method:                           MLE   Df Model:                            4
Date:                Sat, 20 May 2017   Pseudo R-squ.:                 0.03399
Time:                        19:57:24   Log-Likelihood:                -193.49
converged:                       True   LL-Null:                       -200.30
                                        LLR p-value:                  0.008627
=======================================================================================
                      coef    std err          z       P|z|      [95.0% Conf. Int.]
---------------------------------------------------------------------------------------
C(position)[T.1]     1.4933      0.440      3.392      0.001         0.630     2.356
C(position)[T.2]     0.6771      0.373      1.813      0.070        -0.055     1.409
C(position)[T.3]     0.1071      0.410      0.261      0.794        -0.696     0.910
gre                  0.0005      0.001      0.442      0.659        -0.002     0.003
gpa                  0.4613      0.214     -2.152      0.031        -0.881    -0.041
======================================================================================

Confusion Matrix and Odd Ratio

Odd ratio is exponential value of parameter estimates.

#Confusion Matrix
result.pred_table()
#Odd Ratio
np.exp(result.params)

Prediction on Test Data

In this step, we take estimates of logit model which was built on training data and then later apply it into test data.

#prediction on test data
y_pred = result.predict(X_test)

Calculate Area under Curve (ROC)

# AUC on test data
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
auc(false_positive_rate, true_positive_rate)

Result : AUC = 0.6763

Calculate Accuracy Score

accuracy_score([ 1 if p > 0.5 else 0 for p in y_pred ], y_test)

Decision Tree Model

Decision trees can have a target variable continuous or categorical. When it is continuous, it is called regression tree. And when it is categorical, it is called classification tree. It selects a variable at each step that best splits the set of values. There are several algorithms to find best split. Some of them are Gini, Entropy, C4.5, Chi-Square. There are several advantages of decision tree. It is simple to use and easy to understand. It requires a very few data preparation steps. It can handle mixed data - both categorical and continuous variables. In terms of speed, it is a very fast algorithm.

#Drop Intercept from predictors for tree algorithms
X_train = X_train.drop(['Intercept'], axis = 1)
X_test = X_test.drop(['Intercept'], axis = 1)

#Decision Tree
from sklearn.tree import DecisionTreeClassifier
model_tree = DecisionTreeClassifier(max_depth=7)

#Fit the model:
model_tree.fit(X_train,y_train)

#Make predictions on test set
predictions_tree = model_tree.predict_proba(X_test)

#AUC
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_tree[:,1])
auc(false_positive_rate, true_positive_rate)

Result : AUC = 0.664

Important Note

Feature engineering plays an important role in building predictive models. In the above case, we have not performed variable selection. We can also select best parameters by using grid search fine tuning technique.

Random Forest Model

Decision Tree has limitation of overfitting which implies it does not generalize pattern. It is very sensitive to a small change in training data. To overcome this problem, random forest comes into picture. It grows a large number of trees on randomised data. It selects random number of variables to grow each tree. It is more robust algorithm than decision tree. It is one of the most popular machine learning algorithm. It is commonly used in data science competitions. It is always ranked in top 5 algorithms. It has become a part of every data science toolkit.

#Random Forest
from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier(n_estimators=100, max_depth=7)

#Fit the model:
target = y_train['admit']
model_rf.fit(X_train,target)

#Make predictions on test set
predictions_rf = model_rf.predict_proba(X_test)

#AUC
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1])
auc(false_positive_rate, true_positive_rate)

#Variable Importance
importances = pd.Series(model_rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
print(importances)
importances.plot.bar()

Result : AUC = 0.6974

Grid Search - Hyper Parameters Tuning

The sklearn library makes hyper-parameters tuning very easy. It is a strategy to select the best parameters for an algorithm. In scikit-learn they are passed as arguments to the constructor of the estimator classes. For example, max_features in randomforest. alpha for lasso.

from sklearn.model_selection import GridSearchCV
rf = RandomForestClassifier()
target = y_train['admit']

param_grid = { 
    'n_estimators': [100, 200, 300],
    'max_features': ['sqrt', 3, 4]
}

CV_rfc = GridSearchCV(estimator=rf , param_grid=param_grid, cv= 5, scoring='roc_auc')
CV_rfc.fit(X_train,target)

#Parameters with Scores
CV_rfc.grid_scores_

#Best Parameters
CV_rfc.best_params_
CV_rfc.best_estimator_

#Make predictions on test set
predictions_rf = CV_rfc.predict_proba(X_test)

#AUC
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1])
auc(false_positive_rate, true_positive_rate)

Cross Validation

# Cross Validation
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict,cross_val_score
target = y['admit']
prediction_logit = cross_val_predict(LogisticRegression(), X, target, cv=10, method='predict_proba')
#AUC
cross_val_score(LogisticRegression(fit_intercept = False), X, target, cv=10, scoring='roc_auc')

Data Mining : PreProcessing Steps

1. The machine learning package sklearn requires all categorical variables in numeric form. Hence, we need to convert all character/categorical variables to be numeric. This can be accomplished using the following script. In sklearn, there is already a function for this step.

from sklearn.preprocessing import LabelEncoder
def ConverttoNumeric(df):
    cols = list(df.select_dtypes(include=['category','object']))
    le = LabelEncoder()
    for i in cols:
        try:
            df[i] = le.fit_transform(df[i])
        except:
            print('Error in Variable :'+i)
    return df

ConverttoNumeric(df)

Encoding

2. Create Dummy Variables

Suppose you want to convert categorical variables into dummy variables. It is different to the previous example as it creates dummy variables instead of convert it in numeric form.

productcode_dummy = pd.get_dummies(df["productcode"])
df2 = pd.concat([df, productcode_dummy], axis=1)

The output looks like below -

Create k-1 Categories

To avoid multi-collinearity, you can set one of the category as reference category and leave it while creating dummy variables. In the script below, we are leaving first category.

productcode_dummy = pd.get_dummies(df["productcode"], prefix='pcode', drop_first=True)
df2 = pd.concat([df, productcode_dummy], axis=1)

3. Impute Missing Values

Imputing missing values is an important step of predictive modeling. In many algorithms, if missing values are not filled, it removes complete row. If data contains a lot of missing values, it can lead to huge data loss. There are multiple ways to impute missing values. Some of the common techniques - to replace missing value with mean/median/zero. It makes sense to replace missing value with 0 when 0 signifies meaningful. For example, whether customer holds a credit card product.

Fill missing values of a particular variable

# fill missing values with 0
df['var1'] = df['var1'].fillna(0)
# fill missing values with mean
df['var1'] = df['var1'].fillna(df['var1'].mean())

Apply imputation to the whole dataset

from sklearn.preprocessing import Imputer

# Set an imputer object
mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)

# Train the imputor
mean_imputer = mean_imputer.fit(df)

# Apply imputation
df_new = mean_imputer.transform(df.values)

4. Outlier Treatment

There are many ways to handle or treat outliers (or extreme values). Some of the methods are as follows -

Cap extreme values at 95th / 99th percentile depending on distribution
Apply log transformation of variables. See below the implementation of log transformation in Python.

import numpy as np
df['var1'] = np.log(df['var1'])

5. Standardization

In some algorithms, it is required to standardize variables before running the actual algorithm. Standardization refers to the process of making mean of variable zero and unit variance (standard deviation).

#load dataset
dataset = load_boston()
predictors = dataset.data
target = dataset.target
df = pd.DataFrame(predictors, columns = dataset.feature_names)

#Apply Standardization
from sklearn.preprocessing import StandardScaler
k = StandardScaler()
df2 = k.fit_transform(df)

Next Steps

Practice, practice and practice. Download free public data sets from Kaggle / UCLA websites and try to play around with data and generate insights from it with pandas package and build statistical models using sklearn package. I hope you would find this tutorial helpful. I tried to cover all the important topics which beginner must know about Python. Once completion of this tutorial, you can flaunt you know how to program it in Python and you can implement machine learning algorithms using sklearn package.

↧

ListenData: NumPy Tutorial with Exercises

April 19, 2019, 7:12 am

≫ Next: ListenData: Pandas Python Tutorial - Learn by Examples

≪ Previous: ListenData: Python for Data Science : Learn in 3 Days

NumPy (acronym for 'Numerical Python' or 'Numeric Python') is one of the most essential package for speedy mathematical computation on arrays and matrices in Python. It is also quite useful while dealing with multi-dimensional data. It is a blessing for integrating C, C++ and FORTRAN tools. It also provides numerous functions for Fourier transform (FT) and linear algebra.

Python : Numpy Tutorial

Why NumPy instead of lists?

One might think of why one should prefer arrays in NumPy instead we can create lists having the same data type. If this statement also rings a bell then the following reasons may convince you:

Numpy arrays have contiguous memory allocation. Thus if a same array stored as list will require more space as compared to arrays.
They are more speedy to work with and hence are more efficient than the lists.
They are more convenient to deal with.

NumPy vs. Pandas

Pandas is built on top of NumPy. In other words,Numpy is required by pandas to make it work. So Pandas is not an alternative to Numpy. Instead pandas offers additionalmethod or provides more streamlined way of working with numerical and tabular data in Python.

Importing numpy

Firstly you need to import the numpy library. Importing numpy can be done by running the following command:

import numpy as np

It is a general approach to import numpy with alias as 'np'. If alias is not provided then to access the functions from numpy we shall write numpy.function. To make it easier an alias 'np' is introduced so we can write np.function. Some of the common functions of numpy are listed below -

Functions	Tasks
array	Create numpy array
ndim	Dimension of the array
shape	Size of the array (Number of rows and Columns)
size	Total number of elements in the array
dtype	Type of elements in the array, i.e., int64, character
reshape	Reshapes the array without changing the original shape
resize	Reshapes the array. Also change the original shape
arange	Create sequence of numbers in array
Itemsize	Size in bytes of each item
diag	Create a diagonal matrix
vstack	Stacking vertically
hstack	Stacking horizontally

1D array
Using numpy an array is created by using np.array:

a = np.array([15,25,14,78,96])
a
print(a)

a
Output: array([15, 25, 14, 78, 96])

print(a)
Output: [15 25 14 78 96]

Notice that in np.array square brackets are present. Absence of square bracket introduces an error. To print the array we can use print(a).

Changing the datatype
np.array( ) has an additional parameter of dtype through which one can define whether the elements are integers or floating points or complex numbers.

a.dtype
a = np.array([15,25,14,78,96],dtype = "float")
a
a.dtype

Initially datatype of 'a' was 'int32' which on modifying becomes 'float64'.

int32 refers to number without a decimal point. '32' means number can be in between-2147483648 and 2147483647. Similarly, int16 implies number can be in range -32768 to 32767
float64 refers to number with decimal place.

Creating the sequence of numbers
If you want to create a sequence of numbers then using np.arange, we can get our sequence. To get the sequence of numbers from 20 to 29 we run the following command.

b = np.arange(start = 20,stop = 30, step = 1)
b

array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

In np.arange the end point is always excluded.

np.arange provides an option of step which defines the difference between 2 consecutive numbers. If step is not provided then it takes the value 1 by default.

Suppose we want to create an arithmetic progression with initial term 20 and common difference 2, upto 30; 30 being excluded.

c = np.arange(20,30,2) #30 is excluded.
c

array([20, 22, 24, 26, 28])

It is to be taken care that in np.arange( ) the stop argument is always excluded.

Indexing in arrays
It is important to note that Python indexing starts from 0. The syntax of indexing is as follows -

x[start:end:step]: Elements in array x start through the end (but the end is excluded), default step value is 1.
x[start:end] : Elements in array x start through the end (but the end is excluded)
x[start:] : Elements start through the end
x[:end] : Elements from the beginning through the end (but the end is excluded)

If we want to extract 3rd element we write the index as 2 as it starts from 0.

x = np.arange(10)
x[2]
x[2:5]
x[::2]
x[1::2]

x
Output: [0 1 2 3 4 5 6 7 8 9]

x[2]
Output: 2

x[2:5]
Output: array([2, 3, 4])

x[::2]
Output: array([0, 2, 4, 6, 8])

x[1::2]
Output: array([1, 3, 5, 7, 9])

Note that in x[2:5] elements starting from 2nd index up to 5th index(exclusive) are selected.
If we want to change the value of all the elements from starting upto index 7,excluding 7, with a step of 3 as 123 we write:

x[:7:3] = 123
x

 array([123,   1,   2, 123,   4,   5, 123,   7,   8,   9])

To reverse a given array we write:

x = np.arange(10)
x[ : :-1] # reversed x

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

Note that the above command does not modify the original array.

Reshaping the arrays

To reshape the array we can use reshape( ).

f = np.arange(101,113)
f.reshape(3,4)
f

 array([101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112])

Note that reshape() does not alter the shape of the original array. Thus to modify the original array we can use resize( )

f.resize(3,4)
f

array([[101, 102, 103, 104],
       [105, 106, 107, 108],
       [109, 110, 111, 112]])

If a dimension is given as -1 in a reshaping, the other dimensions are automatically calculated provided that the given dimension is a multiple of total number of elements in the array.

f.reshape(3,-1)

array([[101, 102, 103, 104],
       [105, 106, 107, 108],
       [109, 110, 111, 112]])

In the above code we only directed that we will have 3 rows. Python automatically calculates the number of elements in other dimension i.e. 4 columns.

Missing Data

The missing data is represented by NaN (acronym for Not a Number). You can use the command np.nan

val = np.array([15,10, np.nan, 3, 2, 5, 6, 4])

val.sum()
Out: nan

To ignore missing values, you can use np.nansum(val) which returns 45

To check whether array contains missing value, you can use the functionisnan( )

np.isnan(val)

2D arrays
A 2D array in numpy can be created in the following manner:

g = np.array([(10,20,30),(40,50,60)])
#Alternatively
g = np.array([[10,20,30],[40,50,60]])
g

The dimension, total number of elements and shape can be ascertained by ndim, size and shape respectively:

g.ndim
g.size
g.shape

g.ndim
Output: 2

g.size
Output: 6

g.shape
Output: (2, 3)

Creating some usual matrices

numpy provides the utility to create some usual matrices which are commonly used for linear algebra.

To create a matrix of all zeros of 2 rows and 4 columns we can use np.zeros( ):

np.zeros( (2,4) )

array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

Here the dtype can also be specified. For a zero matrix the default dtype is 'float'. To change it to integer we write 'dtype = np.int16'

np.zeros([2,4],dtype=np.int16)

array([[0, 0, 0, 0],
       [0, 0, 0, 0]], dtype=int16)

To get a matrix of all random numbers from 0 to 1 we write np.empty.

np.empty( (2,3) )

array([[  2.16443571e-312,   2.20687562e-312,   2.24931554e-312],
       [  2.29175545e-312,   2.33419537e-312,   2.37663529e-312]])

Note: The results may vary everytime you run np.empty.
To create a matrix of unity we write np.ones( ). We can create a 3 * 3 matrix of all ones by:

np.ones([3,3])

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

To create a diagonal matrix we can write np.diag( ). To create a diagonal matrix where the diagonal elements are 14,15,16 and 17 we write:

np.diag([14,15,16,17])

array([[14,  0,  0,  0],
       [ 0, 15,  0,  0],
       [ 0,  0, 16,  0],
       [ 0,  0,  0, 17]])

To create an identity matrix we can use np.eye( ) .

np.eye(5,dtype = "int")

array([[1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1]])

By default the datatype in np.eye( ) is 'float' thus we write dtype = "int" to convert it to integers.

Reshaping 2D arrays
To get a flattened 1D array we can use ravel( )

g = np.array([(10,20,30),(40,50,60)])
g.ravel()

 array([10, 20, 30, 40, 50, 60])

To change the shape of 2D array we can use reshape. Writing -1 will calculate the other dimension automatically and does not modify the original array.

g.reshape(3,-1) # returns the array with a modified shape
#It does not modify the original array
g.shape

 (2, 3)

Similar to 1D arrays, using resize( ) will modify the shape in the original array.

g.resize((3,2))
g #resize modifies the original array

array([[10, 20],
       [30, 40],
       [50, 60]])

Time for some matrix algebra

Let us create some arrays A,b and B and they will be used for this section:

A = np.array([[2,0,1],[4,3,8],[7,6,9]])
b = np.array([1,101,14])
B = np.array([[10,20,30],[40,50,60],[70,80,90]])

In order to get the transpose, trace and inverse we use A.transpose( ) , np.trace( ) and np.linalg.inv( ) respectively.

A.T #transpose
A.transpose() #transpose
np.trace(A) # trace
np.linalg.inv(A) #Inverse

A.transpose()  #transpose
Output: 
array([[2, 4, 7],
       [0, 3, 6],
       [1, 8, 9]])

np.trace(A)  # trace
Output: 14

np.linalg.inv(A)  #Inverse
Output: 
array([[ 0.53846154, -0.15384615,  0.07692308],
       [-0.51282051, -0.28205128,  0.30769231],
       [-0.07692308,  0.30769231, -0.15384615]])

Note that transpose does not modify the original array.

Matrix addition and subtraction can be done in the usual way:

A+B
A-B

A+B
Output: 
array([[12, 20, 31],
       [44, 53, 68],
       [77, 86, 99]])

A-B
Output: 
array([[ -8, -20, -29],
       [-36, -47, -52],
       [-63, -74, -81]])

Matrix multiplication of A and B can be accomplished by A.dot(B). Where A will be the 1st matrix on the left hand side and B will be the second matrix on the right side.

A.dot(B)

array([[  90,  120,  150],
       [ 720,  870, 1020],
       [ 940, 1160, 1380]])

To solve the system of linear equations: Ax = b we use np.linalg.solve( )

np.linalg.solve(A,b)

array([-13.92307692, -24.69230769,  28.84615385])

The eigen values and eigen vectors can be calculated using np.linalg.eig( )

np.linalg.eig(A)

(array([ 14.0874236 ,   1.62072127,  -1.70814487]),
 array([[-0.06599631, -0.78226966, -0.14996331],
        [-0.59939873,  0.54774477, -0.81748379],
        [-0.7977253 ,  0.29669824,  0.55608566]]))

The first row are the various eigen values and the second matrix denotes the matrix of eigen vectors where each column is the eigen vector to the corresponding eigen value.

Some Mathematics functions

We can have various trigonometric functions like sin, cosine etc. using numpy:

B = np.array([[0,-20,36],[40,50,1]])
np.sin(B)

array([[ 0.        , -0.91294525, -0.99177885],
       [ 0.74511316, -0.26237485,  0.84147098]])

The resultant is the matrix of all sin( ) elements.
In order to get the exponents we use **

B**2

array([[   0,  400, 1296],
       [1600, 2500,    1]], dtype=int32)

We get the matrix of the square of all elements of B.
In order to obtain if a condition is satisfied by the elements of a matrix we need to write the criteria. For instance, to check if the elements of B are more than 25 we write:

B>25

array([[False, False,  True],
       [ True,  True, False]], dtype=bool)

We get a matrix of Booleans where True indicates that the corresponding element is greater than 25 and False indicates that the condition is not satisfied.
In a similar manner np.absolute, np.sqrt and np.exp return the matrices of absolute numbers, square roots and exponentials respectively.

np.absolute(B)
np.sqrt(B)
np.exp(B)

Now we consider a matrix A of shape 3*3:

A = np.arange(1,10).reshape(3,3)
A

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

To find the sum, minimum, maximum, mean, standard deviation and variance respectively we use the following commands:

A.sum()
A.min()
A.max()
A.mean()
A.std() #Standard deviation
A.var() #Variance

A.sum()
Output: 45

A.min()
Output: 1

A.max()
Output: 9

A.mean()
Output: 5.0

A.std()   #Standard deviation
Output: 2.5819888974716112

A.var()
Output: 6.666666666666667

In order to obtain the index of the minimum and maximum elements we use argmin( ) and argmax( ) respectively.

A.argmin()
A.argmax()

A.argmin()
Output: 0

A.argmax()
Output: 8

If we wish to find the above statistics for each row or column then we need to specify the axis:

A.sum(axis=0)
A.mean(axis = 0)
A.std(axis = 0)
A.argmin(axis = 0)

A.sum(axis=0)                 # sum of each column, it will move in downward direction
Output: array([12, 15, 18])

A.mean(axis = 0)
Output: array([ 4.,  5.,  6.])

A.std(axis = 0)
Output: array([ 2.44948974,  2.44948974,  2.44948974])

A.argmin(axis = 0)
Output: array([0, 0, 0], dtype=int64)

By defining axis = 0, calculations will move in downward direction i.e. it will give the statistics for each column. To find the min and index of maximum element for each row, we need to move in right-wise direction so we write axis = 1:

A.min(axis=1)
A.argmax(axis = 1)

A.min(axis=1)                  # min of each row, it will move in rightwise direction
Output: array([1, 4, 7])

A.argmax(axis = 1)
Output: array([2, 2, 2], dtype=int64)

To find the cumulative sum along each row we use cumsum( )

A.cumsum(axis=1)

array([[ 1,  3,  6],
       [ 4,  9, 15],
       [ 7, 15, 24]], dtype=int32)

Creating 3D arrays
Numpy also provides the facility to create 3D arrays. A 3D array can be created as:

X = np.array( [[[ 1, 2,3],
[ 4, 5, 6]],
[[7,8,9],
[10,11,12]]])
X.shape
X.ndim
X.size

X contains two 2D arrays Thus the shape is 2,2,3. Totol number of elements is 12.
To calculate the sum along a particular axis we use the axis parameter as follows:

X.sum(axis = 0)
X.sum(axis = 1)
X.sum(axis = 2)

X.sum(axis = 0)
Output: 
array([[ 8, 10, 12],
       [14, 16, 18]])

X.sum(axis = 1)
Output: 
array([[ 5,  7,  9],
       [17, 19, 21]])

X.sum(axis = 2)
Output: 
array([[ 6, 15],
       [24, 33]])

axis = 0 returns the sum of the corresponding elements of each 2D array. axis = 1 returns the sum of elements in each column in each matrix while axis = 2 returns the sum of each row in each matrix.

X.ravel()

 array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])

ravel( ) writes all the elements in a single array.
Consider a 3D array:

X = np.array( [[[ 1, 2,3],
[ 4, 5, 6]],
[[7,8,9],
[10,11,12]]])

To extract the 2nd matrix we write:

X[1,...] # same as X[1,:,:] or X[1]

array([[ 7,  8,  9],
       [10, 11, 12]])

Remember python indexing starts from 0 that is why we wrote 1 to extract the 2nd 2D array.
To extract the first element from all the rows we write:

X[...,0] # same as X[:,:,0]

array([[ 1,  4],
       [ 7, 10]])

Find out position of elements that satisfy a given condition

a = np.array([8, 3, 7, 0, 4, 2, 5, 2])
np.where(a > 4)

array([0, 2, 6]

np.where locates the positions in the array where element of array is greater than 4.

Indexing with Arrays of Indices
Consider a 1D array.

x = np.arange(11,35,2)
x

array([11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33])

We form a 1D array i which subsets the elements of x as follows:

i = np.array( [0,1,5,3,7,9 ] )
x[i]

array([11, 13, 21, 17, 25, 29])

In a similar manner we create a 2D array j of indices to subset x.

j = np.array( [ [ 0, 1], [ 6, 2 ] ] )
x[j]

array([[11, 13],
       [23, 15]])

Similarly we can create both i and j as 2D arrays of indices for x

x = np.arange(15).reshape(3,5)
x
i = np.array( [ [0,1], # indices for the first dim
[2,0] ] )
j = np.array( [ [1,1], # indices for the second dim
[2,0] ] )

To get the ith index in row and jth index for columns we write:

x[i,j] # i and j must have equal shape

array([[ 1,  6],
       [12,  0]])

To extract ith index from 3rd column we write:

x[i,2]

array([[ 2,  7],
       [12,  2]])

For each row if we want to find the jth index we write:

x[:,j]

array([[[ 1,  1],
        [ 2,  0]],

       [[ 6,  6],
        [ 7,  5]],

       [[11, 11],
        [12, 10]]])

Fixing 1st row and jth index,fixing 2nd row jth index, fixing 3rd row and jth index.

You can also use indexing with arrays to assign the values:

x = np.arange(10)
x
x[[4,5,8,1,2]] = 0
x

array([0, 0, 0, 3, 0, 0, 6, 7, 0, 9])

0 is assigned to 4th, 5th, 8th, 1st and 2nd indices of x.
When the list of indices contains repetitions then it assigns the last value to that index:

x = np.arange(10)
x
x[[4,4,2,3]] = [100,200,300,400]
x

array([  0,   1, 300, 400, 200,   5,   6,   7,   8,   9])

Notice that for the 5th element(i.e. 4th index) the value assigned is 200, not 100.
Caution: If one is using += operator on repeated indices then it carries out the operator only once on repeated indices.

x = np.arange(10)
x[[1,1,1,7,7]]+=1
x

 array([0, 2, 2, 3, 4, 5, 6, 8, 8, 9])

Although index 1 and 7 are repeated but they are incremented only once.

Indexing with Boolean Arrays
We create a 2D array and store our condition in b. If we the condition is true it results in True otherwise False.

a = np.arange(12).reshape(3,4)
b = a > 4
b

array([[False, False, False, False],
       [False,  True,  True,  True],
       [ True,  True,  True,  True]], dtype=bool)

Note that 'b' is a Boolean with same shape as that of 'a'.
To select the elements from 'a' which adhere to condition 'b' we write:

a[b]

array([ 5,  6,  7,  8,  9, 10, 11])

Now 'a' becomes a 1D array with the selected elements
This property can be very useful in assignments:

a[b] = 0
a

array([[0, 1, 2, 3],
       [4, 0, 0, 0],
       [0, 0, 0, 0]])

All elements of 'a' higher than 4 become 0
As done in integer indexing we can use indexing via Booleans:
Let x be the original matrix and 'y' and 'z' be the arrays of Booleans to select the rows and columns.

x = np.arange(15).reshape(3,5)
y = np.array([True,True,False]) # first dim selection
z = np.array([True,True,False,True,False]) # second dim selection

We write the x[y,:] which will select only those rows where y is True.

x[y,:] # selecting rows
x[y] # same thing

Writing x[:,z] will select only those columns where z is True.

x[:,z] # selecting columns

x[y,:]                                   # selecting rows
Output: 
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

x[y]                                     # same thing
Output: 
array([[0, 1, 2, 3, 4],
       [5, 6, 7, 8, 9]])

x[:,z]                                   # selecting columns
Output: 
array([[ 0,  1,  3],
       [ 5,  6,  8],
       [10, 11, 13]])

Statistics on Pandas DataFrame

Let's create dummy data frame for illustration :

np.random.seed(234)
mydata = pd.DataFrame({"x1" : np.random.randint(low=1, high=100, size=10),
"x2"  : range(10)
                     })

1. Calculate mean of each column of data frame

np.mean(mydata)

2. Calculate median of each column of data frame

np.median(mydata, axis=0)

axis = 0 means the median function would be run on each column. axis = 1 implies the function to be run on each row.

Stacking various arrays
Let us consider 2 arrays A and B:

A = np.array([[10,20,30],[40,50,60]])
B = np.array([[100,200,300],[400,500,600]])

To join them vertically we use np.vstack( ).

np.vstack((A,B)) #Stacking vertically

array([[ 10,  20,  30],
       [ 40,  50,  60],
       [100, 200, 300],
       [400, 500, 600]])

To join them horizontally we use np.hstack( ).

np.hstack((A,B)) #Stacking horizontally

array([[ 10,  20,  30, 100, 200, 300],
       [ 40,  50,  60, 400, 500, 600]])

newaxis helps in transforming a 1D row vector to a 1D column vector.

from numpy import newaxis
a = np.array([4.,1.])
b = np.array([2.,8.])
a[:,newaxis]

array([[ 4.],
       [ 1.]])

#The function np.column_stack( ) stacks 1D arrays as columns into a 2D array. It is equivalent to hstack only for 1D arrays:

np.column_stack((a[:,newaxis],b[:,newaxis]))
np.hstack((a[:,newaxis],b[:,newaxis])) # same as column_stack

np.column_stack((a[:,newaxis],b[:,newaxis]))
Output: 
array([[ 4.,  2.],
       [ 1.,  8.]])

np.hstack((a[:,newaxis],b[:,newaxis]))
Output: 
array([[ 4.,  2.],
       [ 1.,  8.]])

Splitting the arrays
Consider an array 'z' of 15 elements:

z = np.arange(1,16)

Using np.hsplit( ) one can split the arrays

np.hsplit(z,5) # Split a into 5 arrays

[array([1, 2, 3]),
 array([4, 5, 6]),
 array([7, 8, 9]),
 array([10, 11, 12]),
 array([13, 14, 15])]

It splits 'z' into 5 arrays of eqaual length.
On passing 2 elements we get:

np.hsplit(z,(3,5))

[array([1, 2, 3]),
 array([4, 5]),
 array([ 6,  7,  8,  9, 10, 11, 12, 13, 14, 15])]

It splits 'z' after the third and the fifth element.
For 2D arrays np.hsplit( ) works as follows:

A = np.arange(1,31).reshape(3,10)
A
np.hsplit(A,5) # Split a into 5 arrays

[array([[ 1,  2],
        [11, 12],
        [21, 22]]), array([[ 3,  4],
        [13, 14],
        [23, 24]]), array([[ 5,  6],
        [15, 16],
        [25, 26]]), array([[ 7,  8],
        [17, 18],
        [27, 28]]), array([[ 9, 10],
        [19, 20],
        [29, 30]])]

In the above command A gets split into 5 arrays of same shape.
To split after the third and the fifth column we write:

np.hsplit(A,(3,5))

[array([[ 1,  2,  3],
        [11, 12, 13],
        [21, 22, 23]]), array([[ 4,  5],
        [14, 15],
        [24, 25]]), array([[ 6,  7,  8,  9, 10],
        [16, 17, 18, 19, 20],
        [26, 27, 28, 29, 30]])]

Copying
Consider an array x

x = np.arange(1,16)

We assign y as x and then say 'y is x'

y = x
y is x

Let us change the shape of y

y.shape = 3,5

Note that it alters the shape of x

x.shape

(3, 5)

Creating a view of the data
Let us store z as a view of x by:

z = x.view()
z is x

False

Thus z is not x.
Changing the shape of z

z.shape = 5,3

Creating a view does not alter the shape of x

x.shape

(3, 5)

Changing an element in z

z[0,0] = 1234

Note that the value in x also get alters:

x

array([[1234,    2,    3,    4,    5],
       [   6,    7,    8,    9,   10],
       [  11,   12,   13,   14,   15]])

Thus changes in the display does not hamper the original data but changes in values of view will affect the original data.

Creating a copy of the data:
Now let us create z as a copy of x:

z = x.copy()

Note that z is not x

z is x

Changing the value in z

z[0,0] = 9999

No alterations are made in x.

x

array([[1234,    2,    3,    4,    5],
       [   6,    7,    8,    9,   10],
       [  11,   12,   13,   14,   15]])

Python sometimes may give 'setting with copy' warning because it is unable to recognize whether the new dataframe or array (created as a subset of another dataframe or array) is a view or a copy. Thus in such situations user needs to specify whether it is a copy or a view otherwise Python may hamper the results.

Exercises : Numpy

1. How to extract even numbers from array?

arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Desired Output :array([0, 2, 4, 6, 8])

Show Solution

arr[arr % 2 == 0]

2. How to find out the position where elements of x and y are same

x = np.array([5,6,7,8,3,4])
y = np.array([5,3,4,5,2,4])

Desired Output :array([0, 5]

Show Solution

np.where(x == y)

3. How to standardize values so that it lies between 0 and 1

k = np.array([5,3,4,5,2,4])

Hint :k-min(k)/(max(k)-min(k))

Show Solution

kmax, kmin = k.max(), k.min()
k_new = (k - kmin)/(kmax - kmin)

4. How to calculate the percentile scores of an array

p = np.array([15,10, 3,2,5,6,4])

Show Solution

np.percentile(p, q=[5, 95])

5. Print the number of missing values in an array

p = np.array([5,10, np.nan, 3, 2, 5, 6, np.nan])

Show Solution

print("Number of missing values =", np.isnan(p).sum())

↧

ListenData: Pandas Python Tutorial - Learn by Examples

April 20, 2019, 2:28 am

≫ Next: ListenData: Loops in Python explained with examples

≪ Previous: ListenData: NumPy Tutorial with Exercises

Pandas being one of the most popular package in Python is widely used for data manipulation. It is a very powerful and versatile package which makes data cleaning and wrangling much easier and pleasant.

The Pandas library has a great contribution to the python community and it makes python as one of the top programming language for data science and analytics. It has become first choice of data analysts and scientists for data analysis and manipulation.

Data Analysis with Python : Pandas Step by Step Guide

Why pandas?
It has many functions which are the essence for data handling. In short, it can perform the following tasks for you -

Create a structured data set similar to R's data frame and Excel spreadsheet.
Reading data from various sources such as CSV, TXT, XLSX, SQL database, R etc.
Selecting particular rows or columns from data set
Arranging data in ascending or descending order
Filtering data based on some conditions
Summarizing data by classification variable
Reshape data into wide or long format
Time series analysis
Merging and concatenating two datasets
Iterate over the rows of dataset
Writing or Exporting data in CSV or Excel format

Datasets:

In this tutorial we will use two datasets: 'income' and 'iris'.

'income' data : This data contains the income of various states from 2002 to 2015. The dataset contains 51 observations and 16 variables. Download link
'iris' data: It comprises of 150 observations with 5 variables. We have 3 species of flowers(50 flowers for each specie) and for all of them the sepal length and width and petal length and width are given. Download link

Important pandas functions to remember

The following is a list of common tasks along with pandas functions.

Utility	Functions
Extract Column Names	df.columns
Select first 2 rows	df.iloc[:2]
Select first 2 columns	df.iloc[:,:2]
Select columns by name	df.loc[:,["col1","col2"]]
Select random no. of rows	df.sample(n = 10)
Select fraction of random rows	df.sample(frac = 0.2)
Rename the variables	df.rename( )
Selecting a column as index	df.set_index( )
Removing rows or columns	df.drop( )
Sorting values	df.sort_values( )
Grouping variables	df.groupby( )
Filtering	df.query( )
Finding the missing values	df.isnull( )
Dropping the missing values	df.dropna( )
Removing the duplicates	df.drop_duplicates( )
Creating dummies	pd.get_dummies( )
Ranking	df.rank( )
Cumulative sum	df.cumsum( )
Quantiles	df.quantile( )
Selecting numeric variables	df.select_dtypes( )
Concatenating two dataframes	pd.concat()
Merging on basis of common variable	pd.merge( )

Importing pandas library

You need to import or load the Pandas library first in order to use it. By "Importing a library", it means loading it into the memory and then you can use it. Run the following code to import pandas library:

import pandas as pd

The "pd" is an alias or abbreviation which will be used as a shortcut to access or call pandas functions. To access the functions from pandas library, you just need to type pd.function instead of pandas.function every time you need to apply it.

Importing Dataset

To read or import data from CSV file, you can use read_csv() function. In the function, you need to specify the file location of your CSV file.

income = pd.read_csv("C:\\Users\\Hp\\Python\\Basics\\income.csv")

 Index       State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
0     A     Alabama  1296530  1317711  1118631  1492583  1107408  1440134   
1     A      Alaska  1170302  1960378  1818085  1447852  1861639  1465841   
2     A     Arizona  1742027  1968140  1377583  1782199  1102568  1109382   
3     A    Arkansas  1485531  1994927  1119299  1947979  1669191  1801213   
4     C  California  1685349  1675807  1889570  1480280  1735069  1812546   

     Y2008    Y2009    Y2010    Y2011    Y2012    Y2013    Y2014    Y2015  
0  1945229  1944173  1237582  1440756  1186741  1852841  1558906  1916661  
1  1551826  1436541  1629616  1230866  1512804  1985302  1580394  1979143  
2  1752886  1554330  1300521  1130709  1907284  1363279  1525866  1647724  
3  1188104  1628980  1669295  1928238  1216675  1591896  1360959  1329341  
4  1487315  1663809  1624509  1639670  1921845  1156536  1388461  1644607

Get Variable Names

By using income.columnscommand, you can fetch the names of variables of a data frame.

Index(['Index', 'State', 'Y2002', 'Y2003', 'Y2004', 'Y2005', 'Y2006', 'Y2007',
       'Y2008', 'Y2009', 'Y2010', 'Y2011', 'Y2012', 'Y2013', 'Y2014', 'Y2015'],
      dtype='object')

income.columns[0:2] returns first two column names 'Index', 'State'. In python, indexing starts from 0.

Knowing the Variable types

You can use the dataFrameName.dtypes command to extract the information of types of variables stored in the data frame.

income.dtypes

Index    object
State    object
Y2002     int64
Y2003     int64
Y2004     int64
Y2005     int64
Y2006     int64
Y2007     int64
Y2008     int64
Y2009     int64
Y2010     int64
Y2011     int64
Y2012     int64
Y2013     int64
Y2014     int64
Y2015     int64
dtype: object

Here 'object' means strings or character variables. 'int64' refers to numeric variables (without decimals).

To see the variable type of one variable (let's say "State") instead of all the variables, you can use the command below -

income['State'].dtypes

It returns dtype('O'). In this case, 'O' refers to object i.e. type of variable as character.

Changing the data types

Y2008 is an integer. Suppose we want to convert it to float (numeric variable with decimals) we can write:

income.Y2008 = income.Y2008.astype(float)
income.dtypes

Index     object
State     object
Y2002      int64
Y2003      int64
Y2004      int64
Y2005      int64
Y2006      int64
Y2007      int64
Y2008    float64
Y2009      int64
Y2010      int64
Y2011      int64
Y2012      int64
Y2013      int64
Y2014      int64
Y2015      int64
dtype: object

To view the dimensions or shape of the data

income.shape

 (51, 16)

51 is the number of rows and 16 is the number of columns.

You can also use shape[0] to see the number of rows (similar to nrow() in R) and shape[1] for number of columns (similar to ncol() in R).

income.shape[0]
income.shape[1]

To view only some of the rows

By default head( ) shows first 5 rows. If we want to see a specific number of rows we can mention it in the parenthesis. Similarly tail( ) function shows last 5 rows by default.

income.head()
income.head(2) #shows first 2 rows.
income.tail()
income.tail(2) #shows last 2 rows

Alternatively, any of the following commands can be used to fetch first five rows.
income[0:5]
income.iloc[0:5]

Define Categorical Variable

Like factors() function in R, we can include categorical variable in python using "category" dtype.

s = pd.Series([1,2,3,1,2], dtype="category")
s

0    1
1    2
2    3
3    1
4    2
dtype: category
Categories (3, int64): [1, 2, 3]

Extract Unique Values

The unique() function shows the unique levels or categories in the dataset.

income.Index.unique()

array(['A', 'C', 'D', ..., 'U', 'V', 'W'], dtype=object)

The nunique( ) shows the number of unique values.

income.Index.nunique()

It returns 19 as index column contains distinct 19 values.

Generate Cross Tab

pd.crosstab( ) is used to create a bivariate frequency distribution. Here the bivariate frequency distribution is between Index and State columns.

pd.crosstab(income.Index,income.State)

Creating a frequency distribution

income.Index selects the 'Index' column of 'income' dataset and value_counts( ) creates a frequency distribution. By default ascending = False i.e. it will show the 'Index' having the maximum frequency on the top.

income.Index.value_counts(ascending = True)

F    1
G    1
U    1
L    1
H    1
P    1
R    1
D    2
T    2
S    2
V    2
K    2
O    3
C    3
I    4
W    4
A    4
M    8
N    8
Name: Index, dtype: int64

To draw the samples
income.sample( ) is used to draw random samples from the dataset containing all the columns. Here n = 5 depicts we need 5 columns and frac = 0.1 tells that we need 10 percent of the data as my sample.

income.sample(n = 5)
income.sample(frac = 0.1)

Selecting only a few of the columns
To select only a specific columns we use either loc[ ] or iloc[ ] functions. The index or columns to be selected are passed as lists. "Index":"Y2008" denotes the that all the columns from Index to Y2008 are to be selected.

Syntax of df.loc[ ]

df.loc[row_index , column_index]

income.loc[:,["Index","State","Y2008"]]
income.loc[0:2,["Index","State","Y2008"]] #Selecting rows with Index label 0 to 2 & columns
income.loc[:,"Index":"Y2008"] #Selecting consecutive columns
#In the above command both Index and Y2008 are included.
income.iloc[:,0:5] #Columns from 1 to 5 are included. 6th column not included

Difference between loc and iloc

loc considers rows (or columns) with particular labels from the index. Whereas iloc considers rows (or columns) at particular positions in the index so it only takes integers.

x = pd.DataFrame({"var1" : np.arange(1,20,2)}, index=[9,8,7,6,10, 1, 2, 3, 4, 5])

iloc Code

x.iloc[:3]

Output:
   var1
9     1
8     3
7     5

loc code

You can also use the following syntax to select specific variables.

income[["Index","State","Y2008"]]

Renaming the variables

We create a dataframe 'data' for information of people and their respective zodiac signs.

data = pd.DataFrame({"A" : ["John","Mary","Julia","Kenny","Henry"], "B" : ["Libra","Capricorn","Aries","Scorpio","Aquarius"]})
data

       A          B
0   John      Libra
1   Mary  Capricorn
2  Julia      Aries
3  Kenny    Scorpio
4  Henry   Aquarius

If all the columns are to be renamed then we can use data.columns and assign the list of new column names.

#Renaming all the variables.
data.columns = ['Names','Zodiac Signs']

   Names Zodiac Signs
0   John        Libra
1   Mary    Capricorn
2  Julia        Aries
3  Kenny      Scorpio
4  Henry     Aquarius

If only some of the variables are to be renamed then we can use rename( ) function where the new names are passed in the form of a dictionary.

#Renaming only some of the variables.
data.rename(columns = {"Names":"Cust_Name"},inplace = True)

  Cust_Name Zodiac Signs
0      John        Libra
1      Mary    Capricorn
2     Julia        Aries
3     Kenny      Scorpio
4     Henry     Aquarius

By default in pandas inplace = False which means that no changes are made in the original dataset. Thus if we wish to alter the original dataset we need to define inplace = True.

Suppose we want to replace only a particular character in the list of the column names then we can use str.replace( ) function. For example, renaming the variables which contain "Y" as "Year"

income.columns = income.columns.str.replace('Y' , 'Year ')
income.columns

Index(['Index', 'State', 'Year 2002', 'Year 2003', 'Year 2004', 'Year 2005',
       'Year 2006', 'Year 2007', 'Year 2008', 'Year 2009', 'Year 2010',
       'Year 2011', 'Year 2012', 'Year 2013', 'Year 2014', 'Year 2015'],
      dtype='object')

Setting one column in the data frame as the index

Using set_index("column name") we can set the indices as that column and that column gets removed.

income.set_index("Index",inplace = True)
income.head()
#Note that the indices have changed and Index column is now no more a column
income.columns
income.reset_index(inplace = True)
income.head()

reset_index( ) tells us that one should use the by default indices.

Removing the columns and rows

To drop a column we use drop( ) where the first argument is a list of columns to be removed.

By default axis = 0 which means the operation should take place horizontally, row wise. To remove a column we need to set axis = 1.

income.drop('Index',axis = 1)

#Alternatively
income.drop("Index",axis = "columns")
income.drop(['Index','State'],axis = 1)
income.drop(0,axis = 0)
income.drop(0,axis = "index")
income.drop([0,1,2,3],axis = 0)

Also inplace = False by default thus no alterations are made in the original dataset. axis = "columns" and axis = "index" means the column and row(index) should be removed respectively.

Sorting the data
To sort the data sort_values( ) function is deployed. By default inplace = False and ascending = True.

income.sort_values("State",ascending = False)
income.sort_values("State",ascending = False,inplace = True)
income.Y2006.sort_values()

We have got duplicated for Index thus we need to sort the dataframe firstly by Index and then for each particular index we sort the values by Y2002

income.sort_values(["Index","Y2002"])

Create new variables
Using eval( ) arithmetic operations on various columns can be carried out in a dataset.

income["difference"] = income.Y2008-income.Y2009

#Alternatively
income["difference2"] = income.eval("Y2008 - Y2009")
income.head()

  Index       State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
0     A     Alabama  1296530  1317711  1118631  1492583  1107408  1440134   
1     A      Alaska  1170302  1960378  1818085  1447852  1861639  1465841   
2     A     Arizona  1742027  1968140  1377583  1782199  1102568  1109382   
3     A    Arkansas  1485531  1994927  1119299  1947979  1669191  1801213   
4     C  California  1685349  1675807  1889570  1480280  1735069  1812546   

       Y2008    Y2009    Y2010    Y2011    Y2012    Y2013    Y2014    Y2015  \
0  1945229.0  1944173  1237582  1440756  1186741  1852841  1558906  1916661   
1  1551826.0  1436541  1629616  1230866  1512804  1985302  1580394  1979143   
2  1752886.0  1554330  1300521  1130709  1907284  1363279  1525866  1647724   
3  1188104.0  1628980  1669295  1928238  1216675  1591896  1360959  1329341   
4  1487315.0  1663809  1624509  1639670  1921845  1156536  1388461  1644607   

   difference  difference2  
0      1056.0       1056.0  
1    115285.0     115285.0  
2    198556.0     198556.0  
3   -440876.0    -440876.0  
4   -176494.0    -176494.0

income.ratio = income.Y2008/income.Y2009

The above command does not work, thus to create new columns we need to use square brackets.
We can also use assign( ) function but this command does not make changes in the original data as there is no inplace parameter. Hence we need to save it in a new dataset.

data = income.assign(ratio = (income.Y2008 / income.Y2009))
data.head()

Finding Descriptive Statistics
describe( ) is used to find some statistics like mean,minimum, quartiles etc. for numeric variables.

income.describe() #for numeric variables

              Y2002         Y2003         Y2004         Y2005         Y2006  \
count  5.100000e+01  5.100000e+01  5.100000e+01  5.100000e+01  5.100000e+01   
mean   1.566034e+06  1.509193e+06  1.540555e+06  1.522064e+06  1.530969e+06   
std    2.464425e+05  2.641092e+05  2.813872e+05  2.671748e+05  2.505603e+05   
min    1.111437e+06  1.110625e+06  1.118631e+06  1.122030e+06  1.102568e+06   
25%    1.374180e+06  1.292390e+06  1.268292e+06  1.267340e+06  1.337236e+06   
50%    1.584734e+06  1.485909e+06  1.522230e+06  1.480280e+06  1.531641e+06   
75%    1.776054e+06  1.686698e+06  1.808109e+06  1.778170e+06  1.732259e+06   
max    1.983285e+06  1.994927e+06  1.979395e+06  1.990062e+06  1.985692e+06   

              Y2007         Y2008         Y2009         Y2010         Y2011  \
count  5.100000e+01  5.100000e+01  5.100000e+01  5.100000e+01  5.100000e+01   
mean   1.553219e+06  1.538398e+06  1.658519e+06  1.504108e+06  1.574968e+06   
std    2.539575e+05  2.958132e+05  2.361854e+05  2.400771e+05  2.657216e+05   
min    1.109382e+06  1.112765e+06  1.116168e+06  1.103794e+06  1.116203e+06   
25%    1.322419e+06  1.254244e+06  1.553958e+06  1.328439e+06  1.371730e+06   
50%    1.563062e+06  1.545621e+06  1.658551e+06  1.498662e+06  1.575533e+06   
75%    1.780589e+06  1.779538e+06  1.857746e+06  1.639186e+06  1.807766e+06   
max    1.983568e+06  1.990431e+06  1.993136e+06  1.999102e+06  1.992996e+06   

              Y2012         Y2013         Y2014         Y2015  
count  5.100000e+01  5.100000e+01  5.100000e+01  5.100000e+01  
mean   1.591135e+06  1.530078e+06  1.583360e+06  1.588297e+06  
std    2.837675e+05  2.827299e+05  2.601554e+05  2.743807e+05  
min    1.108281e+06  1.100990e+06  1.110394e+06  1.110655e+06  
25%    1.360654e+06  1.285738e+06  1.385703e+06  1.372523e+06  
50%    1.643855e+06  1.531212e+06  1.580394e+06  1.627508e+06  
75%    1.866322e+06  1.725377e+06  1.791594e+06  1.848316e+06  
max    1.988270e+06  1.994022e+06  1.990412e+06  1.996005e+06

For character or string variables, you can write include = ['object']. It will return total count, maximum occurring string and its frequency

income.describe(include = ['object']) #Only for strings / objects

To find out specific descriptive statistics of each column of data frame

income.mean()
income.median()
income.agg(["mean","median"])

Mean, median, maximum and minimum can be obtained for a particular column(s) as:

income.Y2008.mean()
income.Y2008.median()
income.Y2008.min()
income.loc[:,["Y2002","Y2008"]].max()

Groupby function
To group the data by a categorical variable we use groupby( ) function and hence we can do the operations on each category.

income.groupby("Index").Y2008.min()
income.groupby("Index")["Y2008","Y2010"].max()

agg( ) function is used to find all the functions for a given variable.

income.groupby("Index").Y2002.agg(["count","min","max","mean"])
income.groupby("Index")["Y2002","Y2003"].agg(["count","min","max","mean"])

The following command finds minimum and maximum values for Y2002 and only mean for Y2003

income.groupby("Index").agg({"Y2002": ["min","max"],"Y2003" : "mean"})

          Y2002                 Y2003
           min      max         mean
Index                               
A      1170302  1742027  1810289.000
C      1343824  1685349  1595708.000
D      1111437  1330403  1631207.000
F      1964626  1964626  1468852.000
G      1929009  1929009  1541565.000
H      1461570  1461570  1200280.000
I      1353210  1776918  1536164.500
K      1509054  1813878  1369773.000
L      1584734  1584734  1110625.000
M      1221316  1983285  1535717.625
N      1395149  1885081  1382499.625
O      1173918  1802132  1569934.000
P      1320191  1320191  1446723.000
R      1501744  1501744  1942942.000
S      1159037  1631522  1477072.000
T      1520591  1811867  1398343.000
U      1771096  1771096  1195861.000
V      1134317  1146902  1498122.500
W      1677347  1977749  1521118.500

Filtering
To filter only those rows which have Index as "A" we write:

income[income.Index == "A"]

#Alternatively
income.loc[income.Index == "A",:]

  Index     State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
0     A   Alabama  1296530  1317711  1118631  1492583  1107408  1440134   
1     A    Alaska  1170302  1960378  1818085  1447852  1861639  1465841   
2     A   Arizona  1742027  1968140  1377583  1782199  1102568  1109382   
3     A  Arkansas  1485531  1994927  1119299  1947979  1669191  1801213   

     Y2008    Y2009    Y2010    Y2011    Y2012    Y2013    Y2014    Y2015  
0  1945229  1944173  1237582  1440756  1186741  1852841  1558906  1916661  
1  1551826  1436541  1629616  1230866  1512804  1985302  1580394  1979143  
2  1752886  1554330  1300521  1130709  1907284  1363279  1525866  1647724  
3  1188104  1628980  1669295  1928238  1216675  1591896  1360959  1329341

To select the States having Index as "A":

income.loc[income.Index == "A","State"]
income.loc[income.Index == "A",:].State

To filter the rows with Index as "A" and income for 2002 > 1500000"

income.loc[(income.Index == "A") & (income.Y2002 > 1500000),:]

To filter the rows with index either "A" or "W", we can use isin( ) function:

income.loc[(income.Index == "A") | (income.Index == "W"),:]

#Alternatively.
income.loc[income.Index.isin(["A","W"]),:]

   Index          State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
0      A        Alabama  1296530  1317711  1118631  1492583  1107408  1440134   
1      A         Alaska  1170302  1960378  1818085  1447852  1861639  1465841   
2      A        Arizona  1742027  1968140  1377583  1782199  1102568  1109382   
3      A       Arkansas  1485531  1994927  1119299  1947979  1669191  1801213   
47     W     Washington  1977749  1687136  1199490  1163092  1334864  1621989   
48     W  West Virginia  1677347  1380662  1176100  1888948  1922085  1740826   
49     W      Wisconsin  1788920  1518578  1289663  1436888  1251678  1721874   
50     W        Wyoming  1775190  1498098  1198212  1881688  1750527  1523124   

      Y2008    Y2009    Y2010    Y2011    Y2012    Y2013    Y2014    Y2015  
0   1945229  1944173  1237582  1440756  1186741  1852841  1558906  1916661  
1   1551826  1436541  1629616  1230866  1512804  1985302  1580394  1979143  
2   1752886  1554330  1300521  1130709  1907284  1363279  1525866  1647724  
3   1188104  1628980  1669295  1928238  1216675  1591896  1360959  1329341  
47  1545621  1555554  1179331  1150089  1775787  1273834  1387428  1377341  
48  1238174  1539322  1539603  1872519  1462137  1683127  1204344  1198791  
49  1980167  1901394  1648755  1940943  1729177  1510119  1701650  1846238  
50  1587602  1504455  1282142  1881814  1673668  1994022  1204029  1853858

Alternatively we can use query( ) function and write our filtering criteria:

income.query('Y2002>1700000 & Y2003 > 1500000')

Dealing with missing values
We create a new dataframe named 'crops' and to create a NaN value we use np.nan by importing numpy.

import numpy as np
mydata = {'Crop': ['Rice', 'Wheat', 'Barley', 'Maize'],
'Yield': [1010, 1025.2, 1404.2, 1251.7],
'cost' : [102, np.nan, 20, 68]}
crops = pd.DataFrame(mydata)
crops

isnull( ) returns True and notnull( ) returns False if the value is NaN.

crops.isnull() #same as is.na in R
crops.notnull() #opposite of previous command.
crops.isnull().sum() #No. of missing values.

crops.cost.isnull() firstly subsets the 'cost' from the dataframe and returns a logical vector with isnull()

crops[crops.cost.isnull()] #shows the rows with NAs.
crops[crops.cost.isnull()].Crop #shows the rows with NAs in crops.Crop
crops[crops.cost.notnull()].Crop #shows the rows without NAs in crops.Crop

To drop all the rows which have missing values in any rows we use dropna(how = "any") . By default inplace = False . If how = "all" means drop a row if all the elements in that row are missing

crops.dropna(how = "any").shape
crops.dropna(how = "all").shape

To remove NaNs if any of 'Yield' or'cost' are missing we use the subset parameter and pass a list:

crops.dropna(subset = ['Yield',"cost"],how = 'any').shape
crops.dropna(subset = ['Yield',"cost"],how = 'all').shape

Replacing the missing values by "UNKNOWN" sub attribute in Column name.

crops['cost'].fillna(value = "UNKNOWN",inplace = True)
crops

Dealing with duplicates

We create a new dataframe comprising of items and their respective prices.

data = pd.DataFrame({"Items" : ["TV","Washing Machine","Mobile","TV","TV","Washing Machine"], "Price" : [10000,50000,20000,10000,10000,40000]})
data

             Items  Price
0               TV  10000
1  Washing Machine  50000
2           Mobile  20000
3               TV  10000
4               TV  10000
5  Washing Machine  40000

duplicated() returns a logical vector returning True when encounters duplicated.

data.loc[data.duplicated(),:]
data.loc[data.duplicated(keep = "first"),:]

By default keep = 'first' i.e. the first occurence is considered a unique value and its repetitions are considered as duplicates.
If keep = "last" the last occurence is considered a unique value and all its repetitions are considered as duplicates.

data.loc[data.duplicated(keep = "last"),:] #last entries are not there,indices have changed.

If keep = "False" then it considers all the occurences of the repeated observations as duplicates.

data.loc[data.duplicated(keep = False),:] #all the duplicates, including unique are shown.

To drop the duplicates drop_duplicates is used with default inplace = False, keep = 'first' or 'last' or 'False' have the respective meanings as in duplicated( )

data.drop_duplicates(keep = "first")
data.drop_duplicates(keep = "last")
data.drop_duplicates(keep = False,inplace = True) #by default inplace = False
data

Creating dummies

Now we will consider the iris dataset.

iris = pd.read_csv("C:\\Users\\Hp\\Desktop\\work\\Python\\Basics\\pandas\\iris.csv")
iris.head()

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

map( ) function is used to match the values and replace them in the new series automatically created.

iris["setosa"] = iris.Species.map({"setosa" : 1,"versicolor":0, "virginica" : 0})
iris.head()

To create dummies get_dummies( ) is used. iris.Species.prefix = "Species" adds a prefix ' Species' to the new series created.

pd.get_dummies(iris.Species,prefix = "Species")
pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:1] #1 is not included
species_dummies = pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:]

With concat( ) function we can join multiple series or dataframes. axis = 1 denotes that they should be joined columnwise.

iris = pd.concat([iris,species_dummies],axis = 1)
iris.head()

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species  \
0           5.1          3.5           1.4          0.2  setosa   
1           4.9          3.0           1.4          0.2  setosa   
2           4.7          3.2           1.3          0.2  setosa   
3           4.6          3.1           1.5          0.2  setosa   
4           5.0          3.6           1.4          0.2  setosa   

   Species_setosa  Species_versicolor  Species_virginica  
0               1                   0                  0  
1               1                   0                  0  
2               1                   0                  0  
3               1                   0                  0  
4               1                   0                  0

It is usual that for a variable with 'n' categories we creat 'n-1' dummies, thus to drop the first 'dummy' column we write drop_first = True

pd.get_dummies(iris,columns = ["Species"],drop_first = True).head()

Ranking
To create a dataframe of all the ranks we use rank( )

iris.rank()

Ranking by a specific variable
Suppose we want to rank the Sepal.Length for different species in ascending order:

iris['Rank'] = iris.sort_values(['Sepal.Length'], ascending=[True]).groupby(['Species']).cumcount() + 1
iris.head( )

#Alternatively
iris['Rank2'] = iris['Sepal.Length'].groupby(iris["Species"]).rank(ascending=1)
iris.head()

Calculating the Cumulative sum
Using cumsum( ) function we can obtain the cumulative sum

iris['cum_sum'] = iris["Sepal.Length"].cumsum()
iris.head()

Cumulative sum by a variable
To find the cumulative sum of sepal lengths for different species we use groupby( ) and then use cumsum( )

iris["cumsum2"] = iris.groupby(["Species"])["Sepal.Length"].cumsum()
iris.head()

Calculating the percentiles.

Various quantiles can be obtained by using quantile( )

iris.quantile(0.5)
iris.quantile([0.1,0.2,0.5])
iris.quantile(0.55)

if else in Python
We create a new dataframe of students' name and their respective zodiac signs.

students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})

def name(row):
    if row["Names"] in ["John","Henry"]:
        return "yes"
    else:
        return "no"

students['flag'] = students.apply(name, axis=1)
students

Functions in python are defined using the block keyword def , followed with the function's name as the block's name. apply( ) function applies function along rows or columns of dataframe.

Note :If using simple 'if else' we need to take care of the indentation . Python does not involve curly braces for the loops and if else.

Output

      Names Zodiac Signs flag
0      John     Aquarius  yes
1      Mary        Libra   no
2     Henry       Gemini  yes
3  Augustus       Pisces   no
4     Kenny        Virgo   no

Alternatively, By importing numpy we can use np.where. The first argument is the condition to be evaluated, 2nd argument is the value if condition is True and last argument defines the value if the condition evaluated returns False.

import numpy as np
students['flag'] = np.where(students['Names'].isin(['John','Henry']), 'yes', 'no')
students

Multiple Conditions : If Else-if Else

def mname(row):
    if row["Names"] == "John" and row["Zodiac Signs"] == "Aquarius" :
        return "yellow"
    elif row["Names"] == "Mary" and row["Zodiac Signs"] == "Libra" :
        return "blue"
    elif row["Zodiac Signs"] == "Pisces" :
        return "blue"
    else:
        return "black"

students['color'] = students.apply(mname, axis=1)
students

We create a list of conditions and their respective values if evaluated True and use np.select where default value is the value if all the conditions is False

conditions = [
(students['Names'] == 'John') & (students['Zodiac Signs'] == 'Aquarius'),
(students['Names'] == 'Mary') & (students['Zodiac Signs'] == 'Libra'),
(students['Zodiac Signs'] == 'Pisces')]
choices = ['yellow', 'blue', 'purple']
students['color'] = np.select(conditions, choices, default='black')
students

      Names Zodiac Signs flag   color
0      John     Aquarius  yes  yellow
1      Mary        Libra   no    blue
2     Henry       Gemini  yes   black
3  Augustus       Pisces   no  purple
4     Kenny        Virgo   no   black

Select numeric or categorical columns only
To include numeric columns we use select_dtypes( )

data1 = iris.select_dtypes(include=[np.number])
data1.head()

_get_numeric_data also provides utility to select the numeric columns only.

data3 = iris._get_numeric_data()
data3.head(3)

   Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  cum_sum  cumsum2
0           5.1          3.5           1.4          0.2      5.1      5.1
1           4.9          3.0           1.4          0.2     10.0     10.0
2           4.7          3.2           1.3          0.2     14.7     14.7

For selecting categorical variables

data4 = iris.select_dtypes(include = ['object'])
data4.head(2)

 Species
0  setosa
1  setosa

Concatenating
We create 2 dataframes containing the details of the students:

students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})
students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
'Marks' : [50,81,98,25,35]})

using pd.concat( ) function we can join the 2 dataframes:

data = pd.concat([students,students2]) #by default axis = 0

   Marks     Names Zodiac Signs
0    NaN      John     Aquarius
1    NaN      Mary        Libra
2    NaN     Henry       Gemini
3    NaN  Augustus       Pisces
4    NaN     Kenny        Virgo
0   50.0      John          NaN
1   81.0      Mary          NaN
2   98.0     Henry          NaN
3   25.0  Augustus          NaN
4   35.0     Kenny          NaN

By default axis = 0 thus the new dataframe will be added row-wise. If a column is not present then in one of the dataframes it creates NaNs. To join column wise we set axis = 1

data = pd.concat([students,students2],axis = 1)
data

      Names Zodiac Signs  Marks     Names
0      John     Aquarius     50      John
1      Mary        Libra     81      Mary
2     Henry       Gemini     98     Henry
3  Augustus       Pisces     25  Augustus
4     Kenny        Virgo     35     Kenny

Using append function we can join the dataframes row-wise

students.append(students2) #for rows

Alternatively we can create a dictionary of the two data frames and can use pd.concat to join the dataframes row wise

classes = {'x': students, 'y': students2}
result = pd.concat(classes)
result

     Marks     Names Zodiac Signs
x 0    NaN      John     Aquarius
  1    NaN      Mary        Libra
  2    NaN     Henry       Gemini
  3    NaN  Augustus       Pisces
  4    NaN     Kenny        Virgo
y 0   50.0      John          NaN
  1   81.0      Mary          NaN
  2   98.0     Henry          NaN
  3   25.0  Augustus          NaN
  4   35.0     Kenny          NaN

Merging or joining on the basis of common variable.

We take 2 dataframes with different number of observations:

students = pd.DataFrame({'Names': ['John','Mary','Henry','Maria'],
'Zodiac Signs': ['Aquarius','Libra','Gemini','Capricorn']})
students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
'Marks' : [50,81,98,25,35]})

Using pd.merge we can join the two dataframes. on = 'Names' denotes the common variable on the basis of which the dataframes are to be combined is 'Names'

result = pd.merge(students, students2, on='Names') #it only takes intersections
result

   Names Zodiac Signs  Marks
0   John     Aquarius     50
1   Mary        Libra     81
2  Henry       Gemini     98

By default how = "inner" thus it takes only the common elements in both the dataframes. If you want all the elements in both the dataframes set how = "outer"

result = pd.merge(students, students2, on='Names',how = "outer") #it only takes unions
result

      Names Zodiac Signs  Marks
0      John     Aquarius   50.0
1      Mary        Libra   81.0
2     Henry       Gemini   98.0
3     Maria    Capricorn    NaN
4  Augustus          NaN   25.0
5     Kenny          NaN   35.0

To take only intersections and all the values in left df set how = 'left'

result = pd.merge(students, students2, on='Names',how = "left")
result

   Names Zodiac Signs  Marks
0   John     Aquarius   50.0
1   Mary        Libra   81.0
2  Henry       Gemini   98.0
3  Maria    Capricorn    NaN

Similarly how = 'right' takes only intersections and all the values in right df.

result = pd.merge(students, students2, on='Names',how = "right",indicator = True)
result

      Names Zodiac Signs  Marks      _merge
0      John     Aquarius     50        both
1      Mary        Libra     81        both
2     Henry       Gemini     98        both
3  Augustus          NaN     25  right_only
4     Kenny          NaN     35  right_only

indicator = True creates a column for indicating that whether the values are present in both the dataframes or either left or right dataframe.

↧

ListenData: Loops in Python explained with examples

April 20, 2019, 7:01 am

≫ Next: Talk Python to Me: #208 Packaging, Making the most of PyCon, and more

≪ Previous: ListenData: Pandas Python Tutorial - Learn by Examples

This tutorial covers various ways to execute loops in python. Loops is an important concept of any programming language which performs iterations i.e. run specific code repeatedly until a certain condition is reached.

1. For Loop

Like R and C programming language, you can use for loop in Python. It is one of the most commonly used loop method to automate the repetitive tasks.

How for loop works?

Suppose you are asked to print sequence of numbers from 1 to 9, increment by 2.

for i in range(1,10,2):
  print(i)

Output

range(1,10,2) means starts from 1 and ends with 9 (excluding 10), increment by 2.

Iteration over list
This section covers how to run for in loop on a list.

mylist = [30,21,33,42,53,64,71,86,97,10]
for i in mylist:
    print(i)

Output

Suppose you need to select every 3rd value of list.

for i in mylist[::3]:
    print(i)

Output

mylist[::3] is equivalent to mylist[0::3] which follows this syntax style list[start:stop:step]

Python Loop Explained with Examples

Example 1 : Create a new list with only items from list that is between 0 and 10

l1 = [100, 1, 10, 2, 3, 5, 8, 13, 21, 34, 55, 98]

new = [] #Blank list
for i in l1:
    if i > 0 and i <= 10:
        new.append(i)

new

Output: [1, 10, 2, 3, 5, 8]

It can also be done via numpy package by creating list as numpy array. See the code below.

import numpy as np
k=np.array(l1)
new=k[np.where(k<=10)]

Example 2 : Check which alphabet (a-z) is mentioned in string

Suppose you have a string named k and you want to check which alphabet exists in the string k.

k = "deepanshu"

import string
for n in string.ascii_lowercase:
    if n in k:
        print(n + ' exists in ' + k)
    else:
        print(n + ' does not exist in ' + k)

string.ascii_lowercase returns 'abcdefghijklmnopqrstuvwxyz'.

Practical Examples : for in loop in Python

Create sample pandas data frame for illustrative purpose.

import pandas as pd
np.random.seed(234)
df = pd.DataFrame({"x1" : np.random.randint(low=1, high=100, size=10),
"Month1" : np.random.normal(size=10),
"Month2" : np.random.normal(size=10),
"Month3" : np.random.normal(size=10),
"price"  : range(10)
                     })

df

1. Multiple each month column by 1.2

for i in range(1,4):
    print(df["Month"+str(i)]*1.2)

range(1,4) returns 1, 2 and 3. str( ) function is used to covert to string."Month" + str(1) means Month1.

2. Store computed columns in new data frame

import pandas as pd
newDF = pd.DataFrame()
for i in range(1,4):
    data = pd.DataFrame(df["Month"+str(i)]*1.2)
    newDF=pd.concat([newDF,data], axis=1)

pd.DataFrame( ) is used to create blank data frame. The concat() function from pandas package is used to concatenate two data frames.

3. Check if value of x1 >= 50, multiply each month cost by price. Otherwise same as month.

import pandas as pd
import numpy as np
for i in range(1,4):
    df['newcol'+str(i)] = np.where(df['x1'] >= 50,
                                   df['Month'+str(i)] * df['price'],
                                   df['Month'+str(i)])

In this example, we are adding new columns named newcol1, newcol2 and newcol3.np.where(condition, value_if condition meets, value_if condition does not meet) is used to construct IF ELSE statement.

4. Filter data frame by each unique value of a column and store it in a separate data frame

mydata = pd.DataFrame({"X1" : ["A","A","B","B","C"]})

for name in mydata.X1.unique():
    temp = pd.DataFrame(mydata[mydata.X1 == name])
    exec('{} = temp'.format(name))

The unique( ) function is used to calculate distinct values of a variable. The exec( ) function is used for dynamic execution of Python program. See the usage of string format( ) function below -

s= "Your Input"
"i am {}".format(s)

Output: 'i am Your Input'

Loop Control Statements

Loop control statements change execution from its normal iteration. When execution leaves a scope, all automatic objects that were created in that scope are destroyed.

Python supports the following control statements.

Continue statement
Break statement

Continue Statement
When continue statement is executed, it skips the further code in the loop and continue iteration.
In the code below, we are avoiding letters a and d to be printed.

for n in "abcdef":
    if n =="a" or n =="d":
       continue
    print("letter :", n)

letter : b
letter : c
letter : e
letter : f

Break Statement
When break statement runs, it breaks or stops the loop.
In this program, when n is either c or d, loop stops executing.

for n in "abcdef":
    if n =="c" or n =="d":
       break
    print("letter :", n)

letter : a
letter : b

for loop with else clause

Using else clause with for loop is not common among python developers community.

The else clause executes after the loop completes. It means that the loop did not encounter a break statement.

The program below calculates factors for numbers between 2 to 10. Else clause returns numbers which have no factors and are therefore prime numbers:

for k in range(2, 10):
    for y in range(2, k):
        if k % y == 0:
            print( k, '=', y, '*', round(k/y))
            break
    else:
        print(k, 'is a prime number')

2 is a prime number
3 is a prime number
4 = 2 * 2
5 is a prime number
6 = 2 * 3
7 is a prime number
8 = 2 * 4
9 = 3 * 3

While Loop

While loop is used to execute code repeatedly until a condition is met. And when the condition becomes false, the line immediately after the loop in program is executed.

i = 1
while i < 10:
    print(i)
    i += 2 #means i = i + 2
    print("new i :", i)

Output:
1
new i : 3
3
new i : 5
5
new i : 7
7
new i : 9
9
new i : 11

While Loop with If-Else Statement

If-Else statement can be used along with While loop. See the program below -

counter = 1 
while (counter <= 5): 
    if counter < 2:
        print("Less than 2")
    elif counter > 4:
        print("Greater than 4")
    else: 
        print(">= 2 and <=4")    
    counter += 1

↧

Talk Python to Me: #208 Packaging, Making the most of PyCon, and more

April 21, 2019, 1:00 am

≫ Next: The Code Bits: Getting started with Raspberry Pi and Python

≪ Previous: ListenData: Loops in Python explained with examples

Are you going to PyCon (or a similar conference)? Join me and Kenneth Retiz as we discuss how to make the most of PyCon and what makes it special for each of us.

↧

The Code Bits: Getting started with Raspberry Pi and Python

April 21, 2019, 3:52 pm

≫ Next: The Code Bits: Flask Project for Beginners: Inspirational Quotes

≪ Previous: Talk Python to Me: #208 Packaging, Making the most of PyCon, and more

Hello there!, So you just got a shiny new Raspberry Pi. Well done and congrats! In this tutorial, we are going to look at how we can set up your Raspberry Pi and get it up and running.

What you need to get started

First and foremost we have to ensure that we have all the required items to get started. Here is a short list of items that are absolutely required to get the ball rolling.

Raspberry Pi (buy from Amazon) (you guessed it right!)
USB power adapter (buy from Amazon)(you could get away with a normal USB cable, but it is not recommended)
micro SD card / SD card adapter (buy from Amazon)(at least 8 GB in size, 32 GB or more recommended. Hey it’s not that costly!). The given link points to a bundle that includes an SD card adapter.
USB card reader (buy from Amazon)(you will need this for plugging the microSD card onto your computer to load the Operating System. If your PC/Mac has a supported card reader, that’s fine too!).
USB keyboard / Mouse (buy from Amazon)(to get the OS setup properly during installation)
HDMI cable (buy from Amazon)and a monitor (well, you need to see what you’re doing)

Optional

Ethernet cable / USB-WiFi adapter (buy from Amazon)– Unless you have a Raspberry Pi 3 model b+, this will come in handy to connect to the internet and update your software packages

You can also get these as a bundle (buy from Amazon)or if you have some of the components lying around, then you could selectively get them as needed.

Getting your OS ready

Raspberry Pi is a miniature computer and just like your laptop and desktop PC, it needs to run an Operating System. Since Raspberry Pi runs an ARM-based processor (If you haven’t heard of ARM processors, they are low power processors commonly found in your mobile phones and tablets), we need to use a supporting version of OS. Luckily both the Raspberry Pi foundation and many of the Linux community members have created many Operating Systems that you can choose from to run on your board! In this tutorial, we are going to use NOOBS distribution which you can find here.

Once you go to the above link, choose NOOBS, and you will be given two options, select NOOBS. This is a large file and it might take some time. Be patient!

In the meanwhile, we need another piece of software that helps us format and transfer our OS to our microSD card. You can find the software here.

Once the SDFormatter tool is installed, and the OS image is downloaded, we are ready to go!

The first step is to unzip the NOOBS file that you have downloaded. You will get a NOOBS_v3_0_0 folder or something similar depending on your OS version. Open up the SDFormatter tool and insert your microSD card onto your PC. You can either use a USB microSD adapter like this or use the one that’s in your laptop / PC.

On the SDFormatter tool select the drive that corresponds to your SD card. Make sure this is correct!.

IMPORTANT: Please back up any data before formatting!

Once the formatting is complete, you can copy the contents of the NOOB folder which would be something similar to what is shown below onto your formatted SD card.

Connecting all pieces and booting up!

All right, we’re almost there! Now we need to connect our Raspberry Pi to our peripherals and boot it up for the first time.

Here, I have used a USB keyboard, a Wireless Logitech mouse that has a wireless USB adapter. I have plugged in a USB Wi-Fi adapter to connect to my network and an HDMI cable to connect to my monitor. The overall setup after setting up everything including power looks something like this.

Setting up the OS

Once we have the setup ready, let’s connect the board to the peripherals and power and boot it up. Initially, you will see a dialog box to select the operating system

Select Raspbian Full and click the install button. The installation dialog will follow afterward. It will copy the files and install the OS.

Once the OS is installed, the system will reboot and you will be welcomed with the new desktop!. Next, you will be prompted to set up your account password, keyboard profile, and timezone as well as connect to a Wi-Fi endpoint. That is pretty much it!

Hello world in Python

Fire up the terminal by going to the menu button on the top left and selecting the terminal. Open up a python shell by entering

python

This opens up a python shell, say hello world from your shiny new raspberry pi!

print "Hello world!"

Wrap up

So that’s it! We’ve made it to the end and we have a Raspberry Pi up and running! Woohoo! Now is the fun part, which is all the fun things we could do with it. I’ll be doing some interesting projects that you could follow along on this blog going forward. Make sure to subscribe to thecodebits.com to receive updates. See you all soon!

↧

The Code Bits: Flask Project for Beginners: Inspirational Quotes

April 21, 2019, 4:47 pm

≫ Next: Mike Driscoll: PyDev of the Week: Dane Hillard

≪ Previous: The Code Bits: Getting started with Raspberry Pi and Python

In this project, we will create a web application that displays a random inspirational quote.

The goal of this project is to learn about application factory and how to create templates in Flask.

This is the second part of the “Getting started with Flask series”. In the first part, we learned how to create a basic Hello World application using Flask and run it locally.

Installation and Setup

First, let us install the dependencies and setup our project directory. Basically what we need to do is:

Install Python. We will be using Python3 here.
Create a directory for our project, say “1_inspirational_quotes”, and go inside the directory.
```
mkdir 1_inspirational_quotes
cd 1_inspirational_quotes
```
We will create a virtual environment for our project where we will install Flask and any other dependencies. So go ahead and create a virtual environment and activate it.
```
python3 -m venv venv
. venv/bin/activate
```
Finally install Flask.
```
pip3 install flask
```

If you need more instructions on installation, refer to Flask installation guide or Getting started with Flask.

Set up the Application Factory

Now that we have setup our project directory, the next thing that we need to do is to create the Flask application, which is nothing but an instance of the Flask class.

We could create the Flask instance globally, the way we did in Getting Started with Flask: Hello World. However, in this example, we will create it within a function.

Application Factory is the term used to refer to the method inside which we will create our application (Flask instance). All sorts of configuration and setup required for the application will also be done within the application factory. Finally it will return the Flask application instance.

So let us create a directory ‘quotes’ and add an __init__.py file to it. This will make the directory get treated as a Python package.

mkdir quotes
cd quotes
touch __init__.py

Then let us define our application factory in this file.

––> 1_inspirational_quotes/quotes/__init__.py

from flask import Flask

def create_app():
    """
    create_app is the application factory.
    """
    # Create the app.
    app = Flask(__name__)

    # Add a route.
    @app.route('/')
    def home():
        return 'Hello there! :)'

    # Return the app.
    return app

Notes:

The method create_app is the application factory.
Within the application factory, we created a Flask instance, app, which is nothing but our application. Note that __name__ refers to the package name here, i.e., quotes. This is what will be used as our application name.
Then we created a placeholder method, home, which will serve the content for our app page. For now, it just returns some string which will get displayed on our browser when we run the application.
The decorator, @app.route, links the URL (/) to the method, home.

Run the basic application

In order to make sure that everything is set up correctly, let us run the application and see if it is working.

First, let us set the FLASK_APP environment variable to be our application package name. This basically tells Flask which application to run.

export FLASK_APP=quotes

We will also set the environment variable FLASK_ENV to development so that:

debug mode is turned on and the debugger is activated.
the server will be restarted whenever we make a code change. We can make modifications to our code and simply refresh the browser to see the changes in effect.

export FLASK_ENV=development

Note: If you are on Windows, use set instead of export.

Now we are ready to run the application. So go ahead and run it using the flask command. You should see an output similar to the following.

flask run
 * Serving Flask app "quotes" (lazy loading)
 * Environment: development
 * Debug mode: on
 * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
 * Restarting with stat
 * Debugger is active!
 * Debugger PIN: 150-101-403

Note: Make sure that you are running the command from the ‘1_inspirational_quotes’ directory and not ‘quotes’. Otherwise, you will see the error “flask.cli.NoAppException: Could not import “quotes.quotes”.”

To see the app in action, go to http://127.0.0.1:5000/ on your browser. You should see our message displayed in it as shown in the following image.

Awesome! Now let us start building our quotes app.

Add a template

Currently, our app just displays the string, “Hello there! :)” to the user. In this section, we will learn how to create a template that shows a random inspirational quote.

Return HTML content from the application factory

The simplest way to achieve this is to return the HTML code as a string instead of our hello world string as shown below:

—> 1_inspirational_quotes/quotes/__init__.py

from flask import Flask

def create_app():
    """
    create_app is the application factory.
    """
    # Create the app.
    app = Flask(__name__)

    # Add a route.
    @app.route('/')
    def home():
        return '''
<html>
<body>
  I find that the harder I work, the more luck I seem to have. – Thomas Jefferson
</body>
</html>
'''

    # Return the app.
    return app

Now if you go to http://127.0.0.1:5000/, you should see the quote displayed on the screen:

Even though this works perfectly fine, this is not the best approach to serve HTML content for our application. First of all, the code does not look clean. Second, as our application grows, modifying and maintaining the template within the application factory will be tedious. So we need to isolate our template from the application factory.

Create a static HTML template file

A template is a file that contains static data as well as placeholders for dynamic data. In this section, we will just be creating static HTML template that displays a quote to our user. In a later section, we will see how to make it dynamic.

Within the quotes directory, let us add a directory to keep our templates and move our quotes template to a separate HTML file.

mkdir templates
touch templates/quotes.html

Note that our template is stored within a directory named templates under the application directory, quotes. This is where Flask expects its templates by default.

–>1_inspirational_quotes/quotes/templates/quotes.html

<!doctype html>
<html>
<body>
  I find that the harder I work, the more luck I seem to have. – Thomas Jefferson
<body>
</html>

Register the template with the application factory

Now we need to modify our application factory such that this HTML file is served when users visit our web page.

—> 1_inspirational_quotes/quotes/__init__.py

from flask import Flask, render_template

def create_app():
    """
    create_app is the application factory.
    """
    # Create the app.
    app = Flask(__name__)

    # Add a route.
    @app.route('/')
    def home():
        return render_template('quotes.html')

    # Return the app.
    return app

Note how we introduced the method, render_template(). In this case, it takes our HTML file name and returns its contents. Later on, when we learn about serving dynamic content, we will learn more about rendering and how Flask uses Jinja for template rendering.

Now if we go to http://127.0.0.1:5000/, we should see the quote displayed on the screen just as we saw earlier.

Update the template to render quotes dynamically using Jinja

Now that we have learned how to create a template and register it with the application factory, let us see how we can serve content dynamically.

Right now our app just displays the same quote every time someone visits. Our goal is to dynamically update the quote by selecting one randomly from a set of quotes.

First, let us go ahead and create a list of quotes. To keep things simple, we will be adding it in memory within the application factory. In a later post, we will explore how to use databases with Flask.

—> 1_inspirational_quotes/quotes/__init__.py

from flask import Flask, render_template
import random

def create_app():
    """
    create_app is the application factory.
    """
    # Create the app.
    app = Flask(__name__)

    # Add a route.
    @app.route('/')
    def home():
        sample_quotes = [
            "I find that the harder I work, the more luck I seem to have. – Thomas Jefferson",
            "Success is the sum of small efforts, repeated day in and day out. – Robert Collier",
            "There are no shortcuts to any place worth going. – Beverly Sills",
            "The only place where success comes before work is in the dictionary. – Vidal Sassoon",
            "You don’t drown by falling in the water; you drown by staying there. – Ed Cole"
        ]

        # Select a random quote.
        selected_quote = random.choice(sample_quotes)

        # Pass the selected quote to the template.
        return render_template('quotes.html', quote=selected_quote)

    # Return the app.
    return app

As you can see, now we are passing an additional parameter, quote, to the render_template function. Flask uses Jinja to render dynamic content in the template. With this change, the variable, quote, becomes available in the template, quotes.html. Now let us see how we can update the template file to make use of this variable.

–>1_inspirational_quotes/quotes/templates/quotes.html

<!doctype html>
<html>
<body>
  {{ quote }}
<body>
</html>

Here, {{..}} is the delimiter used by Jinja to denote expressions which will be evaluated and rendered in the final HTML document.

Now if we go to http://127.0.0.1:5000/ and keep refreshing the page, we should see a different random quote selected from the list every time. A demo is shown below:

Add a stylesheet

As of now, our app works, but it looks very plain. So now we will see how to add a simple stylesheet to it.

In Flask, just like templates were expected to be in the templates directory by default, static files like CSS stylesheets are expected to be in static directory within the application folder.

So go ahead and create the directory and add a CSS file style.css to it.

mkdir static
touch static/style.css

—>1_inspirational_quotes/quotes/static/style.css

body {
  background-color: black;
  background-image: url("background.jpg");
  background-size:cover;
}
.quote_div {
  text-align: center;
  color: white;
  font-size: 30px;
  padding: 25px 5px;
  margin: 15% auto auto;
}

You can also add a background image and keep it under the static directory as shown above.

Now let us modify the template to use the stylesheet.

–>1_inspirational_quotes/quotes/templates/quotes.html

<!doctype html>
<html>
<head>
  <link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
</head>
<body>
  <div class="quote_div">
    {{ quote }}
  </div>
<body>
</html>

Now if we go to http://127.0.0.1:5000/, we will see a nicer app! A demo:

Conclusion

In case you want to browse through the code or download and try it out, the Github link for this project is here.

In this post, we learned how to create a basic Flask application that serves dynamic data using a Jinja template.

For more advanced lessons with projects, stay tuned and subscribe to our blog!

↧

Mike Driscoll: PyDev of the Week: Dane Hillard

April 21, 2019, 10:05 pm

≫ Next: Ram Rachum: PySnooper: Never use print for debugging again

≪ Previous: The Code Bits: Flask Project for Beginners: Inspirational Quotes

This week we welcome Dane Hillard (@easyaspython) as our PyDev of the Week! Dane is the author Practices of the Python Pro, an upcoming book from Manning. He is also a blogger and web developer. Let’s take some time to get to know Dane!

Can you tell us a little about yourself (hobbies, education, etc):

I’m a creative type, so many of my interests are in art and music. I’ve been a competitive ballroom dancer, and I’m a published musician and photographer. I’m proud of those accomplishments, but I’m driven to do most of this stuff for personal fulfillment more than anything! I enjoy sharing and discussing what I learn with others, too. When I have some time my next project is to start exploring foodways, which is this idea of exploring food and its cultural impact through written history. I’ve loved cooking (and food in general) for a long time and I want to get to know its origins better, which I think is something this generation is demanding more from industries as a whole. Should be fun!

Why did you start using Python?

I like using my computer engineering skills to build stuff not just for work, but for myself. I had written a website for my photography business in PHP way back in the day, but I wasn’t using a framework of any kind and the application code was mixed with the front-end code in a way that was hard to manage. I decided to try out a framework, and after using (and disliking) Java Spring for a while I gave Django a try. The rest is history! I started using Python for a few work-related things at the time and saw that it adapted well to many different types of tasks, so I kept rolling with it.

What other programming languages do you know and which is your favorite?

I use JavaScript fairly regularly, though it wasn’t until jQuery gave way to reactive paradigms that I really started enjoying it. We’re using React and Vue frequently now and I like it quite a bit for client-side development. I’ve also used Ruby in the past, which I find to be quite Python-like in certain ways. I think I still like Python best, but it’s easy to stick with what you know, right? I wouldn’t mind learning some Rust or Go soon! My original background is mainly in C and C++ but I can barely manage the memory in my own head so I don’t like telling a computer how to manage its memory when I can avoid it, but all these languages have their place.

What projects are you working on now?

At ITHAKA we’ve been managing an open source Python REST client, apiron, for a while now. We just released a feature where I got to explore some metaprogramming, which was stellar. It ended up reducing boilerplate people have to write, which is also stellar. I also built a new website as a bit of a portfolio and to centralize some of my online presence. It’s written in Vue, but was my first chance to explore vue-router and couple of other libraries, along with a headless CMS for blogging.

The biggest amount of my free time definitely goes to thinking about and writing the book I’m working on, which introduces people new to software development to some concepts important in collaborative software, in the context of Python. I’m hoping it will help people just graduating, switching disciplines, or who want to augment their work with software! The book is in early access and I’m chugging away on new chapters as we speak.

Which Python libraries are your favorite (core or 3rd party)?

The requests library is one of the more ubiquitous libraries, and it’s what we built apiron on top of. I’ve started using pytest a bit in place of Python’s built-in unittest, and I like the ways it simplifies existing tests while also providing tooling for doing more complex things with fixtures. There’s a great package, zappa, for deploying Django apps (or anything WSGI-based, I believe) to AWS Lambda. Look into that if you’re spending too much on an EC2 instance! For image manipulation, Pillow is great. One that I’d like to try out more soon is surprise, which helps you make recommendation systems akin to what Netflix or Hulu uses to recommend movies. Too many others to name here!

How did you come to author a book?

I don’t know how it works for most authors, but in my case the publisher, Manning, reached out to me—probably after seeing the blog posts I’ve written online. Presented with the opportunity, it was difficult to figure out if I really felt ready or qualified to do a book, which I still ask myself often if I’m being honest. I try to frame it to myself as an opportunity to help others, so even if I don’t produce something perfect I hope that I’ll still be able to say I did that much!

What challenges did you have writing the book and how did you overcome them?

Finding time and balancing it with other priorities is the primary struggle for me, as I imagine it is for many authors. The uncertainty I mentioned earlier is another one. Something that surprised me was how easy it is to use overloaded terms in the context of programming; many concepts have similar names and many English words can be ambiguous for untrained readers! My editor fortunately keeps these at bay, but I slip up often! Teaching is hard. The best way I’ve found to mitigate issues like this is to automate where I can.

Is there anything else you’d like to say?

If you’re out there thinking about getting into programming or writing a book or anything really, and you’re fortunate to have the means to do so, get to it! I’ve found that I don’t know how I feel about something until I really examine it, flip a few switches, find out how it works under the hood. Sometimes you’ll find you don’t like something as much as you thought, but maybe it uncovers tangentially-related things you want to explore. The most important part is getting started!

Thanks for doing the interview, Dane!

↧

Ram Rachum: PySnooper: Never use print for debugging again

April 22, 2019, 3:54 am

≫ Next: PyCharm: Interview: Dan Tofan for this week’s data science webinar

≪ Previous: Mike Driscoll: PyDev of the Week: Dane Hillard

PySnooper: Never use print for debugging again

I just released a new open-source project!

https://github.com/cool-RR/PySnooper/.

↧

PyCharm: Interview: Dan Tofan for this week’s data science webinar

April 22, 2019, 4:47 am

≫ Next: Real Python: A Beginner’s Guide to the Python time Module

≪ Previous: Ram Rachum: PySnooper: Never use print for debugging again

In the past few years, Python has made a big push into data science and PyCharm has as well. Years ago we added Jupyter Notebook integration, then 2017.3 introduced Scientific Mode for workflows that felt more like an IDE. In 2019.1 we re-invented our Jupyter support to also be more like a professional tool.

PyCharm and data science are thus a hot topic. Dan Tofan very recently published a Pluralsight course on using PyCharm for data science and we invited him for a webinar next week.

To help set the stage, below is an interview with Dan.

Thursday, April 25
7PM GMT+3, 9AM Pacific
Register here
Aimed at new and intermediate data scientists

Let’s start with the key point: what does PyCharm bring to data scientists?

PyCharm brings a productivity boost to data scientists, by helping them explore data, debug Python code, write better Python code, and understand Python code faster. As a PyCharm user, I experienced and benefited from these productivity boosters, which I distilled into my first Pluralsight course, so that data scientists can make the most out of PyCharm in their activities.

For the webinar: who is it for and what can people expect you to cover?

If you are a data scientist who dabbled with PyCharm, then this webinar is for you. I will cover PyCharm’s most relevant features to data science: the scientific mode and the completely rewritten Jupyter support. I will show how these features interplay with other PyCharm features, such as refactoring code from Jupyter cells. I will use easy-to-understand code examples with popular data science libraries.

Now, back to the start: tell us a little about yourself.

Currently, I am a senior backend developer for Dimensions– a research data platform that uses data science, and links data on a total of over 140 million publications, grants, patents and clinical trials. I’ve always been curious, which led me to do my PhD studies at the University of Groningen (Netherlands) and learn more about statistics and data analysis.

Do Python data scientists feel like programmers first and data scientists second, or the reverse?

In my opinion, data science is a melting pot of skills from three complementing backgrounds: programmers, statisticians and business analysts. At the start of your data science journey, you are going to rely on the skills from your main background, and – as your skills expand – you are going to feel more and more like a data scientist.

Your course has a bunch of sections on software development practices and IDE tips. How important are these practices to “professional” data science?

As part of the melting pot, programmers bring a lot of value with their experiences ranging from software development practices to IDE tips. Data scientists from a programming background are already familiar with most of these, and those from other backgrounds benefit immensely.

Think of a code base that starts to grow: how do you write better code? How do you refactor the code? How can a new team member understand that code faster? These are some of the questions that my course helps with.

The course also covers three major facilities in PyCharm Professional: Scientific Mode, Jupyter support, and the Database tool. How do these fit in?

All of them are data centric, so they are very relevant to data scientists. These facilities are integrated nicely with other PyCharm capabilities such as debugging and refactoring. Overall, after watching the course and getting familiar with these capabilities, data scientists get a nice productivity boost.

This webinar is good timing. You just released the course and we just re-invented our Jupyter support. What do you think of the new, IDE-centric Jupyter integration?

I think the new Jupyter integration is an excellent step in the right direction, because you can use both Jupyter and PyCharm features such as debugging and code completion. Joel Grus gave an insightful and entertaining talk about Jupyter limitations at JupyterCon 2018. I think the new Jupyter integration in PyCharm can eventually help solve some Jupyter pain points raised by Joel, such as hidden state.

What’s one big problem or pain point in Jupyter that could benefit from new ideas or tooling?

Reproducibility is problematic with Jupyter and it is important for data science. For example, it’s easy to share a notebook on GitHub, then someone else tries to run it and gets different results. Perhaps the solution is a mix of discipline and better tools.

↧

Real Python: A Beginner’s Guide to the Python time Module

April 22, 2019, 7:00 am

≫ Next: Codementor: Variable references in Python

≪ Previous: PyCharm: Interview: Dan Tofan for this week’s data science webinar

The Python time module provides many ways of representing time in code, such as objects, numbers, and strings. It also provides functionality other than representing time, like waiting during code execution and measuring the efficiency of your code.

This article will walk you through the most commonly used functions and objects in time.

By the end of this article, you’ll be able to:

Understand core concepts at the heart of working with dates and times, such as epochs, time zones, and daylight savings time
Represent time in code using floats, tuples, and struct_time
Convert between different time representations
Suspend thread execution
Measure code performance using perf_counter()

You’ll start by learning how you can use a floating point number to represent time.

Free Bonus:Click here to get our free Python Cheat Sheet that shows you the basics of Python 3, like working with data types, dictionaries, lists, and Python functions.

Dealing With Python Time Using Seconds

One of the ways you can manage the concept of Python time in your application is by using a floating point number that represents the number of seconds that have passed since the beginning of an era—that is, since a certain starting point.

Let’s dive deeper into what that means, why it’s useful, and how you can use it to implement logic, based on Python time, in your application.

The Epoch

You learned in the previous section that you can manage Python time with a floating point number representing elapsed time since the beginning of an era.

Merriam-Webster defines an era as:

A fixed point in time from which a series of years is reckoned
A system of chronological notation computed from a given date as basis

The important concept to grasp here is that, when dealing with Python time, you’re considering a period of time identified by a starting point. In computing, you call this starting point the epoch.

The epoch, then, is the starting point against which you can measure the passage of time.

For example, if you define the epoch to be midnight on January 1, 1970 UTC—the epoch as defined on Windows and most UNIX systems—then you can represent midnight on January 2, 1970 UTC as 86400 seconds since the epoch.

This is because there are 60 seconds in a minute, 60 minutes in an hour, and 24 hours in a day. January 2, 1970 UTC is only one day after the epoch, so you can apply basic math to arrive at that result:

>>>

>>> 60*60*2486400

It is also important to note that you can still represent time before the epoch. The number of seconds would just be negative.

For example, you would represent midnight on December 31, 1969 UTC (using an epoch of January 1, 1970) as -86400 seconds.

While January 1, 1970 UTC is a common epoch, it is not the only epoch used in computing. In fact, different operating systems, filesystems, and APIs sometimes use different epochs.

As you saw before, UNIX systems define the epoch as January 1, 1970. The Win32 API, on the other hand, defines the epoch as January 1, 1601.

You can use time.gmtime() to determine your system’s epoch:

>>>

>>> importtime>>> time.gmtime(0)time.struct_time(tm_year=1970, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=1, tm_isdst=0)

You’ll learn about gmtime() and struct_time throughout the course of this article. For now, just know that you can use time to discover the epoch using this function.

Now that you understand more about how to measure time in seconds using an epoch, let’s take a look at Python’s time module to see what functions it offers that help you do so.

Python Time in Seconds as a Floating Point Number

First, time.time() returns the number of seconds that have passed since the epoch. The return value is a floating point number to account for fractional seconds:

>>>

>>> fromtimeimporttime>>> time()1551143536.9323719

The number you get on your machine may be very different because the reference point considered to be the epoch may be very different.

Further Reading: Python 3.7 introduced time_ns(), which returns an integer value representing the same elapsed time since the epoch, but in nanoseconds rather than seconds.

Measuring time in seconds is useful for a number of reasons:

You can use a float to calculate the difference between two points in time.
A float is easily serializable, meaning that it can be stored for data transfer and come out intact on the other side.

Sometimes, however, you may want to see the current time represented as a string. To do so, you can pass the number of seconds you get from time() into time.ctime().

Python Time in Seconds as a String Representing Local Time

As you saw before, you may want to convert the Python time, represented as the number of elapsed seconds since the epoch, to a string. You can do so using ctime():

>>>

>>> fromtimeimporttime,ctime>>> t=time()>>> ctime(t)'Mon Feb 25 19:11:59 2019'

Here, you’ve recorded the current time in seconds into the variable t, then passed t as an argument to ctime(), which returns a string representation of that same time.

Technical Detail: The argument, representing seconds since the epoch, is optional according to the ctime() definition. If you don’t pass an argument, then ctime() uses the return value of time() by default. So, you could simplify the example above:

>>>

>>> fromtimeimportctime>>> ctime()'Mon Feb 25 19:11:59 2019'

The string representation of time, also known as a timestamp, returned by ctime() is formatted with the following structure:

Day of the week:Mon (Monday)
Month of the year:Feb (February)
Day of the month:25
Hours, minutes, and seconds using the 24-hour clock notation:19:11:59
Year:2019

The previous example displays the timestamp of a particular moment captured from a computer in the South Central region of the United States. But, let’s say you live in Sydney, Australia, and you executed the same command at the same instant.

Instead of the above output, you’d see the following:

>>>

>>> fromtimeimporttime,ctime>>> t=time()>>> ctime(t)'Tue Feb 26 12:11:59 2019'

Notice that the day of week, day of month, and hour portions of the timestamp are different than the first example.

These outputs are different because the timestamp returned by ctime() depends on your geographical location.

Note: While the concept of time zones is relative to your physical location, you can modify this in your computer’s settings without actually relocating.

The representation of time dependent on your physical location is called local time and makes use of a concept called time zones.

Note: Since local time is related to your locale, timestamps often account for locale-specific details such as the order of the elements in the string and translations of the day and month abbreviations. ctime() ignores these details.

Let’s dig a little deeper into the notion of time zones so that you can better understand Python time representations.

Understanding Time Zones

A time zone is a region of the world that conforms to a standardized time. Time zones are defined by their offset from Coordinated Universal Time (UTC) and, potentially, the inclusion of daylight savings time (which we’ll cover in more detail later in this article).

Fun Fact: If you’re a native English speaker, you might be wondering why the abbreviation for “Coordinated Universal Time” is UTC rather than the more obvious CUT. However, if you’re a native French speaker, you would call it “Temps Universel Coordonné,” which suggests a different abbreviation: TUC.

Ultimately, the International Telecommunication Union and the International Astronomical Union compromised on UTC as the official abbreviation so that, regardless of language, the abbreviation would be the same.

UTC and Time Zones

UTC is the time standard against which all the world’s timekeeping is synchronized (or coordinated). It is not, itself, a time zone but rather a transcendent standard that defines what time zones are.

UTC time is precisely measured using astronomical time, referring to the Earth’s rotation, and atomic clocks.

Time zones are then defined by their offset from UTC. For example, in North and South America, the Central Time Zone (CT) is behind UTC by five or six hours and, therefore, uses the notation UTC-5:00 or UTC-6:00.

Sydney, Australia, on the other hand, belongs to the Australian Eastern Time Zone (AET), which is ten or eleven hours ahead of UTC (UTC+10:00 or UTC+11:00).

This difference (UTC-6:00 to UTC+10:00) is the reason for the variance you observed in the two outputs from ctime() in the previous examples:

Central Time (CT):'Mon Feb 25 19:11:59 2019'
Australian Eastern Time (AET):'Tue Feb 26 12:11:59 2019'

These times are exactly sixteen hours apart, which is consistent with the time zone offsets mentioned above.

You may be wondering why CT can be either five or six hours behind UTC or why AET can be ten or eleven hours ahead. The reason for this is that some areas around the world, including parts of these time zones, observe daylight savings time.

Daylight Savings Time

Summer months generally experience more daylight hours than winter months. Because of this, some areas observe daylight savings time (DST) during the spring and summer to make better use of those daylight hours.

For places that observe DST, their clocks will jump ahead one hour at the beginning of spring (effectively losing an hour). Then, in the fall, the clocks will be reset to standard time.

The letters S and D represent standard time and daylight savings time in time zone notation:

Central Standard Time (CST)
Australian Eastern Daylight Time (AEDT)

When you represent times as timestamps in local time, it is always important to consider whether DST is applicable or not.

ctime() accounts for daylight savings time. So, the output difference listed previously would be more accurate as the following:

Central Standard Time (CST):'Mon Feb 25 19:11:59 2019'
Australian Eastern Daylight Time (AEDT):'Tue Feb 26 12:11:59 2019'

Dealing With Python Time Using Data Structures

Now that you have a firm grasp on many fundamental concepts of time including epochs, time zones, and UTC, let’s take a look at more ways to represent time using the Python time module.

Python Time as a Tuple

Instead of using a number to represent Python time, you can use another primitive data structure: a tuple.

The tuple allows you to manage time a little more easily by abstracting some of the data and making it more readable.

When you represent time as a tuple, each element in your tuple corresponds to a specific element of time:

Year
Month as an integer, ranging between 1 (January) and 12 (December)
Day of the month
Hour as an integer, ranging between 0 (12 A.M.) and 23 (11 P.M.)
Minute
Second
Day of the week as an integer, ranging between 0 (Monday) and 6 (Sunday)
Day of the year
Daylight savings time as an integer with the following values:
- 1 is daylight savings time.
- 0 is standard time.
- -1 is unknown.

Using the methods you’ve already learned, you can represent the same Python time in two different ways:

>>>

>>> fromtimeimporttime,ctime>>> t=time()>>> t1551186415.360564>>> ctime(t)'Tue Feb 26 07:06:55 2019'>>> time_tuple=(2019,2,26,7,6,55,1,57,0)

In this case, both t and time_tuple represent the same time, but the tuple provides a more readable interface for working with time components.

Technical Detail: Actually, if you look at the Python time represented by time_tuple in seconds (which you’ll see how to do later in this article), you’ll see that it resolves to 1551186415.0 rather than 1551186415.360564.

This is because the tuple doesn’t have a way to represent fractional seconds.

While the tuple provides a more manageable interface for working with Python time, there is an even better object: struct_time.

Python Time as an Object

The problem with the tuple construct is that it still looks like a bunch of numbers, even though it’s better organized than a single, seconds-based number.

struct_time provides a solution to this by utilizing NamedTuple, from Python’s collections module, to associate the tuple’s sequence of numbers with useful identifiers:

>>>

>>> fromtimeimportstruct_time>>> time_tuple=(2019,2,26,7,6,55,1,57,0)>>> time_obj=struct_time(time_tuple)>>> time_objtime.struct_time(tm_year=2019, tm_mon=2, tm_mday=26, tm_hour=7, tm_min=6, tm_sec=55, tm_wday=1, tm_yday=57, tm_isdst=0)

Technical Detail: If you’re coming from another language, the terms struct and object might be in opposition to one another.

In Python, there is no data type called struct. Instead, everything is an object.

However, the name struct_time is derived from the C-based time library where the data type is actually a struct.

In fact, Python’s time module, which is implemented in C, uses this struct directly by including the header file times.h.

Now, you can access specific elements of time_obj using the attribute’s name rather than an index:

>>>

>>> day_of_year=time_obj.tm_yday>>> day_of_year57>>> day_of_month=time_obj.tm_mday>>> day_of_month26

Beyond the readability and usability of struct_time, it is also important to know because it is the return type of many of the functions in the Python time module.

Converting Python Time in Seconds to an Object

Now that you’ve seen the three primary ways of working with Python time, you’ll learn how to convert between the different time data types.

Converting between time data types is dependent on whether the time is in UTC or local time.

Coordinated Universal Time (UTC)

The epoch uses UTC for its definition rather than a time zone. Therefore, the seconds elapsed since the epoch is not variable depending on your geographical location.

However, the same cannot be said of struct_time. The object representation of Python time may or may not take your time zone into account.

There are two ways to convert a float representing seconds to a struct_time:

UTC
Local time

To convert a Python time float to a UTC-based struct_time, the Python time module provides a function called gmtime().

You’ve seen gmtime() used once before in this article:

>>>

>>> importtime>>> time.gmtime(0)time.struct_time(tm_year=1970, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=1, tm_isdst=0)

You used this call to discover your system’s epoch. Now, you have a better foundation for understanding what’s actually happening here.

gmtime() converts the number of elapsed seconds since the epoch to a struct_time in UTC. In this case, you’ve passed 0 as the number of seconds, meaning you’re trying to find the epoch, itself, in UTC.

Note: Notice the attribute tm_isdst is set to 0. This attribute represents whether the time zone is using daylight savings time. UTC never subscribes to DST, so that flag will always be 0 when using gmtime().

As you saw before, struct_time cannot represent fractional seconds, so gmtime() ignores the fractional seconds in the argument:

>>>

>>> importtime>>> time.gmtime(1.99)time.struct_time(tm_year=1970, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=1, tm_wday=3, tm_yday=1, tm_isdst=0)

Notice that even though the number of seconds you passed was very close to 2, the .99 fractional seconds were simply ignored, as shown by tm_sec=1.

The secs parameter for gmtime() is optional, meaning you can call gmtime() with no arguments. Doing so will provide the current time in UTC:

>>>

>>> importtime>>> time.gmtime()time.struct_time(tm_year=2019, tm_mon=2, tm_mday=28, tm_hour=12, tm_min=57, tm_sec=24, tm_wday=3, tm_yday=59, tm_isdst=0)

Interestingly, there is no inverse for this function within time. Instead, you’ll have to look in Python’s calendar module for a function named timegm():

>>>

>>> importcalendar>>> importtime>>> time.gmtime()time.struct_time(tm_year=2019, tm_mon=2, tm_mday=28, tm_hour=13, tm_min=23, tm_sec=12, tm_wday=3, tm_yday=59, tm_isdst=0)>>> calendar.timegm(time.gmtime())1551360204

timegm() takes a tuple (or struct_time, since it is a subclass of tuple) and returns the corresponding number of seconds since the epoch.

Historical Context: If you’re interested in why timegm() is not in time, you can view the discussion in Python Issue 6280.

In short, it was originally added to calendar because time closely follows C’s time library (defined in time.h), which contains no matching function. The above-mentioned issue proposed the idea of moving or copying timegm() into time.

However, with advances to the datetime library, inconsistencies in the patched implementation of time.timegm(), and a question of how to then handle calendar.timegm(), the maintainers declined the patch, encouraging the use of datetime instead.

Working with UTC is valuable in programming because it’s a standard. You don’t have to worry about DST, time zone, or locale information.

That said, there are plenty of cases when you’d want to use local time. Next, you’ll see how to convert from seconds to local time so that you can do just that.

Local Time

In your application, you may need to work with local time rather than UTC. Python’s time module provides a function for getting local time from the number of seconds elapsed since the epoch called localtime().

The signature of localtime() is similar to gmtime() in that it takes an optional secs argument, which it uses to build a struct_time using your local time zone:

>>>

>>> importtime>>> time.time()1551448206.86196>>> time.localtime(1551448206.86196)time.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=7, tm_min=50, tm_sec=6, tm_wday=4, tm_yday=60, tm_isdst=0)

Notice that tm_isdst=0. Since DST matters with local time, tm_isdst will change between 0 and 1 depending on whether or not DST is applicable for the given time. Since tm_isdst=0, DST is not applicable for March 1, 2019.

In the United States in 2019, daylight savings time begins on March 10. So, to test if the DST flag will change correctly, you need to add 9 days’ worth of seconds to the secs argument.

To compute this, you take the number of seconds in a day (86,400) and multiply that by 9 days:

>>>

>>> new_secs=1551448206.86196+(86400*9)>>> time.localtime(new_secs)time.struct_time(tm_year=2019, tm_mon=3, tm_mday=10, tm_hour=8, tm_min=50, tm_sec=6, tm_wday=6, tm_yday=69, tm_isdst=1)

Now, you’ll see that the struct_time shows the date to be March 10, 2019 with tm_isdst=1. Also, notice that tm_hour has also jumped ahead, to 8 instead of 7 in the previous example, because of daylight savings time.

Since Python 3.3, struct_time has also included two attributes that are useful in determining the time zone of the struct_time:

tm_zone
tm_gmtoff

At first, these attributes were platform dependent, but they have been available on all platforms since Python 3.6.

First, tm_zone stores the local time zone:

>>>

>>> importtime>>> current_local=time.localtime()>>> current_local.tm_zone'CST'

Here, you can see that localtime() returns a struct_time with the time zone set to CST (Central Standard Time).

As you saw before, you can also tell the time zone based on two pieces of information, the UTC offset and DST (if applicable):

>>>

>>> importtime>>> current_local=time.localtime()>>> current_local.tm_gmtoff-21600>>> current_local.tm_isdst0

In this case, you can see that current_local is 21600 seconds behind GMT, which stands for Greenwich Mean Time. GMT is the time zone with no UTC offset: UTC±00:00.

21600 seconds divided by seconds per hour (3,600) means that current_local time is GMT-06:00 (or UTC-06:00).

You can use the GMT offset plus the DST status to deduce that current_local is UTC-06:00 at standard time, which corresponds to the Central standard time zone.

Like gmtime(), you can ignore the secs argument when calling localtime(), and it will return the current local time in a struct_time:

>>>

>>> importtime>>> time.localtime()time.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=8, tm_min=34, tm_sec=28, tm_wday=4, tm_yday=60, tm_isdst=0)

Unlike gmtime(), the inverse function of localtime() does exist in the Python time module. Let’s take a look at how that works.

Converting a Local Time Object to Seconds

You’ve already seen how to convert a UTC time object to seconds using calendar.timegm(). To convert local time to seconds, you’ll use mktime().

mktime() requires you to pass a parameter called t that takes the form of either a normal 9-tuple or a struct_time object representing local time:

>>>

>>> importtime>>> time_tuple=(2019,3,10,8,50,6,6,69,1)>>> time.mktime(time_tuple)1552225806.0>>> time_struct=time.struct_time(time_tuple)>>> time.mktime(time_struct)1552225806.0

It’s important to keep in mind that t must be a tuple representing local time, not UTC:

>>>

>>> fromtimeimportgmtime,mktime>>> # 1>>> current_utc=time.gmtime()>>> current_utctime.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=14, tm_min=51, tm_sec=19, tm_wday=4, tm_yday=60, tm_isdst=0)>>> # 2>>> current_utc_secs=mktime(current_utc)>>> current_utc_secs1551473479.0>>> # 3>>> time.gmtime(current_utc_secs)time.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=20, tm_min=51, tm_sec=19, tm_wday=4, tm_yday=60, tm_isdst=0)

Note: For this example, assume that the local time is March 1, 2019 08:51:19 CST.

This example shows why it’s important to use mktime() with local time, rather than UTC:

gmtime() with no argument returns a struct_time using UTC. current_utc shows March 1, 2019 14:51:19 UTC. This is accurate because CST is UTC-06:00, so UTC should be 6 hours ahead of local time.
mktime() tries to return the number of seconds, expecting local time, but you passed current_utc instead. So, instead of understanding that current_utc is UTC time, it assumes you meant March 1, 2019 14:51:19 CST.
gmtime() is then used to convert those seconds back into UTC, which results in an inconsistency. The time is now March 1, 2019 20:51:19 UTC. The reason for this discrepancy is the fact that mktime() expected local time. So, the conversion back to UTC adds another 6 hours to local time.

Working with time zones is notoriously difficult, so it’s important to set yourself up for success by understanding the differences between UTC and local time and the Python time functions that deal with each.

Converting a Python Time Object to a String

While working with tuples is fun and all, sometimes it’s best to work with strings.

String representations of time, also known as timestamps, help make times more readable and can be especially useful for building intuitive user interfaces.

There are two Python time functions that you use for converting a time.struct_time object to a string:

asctime()
strftime()

You’ll begin by learning aboutasctime().

`asctime()`

You use asctime() for converting a time tuple or struct_time to a timestamp:

>>>

>>> importtime>>> time.asctime(time.gmtime())'Fri Mar  1 18:42:08 2019'>>> time.asctime(time.localtime())'Fri Mar  1 12:42:15 2019'

Both gmtime() and localtime() return struct_time instances, for UTC and local time respectively.

You can use asctime() to convert either struct_time to a timestamp. asctime() works similarly to ctime(), which you learned about earlier in this article, except instead of passing a floating point number, you pass a tuple. Even the timestamp format is the same between the two functions.

As with ctime(), the parameter for asctime() is optional. If you do not pass a time object to asctime(), then it will use the current local time:

>>>

>>> importtime>>> time.asctime()'Fri Mar  1 12:56:07 2019'

As with ctime(), it also ignores locale information.

One of the biggest drawbacks of asctime() is its format inflexibility. strftime() solves this problem by allowing you to format your timestamps.

`strftime()`

You may find yourself in a position where the string format from ctime() and asctime() isn’t satisfactory for your application. Instead, you may want to format your strings in a way that’s more meaningful to your users.

One example of this is if you would like to display your time in a string that takes locale information into account.

To format strings, given a struct_time or Python time tuple, you use strftime(), which stands for “string format time.”

strftime() takes two arguments:

format specifies the order and form of the time elements in your string.
t is an optional time tuple.

To format a string, you use directives. Directives are character sequences that begin with a % that specify a particular time element, such as:

%d: Day of the month
%m: Month of the year
%Y: Year

For example, you can output the date in your local time using the ISO 8601 standard like this:

>>>

>>> importtime>>> time.strftime('%Y-%m-%d',time.localtime())'2019-03-01'

Further Reading: While representing dates using Python time is completely valid and acceptable, you should also consider using Python’s datetime module, which provides shortcuts and a more robust framework for working with dates and times together.

For example, you can simplify outputting a date in the ISO 8601 format using datetime:

>>>

>>> fromdatetimeimportdate>>> date(year=2019,month=3,day=1).isoformat()'2019-03-01'

As you saw before, a great benefit of using strftime() over asctime() is its ability to render timestamps that make use of locale-specific information.

For example, if you want to represent the date and time in a locale-sensitive way, you can’t use asctime():

>>>

>>> fromtimeimportasctime>>> asctime()'Sat Mar  2 15:21:14 2019'>>> importlocale>>> locale.setlocale(locale.LC_TIME,'zh_HK')# Chinese - Hong Kong'zh_HK'>>> asctime()'Sat Mar  2 15:58:49 2019'

Notice that even after programmatically changing your locale, asctime() still returns the date and time in the same format as before.

Technical Detail:LC_TIME is the locale category for date and time formatting. The locale argument 'zh_HK' may be different, depending on your system.

When you use strftime(), however, you’ll see that it accounts for locale:

>>>

>>> fromtimeimportstrftime,localtime>>> strftime('%c',localtime())'Sat Mar  2 15:23:20 2019'>>> importlocale>>> locale.setlocale(locale.LC_TIME,'zh_HK')# Chinese - Hong Kong'zh_HK'>>> strftime('%c',localtime())'六  3/ 2 15:58:12 2019' 2019'

Here, you have successfully utilized the locale information because you used strftime().

Note:%c is the directive for locale-appropriate date and time.

If the time tuple is not passed to the parameter t, then strftime() will use the result of localtime() by default. So, you could simplify the examples above by removing the optional second argument:

>>>

>>> fromtimeimportstrftime>>> strftime('The current local datetime is: %c')'The current local datetime is: Fri Mar  1 23:18:32 2019'

Here, you’ve used the default time instead of passing your own as an argument. Also, notice that the format argument can consist of text other than formatting directives.

Further Reading: Check out this thorough list of directives available to strftime().

The Python time module also includes the inverse operation of converting a timestamp back into a struct_time object.

Converting a Python Time String to an Object

When you’re working with date and time related strings, it can be very valuable to convert the timestamp to a time object.

To convert a time string to a struct_time, you use strptime(), which stands for “string parse time”:

>>>

>>> fromtimeimportstrptime>>> strptime('2019-03-01','%Y-%m-%d')time.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=4, tm_yday=60, tm_isdst=-1)

The first argument to strptime() must be the timestamp you wish to convert. The second argument is the format that the timestamp is in.

The format parameter is optional and defaults to '%a %b %d %H:%M:%S %Y'. Therefore, if you have a timestamp in that format, you don’t need to pass it as an argument:

>>>

>>> strptime('Fri Mar 01 23:38:40 2019')time.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=23, tm_min=38, tm_sec=40, tm_wday=4, tm_yday=60, tm_isdst=-1)

Since a struct_time has 9 key date and time components, strptime() must provide reasonable defaults for values for those components it can’t parse from string.

In the previous examples, tm_isdst=-1. This means that strptime() can’t determine by the timestamp whether it represents daylight savings time or not.

Now you know how to work with Python times and dates using the time module in a variety of ways. However, there are other uses for time outside of simply creating time objects, getting Python time strings, and using seconds elapsed since the epoch.

Suspending Execution

One really useful Python time function is sleep(), which suspends the thread’s execution for a specified amount of time.

For example, you can suspend your program’s execution for 10 seconds like this:

>>>

>>> fromtimeimportsleep,strftime>>> strftime('%c')'Fri Mar  1 23:49:26 2019'>>> sleep(10)>>> strftime('%c')'Fri Mar  1 23:49:36 2019'

Your program will print the first formatted datetime string, then pause for 10 seconds, and finally print the second formatted datetime string.

You can also pass fractional seconds to sleep():

>>>

>>> fromtimeimportsleep>>> sleep(0.5)

sleep() is useful for testing or making your program wait for any reason, but you must be careful not to halt your production code unless you have good reason to do so.

Before Python 3.5, a signal sent to your process could interrupt sleep(). However, in 3.5 and later, sleep() will always suspend execution for at least the amount of specified time, even if the process receives a signal.

sleep() is just one Python time function that can help you test your programs and make them more robust.

Measuring Performance

You can use time to measure the performance of your program.

The way you do this is to use perf_counter() which, as the name suggests, provides a performance counter with a high resolution to measure short distances of time.

To use perf_counter(), you place a counter before your code begins execution as well as after your code’s execution completes:

>>>

>>> fromtimeimportperf_counter>>> deflongrunning_function():... foriinrange(1,11):... time.sleep(i/i**2)...>>> start=perf_counter()>>> longrunning_function()>>> end=perf_counter()>>> execution_time=(end-start)>>> execution_time8.201258441999926

First, start captures the moment before you call the function. end captures the moment after the function returns. The function’s total execution time took (end - start) seconds.

Technical Detail: Python 3.7 introduced perf_counter_ns(), which works the same as perf_counter(), but uses nanoseconds instead of seconds.

perf_counter() (or perf_counter_ns()) is the most precise way to measure the performance of your code using one execution. However, if you’re trying to accurately gauge the performance of a code snippet, I recommend using the Python timeit module.

timeit specializes in running code many times to get a more accurate performance analysis and helps you to avoid oversimplifying your time measurement as well as other common pitfalls.

Conclusion

Congratulations! You now have a great foundation for working with dates and times in Python.

Now, you’re able to:

Use a floating point number, representing seconds elapsed since the epoch, to deal with time
Manage time using tuples and struct_time objects
Convert between seconds, tuples, and timestamp strings
Suspend the execution of a Python thread
Measure performance using perf_counter()

On top of all that, you’ve learned some fundamental concepts surrounding date and time, such as:

Epochs
UTC
Time zones
Daylight savings time

Now, it’s time for you to apply your newfound knowledge of Python time in your real world applications!

Codementor: Variable references in Python

April 22, 2019, 8:58 am

≫ Next: Podcast.__init__: Exploring Indico: A Full Featured Event Management Platform

≪ Previous: Real Python: A Beginner’s Guide to the Python time Module

Variable reference in python

↧

Podcast.init: Exploring Indico: A Full Featured Event Management Platform

April 22, 2019, 11:18 am

≫ Next: NumFOCUS: NumFOCUS Projects to Apply for Inaugural Google Season of Docs

≪ Previous: Codementor: Variable references in Python

Managing an event is rife with inherent complexity that scales as you move from scheduling a meeting to organizing a conference. Indico is a platform built at CERN to handle their efforts to organize events such as the Computing in High Energy Physics (CHEP) conference, and now it has grown to manage booking of meeting rooms. In this episode Adrian Mönnich, core developer on the Indico project, explains how it is architected to facilitate this use case, how it has evolved since its first incarnation two decades ago, and what he has learned while working on it. The Indico platform is definitely a feature rich and mature platform that is worth considering if you are responsible for organizing a conference or need a room booking system for your office.

Summary

Announcements

Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Bots and automation are taking over whole categories of online interaction. Discover.bot is an online community designed to serve as a platform-agnostic digital space for bot developers and enthusiasts of all skill levels to learn from one another, share their stories, and move the conversation forward together. They regularly publish guides and resources to help you learn about topics such as bot development, using them for business, and the latest in chatbot news. For newcomers to the space they have the Beginners Guide To Bots that will teach you the basics of how bots work, what they can do, and where they are developed and published. To help you choose the right framework and avoid the confusion about which NLU features and platform APIs you will need they have compiled a list of the major options and how they compare. Go to pythonpodcast.com/discoverbot today to get started and thank them for their support of the show.
You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Your host as usual is Tobias Macey and today I’m interviewing Adrian Mönnich about Indico, the effortless open-source tool for event organisation, archival and collaboration

Interview

Introductions
How did you get introduced to Python?
Can you start by describing what Indico is and how the project got started?
- What are some other projects which target a similar use case and what were they lacking that led to Indico being necessary?
Can you talk through an example workflow for setting up and managing an event in Indico?
- How does the lifecycle change when working with larger events, such as PyCon?
Can you describe how Indico is architected and how its design has evolved since it was first built?
- What are some of the most complex or challenging portions of Indico to implement and maintain?
There are a lot of areas for exercising constraint resolution algorithms. Can you talk through some of the business logic of how that operates?
Most of Indico is highly configurable and flexible. How do you approach managing sane defaults to prevent users getting overwhelmed when onboarding?
- What is your approach to testing given how complex the project is?
What are some of the most interesting or unexpected ways that you have seen Indico used?
What are some of the most interesting/unexpected lessons that you have learned in the process of building Indico?
What do you have planned for the future of the project?

Keep In Touch

Indico
Adrian
- ThiefMaster on GitHub

Picks

Tobias
- Mortal Engines movie
Adrian
- Virtual Reality
- Portal VR

Links

The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

↧

NumFOCUS: NumFOCUS Projects to Apply for Inaugural Google Season of Docs

April 22, 2019, 2:41 pm

≫ Next: The Code Bits: Printing star patterns in Python: One line tricks!

≪ Previous: Podcast.__init__: Exploring Indico: A Full Featured Event Management Platform

The post NumFOCUS Projects to Apply for Inaugural Google Season of Docs appeared first on NumFOCUS.

↧

The Code Bits: Printing star patterns in Python: One line tricks!

April 23, 2019, 12:06 am

≫ Next: Catalin George Festila: Testing firebase with Python 3.7.3 .

≪ Previous: NumFOCUS: NumFOCUS Projects to Apply for Inaugural Google Season of Docs

In this post, we will see how to print some of the common star patterns using Python3 with one line of code!

How to print a half-pyramid pattern in Python?

>>> n = 5
>>> print('\n'.join('*' * i for i in range(1, n+1)))
*
**
***
****
*****
>>>>>> print('\n'.join('* ' * i for i in range(1, n+1)))
*
* *
* * *
* * * *
* * * * *

How to print a rotated half-pyramid pattern in Python?

>>> n = 5
>>> print('\n'.join(' ' * (n-i) + '*' * (i) for i in range(1, n+1)))
    *
   **
  ***
 ****
*****
>>>>>> print('\n'.join('  ' * (n-i) + '* ' * (i) for i in range(1, n+1)))
        *
      * *
    * * *
  * * * *
* * * * *

How to print an inverted half-pyramid pattern in Python?

>>> n = 5
>>> print('\n'.join('*' * (n-i) for i in range(n)))
*****
****
***
**
*
>>>>>> print('\n'.join('* ' * (n-i) for i in range(n)))
* * * * *
* * * *
* * *
* *
*

How to print an inverted and rotated half-pyramid pattern in Python?

>>> n = 5
>>> print('\n'.join(' ' * i + '*' * (n-i) for i in range(n)))
*****
 ****
  ***
   **
    *
>>>>>> print('\n'.join('  ' * i + '* ' * (n-i) for i in range(n)))
* * * * *
  * * * *
    * * *
      * *
        *

How to print a full triangle pyramid pattern in Python?

>>> n = 5
>>> print('\n'.join(' ' * (n-i) + '* ' * i for i in range(1, n+1)))
    *
   * *
  * * *
 * * * *
* * * * *
>>>>>> print('\n'.join(' ' * (n-1-i) + '*' * ((i*2)+1) for i in range(n)))
    *
   ***
  *****
 *******
*********

How to print an inverted full triangle pyramid pattern in Python?

>>> n = 5
>>> print('\n'.join(' ' * (n-i) + '* ' * i for i in range(n, 0, -1)))
* * * * *
 * * * *
  * * *
   * *
    *
>>>>>> print('\n'.join(' ' * (n-i) + '*' * ((i*2)-1) for i in range(n, 0, -1)))
*********
 *******
  *****
   ***
    *

↧

Catalin George Festila: Testing firebase with Python 3.7.3 .

April 22, 2019, 9:56 pm

≫ Next: PyCon: Welcome Capital One: Python Software Foundation Principal Sponsor

≪ Previous: The Code Bits: Printing star patterns in Python: One line tricks!

The tutorial for today consists of using the Firebase service with python version 3.7.3 . As you know Firebase offers multiple free and paid services. In order to use the Python programming language, we need to use the pip utility to enter the required modules. If your installation requires other python modules then you will need to install them in the same way. C:\Python373>pip install

↧

PyCon: Welcome Capital One: Python Software Foundation Principal Sponsor

April 23, 2019, 2:26 am

≫ Next: Neckbeard Republic: Sending Emails With Python

≪ Previous: Catalin George Festila: Testing firebase with Python 3.7.3 .

A big welcome and thank you to Capital One for joining the PSF as a Principal sponsor!

Capital One is also a PyCon 2019 Principal sponsor and is excited to share a few things with attendees, including a deeper look at their intelligent virtual assistant, Eno. Eno’s NLP models were built in-house with Python. Eno is a key component of the customer experience at Capital One, proactively looking out for customers and their money. Eno notifies customers about unusual transactions or duplicate charges, helping to spot fraud in its tracks. It also sends bill reminders and makes paying your bill as easy as sending a text or emoji; plus, its new virtual card capabilities let customers shop online without using their real credit card number.

The benefits they’ve seen by developing Eno with Python are numerous: fast time to market, the ability to prototype and iterate quickly, ease of integration with machine learning frameworks, and extensive support for everything we need (like Kafka and Redis). Plus, they see faster performance using Python's Asynchronous I/O.

For Capital One, sponsoring important industry conferences like PyCon brings a lot of benefits, like recruiting and brand awareness, but they’re here first and foremost for the community. By sponsoring PyCon, they feel they’re helping support, strengthen, and engage with the Python community.

Capital One sees the future of banking as real-time, data-driven, and enabled by machine learning and data science -- and Python plays a big role in that. They have embedded machine learning across the entire enterprise, from call center operations to back-office processes, fraud, internal operations, the customer experience, and much more. To them, machine learning not only creates efficiency and scale on a level not possible before, but it also helps give their customers greater protection, security, confidence, and control of their finances.

Python has been and will continue to be critical to advances in machine learning and data science, so they see a lot of exciting innovation, growth, and potential for the Python community. They hope to share back with the community some of their own insights, best practices, and broader work with Python.

As an open source first organization, Capital One has been working in the open source space for several years -- consuming and contributing code, as well as releasing their own projects. One example of an open source project they’ll be showcasing at PyCon is Cloud Custodian. Cloud Custodian is a tool built with Python to allow users to easily define rules to enable a well-managed cloud infrastructure in the enterprise. It’s both secure and cost-optimized and consolidates many of the ad-hoc scripts organizations have into a lightweight and flexible tool, with unified metrics and reporting.

They also developed a Javascript project called Hygieia, a single, configurable dashboard that visualizes the health of an entire software delivery pipeline. All their open source projects are on GitHub and their Python projects can be found here.

According to the Python Software Foundation and JetBrains’ 2018 Python Developers Survey, using Python for machine learning grew 7 percentage points since 2017, which is incredible. Machine learning experienced faster growth than Web development, which has only increased by 2 percentage points when compared to the previous year. Capital One is increasingly focused on using machine learning across the enterprise. One recent Python-based project is work they’ve done in Explainable AI. Their team created a technique called Global Attribution Mapping (GAM), which is capable of explaining neural network predictions across subpopulations. This approach surfaces subpopulations with their most representative explanations, allowing them to inspect global model behavior and essentially make it easier to generate global explanations based on local attributions. You can learn more about the open source tool they developed for GAM along with a recent whitepaper with more details.

Be sure to stop by their booth, #303, and get even more details about how they’re using Python.

↧