Quantcast
Channel: Planet Python
Viewing all 23352 articles
Browse latest View live

ListenData: Importing CSV File in Python

$
0
0
This tutorial explains how to read a CSV file in python with pandas. It outlines many examples of loading a CSV file into Python. Pandas is an awesome package for data manipulation. It includes various functions to load and import data from various formats. In this post, we will see how to load comma separated files with several use cases.

Load Package

You have to load required package i.e. pandas. Run the following command to load it.
import pandas as pd
Create Sample Data for Import

The program below creates a sample data frame which can be used further for demonstration.
dt = {'ID': [11, 12, 13, 14, 15],
            'first_name': ['David', 'Jamie', 'Steve', 'Stevart', 'John'],
            'company': ['Aon', 'TCS', 'Google', 'RBS', '.'],
            'salary': [74, 76, 96, 71, 78]}
mydt = pd.DataFrame(dt, columns = ['ID', 'first_name', 'company', 'salary'])
The sample data looks like below - 
Sample Data
Save data as CSV in the working directory

The following command tells python to write data in CSV format.
mydt.to_csv('workingfile.csv', index=False)
Example 1 : Read CSV file with header row

It's the basic syntax of read_csv() function. You just need to mention the filename.
mydata  = pd.read_csv("workingfile.csv")
Example 2 : Read CSV file without header row
mydata0  = pd.read_csv("workingfile.csv", header = None)
If you specify "header = None", python would assign a series of numbers starting from 0 to (number of columns - 1). See the output shown below -
Output
Example 3 : Specify missing values

The na_values= options is used to set some values as blank / missing values.
mydata00  = pd.read_csv("workingfile.csv", na_values=['.'])
Set Missing Values

Example 4 : Set Index Column
mydata01  = pd.read_csv("workingfile.csv", index_col ='ID')
Python : Setting Index Column
As you can see in the above image, the column ID has been set as index column.

Example 5 : Read CSV File from URL

You can directly read data from the CSV file that is stored on a web link.
mydata02  = pd.read_csv("http://winterolympicsmedals.com/medals.csv")

Example 6 : Skip First 5 Rows While Importing CSV
mydata03  = pd.read_csv("http://winterolympicsmedals.com/medals.csv", skiprows=5)
It reads data from 6th row (6th row would be a header row)

Example 7 : Skip Last 5 Rows While Importing CSV
mydata04  = pd.read_csv("http://winterolympicsmedals.com/medals.csv", skip_footer=5)
In the above code, we are excluding bottom 5 rows using skip_footer= parameter.

Example 8 : Read only first 5 rows
mydata05  = pd.read_csv("http://winterolympicsmedals.com/medals.csv", nrows=5)
Using nrows= option, you can load top K number of rows.

Example 9 : Interpreting "," as thousands separator
mydata06 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", thousands=",")
Example 10 : Read only specific columns
mydata07 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", usecols=(1,5,7))
The above code reads only columns placed at first, fifth and seventh position.

Example 11 : Read some rows and columns
mydata08 = pd.read_csv("http://winterolympicsmedals.com/medals.csv", usecols=(1,5,7),nrows=5)
In the above command, we have combined usecols= and nrows= options. It will select only first 5 rows and selected columns.

Example 12 : Read file with semi colon delimiter
mydata09 = pd.read_csv("file_path", sep = ';')
Using sep= parameter in read_csv( ) function, you can import file with semi-colon delimiter.



ListenData: Importing Data into Python

$
0
0
This tutorial explains various methods to read data into Python. Data can be in any of the popular formats - CSV, TXT, XLS/XLSX (Excel), sas7bdat (SAS), Stata, Rdata (R) etc. Loading data in python environment is the most initial step of analyzing data.
Import Data into Python
While importing external files, we need to check the following points -
  1. Check whether header row exists or not
  2. Treatment of special values as missing values
  3. Consistent data type in a variable (column)
  4. Date Type variable in consistent date format.
  5. No truncation of rows while reading external data

Install and Load pandas Package

pandas is a powerful data analysis package. It makes data exploration and manipulation easy. It has several functions to read data from various sources.

If you are using Anaconda, pandas must be already installed. You need to load the package by using the following command -
import pandas as pd
If pandas package is not installed, you can install it by running the following code in Ipython Console. If you are using Spyder, you can submit the following code in Ipython console within Spyder.
!pip install pandas
If you are using Anaconda, you can try the following line of code to install pandas -
!conda install pandas
1. Import CSV files

It is important to note that a single backslash does not work when specifying the file path. You need to either change it to forward slash or add one more backslash like below
import pandas as pd
mydata= pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv")
If no header (title) in raw data file
mydata1  = pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv", header = None)
You need to include header = None option to tell Python there is no column name (header) in data.

Add Column Names

We can include column names by using names= option.
mydata2  = pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv", header = None, names = ['ID', 'first_name', 'salary'])
The variable names can also be added separately by using the following command.
mydata1.columns = ['ID', 'first_name', 'salary']


2. Import File from URL

You don't need to perform additional steps to fetch data from URL. Simply put URL in read_csv() function (applicable only for CSV files stored in URL).
mydata  = pd.read_csv("http://winterolympicsmedals.com/medals.csv")

3. Read Text File 

We can use read_table() function to pull data from text file. We can also use read_csv() with sep= "\t" to read data from tab-separated file.
mydata = pd.read_table("C:\\Users\\Deepanshu\\Desktop\\example2.txt")
mydata  = pd.read_csv("C:\\Users\\Deepanshu\\Desktop\\example2.txt", sep ="\t")

4. Read Excel File

The read_excel() function can be used to import excel data into Python.
mydata = pd.read_excel("https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls",sheetname="Data 1", skiprows=2)
If you do not specify name of sheet in sheetname= option, it would take by default first sheet.

5. Read delimited file

Suppose you need to import a file that is separated with white spaces.
mydata2 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", header = None)
To include variable names, use the names= option like below -
mydata3 = pd.read_table("http://www.ssc.wisc.edu/~bhansen/econometrics/invest.dat", sep="\s+", names=['a', 'b', 'c', 'd'])
6. Read SAS File

We can import SAS data file by using read_sas() function.
mydata4 = pd.read_sas('cars.sas7bdat')

7. Read Stata File

We can load Stata data file via read_stata() function.
mydata41 = pd.read_stata('cars.dta')
8. Import R Data File

Using pyreadr package, you can load .RData and .Rds format files which in general contains R data frame. You can install this package using the command below -
pip install pyreadr
With the use of read_r( ) function, we can import R data format files.
import pyreadr
result = pyreadr.read_r('C:/Users/sampledata.RData')
print(result.keys()) # let's check what objects we got
df1 = result["df1"] # extract the pandas data frame for object df1
Similarly, you can read .Rds formatted file.
 
9. Read SQL Table

We can extract table from SQL database (Teradata / SQL Server). See the program below -
import sqlite3
from pandas.io import sql
conn = sqlite3.connect('C:/Users/Deepanshu/Downloads/flight.db')
query = "SELECT * FROM flight;"
results = pd.read_sql(query, con=conn)
print results.head()

10. Read sample of rows and columns

By specifying nrows= and usecols=, you can fetch specified number of rows and columns.
mydata7  = pd.read_csv("http://winterolympicsmedals.com/medals.csv", nrows=5, usecols=(1,5,7))
nrows = 5 implies you want to import only first 5 rows and usecols= refers to specified columns you want to import.

11. Skip rows while importing

Suppose you want to skip first 5 rows and wants to read data from 6th row (6th row would be a header row)
mydata8  = pd.read_csv("http://winterolympicsmedals.com/medals.csv", skiprows=5)
12. Specify values as missing values

By including na_values= option, you can specify values as missing values. In this case, we are telling python to consider dot (.) as missing cases.
mydata9  = pd.read_csv("workingfile.csv", na_values=['.'])

ListenData: Install Python Package

$
0
0
Python is one of the most popular programming language for data science and analytics. It is widely used for a variety of tasks in startups and many multi-national organizations. The beauty of this programming language is that it is open-source which means it is available for free and has very active community of developers across the world. Python developers share their solutions in the form of package or module with other python users. This tutorial explains various ways how to install python package.

Ways to Install Python Package


Method 1 : If Anaconda is already installed on your System

Anaconda is the data science platform which comes with pre-installed popular python packages and powerful IDE (Spyder) which has user-friendly interface to ease writing of python programming scripts.

If Anaconda is installed on your system (laptop), click on Anaconda Prompt as shown in the image below.

Anaconda Prompt

To install a python package or module, enter the code below in Anaconda Prompt -
pip install package-name
Install Python Package using PIP Windows

Method 2 : NO Need of Anaconda


1. Open RUN box using shortcut Windows Key + R

2. Enter cmd in the RUN box
Command Prompt

Once you press OK, it will show command prompt screen.



3. Search for folder named Scripts where pip applications are stored.

Scripts Folder

4. In command prompt, type cd <file location of Scripts folder>

cd refers to change directory.

For example, folder location is C:\Users\DELL\Python37\Scripts so you need to enter the following line in command prompt :
cd C:\Users\DELL\Python37\Scripts 

Change Directory

5. Type pip install package-name

Install Package via PIP command prompt


Method 3 : Install Python Package from IPython console

Make sure to use ! before pip when you enter the command below in IPython console window. Otherwise it would return syntax error.
!pip install package_name
The ! prefix tells Python to run a shell command.


Syntax Error : Installing Package using PIP

Some users face error "SyntaxError: invalid syntax"in installing packages. To workaround this issue, run the command line below in command prompt -
python -m pip install package-name
python -m pip tells python to import a module for you, then run it as a script.

Install Specific Versions of Python Package
python -m pip install Packagename==1.3     # specific version
python -m pip install "Packagename>=1.3"  # version greater than or equal to 1.3

How to load or import package or module

Once package is installed, next step is to make the package in use. In other words, it is required to import package once installed. There are several ways to load package or module in Python :

1. import math loads the module math. Then you can use any function defined in math module using math.function. Refer the example below -
import math
math.sqrt(4)

2. from math import * loads the module math. Now we don't need to specify the module to use functions of this module.
from math import *
sqrt(4)

3. from math import sqrt, cos imports the selected functions of the module math.

4.import math as m imports the math module under the alias m.
m.sqrt(4)

Other Useful Commands
DescriptionCommand
To uninstall a packagepip uninstall package
To upgrade a packagepip install --upgrade package
To search a packagepip search "package-name"
To check all the installed packagespip list

ListenData: Python for Data Science : Learn in 3 Days

$
0
0
This tutorial helps you to learn Data Science with Python with examples. Python is an open source language and it is widely used as a high-level programming language for general-purpose programming. It has gained high popularity in data science world. As data science domain is rising these days, IBM recently predicted demand for data science professionals would rise by more than 25% by 2020. In the PyPL Popularity of Programming language index, Python scored second rank with a 14 percent share. In advanced analytics and predictive analytics market, it is ranked among top 3 programming languages for advanced analytics.
Data Science with Python Tutorial

Table of Contents
  1. Getting Started with Python
  2. Data Structures and Conditional Statements
  3. Python Libraries
  4. Data Manipulation using Pandas
  5. Data Science with Python

Python 2.7 vs 3.6

Google yields thousands of articles on this topic. Some bloggers opposed and some in favor of 2.7. If you filter your search criteria and look for only recent articles (late 2016 onwards), you would see majority of bloggers are in favor of Python 3.6. See the following reasons to support Python 3.6.

1. The official end date for the Python 2.7 is year 2020. Afterward there would be no support from community. It does not make any sense to learn 2.7 if you learn it today.

2. Python 3.6 supports 95% of top 360 python packages and almost 100% of top packages for data science.

What's new in Python 3.6

It is cleaner and faster. It is a language for the future. It fixed major issues with versions of Python 2 series. Python 3 was first released in year 2008. It has been 9 years releasing robust versions of Python 3 series.

Key Takeaway
You should go for Python 3.6. In terms of learning Python, there are no major differences in Python 2.7 and 3.6. It is not too difficult to move from Python 3 to Python 2 with a few adjustments. Your focus should go on learning Python as a language.

Python for Data Science : Introduction

Python is widely used and very popular for a variety of software engineering tasks such as website development, cloud-architecture, back-end etc. It is equally popular in data science world. In advanced analytics world, there has been several debates on R vs. Python. There are some areas such as number of libraries for statistical analysis, where R wins over Python but Python is catching up very fast. With popularity of big data and data science, Python has become first programming language of data scientists.

There are several reasons to learn Python. Some of them are as follows -
  1. Python runs well in automating various steps of a predictive model. 
  2. Python has awesome robust libraries for machine learning, natural language processing, deep learning, big data and artificial Intelligence. 
  3. Python wins over R when it comes to deploying machine learning models in production.
  4. It can be easily integrated with big data frameworks such as Spark and Hadoop.
  5. Python has a great online community support.
Do you know these sites are developed in Python?
  1. YouTube
  2. Instagram
  3. Reddit
  4. Dropbox
  5. Disqus

How to Install Python

There are two ways to download and install Python
  1. Download Anaconda. It comes with Python software along with preinstalled popular libraries.
  2. Download Python from its official website. You have to manually install libraries.

Recommended :Go for first option and download anaconda. It saves a lot of time in learning and coding Python

Coding Environments

Anaconda comes with two popular IDE :
  1. Jupyter (Ipython) Notebook
  2. Spyder
Spyder. It is like RStudio for Python. It gives an environment wherein writing python code is user-friendly. If you are a SAS User, you can think of it as SAS Enterprise Guide / SAS Studio. It comes with a syntax editor where you can write programs. It has a console to check each and every line of code. Under the 'Variable explorer', you can access your created data files and function. I highly recommend Spyder!
Spyder - Python Coding Environment
Jupyter (Ipython) Notebook

Jupyter is equivalent to markdown in R. It is useful when you need to present your work to others or when you need to create step by step project report as it can combine code, output, words, and graphics.

Spyder Shortcut Keys

The following is a list of some useful spyder shortcut keys which makes you more productive.
  1. Press F5 to run the entire script
  2. Press F9 to run selection or line 
  3. Press Ctrl + 1 to comment / uncomment
  4. Go to front of function and then press Ctrl + I to see documentation of the function
  5. Run %reset -f to clean workspace
  6. Ctrl + Left click on object to see source code 
  7. Ctrl+Enter executes the current cell.
  8. Shift+Enter executes the current cell and advances the cursor to the next cell

List of arithmetic operators with examples

Arithmetic OperatorsOperationExample
+Addition10 + 2 = 12
Subtraction10 – 2 = 8
*Multiplication10 * 2 = 20
/Division10 / 2 = 5.0
%Modulus (Remainder)10 % 3 = 1
**Power10 ** 2 = 100
//Floor17 // 3 = 5
(x + (d-1)) // dCeiling(17 +(3-1)) // 3 = 6

Basic Programs

Example 1
#Basics
x = 10
y = 3
print("10 divided by 3 is", x/y)
print("remainder after 10 divided by 3 is", x%y)
Result :
10 divided by 3 is 3.33
remainder after 10 divided by 3 is 1

Example 2
x = 100
x > 80 and x <=95
x > 35 or x < 60
x > 80 and x <=95
Out[45]: False
x > 35 or x < 60
Out[46]: True

Comparison & Logical OperatorsDescriptionExample
>Greater than5 > 3 returns True
<Less than5 < 3 returns False
>=Greater than or equal to5 >= 3 returns True
<=Less than or equal to5 <= 3 return False
==Equal to5 == 3 returns False
!=Not equal to5 != 3 returns True
andCheck both the conditionsx > 18 and x <=35
orIf atleast one condition hold Truex > 35 or x < 60
notOpposite of Conditionnot(x>7)

Assignment Operators

It is used to assign a value to the declared variable. For e.g. x += 25 means x = x +25.
x = 100
y = 10
x += y
print(x)
print(x)
110
In this case, x+=y implies x=x+y which is x = 100 + 10.
Similarly, you can use x-=y, x*=y and x /=y

Python Data Structure

In every programming language, it is important to understand the data structures. Following are some data structures used in Python.

1. List

It is a sequence of multiple values. It allows us to store different types of data such as integer, float, string etc. See the examples of list below. First one is an integer list containing only integer. Second one is string list containing only string values. Third one is mixed list containing integer, string and float values.
  1. x = [1, 2, 3, 4, 5]
  2. y = [‘A’, ‘O’, ‘G’, ‘M’]
  3. z = [‘A’, 4, 5.1, ‘M’]
Get List Item

We can extract list item using Indexes. Index starts from 0 and end with (number of elements-1).
x = [1, 2, 3, 4, 5]
x[0]
x[1]
x[4]
x[-1]
x[-2]
x[0]
Out[68]: 1

x[1]
Out[69]: 2

x[4]
Out[70]: 5

x[-1]
Out[71]: 5

x[-2]
Out[72]: 4

x[0] picks first element from list. Negative sign tells Python to search list item from right to left. x[-1] selects the last element from list.

You can select multiple elements from a list using the following method
x[:3] returns [1, 2, 3]

2. Tuple

A tuple is similar to a list in the sense that it is a sequence of elements. The difference between list and tuple are as follows -
  1. A tuple cannot be changed once constructed whereas list can be modified.
  2. A tuple is created by placing comma-separated values inside parentheses ( ). Whereas, list is created inside square brackets [ ]
Examples
K = (1,2,3)
State = ('Delhi','Maharashtra','Karnataka')
Perform for loop on Tuple
for i in State:
    print(i)
Delhi
Maharashtra
Karnataka
Functions

Like print(), you can create your own custom function. It is also called user-defined functions. It helps you in automating the repetitive task and calling reusable code in easier way.

Rules to define a function
  1. Function starts with def keyword followed by function name and ( )
  2. Function body starts with a colon (:) and is indented
  3. The keyword return ends a function  and give value of previous expression.
def sum_fun(a, b):
    result = a + b
    return result 
z = sum_fun(10, 15)
Result : z = 25

Suppose you want python to assume 0 as default value if no value is specified for parameter b.
def sum_fun(a, b=0):
    result = a + b
    return result
z = sum_fun(10)
In the above function, b is set to be 0 if no value is provided for parameter b. It does not mean no other value than 0 can be set here. It can also be used as z = sum_fun(10, 15)

Conditional Statements (if else)

Conditional statements are commonly used in coding. It is IF ELSE statements. It can be read like : " if a condition holds true, then execute something. Else execute something else"

Note : The if and else statements ends with a colon :

Example
k = 27
if k%5 == 0:
  print('Multiple of 5')
else:
  print('Not a Multiple of 5')
Result : Not a Multiple of 5

Popular python packages for Data Analysis & Visualization

Some of the leading packages in Python along with equivalent libraries in R are as follows-
  1. pandas. For data manipulation and data wrangling. A collections of functions to understand and explore data. It is counterpart of dplyr and reshape2 packages in R.
  2. NumPy. For numerical computing. It's a package for efficient array computations. It allows us to do some operations on an entire column or table in one line. It is roughly approximate to Rcpp package in R which eliminates the limitation of slow speed in R. Numpy Tutorial
  3. Scipy.  For mathematical and scientific functions such as integration, interpolation, signal processing, linear algebra, statistics, etc. It is built on Numpy.
  4. Scikit-learn. A collection of machine learning algorithms. It is built on Numpy and Scipy. It can perform all the techniques that can be done in R using glm, knn, randomForest, rpart, e1071 packages.
  5. Matplotlib. For data visualization. It's a leading package for graphics in Python. It is equivalent to ggplot2 package in R.
  6. Statsmodels. For statistical and predictive modeling. It includes various functions to explore data and generate descriptive and predictive analytics. It allows users to run descriptive statistics, methods to impute missing values, statistical tests and take table output to HTML format.
  7. pandasql.  It allows SQL users to write SQL queries in Python. It is very helpful for people who loves writing SQL queries to manipulate data. It is equivalent to sqldf package in R.
Maximum of the above packages are already preinstalled in Spyder.
    Comparison of Python and R Packages by Data Mining Task

    TaskPython PackageR Package
    IDERodeo / SpyderRstudio
    Data Manipulationpandasdplyr and reshape2
    Machine LearningScikit-learnglm, knn, randomForest, rpart, e1071
    Data Visualizationggplot + seaborn + bokehggplot2
    Character FunctionsBuilt-In Functionsstringr
    ReproducibilityJupyterKnitr
    SQL Queriespandasqlsqldf
    Working with Datesdatetimelubridate
    Web Scrapingbeautifulsouprvest

    Popular Python Commands

    The commands below would help you to install and update new and existing packages. Let's say, you want to install / uninstall pandas package.

    Run these commands from IPython console window. Don't forget to add ! before pip otherwise it would return syntax error.

    Install Package
    !pip install pandas

    Uninstall Package
    !pip uninstall pandas

    Show Information about Installed Package
    !pip show pandas

    List of Installed Packages
    !pip list

    Upgrade a package
    !pip install --upgrade pandas

      How to import a package

      There are multiple ways to import a package in Python. It is important to understand the difference between these styles.

      1. import pandas as pd
      It imports the package pandas under the alias pd. A function DataFrame in package pandas is then submitted with pd.DataFrame.

      2. import pandas
      It imports the package without using alias but here the function DataFrame is submitted with full package name pandas.DataFrame

      3. from pandas import *
      It imports the whole package and the function DataFrame is executed simply by typing DataFrame. It sometimes creates confusion when same function name exists in more than one package.

      Pandas Data Structures : Series and DataFrame

      In pandas package, there are two data structures - series and dataframe. These structures are explained below in detail -
      1. Series is a one-dimensional array. You can access individual elements of a series using position. It's similar to vector in R.
      In the example below, we are generating 5 random values.
      import pandas as pd
      import numpy as np
      s1 = pd.Series(np.random.randn(5))
      s1
      0   -2.412015
      1 -0.451752
      2 1.174207
      3 0.766348
      4 -0.361815
      dtype: float64

      Extract first and second value

      You can get a particular element of a series using index value. See the examples below -

      s1[0]
      -2.412015
      s1[1]
      -0.451752
      s1[:3]
      0   -2.412015
      1 -0.451752
      2 1.174207

      2. DataFrame

      It is equivalent to data.frame in R. It is a 2-dimensional data structure that can store data of different data types such as characters, integers, floating point values, factors. Those who are well-conversant with MS Excel, they can think of data frame as Excel Spreadsheet.

      Comparison of Data Type in Python and Pandas

      The following table shows how Python and pandas package stores data.

      Data TypePandasStandard Python
      For character variableobjectstring
      For categorical variablecategory-
      For Numeric variable without decimalsint64int
      Numeric characters with decimalsfloat64float
      For date time variablesdatetime64-

      Important Pandas Functions

      The table below shows comparison of pandas functions with R functions for various data wrangling and manipulation tasks. It would help you to memorize pandas functions. It's a very handy information for programmers who are new to Python. It includes solutions for most of the frequently used data exploration tasks.

      FunctionsRPython (pandas package)
      Installing a packageinstall.packages('name')!pip install name
      Loading a packagelibrary(name)import name as other_name
      Checking working directorygetwd()import os
      os.getcwd()
      Setting working directorysetwd()os.chdir()
      List files in a directorydir()os.listdir()
      Remove an objectrm('name')del object
      Select Variablesselect(df, x1, x2) df[['x1', 'x2']]
      Drop Variablesselect(df, -(x1:x2))df.drop(['x1', 'x2'], axis = 1)
      Filter Datafilter(df, x1 >= 100)df.query('x1 >= 100')
      Structure of a DataFramestr(df)df.info()
      Summarize dataframesummary(df)df.describe()
      Get row names of dataframe "df"rownames(df)df.index
      Get column namescolnames(df)df.columns
      View Top N rowshead(df,N)df.head(N)
      View Bottom N rowstail(df,N)df.tail(N)
      Get dimension of data framedim(df)df.shape
      Get number of rowsnrow(df)df.shape[0]
      Get number of columnsncol(df)df.shape[1]
      Length of data framelength(df)len(df)
      Get random 3 rows from dataframesample_n(df, 3)df.sample(n=3)
      Get random 10% rowssample_frac(df, 0.1)df.sample(frac=0.1)
      Check Missing Valuesis.na(df$x)pd.isnull(df.x)
      Sortingarrange(df, x1, x2)df.sort_values(['x1', 'x2'])
      Rename Variablesrename(df, newvar = x1)df.rename(columns={'x1': 'newvar'})


      Data Manipulation with pandas - Examples

      1. Import Required Packages

      You can import required packages using import statement. In the syntax below, we are asking Python to import numpy and pandas package. The 'as' is used to alias package name.
      import numpy as np
      import pandas as pd

      2. Build DataFrame

      We can build dataframe using DataFrame() function of pandas package.
      mydata = {'productcode': ['AA', 'AA', 'AA', 'BB', 'BB', 'BB'],
              'sales': [1010, 1025.2, 1404.2, 1251.7, 1160, 1604.8],
              'cost' : [1020, 1625.2, 1204, 1003.7, 1020, 1124]}
      df = pd.DataFrame(mydata)
       In this dataframe, we have three variables - productcode, sales, cost.
      Sample DataFrame

      To import data from CSV file


      You can use read_csv() function from pandas package to get data into python from CSV file.
      mydata= pd.read_csv("C:\\Users\\Deepanshu\\Documents\\file1.csv")
      Make sure you use double backslash when specifying path of CSV file. Alternatively, you can use forward slash to mention file path inside read_csv() function.

      Detailed Tutorial : Import Data in Python

      3. To see number of rows and columns

      You can run the command below to find out number of rows and columns.
      df.shape
       Result : (6, 3). It means 6 rows and 3 columns.

      4. To view first 3 rows

      The df.head(N) function can be used to check out first some N rows.
      df.head(3)
           cost productcode   sales
      0 1020.0 AA 1010.0
      1 1625.2 AA 1025.2
      2 1204.0 AA 1404.2

      5. Select or Drop Variables

      To keep a single variable, you can write in any of the following three methods -
      df.productcode
      df["productcode"]
      df.loc[: , "productcode"]
      To select variable by column position, you can use df.iloc function. In the example below, we are selecting second column. Column Index starts from 0. Hence, 1 refers to second column.
      df.iloc[: , 1]
      We can keep multiple variables by specifying desired variables inside [ ]. Also, we can make use of df.loc() function.
      df[["productcode", "cost"]]
      df.loc[ : , ["productcode", "cost"]]

      Drop Variable

      We can remove variables by using df.drop() function. See the example below -
      df2 = df.drop(['sales'], axis = 1)

      6. To summarize data frame

      To summarize or explore data, you can submit the command below.
      df.describe()
                    cost       sales
      count 6.000000 6.00000
      mean 1166.150000 1242.65000
      std 237.926793 230.46669
      min 1003.700000 1010.00000
      25% 1020.000000 1058.90000
      50% 1072.000000 1205.85000
      75% 1184.000000 1366.07500
      max 1625.200000 1604.80000

      To summarise all the character variables, you can use the following script.
      df.describe(include=['O'])
      Similarly, you can use df.describe(include=['float64']) to view summary of all the numeric variables with decimals.

      To select only a particular variable, you can write the following code -
      df.productcode.describe()
      OR
      df["productcode"].describe()
      count      6
      unique 2
      top BB
      freq 3
      Name: productcode, dtype: object

      7. To calculate summary statistics

      We can manually find out summary statistics such as count, mean, median by using commands below
      df.sales.mean()
      df.sales.median()
      df.sales.count()
      df.sales.min()
      df.sales.max()

      8. Filter Data

      Suppose you are asked to apply condition - productcode is equal to "AA" and sales greater than or equal to 1250.
      df1 = df[(df.productcode == "AA") & (df.sales >= 1250)]
      It can also be written like :
      df1 = df.query('(productcode == "AA") & (sales >= 1250)')
      In the second query, we do not need to specify DataFrame along with variable name.

      9. Sort Data

      In the code below, we are arrange data in ascending order by sales.
      df.sort_values(['sales'])

      10.  Group By : Summary by Grouping Variable

      Like SQL GROUP BY, you want to summarize continuous variable by classification variable. In this case, we are calculating average sale and cost by product code.
      df.groupby(df.productcode).mean()
                          cost        sales
      productcode
      AA 1283.066667 1146.466667
      BB 1049.233333 1338.833333
      Instead of summarising for multiple variable, you can run it for a single variable i.e. sales. Submit the following script.
      df["sales"].groupby(df.productcode).mean()

      11. Define Categorical Variable

      Let's create a classification variable - id which contains only 3 unique values - 1/2/3.
      df0 = pd.DataFrame({'id': [1, 1, 2, 3, 1, 2, 2]})
      Let's define as a categorical variable.
      We can use astype() function to make id as a categorical variable.
      df0.id = df0["id"].astype('category')
      Summarize this classification variable to check descriptive statistics.
      df0.describe()
             id
      count 7
      unique 3
      top 2
      freq 3

      Frequency Distribution

      You can calculate frequency distribution of a categorical variable. It is one of the method to explore a categorical variable.
      df['productcode'].value_counts()
      BB    3
      AA 3

      12. Generate Histogram

      Histogram is one of the method to check distribution of a continuous variable. In the figure shown below, there are two values for variable 'sales' in range 1000-1100. In the remaining intervals, there is only a single value. In this case, there are only 5 values. If you have a large dataset, you can plot histogram to identify outliers in a continuous variable.
      df['sales'].hist()
      Histogram

      13. BoxPlot

      Boxplot is a method to visualize continuous or numeric variable. It shows minimum, Q1, Q2, Q3, IQR, maximum value in a single graph.
      df.boxplot(column='sales')
      BoxPlot

      Detailed Tutorial :Data Analysis with Pandas Tutorial

      Data Science using Python - Examples

      In this section, we cover how to perform data mining and machine learning algorithms with Python. sklearn is the most frequently used library for running data mining and machine learning algorithms. We will also cover statsmodels library for regression techniques. statsmodels library generates formattable output which can be used further in project report and presentation.

      1. Install the required libraries

      Import the following libraries before reading or exploring data
      #Import required libraries
      import pandas as pd
      import statsmodels.api as sm
      import numpy as np

      2. Download and import data into Python

      With the use of python library, we can easily get data from web into python.
      # Read data from web
      df = pd.read_csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
      Variables Type Description
      gre Continuous Graduate Record Exam score
      gpa Continuous Grade Point Average
      rank Categorical Prestige of the undergraduate institution
      admit Binary Admission in graduate school

      The binary variable admit is a target variable.

      3. Explore Data

      Let's explore data. We'll answer the following questions -
      1. How many rows and columns in the data file?
      2. What are the distribution of variables?
      3. Check if any outlier(s)
      4. If outlier(s), treat them
      5. Check if any missing value(s)
      6. Impute Missing values (if any)
      # See no. of rows and columns
      df.shape
      Result : 400 rows and 4 columns

      In the code below, we rename the variable rank to 'position' as rank is already a function in python.
      # rename rank column
      df = df.rename(columns={'rank': 'position'}) 
      Summarize and plot all the columns.
      # Summarize
      df.describe()
      # plot all of the columns
      df.hist()
      Categorical variable Analysis

      It is important to check the frequency distribution of categorical variable. It helps to answer the question whether data is skewed.
      # Summarize
      df.position.value_counts(ascending=True)
      1     61
      4 67
      3 121
      2 151

      Generating Crosstab 

      By looking at cross tabulation report, we can check whether we have enough number of events against each unique values of categorical variable.
      pd.crosstab(df['admit'], df['position'])
      position   1   2   3   4
      admit
      0 28 97 93 55
      1 33 54 28 12

      Number of Missing Values

      We can write a simple loop to figure out the number of blank values in all variables in a dataset.
      for i in list(df.columns) :
          k = sum(pd.isnull(df[i]))
          print(i, k)
      In this case, there are no missing values in the dataset.

      4. Logistic Regression Model

      Logistic Regression is a special type of regression where target variable is categorical in nature and independent variables be discrete or continuous. In this post, we will demonstrate only binary logistic regression which takes only binary values in target variable. Unlike linear regression, logistic regression model returns probability of target variable.It assumes binomial distribution of dependent variable. In other words, it belongs to binomial family.

      In python, we can write R-style model formula y ~ x1 + x2 + x3 using  patsy and statsmodels libraries. In the formula, we need to define variable 'position' as a categorical variable by mentioning it inside capital C(). You can also define reference category using reference= option.
      #Reference Category
      from patsy import dmatrices, Treatment
      y, X = dmatrices('admit ~ gre + gpa + C(position, Treatment(reference=4))', df, return_type = 'dataframe')
      It returns two datasets - X and y. The dataset 'y' contains variable admit which is a target variable. The other dataset 'X' contains Intercept (constant value), dummy variables for Treatment, gre and gpa. Since 4 is set as a reference category, it will be 0 against all the three dummy variables. See sample below -
      P  P_1 P_2 P_3
      3 0 0 1
      3 0 0 1
      1 1 0 0
      4 0 0 0
      4 0 0 0
      2 0 1 0


      Split Data into two parts

      80% of data goes to training dataset which is used for building model and 20% goes to test dataset which would be used for validating the model.
      from sklearn.model_selection import train_test_split
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

      Build Logistic Regression Model

      By default, the regression without formula style does not include intercept. To include it, we already have added intercept in X_train which would be used as a predictor.
      #Fit Logit model
      logit = sm.Logit(y_train, X_train)
      result = logit.fit()

      #Summary of Logistic regression model
      result.summary()
      result.params
                                Logit Regression Results                           
      ==============================================================================
      Dep. Variable: admit No. Observations: 320
      Model: Logit Df Residuals: 315
      Method: MLE Df Model: 4
      Date: Sat, 20 May 2017 Pseudo R-squ.: 0.03399
      Time: 19:57:24 Log-Likelihood: -193.49
      converged: True LL-Null: -200.30
      LLR p-value: 0.008627
      =======================================================================================
      coef std err z P|z| [95.0% Conf. Int.]
      ---------------------------------------------------------------------------------------
      C(position)[T.1] 1.4933 0.440 3.392 0.001 0.630 2.356
      C(position)[T.2] 0.6771 0.373 1.813 0.070 -0.055 1.409
      C(position)[T.3] 0.1071 0.410 0.261 0.794 -0.696 0.910
      gre 0.0005 0.001 0.442 0.659 -0.002 0.003
      gpa 0.4613 0.214 -2.152 0.031 -0.881 -0.041
      ======================================================================================

      Confusion Matrix and Odd Ratio

      Odd ratio is exponential value of parameter estimates.
      #Confusion Matrix
      result.pred_table()
      #Odd Ratio
      np.exp(result.params)

      Prediction on Test Data
      In this step, we take estimates of logit model which was built on training data and then later apply it into test data.
      #prediction on test data
      y_pred = result.predict(X_test)

      Calculate Area under Curve (ROC)
      # AUC on test data
      false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
      auc(false_positive_rate, true_positive_rate)
      Result : AUC = 0.6763

      Calculate Accuracy Score
      accuracy_score([ 1 if p > 0.5 else 0 for p in y_pred ], y_test)

      Decision Tree Model

      Decision trees can have a target variable continuous or categorical. When it is continuous, it is called regression tree. And when it is categorical, it is called classification tree. It selects a variable at each step that best splits the set of values. There are several algorithms to find best split. Some of them are Gini, Entropy, C4.5, Chi-Square. There are several advantages of decision tree. It is simple to use and easy to understand. It requires a very few data preparation steps. It can handle mixed data - both categorical and continuous variables. In terms of speed, it is a very fast algorithm.

      #Drop Intercept from predictors for tree algorithms
      X_train = X_train.drop(['Intercept'], axis = 1)
      X_test = X_test.drop(['Intercept'], axis = 1)

      #Decision Tree
      from sklearn.tree import DecisionTreeClassifier
      model_tree = DecisionTreeClassifier(max_depth=7)

      #Fit the model:
      model_tree.fit(X_train,y_train)

      #Make predictions on test set
      predictions_tree = model_tree.predict_proba(X_test)

      #AUC
      false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_tree[:,1])
      auc(false_positive_rate, true_positive_rate)
      Result : AUC = 0.664

      Important Note
      Feature engineering plays an important role in building predictive models. In the above case, we have not performed variable selection. We can also select best parameters by using grid search fine tuning technique.

      Random Forest Model

      Decision Tree has limitation of overfitting which implies it does not generalize pattern. It is very sensitive to a small change in training data. To overcome this problem, random forest comes into picture. It grows a large number of trees on randomised data. It selects random number of variables to grow each tree. It is more robust algorithm than decision tree. It is one of the most popular machine learning algorithm. It is commonly used in data science competitions. It is always ranked in top 5 algorithms. It has become a part of every data science toolkit.

      #Random Forest
      from sklearn.ensemble import RandomForestClassifier
      model_rf = RandomForestClassifier(n_estimators=100, max_depth=7)

      #Fit the model:
      target = y_train['admit']
      model_rf.fit(X_train,target)

      #Make predictions on test set
      predictions_rf = model_rf.predict_proba(X_test)

      #AUC
      false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1])
      auc(false_positive_rate, true_positive_rate)

      #Variable Importance
      importances = pd.Series(model_rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
      print(importances)
      importances.plot.bar()

      Result : AUC = 0.6974

      Grid Search - Hyper Parameters Tuning

      The sklearn library makes hyper-parameters tuning very easy. It is a strategy to select the best parameters for an algorithm. In scikit-learn they are passed as arguments to the constructor of the estimator classes. For example, max_features in randomforest. alpha for lasso.

      from sklearn.model_selection import GridSearchCV
      rf = RandomForestClassifier()
      target = y_train['admit']

      param_grid = {
      'n_estimators': [100, 200, 300],
      'max_features': ['sqrt', 3, 4]
      }

      CV_rfc = GridSearchCV(estimator=rf , param_grid=param_grid, cv= 5, scoring='roc_auc')
      CV_rfc.fit(X_train,target)

      #Parameters with Scores
      CV_rfc.grid_scores_

      #Best Parameters
      CV_rfc.best_params_
      CV_rfc.best_estimator_

      #Make predictions on test set
      predictions_rf = CV_rfc.predict_proba(X_test)

      #AUC
      false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, predictions_rf[:,1])
      auc(false_positive_rate, true_positive_rate)

      Cross Validation
      # Cross Validation
      from sklearn.linear_model import LogisticRegression
      from sklearn.model_selection import cross_val_predict,cross_val_score
      target = y['admit']
      prediction_logit = cross_val_predict(LogisticRegression(), X, target, cv=10, method='predict_proba')
      #AUC
      cross_val_score(LogisticRegression(fit_intercept = False), X, target, cv=10, scoring='roc_auc')

      Data Mining : PreProcessing Steps

      1.  The machine learning package sklearn requires all categorical variables in numeric form. Hence, we need to convert all character/categorical variables to be numeric. This can be accomplished using the following script. In sklearn,  there is already a function for this step.

      from sklearn.preprocessing import LabelEncoder
      def ConverttoNumeric(df):
      cols = list(df.select_dtypes(include=['category','object']))
      le = LabelEncoder()
      for i in cols:
      try:
      df[i] = le.fit_transform(df[i])
      except:
      print('Error in Variable :'+i)
      return df

      ConverttoNumeric(df)
      Encoding

      2. Create Dummy Variables

      Suppose you want to convert categorical variables into dummy variables. It is different to the previous example as it creates dummy variables instead of convert it in numeric form.
      productcode_dummy = pd.get_dummies(df["productcode"])
      df2 = pd.concat([df, productcode_dummy], axis=1)

      The output looks like below -
         AA  BB
      0 1 0
      1 1 0
      2 1 0
      3 0 1
      4 0 1
      5 0 1

      Create k-1 Categories

      To avoid multi-collinearity, you can set one of the category as reference category and leave it while creating dummy variables. In the script below, we are leaving first category.
      productcode_dummy = pd.get_dummies(df["productcode"], prefix='pcode', drop_first=True)
      df2 = pd.concat([df, productcode_dummy], axis=1)

      3. Impute Missing Values

      Imputing missing values is an important step of predictive modeling. In many algorithms, if missing values are not filled, it removes complete row. If data contains a lot of missing values, it can lead to huge data loss. There are multiple ways to impute missing values. Some of the common techniques - to replace missing value with mean/median/zero. It makes sense to replace missing value with 0 when 0 signifies meaningful. For example, whether customer holds a credit card product.

      Fill missing values of a particular variable
      # fill missing values with 0
      df['var1'] = df['var1'].fillna(0)
      # fill missing values with mean
      df['var1'] = df['var1'].fillna(df['var1'].mean())

      Apply imputation to the whole dataset
      from sklearn.preprocessing import Imputer

      # Set an imputer object
      mean_imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)

      # Train the imputor
      mean_imputer = mean_imputer.fit(df)

      # Apply imputation
      df_new = mean_imputer.transform(df.values)

      4. Outlier Treatment

      There are many ways to handle or treat outliers (or extreme values). Some of the methods are as follows -
      1. Cap extreme values at 95th / 99th percentile depending on distribution
      2. Apply log transformation of variables. See below the implementation of log transformation in Python.
      import numpy as np
      df['var1'] = np.log(df['var1'])

      5. Standardization

      In some algorithms, it is required to standardize variables before running the actual algorithm. Standardization refers to the process of making mean of variable zero and unit variance (standard deviation).

      #load dataset
      dataset = load_boston()
      predictors = dataset.data
      target = dataset.target
      df = pd.DataFrame(predictors, columns = dataset.feature_names)

      #Apply Standardization
      from sklearn.preprocessing import StandardScaler
      k = StandardScaler()
      df2 = k.fit_transform(df)


      Next Steps

      Practice, practice and practice. Download free public data sets from Kaggle / UCLA websites and try to play around with data and generate insights from it with pandas package and build statistical models using sklearn package. I hope you would find this tutorial helpful. I tried to cover all the important topics which beginner must know about Python. Once completion of this tutorial, you can flaunt you know how to program it in Python and you can implement machine learning algorithms using sklearn package.

      ListenData: NumPy Tutorial with Exercises

      $
      0
      0
      NumPy (acronym for 'Numerical Python' or 'Numeric Python') is one of the most essential package for speedy mathematical computation on arrays and matrices in Python. It is also quite useful while dealing with multi-dimensional data. It is a blessing for integrating C, C++ and FORTRAN tools. It also provides numerous functions for Fourier transform (FT) and linear algebra.

      Python : Numpy Tutorial

      Why NumPy instead of lists?

      One might think of why one should prefer arrays in NumPy instead we can create lists having the same data type. If this statement also rings a bell then the following reasons may convince you:
      1. Numpy arrays have contiguous memory allocation. Thus if a same array stored as list will require more space as compared to arrays.
      2. They are more speedy to work with and hence are more efficient than the lists.
      3. They are more convenient to deal with.

        NumPy vs. Pandas

        Pandas is built on top of NumPy. In other words,Numpy is required by pandas to make it work. So Pandas is not an alternative to Numpy. Instead pandas offers additionalmethod or provides more streamlined way of working with numerical and tabular data in Python.

        Importing numpy
        Firstly you need to import the numpy library. Importing numpy can be done by running the following command:
        import numpy as np
        It is a general approach to import numpy with alias as 'np'. If alias is not provided then to access the functions from numpy we shall write numpy.function. To make it easier an alias 'np' is introduced so we can write np.function. Some of the common functions of numpy are listed below -

        FunctionsTasks
        arrayCreate numpy array
        ndimDimension of the array
        shapeSize of the array (Number of rows and Columns)
        sizeTotal number of elements in the array
        dtypeType of elements in the array, i.e., int64, character
        reshapeReshapes the array without changing the original shape
        resizeReshapes the array. Also change the original shape
        arangeCreate sequence of numbers in array
        ItemsizeSize in bytes of each item
        diagCreate a diagonal matrix
        vstackStacking vertically
        hstackStacking horizontally
        1D array
        Using numpy an array is created by using np.array:
        a = np.array([15,25,14,78,96])
        a
        print(a)
        a
        Output: array([15, 25, 14, 78, 96])

        print(a)
        Output: [15 25 14 78 96]
        Notice that in np.array square brackets are present. Absence of square bracket introduces an error. To print the array we can use print(a).

        Changing the datatype
        np.array( ) has an additional parameter of dtype through which one can define whether the elements are integers or floating points or complex numbers.
        a.dtype
        a = np.array([15,25,14,78,96],dtype = "float")
        a
        a.dtype
        Initially datatype of 'a' was 'int32' which on modifying becomes 'float64'.

        1. int32 refers to number without a decimal point. '32' means number can be in between-2147483648 and 2147483647. Similarly, int16 implies number can be in range -32768 to 32767
        2. float64 refers to number with decimal place.


        Creating the sequence of numbers
        If you want to create a sequence of numbers then using np.arange, we can get our sequence. To get the sequence of numbers from 20 to 29 we run the following command.
        b = np.arange(start = 20,stop = 30, step = 1)
        b
        array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])
        In np.arange the end point is always excluded.

        np.arange provides an option of step which defines the difference between 2 consecutive numbers. If step is not provided then it takes the value 1 by default.

        Suppose we want to create an arithmetic progression with initial term 20 and common difference 2, upto 30; 30 being excluded.
        c = np.arange(20,30,2) #30 is excluded.
        c
        array([20, 22, 24, 26, 28])
        It is to be taken care that in np.arange( ) the stop argument is always excluded.

        Indexing in arrays
        It is important to note that Python indexing starts from 0. The syntax of indexing is as follows -
        1. x[start:end:step]: Elements in array x start through the end (but the end is excluded), default step value is 1.
        2. x[start:end] : Elements in array x start through the end (but the end is excluded)
        3. x[start:] : Elements start through the end
        4. x[:end] : Elements from the beginning through the end (but the end is excluded)

        If we want to extract 3rd element we write the index as 2 as it starts from 0.
        x = np.arange(10)
        x[2]
        x[2:5]
        x[::2]
        x[1::2]
        x
        Output: [0 1 2 3 4 5 6 7 8 9]

        x[2]
        Output: 2

        x[2:5]
        Output: array([2, 3, 4])

        x[::2]
        Output: array([0, 2, 4, 6, 8])

        x[1::2]
        Output: array([1, 3, 5, 7, 9])

        Note that in x[2:5] elements starting from 2nd index up to 5th index(exclusive) are selected.
        If we want to change the value of all the elements from starting upto index 7,excluding 7, with a step of 3 as 123 we write:
        x[:7:3] = 123
        x
         array([123,   1,   2, 123,   4,   5, 123,   7,   8,   9])
        To reverse a given array we write:
        x = np.arange(10)
        x[ : :-1] # reversed x
        array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
        Note that the above command does not modify the original array.

        Reshaping the arrays
        To reshape the array we can use reshape( ).
        f = np.arange(101,113)
        f.reshape(3,4)
        f
         array([101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112])

        Note that reshape() does not alter the shape of the original array. Thus to modify the original array we can use resize( )
        f.resize(3,4)
        f
        array([[101, 102, 103, 104],
        [105, 106, 107, 108],
        [109, 110, 111, 112]])

        If a dimension is given as -1 in a reshaping, the other dimensions are automatically calculated provided that the given dimension is a multiple of total number of elements in the array.
        f.reshape(3,-1)
        array([[101, 102, 103, 104],
        [105, 106, 107, 108],
        [109, 110, 111, 112]])

        In the above code we only directed that we will have 3 rows. Python automatically calculates the number of elements in other dimension i.e. 4 columns.

        Missing Data
        The missing data is represented by NaN (acronym for Not a Number). You can use the command np.nan
        val = np.array([15,10, np.nan, 3, 2, 5, 6, 4])
        val.sum()
        Out: nan
        To ignore missing values, you can use np.nansum(val) which returns 45

        To check whether array contains missing value, you can use the functionisnan( )
        np.isnan(val)


        2D arrays
        A 2D array in numpy can be created in the following manner:
        g = np.array([(10,20,30),(40,50,60)])
        #Alternatively
        g = np.array([[10,20,30],[40,50,60]])
        g
        The dimension, total number of elements and shape can be ascertained by ndim, size and shape respectively:
        g.ndim
        g.size
        g.shape
        g.ndim
        Output: 2

        g.size
        Output: 6

        g.shape
        Output: (2, 3)

        Creating some usual matrices
        numpy provides the utility to create some usual matrices which are commonly used for linear algebra.
        To create a matrix of all zeros of 2 rows and 4 columns we can use np.zeros( ):
        np.zeros( (2,4) )
        array([[ 0.,  0.,  0.,  0.],
        [ 0., 0., 0., 0.]])
        Here the dtype can also be specified. For a zero matrix the default dtype is 'float'. To change it to integer we write 'dtype = np.int16'
        np.zeros([2,4],dtype=np.int16)
        array([[0, 0, 0, 0],
        [0, 0, 0, 0]], dtype=int16)
        To get a matrix of all random numbers from 0 to 1 we write np.empty.
        np.empty( (2,3) )
        array([[  2.16443571e-312,   2.20687562e-312,   2.24931554e-312],
        [ 2.29175545e-312, 2.33419537e-312, 2.37663529e-312]])
        Note: The results may vary everytime you run np.empty.
        To create a matrix of unity we write np.ones( ). We can create a 3 * 3 matrix of all ones by:
        np.ones([3,3])
        array([[ 1.,  1.,  1.],
        [ 1., 1., 1.],
        [ 1., 1., 1.]])
        To create a diagonal matrix we can write np.diag( ). To create a diagonal matrix where the diagonal elements are 14,15,16 and 17 we write:
        np.diag([14,15,16,17])
        array([[14,  0,  0,  0],
        [ 0, 15, 0, 0],
        [ 0, 0, 16, 0],
        [ 0, 0, 0, 17]])
        To create an identity matrix we can use np.eye( ) .
        np.eye(5,dtype = "int")
        array([[1, 0, 0, 0, 0],
        [0, 1, 0, 0, 0],
        [0, 0, 1, 0, 0],
        [0, 0, 0, 1, 0],
        [0, 0, 0, 0, 1]])
        By default the datatype in np.eye( ) is 'float' thus we write dtype = "int" to convert it to integers.

        Reshaping 2D arrays
        To get a flattened 1D array we can use ravel( )
        g = np.array([(10,20,30),(40,50,60)])
        g.ravel()
         array([10, 20, 30, 40, 50, 60])
        To change the shape of 2D array we can use reshape. Writing -1 will calculate the other dimension automatically and does not modify the original array.
        g.reshape(3,-1) # returns the array with a modified shape
        #It does not modify the original array
        g.shape
         (2, 3)
        Similar to 1D arrays, using resize( ) will modify the shape in the original array.
        g.resize((3,2))
        g #resize modifies the original array
        array([[10, 20],
        [30, 40],
        [50, 60]])

        Time for some matrix algebra
        Let us create some arrays A,b and B and they will be used for this section:
        A = np.array([[2,0,1],[4,3,8],[7,6,9]])
        b = np.array([1,101,14])
        B = np.array([[10,20,30],[40,50,60],[70,80,90]])
        In order to get the transpose, trace and inverse we use A.transpose( ) , np.trace( ) and np.linalg.inv( ) respectively.
        A.T #transpose
        A.transpose() #transpose
        np.trace(A) # trace
        np.linalg.inv(A) #Inverse
        A.transpose()  #transpose
        Output:
        array([[2, 4, 7],
        [0, 3, 6],
        [1, 8, 9]])

        np.trace(A) # trace
        Output: 14

        np.linalg.inv(A) #Inverse
        Output:
        array([[ 0.53846154, -0.15384615, 0.07692308],
        [-0.51282051, -0.28205128, 0.30769231],
        [-0.07692308, 0.30769231, -0.15384615]])
        Note that transpose does not modify the original array.

        Matrix addition and subtraction can be done in the usual way:
        A+B
        A-B
        A+B
        Output:
        array([[12, 20, 31],
        [44, 53, 68],
        [77, 86, 99]])

        A-B
        Output:
        array([[ -8, -20, -29],
        [-36, -47, -52],
        [-63, -74, -81]])
        Matrix multiplication of A and B can be accomplished by A.dot(B). Where A will be the 1st matrix on the left hand side and B will be the second matrix on the right side.
        A.dot(B)
        array([[  90,  120,  150],
        [ 720, 870, 1020],
        [ 940, 1160, 1380]])
        To solve the system of linear equations: Ax = b we use np.linalg.solve( )
        np.linalg.solve(A,b)
        array([-13.92307692, -24.69230769,  28.84615385])
        The eigen values and eigen vectors can be calculated using np.linalg.eig( )
        np.linalg.eig(A)
        (array([ 14.0874236 ,   1.62072127,  -1.70814487]),
        array([[-0.06599631, -0.78226966, -0.14996331],
        [-0.59939873, 0.54774477, -0.81748379],
        [-0.7977253 , 0.29669824, 0.55608566]]))
        The first row are the various eigen values and the second matrix denotes the matrix of eigen vectors where each column is the eigen vector to the corresponding eigen value.

        Some Mathematics functions

        We can have various trigonometric functions like sin, cosine etc. using numpy:
        B = np.array([[0,-20,36],[40,50,1]])
        np.sin(B)
        array([[ 0.        , -0.91294525, -0.99177885],
        [ 0.74511316, -0.26237485, 0.84147098]])
        The resultant is the matrix of all sin( ) elements.
        In order to get the exponents we use **
        B**2
        array([[   0,  400, 1296],
        [1600, 2500, 1]], dtype=int32)
        We get the matrix of the square of all elements of B.
        In order to obtain if a condition is satisfied by the elements of a matrix we need to write the criteria. For instance, to check if the elements of B are more than 25 we write:
        B>25
        array([[False, False,  True],
        [ True, True, False]], dtype=bool)
        We get a matrix of Booleans where True indicates that the corresponding element is greater than 25 and False indicates that the condition is not satisfied.
        In a similar manner np.absolute, np.sqrt and np.exp return the matrices of absolute numbers, square roots and exponentials respectively.
        np.absolute(B)
        np.sqrt(B)
        np.exp(B)
        Now we consider a matrix A of shape 3*3:
        A = np.arange(1,10).reshape(3,3)
        A
        array([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])
        To find the sum, minimum, maximum, mean, standard deviation and variance respectively we use the following commands:
        A.sum()
        A.min()
        A.max()
        A.mean()
        A.std() #Standard deviation
        A.var() #Variance
        A.sum()
        Output: 45

        A.min()
        Output: 1

        A.max()
        Output: 9

        A.mean()
        Output: 5.0

        A.std() #Standard deviation
        Output: 2.5819888974716112

        A.var()
        Output: 6.666666666666667
        In order to obtain the index of the minimum and maximum elements we use argmin( ) and argmax( ) respectively.
        A.argmin()
        A.argmax()
        A.argmin()
        Output: 0

        A.argmax()
        Output: 8
        If we wish to find the above statistics for each row or column then we need to specify the axis:
        A.sum(axis=0)
        A.mean(axis = 0)
        A.std(axis = 0)
        A.argmin(axis = 0)
        A.sum(axis=0)                 # sum of each column, it will move in downward direction
        Output: array([12, 15, 18])

        A.mean(axis = 0)
        Output: array([ 4., 5., 6.])

        A.std(axis = 0)
        Output: array([ 2.44948974, 2.44948974, 2.44948974])

        A.argmin(axis = 0)
        Output: array([0, 0, 0], dtype=int64)
        By defining axis = 0, calculations will move in downward direction i.e. it will give the statistics for each column. To find the min and index of maximum element for each row, we need to move in right-wise direction so we write axis = 1:
        A.min(axis=1)
        A.argmax(axis = 1)
        A.min(axis=1)                  # min of each row, it will move in rightwise direction
        Output: array([1, 4, 7])

        A.argmax(axis = 1)
        Output: array([2, 2, 2], dtype=int64)
        To find the cumulative sum along each row we use cumsum( )
        A.cumsum(axis=1)
        array([[ 1,  3,  6],
        [ 4, 9, 15],
        [ 7, 15, 24]], dtype=int32)

        Creating 3D arrays
        Numpy also provides the facility to create 3D arrays. A 3D array can be created as:
        X = np.array( [[[ 1, 2,3],
        [ 4, 5, 6]],
        [[7,8,9],
        [10,11,12]]])
        X.shape
        X.ndim
        X.size
        X contains two 2D arrays Thus the shape is 2,2,3. Totol number of elements is 12.
        To calculate the sum along a particular axis we use the axis parameter as follows:
        X.sum(axis = 0)
        X.sum(axis = 1)
        X.sum(axis = 2)
        X.sum(axis = 0)
        Output:
        array([[ 8, 10, 12],
        [14, 16, 18]])

        X.sum(axis = 1)
        Output:
        array([[ 5, 7, 9],
        [17, 19, 21]])

        X.sum(axis = 2)
        Output:
        array([[ 6, 15],
        [24, 33]])
        axis = 0 returns the sum of the corresponding elements of each 2D array. axis = 1 returns the sum of elements in each column in each matrix while axis = 2 returns the sum of each row in each matrix.
        X.ravel()
         array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])
        ravel( ) writes all the elements in a single array.
        Consider a 3D array:
        X = np.array( [[[ 1, 2,3],
        [ 4, 5, 6]],
        [[7,8,9],
        [10,11,12]]])
        To extract the 2nd matrix we write:
        X[1,...] # same as X[1,:,:] or X[1]
        array([[ 7,  8,  9],
        [10, 11, 12]])
        Remember python indexing starts from 0 that is why we wrote 1 to extract the 2nd 2D array.
        To extract the first element from all the rows we write:
        X[...,0] # same as X[:,:,0]
        array([[ 1,  4],
        [ 7, 10]])


        Find out position of elements that satisfy a given condition
        a = np.array([8, 3, 7, 0, 4, 2, 5, 2])
        np.where(a > 4)
        array([0, 2, 6]
        np.where locates the positions in the array where element of array is greater than 4.

        Indexing with Arrays of Indices
        Consider a 1D array.
        x = np.arange(11,35,2)
        x
        array([11, 13, 15, 17, 19, 21, 23, 25, 27, 29, 31, 33])
        We form a 1D array i which subsets the elements of x as follows:
        i = np.array( [0,1,5,3,7,9 ] )
        x[i]
        array([11, 13, 21, 17, 25, 29])
        In a similar manner we create a 2D array j of indices to subset x.
        j = np.array( [ [ 0, 1], [ 6, 2 ] ] )
        x[j]
        array([[11, 13],
        [23, 15]])
        Similarly we can create both i and j as 2D arrays of indices for x
        x = np.arange(15).reshape(3,5)
        x
        i = np.array( [ [0,1], # indices for the first dim
        [2,0] ] )
        j = np.array( [ [1,1], # indices for the second dim
        [2,0] ] )
        To get the ith index in row and jth index for columns we write:
        x[i,j] # i and j must have equal shape
        array([[ 1,  6],
        [12, 0]])
        To extract ith index from 3rd column we write:
        x[i,2]
        array([[ 2,  7],
        [12, 2]])
        For each row if we want to find the jth index we write:
        x[:,j]
        array([[[ 1,  1],
        [ 2, 0]],

        [[ 6, 6],
        [ 7, 5]],

        [[11, 11],
        [12, 10]]])
        Fixing 1st row and jth index,fixing 2nd row jth index, fixing 3rd row and jth index.

        You can also use indexing with arrays to assign the values:
        x = np.arange(10)
        x
        x[[4,5,8,1,2]] = 0
        x
        array([0, 0, 0, 3, 0, 0, 6, 7, 0, 9])
        0 is assigned to 4th, 5th, 8th, 1st and 2nd indices of x.
        When the list of indices contains repetitions then it assigns the last value to that index:
        x = np.arange(10)
        x
        x[[4,4,2,3]] = [100,200,300,400]
        x
        array([  0,   1, 300, 400, 200,   5,   6,   7,   8,   9])
        Notice that for the 5th element(i.e. 4th index) the value assigned is 200, not 100.
        Caution: If one is using += operator on repeated indices then it carries out the operator only once on repeated indices.
        x = np.arange(10)
        x[[1,1,1,7,7]]+=1
        x
         array([0, 2, 2, 3, 4, 5, 6, 8, 8, 9])
        Although index 1 and 7 are repeated but they are incremented only once.

        Indexing with Boolean Arrays
        We create a 2D array and store our condition in b. If we the condition is true it results in True otherwise False.
        a = np.arange(12).reshape(3,4)
        b = a > 4
        b
        array([[False, False, False, False],
        [False, True, True, True],
        [ True, True, True, True]], dtype=bool)
        Note that 'b' is a Boolean with same shape as that of 'a'.
        To select the elements from 'a' which adhere to condition 'b' we write:
        a[b]
        array([ 5,  6,  7,  8,  9, 10, 11])
        Now 'a' becomes a 1D array with the selected elements
        This property can be very useful in assignments:
        a[b] = 0
        a
        array([[0, 1, 2, 3],
        [4, 0, 0, 0],
        [0, 0, 0, 0]])
        All elements of 'a' higher than 4 become 0
        As done in integer indexing we can use indexing via Booleans:
        Let x be the original matrix and 'y' and 'z' be the arrays of Booleans to select the rows and columns.
        x = np.arange(15).reshape(3,5)
        y = np.array([True,True,False]) # first dim selection
        z = np.array([True,True,False,True,False]) # second dim selection
        We write the x[y,:] which will select only those rows where y is True.
        x[y,:] # selecting rows
        x[y] # same thing
        Writing x[:,z] will select only those columns where z is True.
        x[:,z] # selecting columns
        x[y,:]                                   # selecting rows
        Output:
        array([[0, 1, 2, 3, 4],
        [5, 6, 7, 8, 9]])

        x[y] # same thing
        Output:
        array([[0, 1, 2, 3, 4],
        [5, 6, 7, 8, 9]])

        x[:,z] # selecting columns
        Output:
        array([[ 0, 1, 3],
        [ 5, 6, 8],
        [10, 11, 13]])

        Statistics on Pandas DataFrame

        Let's create dummy data frame for illustration :
        np.random.seed(234)
        mydata = pd.DataFrame({"x1" : np.random.randint(low=1, high=100, size=10),
        "x2" : range(10)
        })

        1. Calculate mean of each column of data frame
        np.mean(mydata)
        2. Calculate median of each column of data frame
        np.median(mydata, axis=0)
        axis = 0 means the median function would be run on each column. axis = 1 implies the function to be run on each row.

        Stacking various arrays
        Let us consider 2 arrays A and B:
        A = np.array([[10,20,30],[40,50,60]])
        B = np.array([[100,200,300],[400,500,600]])
        To join them vertically we use np.vstack( ).
        np.vstack((A,B)) #Stacking vertically
        array([[ 10,  20,  30],
        [ 40, 50, 60],
        [100, 200, 300],
        [400, 500, 600]])
        To join them horizontally we use np.hstack( ).
        np.hstack((A,B)) #Stacking horizontally
        array([[ 10,  20,  30, 100, 200, 300],
        [ 40, 50, 60, 400, 500, 600]])
        newaxis helps in transforming a 1D row vector to a 1D column vector.
        from numpy import newaxis
        a = np.array([4.,1.])
        b = np.array([2.,8.])
        a[:,newaxis]
        array([[ 4.],
        [ 1.]])
        #The function np.column_stack( ) stacks 1D arrays as columns into a 2D array. It is equivalent to hstack only for 1D arrays:
        np.column_stack((a[:,newaxis],b[:,newaxis]))
        np.hstack((a[:,newaxis],b[:,newaxis])) # same as column_stack
        np.column_stack((a[:,newaxis],b[:,newaxis]))
        Output:
        array([[ 4., 2.],
        [ 1., 8.]])

        np.hstack((a[:,newaxis],b[:,newaxis]))
        Output:
        array([[ 4., 2.],
        [ 1., 8.]])

        Splitting the arrays
        Consider an array 'z' of 15 elements:
        z = np.arange(1,16)
        Using np.hsplit( ) one can split the arrays
        np.hsplit(z,5) # Split a into 5 arrays
        [array([1, 2, 3]),
        array([4, 5, 6]),
        array([7, 8, 9]),
        array([10, 11, 12]),
        array([13, 14, 15])]
        It splits 'z' into 5 arrays of eqaual length.
        On passing 2 elements we get:
        np.hsplit(z,(3,5))
        [array([1, 2, 3]),
        array([4, 5]),
        array([ 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])]
        It splits 'z' after the third and the fifth element.
        For 2D arrays np.hsplit( ) works as follows:
        A = np.arange(1,31).reshape(3,10)
        A
        np.hsplit(A,5) # Split a into 5 arrays
        [array([[ 1,  2],
        [11, 12],
        [21, 22]]), array([[ 3, 4],
        [13, 14],
        [23, 24]]), array([[ 5, 6],
        [15, 16],
        [25, 26]]), array([[ 7, 8],
        [17, 18],
        [27, 28]]), array([[ 9, 10],
        [19, 20],
        [29, 30]])]
        In the above command A gets split into 5 arrays of same shape.
        To split after the third and the fifth column we write:
        np.hsplit(A,(3,5))
        [array([[ 1,  2,  3],
        [11, 12, 13],
        [21, 22, 23]]), array([[ 4, 5],
        [14, 15],
        [24, 25]]), array([[ 6, 7, 8, 9, 10],
        [16, 17, 18, 19, 20],
        [26, 27, 28, 29, 30]])]

        Copying
        Consider an array x
        x = np.arange(1,16)
        We assign y as x and then say 'y is x'
        y = x
        y is x
        Let us change the shape of y
        y.shape = 3,5
        Note that it alters the shape of x
        x.shape
        (3, 5)

        Creating a view of the data
        Let us store z as a view of x by:
        z = x.view()
        z is x
        False
        Thus z is not x.
        Changing the shape of z
        z.shape = 5,3
        Creating a view does not alter the shape of x
        x.shape
        (3, 5)
        Changing an element in z
        z[0,0] = 1234
        Note that the value in x also get alters:
        x
        array([[1234,    2,    3,    4,    5],
        [ 6, 7, 8, 9, 10],
        [ 11, 12, 13, 14, 15]])
        Thus changes in the display does not hamper the original data but changes in values of view will affect the original data.


        Creating a copy of the data:
        Now let us create z as a copy of x:
        z = x.copy()
        Note that z is not x
        z is x
        Changing the value in z
        z[0,0] = 9999
        No alterations are made in x.
        x
        array([[1234,    2,    3,    4,    5],
        [ 6, 7, 8, 9, 10],
        [ 11, 12, 13, 14, 15]])
        Python sometimes may give 'setting with copy' warning because it is unable to recognize whether the new dataframe or array (created as a subset of another dataframe or array) is a view or a copy. Thus in such situations user needs to specify whether it is a copy or a view otherwise Python may hamper the results.

        Exercises : Numpy


        1. How to extract even numbers from array?

        arr = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
        Desired Output :array([0, 2, 4, 6, 8])

        Show Solution
        arr[arr % 2 == 0]

        2. How to find out the position where elements of x and y are same

        x = np.array([5,6,7,8,3,4])
        y = np.array([5,3,4,5,2,4])
        Desired Output :array([0, 5]

        Show Solution
        np.where(x == y)

        3. How to standardize values so that it lies between 0 and 1

        k = np.array([5,3,4,5,2,4])
        Hint :k-min(k)/(max(k)-min(k))

        Show Solution
        kmax, kmin = k.max(), k.min()
        k_new = (k - kmin)/(kmax - kmin)

        4. How to calculate the percentile scores of an array

        p = np.array([15,10, 3,2,5,6,4])

        Show Solution
        np.percentile(p, q=[5, 95])

        5. Print the number of missing values in an array

        p = np.array([5,10, np.nan, 3, 2, 5, 6, np.nan])

        Show Solution
        print("Number of missing values =", np.isnan(p).sum())

        ListenData: Pandas Python Tutorial - Learn by Examples

        $
        0
        0
        Pandas being one of the most popular package in Python is widely used for data manipulation. It is a very powerful and versatile package which makes data cleaning and wrangling much easier and pleasant.

        The Pandas library has a great contribution to the python community and it makes python as one of the top programming language for data science and analytics. It has become first choice of data analysts and scientists for data analysis and manipulation.

        Data Analysis with Python : Pandas Step by Step Guide

        Why pandas?
        It has many functions which are the essence for data handling. In short, it can perform the following tasks for you -
        1. Create a structured data set similar to R's data frame and Excel spreadsheet.
        2. Reading data from various sources such as CSV, TXT, XLSX, SQL database, R etc.
        3. Selecting particular rows or columns from data set
        4. Arranging data in ascending or descending order
        5. Filtering data based on some conditions
        6. Summarizing data by classification variable
        7. Reshape data into wide or long format
        8. Time series analysis
        9. Merging and concatenating two datasets
        10. Iterate over the rows of dataset
        11. Writing or Exporting data in CSV or Excel format

        Datasets:

        In this tutorial we will use two datasets: 'income' and 'iris'.
        1. 'income' data : This data contains the income of various states from 2002 to 2015. The dataset contains 51 observations and 16 variables. Download link
        2. 'iris' data: It comprises of 150 observations with 5 variables. We have 3 species of flowers(50 flowers for each specie) and for all of them the sepal length and width and petal length and width are given. Download link 


        Important pandas functions to remember

        The following is a list of common tasks along with pandas functions.
        UtilityFunctions
        Extract Column Namesdf.columns
        Select first 2 rowsdf.iloc[:2]
        Select first 2 columnsdf.iloc[:,:2]
        Select columns by namedf.loc[:,["col1","col2"]]
        Select random no. of rowsdf.sample(n = 10)
        Select fraction of random rowsdf.sample(frac = 0.2)
        Rename the variablesdf.rename( )
        Selecting a column as indexdf.set_index( )
        Removing rows or columnsdf.drop( )
        Sorting valuesdf.sort_values( )
        Grouping variablesdf.groupby( )
        Filteringdf.query( )
        Finding the missing valuesdf.isnull( )
        Dropping the missing valuesdf.dropna( )
        Removing the duplicatesdf.drop_duplicates( )
        Creating dummiespd.get_dummies( )
        Rankingdf.rank( )
        Cumulative sumdf.cumsum( )
        Quantilesdf.quantile( )
        Selecting numeric variablesdf.select_dtypes( )
        Concatenating two dataframespd.concat()
        Merging on basis of common variablepd.merge( )

        Importing pandas library

        You need to import or load the Pandas library first in order to use it. By "Importing a library", it means loading it into the memory and then you can use it. Run the following code to import pandas library:
        import pandas as pd
        The "pd" is an alias or abbreviation which will be used as a shortcut to access or call pandas functions. To access the functions from pandas library, you just need to type pd.function instead of  pandas.function every time you need to apply it.

        Importing Dataset

        To read or import data from CSV file, you can use read_csv() function. In the function, you need to specify the file location of your CSV file.
        income = pd.read_csv("C:\\Users\\Hp\\Python\\Basics\\income.csv")
         Index       State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
        0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
        1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
        2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
        3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
        4 C California 1685349 1675807 1889570 1480280 1735069 1812546

        Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
        0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
        1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
        2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
        3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
        4 1487315 1663809 1624509 1639670 1921845 1156536 1388461 1644607

        Get Variable Names

        By using income.columnscommand, you can fetch the names of variables of a data frame.
        Index(['Index', 'State', 'Y2002', 'Y2003', 'Y2004', 'Y2005', 'Y2006', 'Y2007',
        'Y2008', 'Y2009', 'Y2010', 'Y2011', 'Y2012', 'Y2013', 'Y2014', 'Y2015'],
        dtype='object')
        income.columns[0:2] returns first two column names 'Index', 'State'. In python, indexing starts from 0.

        Knowing the Variable types

        You can use the dataFrameName.dtypes command to extract the information of types of variables stored in the data frame.
        income.dtypes 
        Index    object
        State object
        Y2002 int64
        Y2003 int64
        Y2004 int64
        Y2005 int64
        Y2006 int64
        Y2007 int64
        Y2008 int64
        Y2009 int64
        Y2010 int64
        Y2011 int64
        Y2012 int64
        Y2013 int64
        Y2014 int64
        Y2015 int64
        dtype: object

        Here 'object' means strings or character variables. 'int64' refers to numeric variables (without decimals).

        To see the variable type of one variable (let's say "State") instead of all the variables, you can use the command below -
        income['State'].dtypes
        It returns dtype('O'). In this case, 'O' refers to object i.e. type of variable as character.

        Changing the data types

        Y2008 is an integer. Suppose we want to convert it to float (numeric variable with decimals) we can write:
        income.Y2008 = income.Y2008.astype(float)
        income.dtypes
        Index     object
        State object
        Y2002 int64
        Y2003 int64
        Y2004 int64
        Y2005 int64
        Y2006 int64
        Y2007 int64
        Y2008 float64
        Y2009 int64
        Y2010 int64
        Y2011 int64
        Y2012 int64
        Y2013 int64
        Y2014 int64
        Y2015 int64
        dtype: object

        To view the dimensions or shape of the data
        income.shape
         (51, 16)

        51 is the number of rows and 16 is the number of columns.

        You can also use shape[0] to see the number of rows (similar to nrow() in R) and shape[1] for number of columns (similar to ncol() in R). 
        income.shape[0]
        income.shape[1]

        To view only some of the rows

        By default head( ) shows first 5 rows. If we want to see a specific number of rows we can mention it in the parenthesis. Similarly tail( ) function shows last 5 rows by default.
        income.head()
        income.head(2) #shows first 2 rows.
        income.tail()
        income.tail(2) #shows last 2 rows

        Alternatively, any of the following commands can be used to fetch first five rows.
        income[0:5]
        income.iloc[0:5]

        Define Categorical Variable

        Like factors() function in R, we can include categorical variable in python using "category" dtype.
        s = pd.Series([1,2,3,1,2], dtype="category")
        s
        0    1
        1 2
        2 3
        3 1
        4 2
        dtype: category
        Categories (3, int64): [1, 2, 3]

        Extract Unique Values

        The unique() function shows the unique levels or categories in the dataset.
        income.Index.unique()
        array(['A', 'C', 'D', ..., 'U', 'V', 'W'], dtype=object)


        The nunique( ) shows the number of unique values.
        income.Index.nunique()
        It returns 19 as index column contains distinct 19 values.

        Generate Cross Tab

        pd.crosstab( ) is used to create a bivariate frequency distribution. Here the bivariate frequency distribution is between Index and State columns.
        pd.crosstab(income.Index,income.State)

        Creating a frequency distribution

        income.Index selects the 'Index' column of 'income' dataset and value_counts( ) creates a frequency distribution. By default ascending = False i.e. it will show the 'Index' having the maximum frequency on the top.
        income.Index.value_counts(ascending = True)
        F    1
        G 1
        U 1
        L 1
        H 1
        P 1
        R 1
        D 2
        T 2
        S 2
        V 2
        K 2
        O 3
        C 3
        I 4
        W 4
        A 4
        M 8
        N 8
        Name: Index, dtype: int64

        To draw the samples
        income.sample( ) is used to draw random samples from the dataset containing all the columns. Here n = 5 depicts we need 5 columns and frac = 0.1 tells that we need 10 percent of the data as my sample.
        income.sample(n = 5)
        income.sample(frac = 0.1)
        Selecting only a few of the columns
        To select only a specific columns we use either loc[ ] or iloc[ ] functions. The index or columns to be selected are passed as lists. "Index":"Y2008" denotes the that all the columns from Index to Y2008 are to be selected.

        Syntax of df.loc[  ]
        df.loc[row_index , column_index]
        income.loc[:,["Index","State","Y2008"]]
        income.loc[0:2,["Index","State","Y2008"]]  #Selecting rows with Index label 0 to 2 & columns
        income.loc[:,"Index":"Y2008"]  #Selecting consecutive columns
        #In the above command both Index and Y2008 are included.
        income.iloc[:,0:5]  #Columns from 1 to 5 are included. 6th column not included
        Difference between loc and iloc

        loc considers rows (or columns) with particular labels from the index. Whereas iloc considers rows (or columns) at particular positions in the index so it only takes integers.
        x = pd.DataFrame({"var1" : np.arange(1,20,2)}, index=[9,8,7,6,10, 1, 2, 3, 4, 5])
            var1
        9 1
        8 3
        7 5
        6 7
        10 9
        1 11
        2 13
        3 15
        4 17
        5 19

        iloc Code

        x.iloc[:3]

        Output:
        var1
        9 1
        8 3
        7 5

        loc code

        x.loc[:3]

        Output:
        var1
        9 1
        8 3
        7 5
        6 7
        10 9
        1 11
        2 13
        3 15
        You can also use the following syntax to select specific variables.
        income[["Index","State","Y2008"]]

        Renaming the variables
        We create a dataframe 'data' for information of people and their respective zodiac signs.
        data = pd.DataFrame({"A" : ["John","Mary","Julia","Kenny","Henry"], "B" : ["Libra","Capricorn","Aries","Scorpio","Aquarius"]})
        data 
               A          B
        0 John Libra
        1 Mary Capricorn
        2 Julia Aries
        3 Kenny Scorpio
        4 Henry Aquarius
        If all the columns are to be renamed then we can use data.columns and assign the list of new column names.
        #Renaming all the variables.
        data.columns = ['Names','Zodiac Signs']

           Names Zodiac Signs
        0 John Libra
        1 Mary Capricorn
        2 Julia Aries
        3 Kenny Scorpio
        4 Henry Aquarius
        If only some of the variables are to be renamed then we can use rename( ) function where the new names are passed in the form of a dictionary.
        #Renaming only some of the variables.
        data.rename(columns = {"Names":"Cust_Name"},inplace = True)
          Cust_Name Zodiac Signs
        0 John Libra
        1 Mary Capricorn
        2 Julia Aries
        3 Kenny Scorpio
        4 Henry Aquarius
        By default in pandas inplace = False which means that no changes are made in the original dataset. Thus if we wish to alter the original dataset we need to define inplace = True.

        Suppose we want to replace only a particular character in the list of the column names then we can use str.replace( ) function. For example, renaming the variables which contain "Y" as "Year"
        income.columns = income.columns.str.replace('Y' , 'Year ')
        income.columns
        Index(['Index', 'State', 'Year 2002', 'Year 2003', 'Year 2004', 'Year 2005',
        'Year 2006', 'Year 2007', 'Year 2008', 'Year 2009', 'Year 2010',
        'Year 2011', 'Year 2012', 'Year 2013', 'Year 2014', 'Year 2015'],
        dtype='object')

        Setting one column in the data frame as the index
        Using set_index("column name") we can set the indices as that column and that column gets removed.
        income.set_index("Index",inplace = True)
        income.head()
        #Note that the indices have changed and Index column is now no more a column
        income.columns
        income.reset_index(inplace = True)
        income.head()
        reset_index( ) tells us that one should use the by default indices.

        Removing the columns and rows
        To drop a column we use drop( ) where the first argument is a list of columns to be removed.

        By default axis = 0 which means the operation should take place horizontally, row wise. To remove a column we need to set axis = 1.
        income.drop('Index',axis = 1)

        #Alternatively
        income.drop("Index",axis = "columns")
        income.drop(['Index','State'],axis = 1)
        income.drop(0,axis = 0)
        income.drop(0,axis = "index")
        income.drop([0,1,2,3],axis = 0)
         Also inplace = False by default thus no alterations are made in the original dataset.  axis = "columns"  and axis = "index" means the column and row(index) should be removed respectively.

        Sorting the data
        To sort the data sort_values( ) function is deployed. By default inplace = False and ascending = True.
        income.sort_values("State",ascending = False)
        income.sort_values("State",ascending = False,inplace = True)
        income.Y2006.sort_values() 
        We have got duplicated for Index thus we need to sort the dataframe firstly by Index and then for each particular index we sort the values by Y2002
        income.sort_values(["Index","Y2002"]) 

        Create new variables
        Using eval( ) arithmetic operations on various columns can be carried out in a dataset.
        income["difference"] = income.Y2008-income.Y2009

        #Alternatively
        income["difference2"] = income.eval("Y2008 - Y2009")
        income.head()
          Index       State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
        0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
        1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
        2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
        3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
        4 C California 1685349 1675807 1889570 1480280 1735069 1812546

        Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015 \
        0 1945229.0 1944173 1237582 1440756 1186741 1852841 1558906 1916661
        1 1551826.0 1436541 1629616 1230866 1512804 1985302 1580394 1979143
        2 1752886.0 1554330 1300521 1130709 1907284 1363279 1525866 1647724
        3 1188104.0 1628980 1669295 1928238 1216675 1591896 1360959 1329341
        4 1487315.0 1663809 1624509 1639670 1921845 1156536 1388461 1644607

        difference difference2
        0 1056.0 1056.0
        1 115285.0 115285.0
        2 198556.0 198556.0
        3 -440876.0 -440876.0
        4 -176494.0 -176494.0

        income.ratio = income.Y2008/income.Y2009
        The above command does not work, thus to create new columns we need to use square brackets.
        We can also use assign( ) function but this command does not make changes in the original data as there is no inplace parameter. Hence we need to save it in a new dataset.
        data = income.assign(ratio = (income.Y2008 / income.Y2009))
        data.head()

        Finding Descriptive Statistics
        describe( ) is used to find some statistics like mean,minimum, quartiles etc. for numeric variables.
        income.describe() #for numeric variables
                      Y2002         Y2003         Y2004         Y2005         Y2006  \
        count 5.100000e+01 5.100000e+01 5.100000e+01 5.100000e+01 5.100000e+01
        mean 1.566034e+06 1.509193e+06 1.540555e+06 1.522064e+06 1.530969e+06
        std 2.464425e+05 2.641092e+05 2.813872e+05 2.671748e+05 2.505603e+05
        min 1.111437e+06 1.110625e+06 1.118631e+06 1.122030e+06 1.102568e+06
        25% 1.374180e+06 1.292390e+06 1.268292e+06 1.267340e+06 1.337236e+06
        50% 1.584734e+06 1.485909e+06 1.522230e+06 1.480280e+06 1.531641e+06
        75% 1.776054e+06 1.686698e+06 1.808109e+06 1.778170e+06 1.732259e+06
        max 1.983285e+06 1.994927e+06 1.979395e+06 1.990062e+06 1.985692e+06

        Y2007 Y2008 Y2009 Y2010 Y2011 \
        count 5.100000e+01 5.100000e+01 5.100000e+01 5.100000e+01 5.100000e+01
        mean 1.553219e+06 1.538398e+06 1.658519e+06 1.504108e+06 1.574968e+06
        std 2.539575e+05 2.958132e+05 2.361854e+05 2.400771e+05 2.657216e+05
        min 1.109382e+06 1.112765e+06 1.116168e+06 1.103794e+06 1.116203e+06
        25% 1.322419e+06 1.254244e+06 1.553958e+06 1.328439e+06 1.371730e+06
        50% 1.563062e+06 1.545621e+06 1.658551e+06 1.498662e+06 1.575533e+06
        75% 1.780589e+06 1.779538e+06 1.857746e+06 1.639186e+06 1.807766e+06
        max 1.983568e+06 1.990431e+06 1.993136e+06 1.999102e+06 1.992996e+06

        Y2012 Y2013 Y2014 Y2015
        count 5.100000e+01 5.100000e+01 5.100000e+01 5.100000e+01
        mean 1.591135e+06 1.530078e+06 1.583360e+06 1.588297e+06
        std 2.837675e+05 2.827299e+05 2.601554e+05 2.743807e+05
        min 1.108281e+06 1.100990e+06 1.110394e+06 1.110655e+06
        25% 1.360654e+06 1.285738e+06 1.385703e+06 1.372523e+06
        50% 1.643855e+06 1.531212e+06 1.580394e+06 1.627508e+06
        75% 1.866322e+06 1.725377e+06 1.791594e+06 1.848316e+06
        max 1.988270e+06 1.994022e+06 1.990412e+06 1.996005e+06
        For character or string variables, you can write include = ['object']. It will return total count, maximum occurring string and its frequency
        income.describe(include = ['object'])  #Only for strings / objects
        To find out specific descriptive statistics of each column of data frame
        income.mean()
        income.median()
        income.agg(["mean","median"])

        Mean, median, maximum and minimum can be obtained for a particular column(s) as:
        income.Y2008.mean()
        income.Y2008.median()
        income.Y2008.min()
        income.loc[:,["Y2002","Y2008"]].max()

        Groupby function
        To group the data by a categorical variable we use groupby( ) function and hence we can do the operations on each category.
        income.groupby("Index").Y2008.min()
        income.groupby("Index")["Y2008","Y2010"].max()
        agg( ) function is used to find all the functions for a given variable.
        income.groupby("Index").Y2002.agg(["count","min","max","mean"])
        income.groupby("Index")["Y2002","Y2003"].agg(["count","min","max","mean"])
        The following command finds minimum and maximum values for Y2002 and only mean for Y2003
        income.groupby("Index").agg({"Y2002": ["min","max"],"Y2003" : "mean"})
                  Y2002                 Y2003
        min max mean
        Index
        A 1170302 1742027 1810289.000
        C 1343824 1685349 1595708.000
        D 1111437 1330403 1631207.000
        F 1964626 1964626 1468852.000
        G 1929009 1929009 1541565.000
        H 1461570 1461570 1200280.000
        I 1353210 1776918 1536164.500
        K 1509054 1813878 1369773.000
        L 1584734 1584734 1110625.000
        M 1221316 1983285 1535717.625
        N 1395149 1885081 1382499.625
        O 1173918 1802132 1569934.000
        P 1320191 1320191 1446723.000
        R 1501744 1501744 1942942.000
        S 1159037 1631522 1477072.000
        T 1520591 1811867 1398343.000
        U 1771096 1771096 1195861.000
        V 1134317 1146902 1498122.500
        W 1677347 1977749 1521118.500

        Filtering
        To filter only those rows which have Index as "A" we write:
        income[income.Index == "A"]

        #Alternatively
        income.loc[income.Index == "A",:]
          Index     State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
        0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
        1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
        2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
        3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213

        Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
        0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
        1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
        2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
        3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
        To select the States having Index as "A":
        income.loc[income.Index == "A","State"]
        income.loc[income.Index == "A",:].State
        To filter the rows with Index as "A" and income for 2002 > 1500000"
        income.loc[(income.Index == "A") & (income.Y2002 > 1500000),:]
        To filter the rows with index either "A" or "W", we can use isin( ) function:
        income.loc[(income.Index == "A") | (income.Index == "W"),:]

        #Alternatively.
        income.loc[income.Index.isin(["A","W"]),:]
           Index          State    Y2002    Y2003    Y2004    Y2005    Y2006    Y2007  \
        0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
        1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
        2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
        3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
        47 W Washington 1977749 1687136 1199490 1163092 1334864 1621989
        48 W West Virginia 1677347 1380662 1176100 1888948 1922085 1740826
        49 W Wisconsin 1788920 1518578 1289663 1436888 1251678 1721874
        50 W Wyoming 1775190 1498098 1198212 1881688 1750527 1523124

        Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
        0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
        1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
        2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
        3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
        47 1545621 1555554 1179331 1150089 1775787 1273834 1387428 1377341
        48 1238174 1539322 1539603 1872519 1462137 1683127 1204344 1198791
        49 1980167 1901394 1648755 1940943 1729177 1510119 1701650 1846238
        50 1587602 1504455 1282142 1881814 1673668 1994022 1204029 1853858
        Alternatively we can use query( ) function and write our filtering criteria:
        income.query('Y2002>1700000 & Y2003 > 1500000')

        Dealing with missing values
        We create a new dataframe named 'crops' and to create a NaN value we use np.nan by importing numpy.
        import numpy as np
        mydata = {'Crop': ['Rice', 'Wheat', 'Barley', 'Maize'],
                'Yield': [1010, 1025.2, 1404.2, 1251.7],
                'cost' : [102, np.nan, 20, 68]}
        crops = pd.DataFrame(mydata)
        crops
        isnull( ) returns True and notnull( ) returns False if the value is NaN.
        crops.isnull()  #same as is.na in R
        crops.notnull()  #opposite of previous command.
        crops.isnull().sum()  #No. of missing values.
        crops.cost.isnull() firstly subsets the 'cost' from the dataframe and returns a logical vector with isnull()

        crops[crops.cost.isnull()] #shows the rows with NAs.
        crops[crops.cost.isnull()].Crop #shows the rows with NAs in crops.Crop
        crops[crops.cost.notnull()].Crop #shows the rows without NAs in crops.Crop
        To drop all the rows which have missing values in any rows we use dropna(how = "any") . By default inplace = False . If how = "all" means drop a row if all the elements in that row are missing

        crops.dropna(how = "any").shape
        crops.dropna(how = "all").shape  
        To remove NaNs if any of 'Yield' or'cost' are missing we use the subset parameter and pass a list:
        crops.dropna(subset = ['Yield',"cost"],how = 'any').shape
        crops.dropna(subset = ['Yield',"cost"],how = 'all').shape
        Replacing the missing values by "UNKNOWN" sub attribute in Column name.
        crops['cost'].fillna(value = "UNKNOWN",inplace = True)
        crops

        Dealing with duplicates
        We create a new dataframe comprising of items and their respective prices.
        data = pd.DataFrame({"Items" : ["TV","Washing Machine","Mobile","TV","TV","Washing Machine"], "Price" : [10000,50000,20000,10000,10000,40000]})
        data
                     Items  Price
        0 TV 10000
        1 Washing Machine 50000
        2 Mobile 20000
        3 TV 10000
        4 TV 10000
        5 Washing Machine 40000
        duplicated() returns a logical vector returning True when encounters duplicated.
        data.loc[data.duplicated(),:]
        data.loc[data.duplicated(keep = "first"),:]
        By default keep = 'first' i.e. the first occurence is considered a unique value and its repetitions are considered as duplicates.
        If keep = "last" the last occurence is considered a unique value and all its repetitions are considered as duplicates.
        data.loc[data.duplicated(keep = "last"),:] #last entries are not there,indices have changed.
        If keep = "False" then it considers all the occurences of the repeated observations as duplicates.
        data.loc[data.duplicated(keep = False),:]  #all the duplicates, including unique are shown.
        To drop the duplicates drop_duplicates is used with default inplace = False, keep = 'first' or 'last' or 'False' have the respective meanings as in duplicated( )
        data.drop_duplicates(keep = "first")
        data.drop_duplicates(keep = "last")
        data.drop_duplicates(keep = False,inplace = True)  #by default inplace = False
        data

        Creating dummies
        Now we will consider the iris dataset
        iris = pd.read_csv("C:\\Users\\Hp\\Desktop\\work\\Python\\Basics\\pandas\\iris.csv")
        iris.head()
           Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
        0 5.1 3.5 1.4 0.2 setosa
        1 4.9 3.0 1.4 0.2 setosa
        2 4.7 3.2 1.3 0.2 setosa
        3 4.6 3.1 1.5 0.2 setosa
        4 5.0 3.6 1.4 0.2 setosa
        map( ) function is used to match the values and replace them in the new series automatically created.
        iris["setosa"] = iris.Species.map({"setosa" : 1,"versicolor":0, "virginica" : 0})
        iris.head()
        To create dummies get_dummies( ) is used. iris.Species.prefix = "Species" adds a prefix ' Species' to the new series created.
        pd.get_dummies(iris.Species,prefix = "Species")
        pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:1]  #1 is not included
        species_dummies = pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:]
        With concat( ) function we can join multiple series or dataframes. axis = 1 denotes that they should be joined columnwise.
        iris = pd.concat([iris,species_dummies],axis = 1)
        iris.head()
           Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species  \
        0 5.1 3.5 1.4 0.2 setosa
        1 4.9 3.0 1.4 0.2 setosa
        2 4.7 3.2 1.3 0.2 setosa
        3 4.6 3.1 1.5 0.2 setosa
        4 5.0 3.6 1.4 0.2 setosa

        Species_setosa Species_versicolor Species_virginica
        0 1 0 0
        1 1 0 0
        2 1 0 0
        3 1 0 0
        4 1 0 0
        It is usual that for a variable with 'n' categories we creat 'n-1' dummies, thus to drop the first 'dummy' column we write drop_first = True
        pd.get_dummies(iris,columns = ["Species"],drop_first = True).head()

        Ranking
         To create a dataframe of all the ranks we use rank( )
        iris.rank() 
        Ranking by a specific variable
        Suppose we want to rank the Sepal.Length for different species in ascending order:
        iris['Rank'] = iris.sort_values(['Sepal.Length'], ascending=[True]).groupby(['Species']).cumcount() + 1
        iris.head( )

        #Alternatively
        iris['Rank2'] = iris['Sepal.Length'].groupby(iris["Species"]).rank(ascending=1)
        iris.head()

        Calculating the Cumulative sum
        Using cumsum( ) function we can obtain the cumulative sum
        iris['cum_sum'] = iris["Sepal.Length"].cumsum()
        iris.head()
        Cumulative sum by a variable
        To find the cumulative sum of sepal lengths for different species we use groupby( ) and then use cumsum( )
        iris["cumsum2"] = iris.groupby(["Species"])["Sepal.Length"].cumsum()
        iris.head()

        Calculating the percentiles.
        Various quantiles can be obtained by using quantile( )
        iris.quantile(0.5)
        iris.quantile([0.1,0.2,0.5])
        iris.quantile(0.55)

        if else in Python
        We create a new dataframe of students' name and their respective zodiac signs.
        students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                                 'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})
        def name(row):
        if row["Names"] in ["John","Henry"]:
        return "yes"
        else:
        return "no"

        students['flag'] = students.apply(name, axis=1)
        students
        Functions in python are defined using the block keyword def , followed with the function's name as the block's name. apply( ) function applies function along rows or columns of dataframe.

        Note :If using simple 'if else' we need to take care of the indentation . Python does not involve curly braces for the loops and if else.

        Output
              Names Zodiac Signs flag
        0 John Aquarius yes
        1 Mary Libra no
        2 Henry Gemini yes
        3 Augustus Pisces no
        4 Kenny Virgo no

        Alternatively, By importing numpy we can use np.where. The first argument is the condition to be evaluated, 2nd argument is the value if condition is True and last argument defines the value if the condition evaluated returns False.
        import numpy as np
        students['flag'] = np.where(students['Names'].isin(['John','Henry']), 'yes', 'no')
        students

        Multiple Conditions : If Else-if Else
        def mname(row):
        if row["Names"] == "John" and row["Zodiac Signs"] == "Aquarius" :
        return "yellow"
        elif row["Names"] == "Mary" and row["Zodiac Signs"] == "Libra" :
        return "blue"
        elif row["Zodiac Signs"] == "Pisces" :
        return "blue"
        else:
        return "black"

        students['color'] = students.apply(mname, axis=1)
        students

        We create a list of conditions and their respective values if evaluated True and use np.select where default value is the value if all the conditions is False
        conditions = [
            (students['Names'] == 'John') & (students['Zodiac Signs'] == 'Aquarius'),
            (students['Names'] == 'Mary') & (students['Zodiac Signs'] == 'Libra'),
            (students['Zodiac Signs'] == 'Pisces')]
        choices = ['yellow', 'blue', 'purple']
        students['color'] = np.select(conditions, choices, default='black')
        students
              Names Zodiac Signs flag   color
        0 John Aquarius yes yellow
        1 Mary Libra no blue
        2 Henry Gemini yes black
        3 Augustus Pisces no purple
        4 Kenny Virgo no black

        Select numeric or categorical columns only
        To include numeric columns we use select_dtypes( ) 
        data1 = iris.select_dtypes(include=[np.number])
        data1.head()
         _get_numeric_data also provides utility to select the numeric columns only.
        data3 = iris._get_numeric_data()
        data3.head(3)
           Sepal.Length  Sepal.Width  Petal.Length  Petal.Width  cum_sum  cumsum2
        0 5.1 3.5 1.4 0.2 5.1 5.1
        1 4.9 3.0 1.4 0.2 10.0 10.0
        2 4.7 3.2 1.3 0.2 14.7 14.7
        For selecting categorical variables
        data4 = iris.select_dtypes(include = ['object'])
        data4.head(2)
         Species
        0 setosa
        1 setosa

        Concatenating
        We create 2 dataframes containing the details of the students:
        students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                                 'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})
        students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                                  'Marks' : [50,81,98,25,35]})
         using pd.concat( ) function we can join the 2 dataframes:
        data = pd.concat([students,students2])  #by default axis = 0
           Marks     Names Zodiac Signs
        0 NaN John Aquarius
        1 NaN Mary Libra
        2 NaN Henry Gemini
        3 NaN Augustus Pisces
        4 NaN Kenny Virgo
        0 50.0 John NaN
        1 81.0 Mary NaN
        2 98.0 Henry NaN
        3 25.0 Augustus NaN
        4 35.0 Kenny NaN
        By default axis = 0 thus the new dataframe will be added row-wise. If a column is not present then in one of the dataframes it creates NaNs. To join column wise we set axis = 1
        data = pd.concat([students,students2],axis = 1)
        data
              Names Zodiac Signs  Marks     Names
        0 John Aquarius 50 John
        1 Mary Libra 81 Mary
        2 Henry Gemini 98 Henry
        3 Augustus Pisces 25 Augustus
        4 Kenny Virgo 35 Kenny
        Using append function we can join the dataframes row-wise
        students.append(students2)  #for rows
        Alternatively we can create a dictionary of the two data frames and can use pd.concat to join the dataframes row wise
        classes = {'x': students, 'y': students2}
         result = pd.concat(classes)
        result 
             Marks     Names Zodiac Signs
        x 0 NaN John Aquarius
        1 NaN Mary Libra
        2 NaN Henry Gemini
        3 NaN Augustus Pisces
        4 NaN Kenny Virgo
        y 0 50.0 John NaN
        1 81.0 Mary NaN
        2 98.0 Henry NaN
        3 25.0 Augustus NaN
        4 35.0 Kenny NaN

        Merging or joining on the basis of common variable.
        We take 2 dataframes with different number of observations:
        students = pd.DataFrame({'Names': ['John','Mary','Henry','Maria'],
                                 'Zodiac Signs': ['Aquarius','Libra','Gemini','Capricorn']})
        students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
                                  'Marks' : [50,81,98,25,35]})
        Using pd.merge we can join the two dataframes. on = 'Names' denotes the common variable on the basis of which the dataframes are to be combined is 'Names'
        result = pd.merge(students, students2, on='Names')  #it only takes intersections
        result
           Names Zodiac Signs  Marks
        0 John Aquarius 50
        1 Mary Libra 81
        2 Henry Gemini 98
         By default how = "inner" thus it takes only the common elements in both the dataframes. If you want all the elements in both the dataframes set how = "outer"
         result = pd.merge(students, students2, on='Names',how = "outer")  #it only takes unions
        result
              Names Zodiac Signs  Marks
        0 John Aquarius 50.0
        1 Mary Libra 81.0
        2 Henry Gemini 98.0
        3 Maria Capricorn NaN
        4 Augustus NaN 25.0
        5 Kenny NaN 35.0
        To take only intersections and all the values in left df set how = 'left'
        result = pd.merge(students, students2, on='Names',how = "left")
        result
           Names Zodiac Signs  Marks
        0 John Aquarius 50.0
        1 Mary Libra 81.0
        2 Henry Gemini 98.0
        3 Maria Capricorn NaN
        Similarly how = 'right' takes only intersections and all the values in right df.
        result = pd.merge(students, students2, on='Names',how = "right",indicator = True)
        result
              Names Zodiac Signs  Marks      _merge
        0 John Aquarius 50 both
        1 Mary Libra 81 both
        2 Henry Gemini 98 both
        3 Augustus NaN 25 right_only
        4 Kenny NaN 35 right_only
        indicator = True creates a column for indicating that whether the values are present in both the dataframes or either left or right dataframe.

        ListenData: Loops in Python explained with examples

        $
        0
        0
        This tutorial covers various ways to execute loops in python. Loops is an important concept of any programming language which performs iterations i.e. run specific code repeatedly until a certain condition is reached.

        1. For Loop

        Like R and C programming language, you can use for loop in Python. It is one of the most commonly used loop method to automate the repetitive tasks.

        How for loop works?

        Suppose you are asked to print sequence of numbers from 1 to 9, increment by 2.
        for i in range(1,10,2):
        print(i)
        Output
        1
        3
        5
        7
        9
        range(1,10,2) means starts from 1 and ends with 9 (excluding 10), increment by 2.

        Iteration over list
        This section covers how to run for in loop on a list.
        mylist = [30,21,33,42,53,64,71,86,97,10]
        for i in mylist:
        print(i)
        Output
        30
        21
        33
        42
        53
        64
        71
        86
        97
        10

        Suppose you need to select every 3rd value of list.
        for i in mylist[::3]:
        print(i)
        Output
        30
        42
        71
        10
        mylist[::3] is equivalent to mylist[0::3] which follows this syntax style list[start:stop:step]

        Python Loop Explained with Examples

        Example 1 : Create a new list with only items from list that is between 0 and 10
        l1 = [100, 1, 10, 2, 3, 5, 8, 13, 21, 34, 55, 98]

        new = [] #Blank list
        for i in l1:
        if i > 0 and i <= 10:
        new.append(i)

        new
        Output: [1, 10, 2, 3, 5, 8]
        It can also be done via numpy package by creating list as numpy array. See the code below.
        import numpy as np
        k=np.array(l1)
        new=k[np.where(k<=10)]

        Example 2 : Check which alphabet (a-z) is mentioned in string

        Suppose you have a string named k and you want to check which alphabet exists in the string k.
        k = "deepanshu"

        import string
        for n in string.ascii_lowercase:
        if n in k:
        print(n + ' exists in ' + k)
        else:
        print(n + ' does not exist in ' + k)
        string.ascii_lowercase returns 'abcdefghijklmnopqrstuvwxyz'.

        Practical Examples : for in loop in Python

        Create sample pandas data frame for illustrative purpose.
        import pandas as pd
        np.random.seed(234)
        df = pd.DataFrame({"x1" : np.random.randint(low=1, high=100, size=10),
        "Month1" : np.random.normal(size=10),
        "Month2" : np.random.normal(size=10),
        "Month3" : np.random.normal(size=10),
        "price" : range(10)
        })

        df
        1. Multiple each month column by 1.2
        for i in range(1,4):
        print(df["Month"+str(i)]*1.2)
        range(1,4) returns 1, 2 and 3. str( ) function is used to covert to string."Month" + str(1) means Month1.
        2. Store computed columns in new data frame
        import pandas as pd
        newDF = pd.DataFrame()
        for i in range(1,4):
        data = pd.DataFrame(df["Month"+str(i)]*1.2)
        newDF=pd.concat([newDF,data], axis=1)
        pd.DataFrame( ) is used to create blank data frame. The concat() function from pandas package is used to concatenate two data frames.

        3. Check if value of x1 >= 50, multiply each month cost by price. Otherwise same as month.
        import pandas as pd
        import numpy as np
        for i in range(1,4):
        df['newcol'+str(i)] = np.where(df['x1'] >= 50,
        df['Month'+str(i)] * df['price'],
        df['Month'+str(i)])
        In this example, we are adding new columns named newcol1, newcol2 and newcol3.np.where(condition, value_if condition meets, value_if condition does not meet) is used to construct IF ELSE statement.

        4. Filter data frame by each unique value of a column and store it in a separate data frame
        mydata = pd.DataFrame({"X1" : ["A","A","B","B","C"]})

        for name in mydata.X1.unique():
        temp = pd.DataFrame(mydata[mydata.X1 == name])
        exec('{} = temp'.format(name))
        The unique( ) function is used to calculate distinct values of a variable. The exec( ) function is used for dynamic execution of Python program. See the usage of string format( ) function below -
        s= "Your Input"
        "i am {}".format(s)

        Output: 'i am Your Input'

        Loop Control Statements

        Loop control statements change execution from its normal iteration. When execution leaves a scope, all automatic objects that were created in that scope are destroyed.

        Python supports the following control statements.
        1. Continue statement
        2. Break statement

        Continue Statement
        When continue statement is executed, it skips the further code in the loop and continue iteration.
        In the code below, we are avoiding letters a and d to be printed.
        for n in "abcdef":
        if n =="a" or n =="d":
        continue
        print("letter :", n)
        letter : b
        letter : c
        letter : e
        letter : f
        Break Statement
        When break statement runs, it breaks or stops the loop.
        In this program, when n is either c or d, loop stops executing.
        for n in "abcdef":
        if n =="c" or n =="d":
        break
        print("letter :", n)
        letter : a
        letter : b

        for loop with else clause

        Using else clause with for loop is not common among python developers community.
        The else clause executes after the loop completes. It means that the loop did not encounter a break statement.
        The program below calculates factors for numbers between 2 to 10. Else clause returns numbers which have no factors and are therefore prime numbers:

        for k in range(2, 10):
        for y in range(2, k):
        if k % y == 0:
        print( k, '=', y, '*', round(k/y))
        break
        else:
        print(k, 'is a prime number')
        2 is a prime number
        3 is a prime number
        4 = 2 * 2
        5 is a prime number
        6 = 2 * 3
        7 is a prime number
        8 = 2 * 4
        9 = 3 * 3

        While Loop

        While loop is used to execute code repeatedly until a condition is met. And when the condition becomes false, the line immediately after the loop in program is executed.
        i = 1
        while i < 10:
        print(i)
        i += 2 #means i = i + 2
        print("new i :", i)
        Output:
        1
        new i : 3
        3
        new i : 5
        5
        new i : 7
        7
        new i : 9
        9
        new i : 11

        While Loop with If-Else Statement

        If-Else statement can be used along with While loop. See the program below -

        counter = 1 
        while (counter <= 5):
        if counter < 2:
        print("Less than 2")
        elif counter > 4:
        print("Greater than 4")
        else:
        print(">= 2 and <=4")
        counter += 1

        Talk Python to Me: #208 Packaging, Making the most of PyCon, and more

        $
        0
        0
        Are you going to PyCon (or a similar conference)? Join me and Kenneth Retiz as we discuss how to make the most of PyCon and what makes it special for each of us.

        The Code Bits: Getting started with Raspberry Pi and Python

        $
        0
        0

        Hello there!, So you just got a shiny new Raspberry Pi. Well done and congrats! In this tutorial, we are going to look at how we can set up your Raspberry Pi and get it up and running.

        What you need to get started

        First and foremost we have to ensure that we have all the required items to get started. Here is a short list of items that are absolutely required to get the ball rolling.

        things_needed

        Optional

        You can also get these as a bundle (buy from Amazon)or if you have some of the components lying around, then you could selectively get them as needed.

        Getting your OS ready

        Raspberry Pi is a miniature computer and just like your laptop and desktop PC, it needs to run an Operating System. Since Raspberry Pi runs an ARM-based processor (If you haven’t heard of ARM processors, they are low power processors commonly found in your mobile phones and tablets), we need to use a supporting version of OS. Luckily both the Raspberry Pi foundation and many of the Linux community members have created many Operating Systems that you can choose from to run on your board! In this tutorial, we are going to use NOOBS distribution which you can find here.

        Once you go to the above link, choose NOOBS, and you will be given two options, select NOOBS. This is a large file and it might take some time. Be patient!

        In the meanwhile, we need another piece of software that helps us format and transfer our OS to our microSD card. You can find the software here.

        Once the SDFormatter tool is installed, and the OS image is downloaded, we are ready to go!

        The first step is to unzip the NOOBS file that you have downloaded. You will get a NOOBS_v3_0_0 folder or something similar depending on your OS version. Open up the SDFormatter tool and insert your microSD card onto your PC. You can either use a USB microSD adapter like this or use the one that’s in your laptop / PC.

        On the SDFormatter tool select the drive that corresponds to your SD card. Make sure this is correct!.

        SDFormatter

        IMPORTANT: Please back up any data before formatting!

        Once the formatting is complete, you can copy the contents of the NOOB folder which would be something similar to what is shown below onto your formatted SD card.

        Noobs_contents

        Connecting all pieces and booting up!

        All right, we’re almost there! Now we need to connect our Raspberry Pi to our peripherals and boot it up for the first time.

        Here, I have used a USB keyboard, a Wireless Logitech mouse that has a wireless USB adapter. I have plugged in a USB Wi-Fi adapter to connect to my network and an HDMI cable to connect to my monitor. The overall setup after setting up everything including power looks something like this.

        fully_connected

        Setting up the OS

        Once we have the setup ready, let’s connect the board to the peripherals and power and boot it up. Initially, you will see a dialog box to select the operating system

        Select Raspbian Full and click the install button. The installation dialog will follow afterward. It will copy the files and install the OS.

        Once the OS is installed, the system will reboot and you will be welcomed with the new desktop!. Next, you will be prompted to set up your account password, keyboard profile, and timezone as well as connect to a Wi-Fi endpoint. That is pretty much it!

        Hello world in Python

        Fire up the terminal by going to the menu button on the top left and selecting the terminal. Open up a python shell by entering

        python
        

        This opens up a python shell, say hello world from your shiny new raspberry pi!

        print "Hello world!"

        Wrap up

        So that’s it! We’ve made it to the end and we have a Raspberry Pi up and running! Woohoo! Now is the fun part, which is all the fun things we could do with it. I’ll be doing some interesting projects that you could follow along on this blog going forward. Make sure to subscribe to thecodebits.com to receive updates. See you all soon!

        The Code Bits: Flask Project for Beginners: Inspirational Quotes

        $
        0
        0

        In this project, we will create a web application that displays a random inspirational quote.

        The goal of this project is to learn about application factory and how to create templates in Flask.

        This is the second part of the “Getting started with Flask series”. In the first part, we learned how to create a basic Hello World application using Flask and run it locally.

        Installation and Setup

        First, let us install the dependencies and setup our project directory. Basically what we need to do is:
        • Install Python. We will be using Python3 here.
        • Create a directory for our project, say “1_inspirational_quotes”, and go inside the directory.
          mkdir 1_inspirational_quotes
          cd 1_inspirational_quotes
        • We will create a virtual environment for our project where we will install Flask and any other dependencies. So go ahead and create a virtual environment and activate it.
          python3 -m venv venv
          . venv/bin/activate
        • Finally install Flask.
          pip3 install flask
        If you need more instructions on installation, refer to Flask installation guide or Getting started with Flask.

        Set up the Application Factory

        Now that we have setup our project directory, the next thing that we need to do is to create the Flask application, which is nothing but an instance of the Flask class.

        We could create the Flask instance globally, the way we did in Getting Started with Flask: Hello World. However, in this example, we will create it within a function.

        Application Factory is the term used to refer to the method inside which we will create our application (Flask instance). All sorts of configuration and setup required for the application will also be done within the application factory. Finally it will return the Flask application instance.

        So let us create a directory ‘quotes’ and add an __init__.py file to it. This will make the directory get treated as a Python package.

        mkdir quotes
        cd quotes
        touch __init__.py
        

        Then let us define our application factory in this file.

        –> 1_inspirational_quotes/quotes/__init__.py

        from flask import Flask
        
        def create_app():
            """
            create_app is the application factory.
            """
            # Create the app.
            app = Flask(__name__)
        
            # Add a route.
            @app.route('/')
            def home():
                return 'Hello there! :)'
        
            # Return the app.
            return app
        

        Notes:

        • The method create_app is the application factory.
        • Within the application factory, we created a Flask instance, app, which is nothing but our application. Note that __name__ refers to the package name here, i.e., quotes. This is what will be used as our application name.
        • Then we created a placeholder method, home, which will serve the content for our app page. For now, it just returns some string which will get displayed on our browser when we run the application.
        • The decorator, @app.route, links the URL (/) to the method, home.

        Run the basic application

        In order to make sure that everything is set up correctly, let us run the application and see if it is working.

        First, let us set the FLASK_APP environment variable to be our application package name. This basically tells Flask which application to run.

        export FLASK_APP=quotes

        We will also set the environment variable FLASK_ENV to development so that:

        1. debug mode is turned on and the debugger is activated.
        2. the server will be restarted whenever we make a code change. We can make modifications to our code and simply refresh the browser to see the changes in effect.
        export FLASK_ENV=development

        Note: If you are on Windows, use set  instead of export.

        Now we are ready to run the application. So go ahead and run it using the flask command. You should see an output similar to the following.

        flask run
         * Serving Flask app "quotes" (lazy loading)
         * Environment: development
         * Debug mode: on
         * Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)
         * Restarting with stat
         * Debugger is active!
         * Debugger PIN: 150-101-403
        

        Note: Make sure that you are running the command from the ‘1_inspirational_quotes’ directory and not ‘quotes’. Otherwise, you will see the error “flask.cli.NoAppException: Could not import “quotes.quotes”.”

        To see the app in action, go to http://127.0.0.1:5000/ on your browser. You should see our message displayed in it as shown in the following image.

        Awesome! Now let us start building our quotes app.

        Add a template

        Currently, our app just displays the string, “Hello there! :)” to the user. In this section, we will learn how to create a template that shows a random inspirational quote.

        Return HTML content from the application factory

        The simplest way to achieve this is to return the HTML code as a string instead of our hello world string as shown below:

        —> 1_inspirational_quotes/quotes/__init__.py

        from flask import Flask
        
        def create_app():
            """
            create_app is the application factory.
            """
            # Create the app.
            app = Flask(__name__)
        
            # Add a route.
            @app.route('/')
            def home():
                return '''
        <html>
        <body>
          I find that the harder I work, the more luck I seem to have. – Thomas Jefferson
        </body>
        </html>
        '''
        
            # Return the app.
            return app
        

        Now if you go to http://127.0.0.1:5000/, you should see the quote displayed on the screen:

        Even though this works perfectly fine, this is not the best approach to serve HTML content for our application. First of all, the code does not look clean. Second, as our application grows, modifying and maintaining the template within the application factory will be tedious. So we need to isolate our template from the application factory.

        Create a static HTML template file

        A template is a file that contains static data as well as placeholders for dynamic data. In this section, we will just be creating static HTML template that displays a quote to our user. In a later section, we will see how to make it dynamic.

        Within the quotes directory, let us add a directory to keep our templates and move our quotes template to a separate HTML file.

        mkdir templates
        touch templates/quotes.html
        

        Note that our template is stored within a directory named templates under the application directory, quotes. This is where Flask expects its templates by default.

        –>1_inspirational_quotes/quotes/templates/quotes.html

        <!doctype html>
        <html>
        <body>
          I find that the harder I work, the more luck I seem to have. – Thomas Jefferson
        <body>
        </html>
        

        Register the template with the application factory

        Now we need to modify our application factory such that this HTML file is served when users visit our web page.

        —> 1_inspirational_quotes/quotes/__init__.py

        from flask import Flask, render_template
        
        def create_app():
            """
            create_app is the application factory.
            """
            # Create the app.
            app = Flask(__name__)
        
            # Add a route.
            @app.route('/')
            def home():
                return render_template('quotes.html')
        
            # Return the app.
            return app
        

        Note how we introduced the method, render_template(). In this case, it takes our HTML file name and returns its contents. Later on, when we learn about serving dynamic content, we will learn more about rendering and how Flask uses Jinja for template rendering.

        Now if we go to http://127.0.0.1:5000/, we should see the quote displayed on the screen just as we saw earlier.

        Update the template to render quotes dynamically using Jinja

        Now that we have learned how to create a template and register it with the application factory, let us see how we can serve content dynamically.

        Right now our app just displays the same quote every time someone visits. Our goal is to dynamically update the quote by selecting one randomly from a set of quotes.

        First, let us go ahead and create a list of quotes. To keep things simple, we will be adding it in memory within the application factory. In a later post, we will explore how to use databases with Flask.

        —> 1_inspirational_quotes/quotes/__init__.py

        from flask import Flask, render_template
        import random
        
        def create_app():
            """
            create_app is the application factory.
            """
            # Create the app.
            app = Flask(__name__)
        
            # Add a route.
            @app.route('/')
            def home():
                sample_quotes = [
                    "I find that the harder I work, the more luck I seem to have. – Thomas Jefferson",
                    "Success is the sum of small efforts, repeated day in and day out. – Robert Collier",
                    "There are no shortcuts to any place worth going. – Beverly Sills",
                    "The only place where success comes before work is in the dictionary. – Vidal Sassoon",
                    "You don’t drown by falling in the water; you drown by staying there. – Ed Cole"
                ]
        
                # Select a random quote.
                selected_quote = random.choice(sample_quotes)
        
                # Pass the selected quote to the template.
                return render_template('quotes.html', quote=selected_quote)
        
            # Return the app.
            return app
        

        As you can see, now we are passing an additional parameter, quote, to the render_template function. Flask uses Jinja to render dynamic content in the template. With this change, the variable, quote, becomes available in the template, quotes.html. Now let us see how we can update the template file to make use of this variable.

        –>1_inspirational_quotes/quotes/templates/quotes.html

        <!doctype html>
        <html>
        <body>
          {{ quote }}
        <body>
        </html>
        

        Here, {{..}} is the delimiter used by Jinja to denote expressions which will be evaluated and rendered in the final HTML document.

        Now if we go to http://127.0.0.1:5000/ and keep refreshing the page, we should see a different random quote selected from the list every time. A demo is shown below:

        Add a stylesheet

        As of now, our app works, but it looks very plain. So now we will see how to add a simple stylesheet to it.

        In Flask, just like templates were expected to be in the templates directory by default, static files like CSS stylesheets are expected to be in static directory within the application folder.

        So go ahead and create the directory and add a CSS file style.css to it.

        mkdir static
        touch static/style.css
        

        —>1_inspirational_quotes/quotes/static/style.css

        body {
          background-color: black;
          background-image: url("background.jpg");
          background-size:cover;
        }
        .quote_div {
          text-align: center;
          color: white;
          font-size: 30px;
          padding: 25px 5px;
          margin: 15% auto auto;
        }
        

        You can also add a background image and keep it under the static directory as shown above.

        Now let us modify the template to use the stylesheet.

        –>1_inspirational_quotes/quotes/templates/quotes.html

        <!doctype html>
        <html>
        <head>
          <link rel="stylesheet" href="{{ url_for('static', filename='style.css') }}">
        </head>
        <body>
          <div class="quote_div">
            {{ quote }}
          </div>
        <body>
        </html>
        

        Now if we go to http://127.0.0.1:5000/, we will see a nicer app! A demo:

        Conclusion

        In case you want to browse through the code or download and try it out, the Github link for this project is here.

        In this post, we learned how to create a basic Flask application that serves dynamic data using a Jinja template.

        For more advanced lessons with projects, stay tuned and subscribe to our blog!

        Mike Driscoll: PyDev of the Week: Dane Hillard

        $
        0
        0

        This week we welcome Dane Hillard (@easyaspython) as our PyDev of the Week! Dane is the author Practices of the Python Pro, an upcoming book from Manning. He is also a blogger and web developer. Let’s take some time to get to know Dane!

        Can you tell us a little about yourself (hobbies, education, etc):

        I’m a creative type, so many of my interests are in art and music. I’ve been a competitive ballroom dancer, and I’m a published musician and photographer. I’m proud of those accomplishments, but I’m driven to do most of this stuff for personal fulfillment more than anything! I enjoy sharing and discussing what I learn with others, too. When I have some time my next project is to start exploring foodways, which is this idea of exploring food and its cultural impact through written history. I’ve loved cooking (and food in general) for a long time and I want to get to know its origins better, which I think is something this generation is demanding more from industries as a whole. Should be fun!

        Why did you start using Python?

        I like using my computer engineering skills to build stuff not just for work, but for myself. I had written a website for my photography business in PHP way back in the day, but I wasn’t using a framework of any kind and the application code was mixed with the front-end code in a way that was hard to manage. I decided to try out a framework, and after using (and disliking) Java Spring for a while I gave Django a try. The rest is history! I started using Python for a few work-related things at the time and saw that it adapted well to many different types of tasks, so I kept rolling with it.

        What other programming languages do you know and which is your favorite?

        I use JavaScript fairly regularly, though it wasn’t until jQuery gave way to reactive paradigms that I really started enjoying it. We’re using React and Vue frequently now and I like it quite a bit for client-side development. I’ve also used Ruby in the past, which I find to be quite Python-like in certain ways. I think I still like Python best, but it’s easy to stick with what you know, right? I wouldn’t mind learning some Rust or Go soon! My original background is mainly in C and C++ but I can barely manage the memory in my own head so I don’t like telling a computer how to manage its memory when I can avoid it, but all these languages have their place.

        What projects are you working on now?

        At ITHAKA we’ve been managing an open source Python REST client, apiron, for a while now. We just released a feature where I got to explore some metaprogramming, which was stellar. It ended up reducing boilerplate people have to write, which is also stellar. I also built a new website as a bit of a portfolio and to centralize some of my online presence. It’s written in Vue, but was my first chance to explore vue-router and couple of other libraries, along with a headless CMS for blogging.

        The biggest amount of my free time definitely goes to thinking about and writing the book I’m working on, which introduces people new to software development to some concepts important in collaborative software, in the context of Python. I’m hoping it will help people just graduating, switching disciplines, or who want to augment their work with software! The book is in early access and I’m chugging away on new chapters as we speak.

        Which Python libraries are your favorite (core or 3rd party)?

        The requests library is one of the more ubiquitous libraries, and it’s what we built apiron on top of. I’ve started using pytest a bit in place of Python’s built-in unittest, and I like the ways it simplifies existing tests while also providing tooling for doing more complex things with fixtures. There’s a great package, zappa, for deploying Django apps (or anything WSGI-based, I believe) to AWS Lambda. Look into that if you’re spending too much on an EC2 instance! For image manipulation, Pillow is great. One that I’d like to try out more soon is surprise, which helps you make recommendation systems akin to what Netflix or Hulu uses to recommend movies. Too many others to name here!

        How did you come to author a book?

        I don’t know how it works for most authors, but in my case the publisher, Manning, reached out to me—probably after seeing the blog posts I’ve written online. Presented with the opportunity, it was difficult to figure out if I really felt ready or qualified to do a book, which I still ask myself often if I’m being honest. I try to frame it to myself as an opportunity to help others, so even if I don’t produce something perfect I hope that I’ll still be able to say I did that much!

        What challenges did you have writing the book and how did you overcome them?

        Finding time and balancing it with other priorities is the primary struggle for me, as I imagine it is for many authors. The uncertainty I mentioned earlier is another one. Something that surprised me was how easy it is to use overloaded terms in the context of programming; many concepts have similar names and many English words can be ambiguous for untrained readers! My editor fortunately keeps these at bay, but I slip up often! Teaching is hard. The best way I’ve found to mitigate issues like this is to automate where I can.

        Is there anything else you’d like to say?

        If you’re out there thinking about getting into programming or writing a book or anything really, and you’re fortunate to have the means to do so, get to it! I’ve found that I don’t know how I feel about something until I really examine it, flip a few switches, find out how it works under the hood. Sometimes you’ll find you don’t like something as much as you thought, but maybe it uncovers tangentially-related things you want to explore. The most important part is getting started!

        Thanks for doing the interview, Dane!

        Ram Rachum: PySnooper: Never use print for debugging again

        PyCharm: Interview: Dan Tofan for this week’s data science webinar

        $
        0
        0

        In the past few years, Python has made a big push into data science and PyCharm has as well. Years ago we added Jupyter Notebook integration, then 2017.3 introduced Scientific Mode for workflows that felt more like an IDE. In 2019.1 we re-invented our Jupyter support to also be more like a professional tool.

        PyCharm and data science are thus a hot topic. Dan Tofan very recently published a Pluralsight course on using PyCharm for data science and we invited him for a webinar next week.

        To help set the stage, below is an interview with Dan.

        • Thursday, April 25
        • 7PM GMT+3, 9AM Pacific
        • Register here
        • Aimed at new and intermediate data scientists

        webinar-05-2

        Let’s start with the key point: what does PyCharm bring to data scientists?

        PyCharm brings a productivity boost to data scientists, by helping them explore data, debug Python code, write better Python code, and understand Python code faster. As a PyCharm user, I experienced and benefited from these productivity boosters, which I distilled into my first Pluralsight course, so that data scientists can make the most out of PyCharm in their activities.

        For the webinar: who is it for and what can people expect you to cover?

        If you are a data scientist who dabbled with PyCharm, then this webinar is for you. I will cover PyCharm’s most relevant features to data science: the scientific mode and the completely rewritten Jupyter support. I will show how these features interplay with other PyCharm features, such as refactoring code from Jupyter cells. I will use easy-to-understand code examples with popular data science libraries.

        Now, back to the start: tell us a little about yourself.

        Currently, I am a senior backend developer for Dimensions– a research data platform that uses data science, and links data on a total of over 140 million publications, grants, patents and clinical trials. I’ve always been curious, which led me to do my PhD studies at the University of Groningen (Netherlands) and learn more about statistics and data analysis.

        Do Python data scientists feel like programmers first and data scientists second, or the reverse?

        In my opinion, data science is a melting pot of skills from three complementing backgrounds: programmers, statisticians and business analysts. At the start of your data science journey, you are going to rely on the skills from your main background, and – as your skills expand – you are going to feel more and more like a data scientist.

        Your course has a bunch of sections on software development practices and IDE tips. How important are these practices to “professional” data science?

        As part of the melting pot, programmers bring a lot of value with their experiences ranging from software development practices to IDE tips. Data scientists from a programming background are already familiar with most of these, and those from other backgrounds benefit immensely.

        Think of a code base that starts to grow: how do you write better code? How do you refactor the code? How can a new team member understand that code faster? These are some of the questions that my course helps with.

        The course also covers three major facilities in PyCharm Professional: Scientific Mode, Jupyter support, and the Database tool. How do these fit in?

        All of them are data centric, so they are very relevant to data scientists. These facilities are integrated nicely with other PyCharm capabilities such as debugging and refactoring. Overall, after watching the course and getting familiar with these capabilities, data scientists get a nice productivity boost.

        This webinar is good timing. You just released the course and we just re-invented our Jupyter support. What do you think of the new, IDE-centric Jupyter integration?

        I think the new Jupyter integration is an excellent step in the right direction, because you can use both Jupyter and PyCharm features such as debugging and code completion. Joel Grus gave an insightful and entertaining talk about Jupyter limitations at JupyterCon 2018. I think the new Jupyter integration in PyCharm can eventually help solve some Jupyter pain points raised by Joel, such as hidden state.

        What’s one big problem or pain point in Jupyter that could benefit from new ideas or tooling?

        Reproducibility is problematic with Jupyter and it is important for data science. For example, it’s easy to share a notebook on GitHub, then someone else tries to run it and gets different results. Perhaps the solution is a mix of discipline and better tools.

        Real Python: A Beginner’s Guide to the Python time Module

        $
        0
        0

        The Python time module provides many ways of representing time in code, such as objects, numbers, and strings. It also provides functionality other than representing time, like waiting during code execution and measuring the efficiency of your code.

        This article will walk you through the most commonly used functions and objects in time.

        By the end of this article, you’ll be able to:

        • Understand core concepts at the heart of working with dates and times, such as epochs, time zones, and daylight savings time
        • Represent time in code using floats, tuples, and struct_time
        • Convert between different time representations
        • Suspend thread execution
        • Measure code performance using perf_counter()

        You’ll start by learning how you can use a floating point number to represent time.

        Free Bonus:Click here to get our free Python Cheat Sheet that shows you the basics of Python 3, like working with data types, dictionaries, lists, and Python functions.

        Dealing With Python Time Using Seconds

        One of the ways you can manage the concept of Python time in your application is by using a floating point number that represents the number of seconds that have passed since the beginning of an era—that is, since a certain starting point.

        Let’s dive deeper into what that means, why it’s useful, and how you can use it to implement logic, based on Python time, in your application.

        The Epoch

        You learned in the previous section that you can manage Python time with a floating point number representing elapsed time since the beginning of an era.

        Merriam-Webster defines an era as:

        • A fixed point in time from which a series of years is reckoned
        • A system of chronological notation computed from a given date as basis

        The important concept to grasp here is that, when dealing with Python time, you’re considering a period of time identified by a starting point. In computing, you call this starting point the epoch.

        The epoch, then, is the starting point against which you can measure the passage of time.

        For example, if you define the epoch to be midnight on January 1, 1970 UTC—the epoch as defined on Windows and most UNIX systems—then you can represent midnight on January 2, 1970 UTC as 86400 seconds since the epoch.

        This is because there are 60 seconds in a minute, 60 minutes in an hour, and 24 hours in a day. January 2, 1970 UTC is only one day after the epoch, so you can apply basic math to arrive at that result:

        >>>
        >>> 60*60*2486400

        It is also important to note that you can still represent time before the epoch. The number of seconds would just be negative.

        For example, you would represent midnight on December 31, 1969 UTC (using an epoch of January 1, 1970) as -86400 seconds.

        While January 1, 1970 UTC is a common epoch, it is not the only epoch used in computing. In fact, different operating systems, filesystems, and APIs sometimes use different epochs.

        As you saw before, UNIX systems define the epoch as January 1, 1970. The Win32 API, on the other hand, defines the epoch as January 1, 1601.

        You can use time.gmtime() to determine your system’s epoch:

        >>>
        >>> importtime>>> time.gmtime(0)time.struct_time(tm_year=1970, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=1, tm_isdst=0)

        You’ll learn about gmtime() and struct_time throughout the course of this article. For now, just know that you can use time to discover the epoch using this function.

        Now that you understand more about how to measure time in seconds using an epoch, let’s take a look at Python’s time module to see what functions it offers that help you do so.

        Python Time in Seconds as a Floating Point Number

        First, time.time() returns the number of seconds that have passed since the epoch. The return value is a floating point number to account for fractional seconds:

        >>>
        >>> fromtimeimporttime>>> time()1551143536.9323719

        The number you get on your machine may be very different because the reference point considered to be the epoch may be very different.

        Further Reading: Python 3.7 introduced time_ns(), which returns an integer value representing the same elapsed time since the epoch, but in nanoseconds rather than seconds.

        Measuring time in seconds is useful for a number of reasons:

        • You can use a float to calculate the difference between two points in time.
        • A float is easily serializable, meaning that it can be stored for data transfer and come out intact on the other side.

        Sometimes, however, you may want to see the current time represented as a string. To do so, you can pass the number of seconds you get from time() into time.ctime().

        Python Time in Seconds as a String Representing Local Time

        As you saw before, you may want to convert the Python time, represented as the number of elapsed seconds since the epoch, to a string. You can do so using ctime():

        >>>
        >>> fromtimeimporttime,ctime>>> t=time()>>> ctime(t)'Mon Feb 25 19:11:59 2019'

        Here, you’ve recorded the current time in seconds into the variable t, then passed t as an argument to ctime(), which returns a string representation of that same time.

        Technical Detail: The argument, representing seconds since the epoch, is optional according to the ctime() definition. If you don’t pass an argument, then ctime() uses the return value of time() by default. So, you could simplify the example above:

        >>>
        >>> fromtimeimportctime>>> ctime()'Mon Feb 25 19:11:59 2019'

        The string representation of time, also known as a timestamp, returned by ctime() is formatted with the following structure:

        1. Day of the week:Mon (Monday)
        2. Month of the year:Feb (February)
        3. Day of the month:25
        4. Hours, minutes, and seconds using the 24-hour clock notation:19:11:59
        5. Year:2019

        The previous example displays the timestamp of a particular moment captured from a computer in the South Central region of the United States. But, let’s say you live in Sydney, Australia, and you executed the same command at the same instant.

        Instead of the above output, you’d see the following:

        >>>
        >>> fromtimeimporttime,ctime>>> t=time()>>> ctime(t)'Tue Feb 26 12:11:59 2019'

        Notice that the day of week, day of month, and hour portions of the timestamp are different than the first example.

        These outputs are different because the timestamp returned by ctime() depends on your geographical location.

        Note: While the concept of time zones is relative to your physical location, you can modify this in your computer’s settings without actually relocating.

        The representation of time dependent on your physical location is called local time and makes use of a concept called time zones.

        Note: Since local time is related to your locale, timestamps often account for locale-specific details such as the order of the elements in the string and translations of the day and month abbreviations. ctime() ignores these details.

        Let’s dig a little deeper into the notion of time zones so that you can better understand Python time representations.

        Understanding Time Zones

        A time zone is a region of the world that conforms to a standardized time. Time zones are defined by their offset from Coordinated Universal Time (UTC) and, potentially, the inclusion of daylight savings time (which we’ll cover in more detail later in this article).

        Fun Fact: If you’re a native English speaker, you might be wondering why the abbreviation for “Coordinated Universal Time” is UTC rather than the more obvious CUT. However, if you’re a native French speaker, you would call it “Temps Universel Coordonné,” which suggests a different abbreviation: TUC.

        Ultimately, the International Telecommunication Union and the International Astronomical Union compromised on UTC as the official abbreviation so that, regardless of language, the abbreviation would be the same.

        UTC and Time Zones

        UTC is the time standard against which all the world’s timekeeping is synchronized (or coordinated). It is not, itself, a time zone but rather a transcendent standard that defines what time zones are.

        UTC time is precisely measured using astronomical time, referring to the Earth’s rotation, and atomic clocks.

        Time zones are then defined by their offset from UTC. For example, in North and South America, the Central Time Zone (CT) is behind UTC by five or six hours and, therefore, uses the notation UTC-5:00 or UTC-6:00.

        Sydney, Australia, on the other hand, belongs to the Australian Eastern Time Zone (AET), which is ten or eleven hours ahead of UTC (UTC+10:00 or UTC+11:00).

        This difference (UTC-6:00 to UTC+10:00) is the reason for the variance you observed in the two outputs from ctime() in the previous examples:

        • Central Time (CT):'Mon Feb 25 19:11:59 2019'
        • Australian Eastern Time (AET):'Tue Feb 26 12:11:59 2019'

        These times are exactly sixteen hours apart, which is consistent with the time zone offsets mentioned above.

        You may be wondering why CT can be either five or six hours behind UTC or why AET can be ten or eleven hours ahead. The reason for this is that some areas around the world, including parts of these time zones, observe daylight savings time.

        Daylight Savings Time

        Summer months generally experience more daylight hours than winter months. Because of this, some areas observe daylight savings time (DST) during the spring and summer to make better use of those daylight hours.

        For places that observe DST, their clocks will jump ahead one hour at the beginning of spring (effectively losing an hour). Then, in the fall, the clocks will be reset to standard time.

        The letters S and D represent standard time and daylight savings time in time zone notation:

        • Central Standard Time (CST)
        • Australian Eastern Daylight Time (AEDT)

        When you represent times as timestamps in local time, it is always important to consider whether DST is applicable or not.

        ctime() accounts for daylight savings time. So, the output difference listed previously would be more accurate as the following:

        • Central Standard Time (CST):'Mon Feb 25 19:11:59 2019'
        • Australian Eastern Daylight Time (AEDT):'Tue Feb 26 12:11:59 2019'

        Dealing With Python Time Using Data Structures

        Now that you have a firm grasp on many fundamental concepts of time including epochs, time zones, and UTC, let’s take a look at more ways to represent time using the Python time module.

        Python Time as a Tuple

        Instead of using a number to represent Python time, you can use another primitive data structure: a tuple.

        The tuple allows you to manage time a little more easily by abstracting some of the data and making it more readable.

        When you represent time as a tuple, each element in your tuple corresponds to a specific element of time:

        1. Year
        2. Month as an integer, ranging between 1 (January) and 12 (December)
        3. Day of the month
        4. Hour as an integer, ranging between 0 (12 A.M.) and 23 (11 P.M.)
        5. Minute
        6. Second
        7. Day of the week as an integer, ranging between 0 (Monday) and 6 (Sunday)
        8. Day of the year
        9. Daylight savings time as an integer with the following values:
          • 1 is daylight savings time.
          • 0 is standard time.
          • -1 is unknown.

        Using the methods you’ve already learned, you can represent the same Python time in two different ways:

        >>>
        >>> fromtimeimporttime,ctime>>> t=time()>>> t1551186415.360564>>> ctime(t)'Tue Feb 26 07:06:55 2019'>>> time_tuple=(2019,2,26,7,6,55,1,57,0)

        In this case, both t and time_tuple represent the same time, but the tuple provides a more readable interface for working with time components.

        Technical Detail: Actually, if you look at the Python time represented by time_tuple in seconds (which you’ll see how to do later in this article), you’ll see that it resolves to 1551186415.0 rather than 1551186415.360564.

        This is because the tuple doesn’t have a way to represent fractional seconds.

        While the tuple provides a more manageable interface for working with Python time, there is an even better object: struct_time.

        Python Time as an Object

        The problem with the tuple construct is that it still looks like a bunch of numbers, even though it’s better organized than a single, seconds-based number.

        struct_time provides a solution to this by utilizing NamedTuple, from Python’s collections module, to associate the tuple’s sequence of numbers with useful identifiers:

        >>>
        >>> fromtimeimportstruct_time>>> time_tuple=(2019,2,26,7,6,55,1,57,0)>>> time_obj=struct_time(time_tuple)>>> time_objtime.struct_time(tm_year=2019, tm_mon=2, tm_mday=26, tm_hour=7, tm_min=6, tm_sec=55, tm_wday=1, tm_yday=57, tm_isdst=0)

        Technical Detail: If you’re coming from another language, the terms struct and object might be in opposition to one another.

        In Python, there is no data type called struct. Instead, everything is an object.

        However, the name struct_time is derived from the C-based time library where the data type is actually a struct.

        In fact, Python’s time module, which is implemented in C, uses this struct directly by including the header file times.h.

        Now, you can access specific elements of time_obj using the attribute’s name rather than an index:

        >>>
        >>> day_of_year=time_obj.tm_yday>>> day_of_year57>>> day_of_month=time_obj.tm_mday>>> day_of_month26

        Beyond the readability and usability of struct_time, it is also important to know because it is the return type of many of the functions in the Python time module.

        Converting Python Time in Seconds to an Object

        Now that you’ve seen the three primary ways of working with Python time, you’ll learn how to convert between the different time data types.

        Converting between time data types is dependent on whether the time is in UTC or local time.

        Coordinated Universal Time (UTC)

        The epoch uses UTC for its definition rather than a time zone. Therefore, the seconds elapsed since the epoch is not variable depending on your geographical location.

        However, the same cannot be said of struct_time. The object representation of Python time may or may not take your time zone into account.

        There are two ways to convert a float representing seconds to a struct_time:

        1. UTC
        2. Local time

        To convert a Python time float to a UTC-based struct_time, the Python time module provides a function called gmtime().

        You’ve seen gmtime() used once before in this article:

        >>>
        >>> importtime>>> time.gmtime(0)time.struct_time(tm_year=1970, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=1, tm_isdst=0)

        You used this call to discover your system’s epoch. Now, you have a better foundation for understanding what’s actually happening here.

        gmtime() converts the number of elapsed seconds since the epoch to a struct_time in UTC. In this case, you’ve passed 0 as the number of seconds, meaning you’re trying to find the epoch, itself, in UTC.

        Note: Notice the attribute tm_isdst is set to 0. This attribute represents whether the time zone is using daylight savings time. UTC never subscribes to DST, so that flag will always be 0 when using gmtime().

        As you saw before, struct_time cannot represent fractional seconds, so gmtime() ignores the fractional seconds in the argument:

        >>>
        >>> importtime>>> time.gmtime(1.99)time.struct_time(tm_year=1970, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=1, tm_wday=3, tm_yday=1, tm_isdst=0)

        Notice that even though the number of seconds you passed was very close to 2, the .99 fractional seconds were simply ignored, as shown by tm_sec=1.

        The secs parameter for gmtime() is optional, meaning you can call gmtime() with no arguments. Doing so will provide the current time in UTC:

        >>>
        >>> importtime>>> time.gmtime()time.struct_time(tm_year=2019, tm_mon=2, tm_mday=28, tm_hour=12, tm_min=57, tm_sec=24, tm_wday=3, tm_yday=59, tm_isdst=0)

        Interestingly, there is no inverse for this function within time. Instead, you’ll have to look in Python’s calendar module for a function named timegm():

        >>>
        >>> importcalendar>>> importtime>>> time.gmtime()time.struct_time(tm_year=2019, tm_mon=2, tm_mday=28, tm_hour=13, tm_min=23, tm_sec=12, tm_wday=3, tm_yday=59, tm_isdst=0)>>> calendar.timegm(time.gmtime())1551360204

        timegm() takes a tuple (or struct_time, since it is a subclass of tuple) and returns the corresponding number of seconds since the epoch.

        Historical Context: If you’re interested in why timegm() is not in time, you can view the discussion in Python Issue 6280.

        In short, it was originally added to calendar because time closely follows C’s time library (defined in time.h), which contains no matching function. The above-mentioned issue proposed the idea of moving or copying timegm() into time.

        However, with advances to the datetime library, inconsistencies in the patched implementation of time.timegm(), and a question of how to then handle calendar.timegm(), the maintainers declined the patch, encouraging the use of datetime instead.

        Working with UTC is valuable in programming because it’s a standard. You don’t have to worry about DST, time zone, or locale information.

        That said, there are plenty of cases when you’d want to use local time. Next, you’ll see how to convert from seconds to local time so that you can do just that.

        Local Time

        In your application, you may need to work with local time rather than UTC. Python’s time module provides a function for getting local time from the number of seconds elapsed since the epoch called localtime().

        The signature of localtime() is similar to gmtime() in that it takes an optional secs argument, which it uses to build a struct_time using your local time zone:

        >>>
        >>> importtime>>> time.time()1551448206.86196>>> time.localtime(1551448206.86196)time.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=7, tm_min=50, tm_sec=6, tm_wday=4, tm_yday=60, tm_isdst=0)

        Notice that tm_isdst=0. Since DST matters with local time, tm_isdst will change between 0 and 1 depending on whether or not DST is applicable for the given time. Since tm_isdst=0, DST is not applicable for March 1, 2019.

        In the United States in 2019, daylight savings time begins on March 10. So, to test if the DST flag will change correctly, you need to add 9 days’ worth of seconds to the secs argument.

        To compute this, you take the number of seconds in a day (86,400) and multiply that by 9 days:

        >>>
        >>> new_secs=1551448206.86196+(86400*9)>>> time.localtime(new_secs)time.struct_time(tm_year=2019, tm_mon=3, tm_mday=10, tm_hour=8, tm_min=50, tm_sec=6, tm_wday=6, tm_yday=69, tm_isdst=1)

        Now, you’ll see that the struct_time shows the date to be March 10, 2019 with tm_isdst=1. Also, notice that tm_hour has also jumped ahead, to 8 instead of 7 in the previous example, because of daylight savings time.

        Since Python 3.3, struct_time has also included two attributes that are useful in determining the time zone of the struct_time:

        1. tm_zone
        2. tm_gmtoff

        At first, these attributes were platform dependent, but they have been available on all platforms since Python 3.6.

        First, tm_zone stores the local time zone:

        >>>
        >>> importtime>>> current_local=time.localtime()>>> current_local.tm_zone'CST'

        Here, you can see that localtime() returns a struct_time with the time zone set to CST (Central Standard Time).

        As you saw before, you can also tell the time zone based on two pieces of information, the UTC offset and DST (if applicable):

        >>>
        >>> importtime>>> current_local=time.localtime()>>> current_local.tm_gmtoff-21600>>> current_local.tm_isdst0

        In this case, you can see that current_local is 21600 seconds behind GMT, which stands for Greenwich Mean Time. GMT is the time zone with no UTC offset: UTC±00:00.

        21600 seconds divided by seconds per hour (3,600) means that current_local time is GMT-06:00 (or UTC-06:00).

        You can use the GMT offset plus the DST status to deduce that current_local is UTC-06:00 at standard time, which corresponds to the Central standard time zone.

        Like gmtime(), you can ignore the secs argument when calling localtime(), and it will return the current local time in a struct_time:

        >>>
        >>> importtime>>> time.localtime()time.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=8, tm_min=34, tm_sec=28, tm_wday=4, tm_yday=60, tm_isdst=0)

        Unlike gmtime(), the inverse function of localtime() does exist in the Python time module. Let’s take a look at how that works.

        Converting a Local Time Object to Seconds

        You’ve already seen how to convert a UTC time object to seconds using calendar.timegm(). To convert local time to seconds, you’ll use mktime().

        mktime() requires you to pass a parameter called t that takes the form of either a normal 9-tuple or a struct_time object representing local time:

        >>>
        >>> importtime>>> time_tuple=(2019,3,10,8,50,6,6,69,1)>>> time.mktime(time_tuple)1552225806.0>>> time_struct=time.struct_time(time_tuple)>>> time.mktime(time_struct)1552225806.0

        It’s important to keep in mind that t must be a tuple representing local time, not UTC:

        >>>
        >>> fromtimeimportgmtime,mktime>>> # 1>>> current_utc=time.gmtime()>>> current_utctime.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=14, tm_min=51, tm_sec=19, tm_wday=4, tm_yday=60, tm_isdst=0)>>> # 2>>> current_utc_secs=mktime(current_utc)>>> current_utc_secs1551473479.0>>> # 3>>> time.gmtime(current_utc_secs)time.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=20, tm_min=51, tm_sec=19, tm_wday=4, tm_yday=60, tm_isdst=0)

        Note: For this example, assume that the local time is March 1, 2019 08:51:19 CST.

        This example shows why it’s important to use mktime() with local time, rather than UTC:

        1. gmtime() with no argument returns a struct_time using UTC. current_utc shows March 1, 2019 14:51:19 UTC. This is accurate because CST is UTC-06:00, so UTC should be 6 hours ahead of local time.

        2. mktime() tries to return the number of seconds, expecting local time, but you passed current_utc instead. So, instead of understanding that current_utc is UTC time, it assumes you meant March 1, 2019 14:51:19 CST.

        3. gmtime() is then used to convert those seconds back into UTC, which results in an inconsistency. The time is now March 1, 2019 20:51:19 UTC. The reason for this discrepancy is the fact that mktime() expected local time. So, the conversion back to UTC adds another 6 hours to local time.

        Working with time zones is notoriously difficult, so it’s important to set yourself up for success by understanding the differences between UTC and local time and the Python time functions that deal with each.

        Converting a Python Time Object to a String

        While working with tuples is fun and all, sometimes it’s best to work with strings.

        String representations of time, also known as timestamps, help make times more readable and can be especially useful for building intuitive user interfaces.

        There are two Python time functions that you use for converting a time.struct_time object to a string:

        1. asctime()
        2. strftime()

        You’ll begin by learning aboutasctime().

        asctime()

        You use asctime() for converting a time tuple or struct_time to a timestamp:

        >>>
        >>> importtime>>> time.asctime(time.gmtime())'Fri Mar  1 18:42:08 2019'>>> time.asctime(time.localtime())'Fri Mar  1 12:42:15 2019'

        Both gmtime() and localtime() return struct_time instances, for UTC and local time respectively.

        You can use asctime() to convert either struct_time to a timestamp. asctime() works similarly to ctime(), which you learned about earlier in this article, except instead of passing a floating point number, you pass a tuple. Even the timestamp format is the same between the two functions.

        As with ctime(), the parameter for asctime() is optional. If you do not pass a time object to asctime(), then it will use the current local time:

        >>>
        >>> importtime>>> time.asctime()'Fri Mar  1 12:56:07 2019'

        As with ctime(), it also ignores locale information.

        One of the biggest drawbacks of asctime() is its format inflexibility. strftime() solves this problem by allowing you to format your timestamps.

        strftime()

        You may find yourself in a position where the string format from ctime() and asctime() isn’t satisfactory for your application. Instead, you may want to format your strings in a way that’s more meaningful to your users.

        One example of this is if you would like to display your time in a string that takes locale information into account.

        To format strings, given a struct_time or Python time tuple, you use strftime(), which stands for “string format time.”

        strftime() takes two arguments:

        1. format specifies the order and form of the time elements in your string.
        2. t is an optional time tuple.

        To format a string, you use directives. Directives are character sequences that begin with a % that specify a particular time element, such as:

        • %d: Day of the month
        • %m: Month of the year
        • %Y: Year

        For example, you can output the date in your local time using the ISO 8601 standard like this:

        >>>
        >>> importtime>>> time.strftime('%Y-%m-%d',time.localtime())'2019-03-01'

        Further Reading: While representing dates using Python time is completely valid and acceptable, you should also consider using Python’s datetime module, which provides shortcuts and a more robust framework for working with dates and times together.

        For example, you can simplify outputting a date in the ISO 8601 format using datetime:

        >>>
        >>> fromdatetimeimportdate>>> date(year=2019,month=3,day=1).isoformat()'2019-03-01'

        As you saw before, a great benefit of using strftime() over asctime() is its ability to render timestamps that make use of locale-specific information.

        For example, if you want to represent the date and time in a locale-sensitive way, you can’t use asctime():

        >>>
        >>> fromtimeimportasctime>>> asctime()'Sat Mar  2 15:21:14 2019'>>> importlocale>>> locale.setlocale(locale.LC_TIME,'zh_HK')# Chinese - Hong Kong'zh_HK'>>> asctime()'Sat Mar  2 15:58:49 2019'

        Notice that even after programmatically changing your locale, asctime() still returns the date and time in the same format as before.

        Technical Detail:LC_TIME is the locale category for date and time formatting. The locale argument 'zh_HK' may be different, depending on your system.

        When you use strftime(), however, you’ll see that it accounts for locale:

        >>>
        >>> fromtimeimportstrftime,localtime>>> strftime('%c',localtime())'Sat Mar  2 15:23:20 2019'>>> importlocale>>> locale.setlocale(locale.LC_TIME,'zh_HK')# Chinese - Hong Kong'zh_HK'>>> strftime('%c',localtime())'六  3/ 2 15:58:12 2019' 2019'

        Here, you have successfully utilized the locale information because you used strftime().

        Note:%c is the directive for locale-appropriate date and time.

        If the time tuple is not passed to the parameter t, then strftime() will use the result of localtime() by default. So, you could simplify the examples above by removing the optional second argument:

        >>>
        >>> fromtimeimportstrftime>>> strftime('The current local datetime is: %c')'The current local datetime is: Fri Mar  1 23:18:32 2019'

        Here, you’ve used the default time instead of passing your own as an argument. Also, notice that the format argument can consist of text other than formatting directives.

        Further Reading: Check out this thorough list of directives available to strftime().

        The Python time module also includes the inverse operation of converting a timestamp back into a struct_time object.

        Converting a Python Time String to an Object

        When you’re working with date and time related strings, it can be very valuable to convert the timestamp to a time object.

        To convert a time string to a struct_time, you use strptime(), which stands for “string parse time”:

        >>>
        >>> fromtimeimportstrptime>>> strptime('2019-03-01','%Y-%m-%d')time.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=4, tm_yday=60, tm_isdst=-1)

        The first argument to strptime() must be the timestamp you wish to convert. The second argument is the format that the timestamp is in.

        The format parameter is optional and defaults to '%a %b %d %H:%M:%S %Y'. Therefore, if you have a timestamp in that format, you don’t need to pass it as an argument:

        >>>
        >>> strptime('Fri Mar 01 23:38:40 2019')time.struct_time(tm_year=2019, tm_mon=3, tm_mday=1, tm_hour=23, tm_min=38, tm_sec=40, tm_wday=4, tm_yday=60, tm_isdst=-1)

        Since a struct_time has 9 key date and time components, strptime() must provide reasonable defaults for values for those components it can’t parse from string.

        In the previous examples, tm_isdst=-1. This means that strptime() can’t determine by the timestamp whether it represents daylight savings time or not.

        Now you know how to work with Python times and dates using the time module in a variety of ways. However, there are other uses for time outside of simply creating time objects, getting Python time strings, and using seconds elapsed since the epoch.

        Suspending Execution

        One really useful Python time function is sleep(), which suspends the thread’s execution for a specified amount of time.

        For example, you can suspend your program’s execution for 10 seconds like this:

        >>>
        >>> fromtimeimportsleep,strftime>>> strftime('%c')'Fri Mar  1 23:49:26 2019'>>> sleep(10)>>> strftime('%c')'Fri Mar  1 23:49:36 2019'

        Your program will print the first formatted datetime string, then pause for 10 seconds, and finally print the second formatted datetime string.

        You can also pass fractional seconds to sleep():

        >>>
        >>> fromtimeimportsleep>>> sleep(0.5)

        sleep() is useful for testing or making your program wait for any reason, but you must be careful not to halt your production code unless you have good reason to do so.

        Before Python 3.5, a signal sent to your process could interrupt sleep(). However, in 3.5 and later, sleep() will always suspend execution for at least the amount of specified time, even if the process receives a signal.

        sleep() is just one Python time function that can help you test your programs and make them more robust.

        Measuring Performance

        You can use time to measure the performance of your program.

        The way you do this is to use perf_counter() which, as the name suggests, provides a performance counter with a high resolution to measure short distances of time.

        To use perf_counter(), you place a counter before your code begins execution as well as after your code’s execution completes:

        >>>
        >>> fromtimeimportperf_counter>>> deflongrunning_function():... foriinrange(1,11):... time.sleep(i/i**2)...>>> start=perf_counter()>>> longrunning_function()>>> end=perf_counter()>>> execution_time=(end-start)>>> execution_time8.201258441999926

        First, start captures the moment before you call the function. end captures the moment after the function returns. The function’s total execution time took (end - start) seconds.

        Technical Detail: Python 3.7 introduced perf_counter_ns(), which works the same as perf_counter(), but uses nanoseconds instead of seconds.

        perf_counter() (or perf_counter_ns()) is the most precise way to measure the performance of your code using one execution. However, if you’re trying to accurately gauge the performance of a code snippet, I recommend using the Python timeit module.

        timeit specializes in running code many times to get a more accurate performance analysis and helps you to avoid oversimplifying your time measurement as well as other common pitfalls.

        Conclusion

        Congratulations! You now have a great foundation for working with dates and times in Python.

        Now, you’re able to:

        • Use a floating point number, representing seconds elapsed since the epoch, to deal with time
        • Manage time using tuples and struct_time objects
        • Convert between seconds, tuples, and timestamp strings
        • Suspend the execution of a Python thread
        • Measure performance using perf_counter()

        On top of all that, you’ve learned some fundamental concepts surrounding date and time, such as:

        • Epochs
        • UTC
        • Time zones
        • Daylight savings time

        Now, it’s time for you to apply your newfound knowledge of Python time in your real world applications!

        Further Reading

        If you want to continue learning more about using dates and times in Python, take a look at the following modules:

        • datetime: A more robust date and time module in Python’s standard library
        • timeit: A module for measuring the performance of code snippets
        • astropy: Higher precision datetimes used in astronomy

        [ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

        Codementor: Variable references in Python


        Podcast.__init__: Exploring Indico: A Full Featured Event Management Platform

        $
        0
        0
        Managing an event is rife with inherent complexity that scales as you move from scheduling a meeting to organizing a conference. Indico is a platform built at CERN to handle their efforts to organize events such as the Computing in High Energy Physics (CHEP) conference, and now it has grown to manage booking of meeting rooms. In this episode Adrian Mönnich, core developer on the Indico project, explains how it is architected to facilitate this use case, how it has evolved since its first incarnation two decades ago, and what he has learned while working on it. The Indico platform is definitely a feature rich and mature platform that is worth considering if you are responsible for organizing a conference or need a room booking system for your office.

        Summary

        Managing an event is rife with inherent complexity that scales as you move from scheduling a meeting to organizing a conference. Indico is a platform built at CERN to handle their efforts to organize events such as the Computing in High Energy Physics (CHEP) conference, and now it has grown to manage booking of meeting rooms. In this episode Adrian Mönnich, core developer on the Indico project, explains how it is architected to facilitate this use case, how it has evolved since its first incarnation two decades ago, and what he has learned while working on it. The Indico platform is definitely a feature rich and mature platform that is worth considering if you are responsible for organizing a conference or need a room booking system for your office.

        Announcements

        • Hello and welcome to Podcast.__init__, the podcast about Python and the people who make it great.
        • When you’re ready to launch your next app or want to try a project you hear about on the show, you’ll need somewhere to deploy it, so take a look at our friends over at Linode. With 200 Gbit/s private networking, scalable shared block storage, node balancers, and a 40 Gbit/s public network, all controlled by a brand new API you’ve got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
        • Bots and automation are taking over whole categories of online interaction. Discover.bot is an online community designed to serve as a platform-agnostic digital space for bot developers and enthusiasts of all skill levels to learn from one another, share their stories, and move the conversation forward together. They regularly publish guides and resources to help you learn about topics such as bot development, using them for business, and the latest in chatbot news. For newcomers to the space they have the Beginners Guide To Bots that will teach you the basics of how bots work, what they can do, and where they are developed and published. To help you choose the right framework and avoid the confusion about which NLU features and platform APIs you will need they have compiled a list of the major options and how they compare. Go to pythonpodcast.com/discoverbot today to get started and thank them for their support of the show.
        • You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to pythonpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
        • Visit the site to subscribe to the show, sign up for the newsletter, and read the show notes. And if you have any questions, comments, or suggestions I would love to hear them. You can reach me on Twitter at @Podcast__init__ or email hosts@podcastinit.com)
        • To help other people find the show please leave a review on iTunes and tell your friends and co-workers
        • Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
        • Your host as usual is Tobias Macey and today I’m interviewing Adrian Mönnich about Indico, the effortless open-source tool for event organisation, archival and collaboration

        Interview

        • Introductions
        • How did you get introduced to Python?
        • Can you start by describing what Indico is and how the project got started?
          • What are some other projects which target a similar use case and what were they lacking that led to Indico being necessary?
        • Can you talk through an example workflow for setting up and managing an event in Indico?
          • How does the lifecycle change when working with larger events, such as PyCon?
        • Can you describe how Indico is architected and how its design has evolved since it was first built?
          • What are some of the most complex or challenging portions of Indico to implement and maintain?
        • There are a lot of areas for exercising constraint resolution algorithms. Can you talk through some of the business logic of how that operates?
        • Most of Indico is highly configurable and flexible. How do you approach managing sane defaults to prevent users getting overwhelmed when onboarding?
          • What is your approach to testing given how complex the project is?
        • What are some of the most interesting or unexpected ways that you have seen Indico used?
        • What are some of the most interesting/unexpected lessons that you have learned in the process of building Indico?
        • What do you have planned for the future of the project?

        Keep In Touch

        Picks

        Links

        The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

        NumFOCUS: NumFOCUS Projects to Apply for Inaugural Google Season of Docs

        The Code Bits: Printing star patterns in Python: One line tricks!

        $
        0
        0

        In this post, we will see how to print some of the common star patterns using Python3 with one line of code!

        How to print a half-pyramid pattern in Python?

        >>> n = 5
        >>> print('\n'.join('*' * i for i in range(1, n+1)))
        *
        **
        ***
        ****
        *****
        >>>>>> print('\n'.join('* ' * i for i in range(1, n+1)))
        *
        * *
        * * *
        * * * *
        * * * * *
        

        How to print a rotated half-pyramid pattern in Python?

        >>> n = 5
        >>> print('\n'.join(' ' * (n-i) + '*' * (i) for i in range(1, n+1)))
            *
           **
          ***
         ****
        *****
        >>>>>> print('\n'.join('  ' * (n-i) + '* ' * (i) for i in range(1, n+1)))
                *
              * *
            * * *
          * * * *
        * * * * *
        

        How to print an inverted half-pyramid pattern in Python?

        >>> n = 5
        >>> print('\n'.join('*' * (n-i) for i in range(n)))
        *****
        ****
        ***
        **
        *
        >>>>>> print('\n'.join('* ' * (n-i) for i in range(n)))
        * * * * *
        * * * *
        * * *
        * *
        *
        

        How to print an inverted and rotated half-pyramid pattern in Python?

        >>> n = 5
        >>> print('\n'.join(' ' * i + '*' * (n-i) for i in range(n)))
        *****
         ****
          ***
           **
            *
        >>>>>> print('\n'.join('  ' * i + '* ' * (n-i) for i in range(n)))
        * * * * *
          * * * *
            * * *
              * *
                *
        

        How to print a full triangle pyramid pattern in Python?

        >>> n = 5
        >>> print('\n'.join(' ' * (n-i) + '* ' * i for i in range(1, n+1)))
            *
           * *
          * * *
         * * * *
        * * * * *
        >>>>>> print('\n'.join(' ' * (n-1-i) + '*' * ((i*2)+1) for i in range(n)))
            *
           ***
          *****
         *******
        *********
        

        How to print an inverted full triangle pyramid pattern in Python?

        >>> n = 5
        >>> print('\n'.join(' ' * (n-i) + '* ' * i for i in range(n, 0, -1)))
        * * * * *
         * * * *
          * * *
           * *
            *
        >>>>>> print('\n'.join(' ' * (n-i) + '*' * ((i*2)-1) for i in range(n, 0, -1)))
        *********
         *******
          *****
           ***
            *
        

        Catalin George Festila: Testing firebase with Python 3.7.3 .

        $
        0
        0
        The tutorial for today consists of using the Firebase service with python version 3.7.3 . As you know Firebase offers multiple free and paid services. In order to use the Python programming language, we need to use the pip utility to enter the required modules. If your installation requires other python modules then you will need to install them in the same way. C:\Python373>pip install

        PyCon: Welcome Capital One: Python Software Foundation Principal Sponsor

        $
        0
        0


        A big welcome and thank you to Capital One for joining the PSF as a Principal sponsor!

        Capital One is also a PyCon 2019 Principal sponsor and is excited to share a few things with attendees, including a deeper look at their intelligent virtual assistant, Eno. Eno’s NLP models were built in-house with Python. Eno is a key component of the customer experience at Capital One, proactively looking out for customers and their money. Eno notifies customers about unusual transactions or duplicate charges, helping to spot fraud in its tracks. It also sends bill reminders and makes paying your bill as easy as sending a text or emoji; plus, its new virtual card capabilities let customers shop online without using their real credit card number.

        The benefits they’ve seen by developing Eno with Python are numerous: fast time to market, the ability to prototype and iterate quickly, ease of integration with machine learning frameworks, and extensive support for everything we need (like Kafka and Redis). Plus, they see faster performance using Python's Asynchronous I/O.

        For Capital One, sponsoring important industry conferences like PyCon brings a lot of benefits, like recruiting and brand awareness, but they’re here first and foremost for the community. By sponsoring PyCon, they feel they’re helping support, strengthen, and engage with the Python community.

        Capital One sees the future of banking as real-time, data-driven, and enabled by machine learning and data science -- and Python plays a big role in that. They have embedded machine learning across the entire enterprise, from call center operations to back-office processes, fraud, internal operations, the customer experience, and much more. To them, machine learning not only creates efficiency and scale on a level not possible before, but it also helps give their customers greater protection, security, confidence, and control of their finances.

        Python has been and will continue to be critical to advances in machine learning and data science, so they see a lot of exciting innovation, growth, and potential for the Python community.  They hope to share back with the community some of their own insights, best practices, and broader work with Python.

        As an open source first organization, Capital One has been working in the open source space for several years -- consuming and contributing code, as well as releasing their own projects. One example of an open source project they’ll be showcasing at PyCon is Cloud Custodian. Cloud Custodian is a tool built with Python to allow users to easily define rules to enable a well-managed cloud infrastructure in the enterprise. It’s both secure and cost-optimized and consolidates many of the ad-hoc scripts organizations have into a lightweight and flexible tool, with unified metrics and reporting.

        They also developed a Javascript project called Hygieia, a single, configurable dashboard that visualizes the health of an entire software delivery pipeline. All their open source projects are on GitHub and their Python projects can be found here.

        According to the Python Software Foundation and JetBrains’ 2018 Python Developers Survey, using Python for machine learning grew 7 percentage points since 2017, which is incredible. Machine learning experienced faster growth than Web development, which has only increased by 2 percentage points when compared to the previous year. Capital One is increasingly focused on using machine learning across the enterprise. One recent Python-based project is work they’ve done in Explainable AI. Their team created a technique called Global Attribution Mapping (GAM), which is capable of explaining neural network predictions across subpopulations. This approach surfaces subpopulations with their most representative explanations, allowing them to inspect global model behavior and essentially make it easier to generate global explanations based on local attributions. You can learn more about the open source tool they developed for GAM along with a recent whitepaper with more details.

        Be sure to stop by their booth, #303, and get even more details about how they’re using Python.



        Viewing all 23352 articles
        Browse latest View live


        <script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>