Quantcast
Channel: Planet Python
Viewing all articles
Browse latest Browse all 22907

Python for Beginners: Drop Duplicate Rows From a Pandas Dataframe

$
0
0

Pandas dataframes are used to handle tabular data in Python. The data sometimes contains duplicate values which might be undesired. In this article, we will discuss different ways to drop duplicate rows from a pandas dataframe using the drop_duplicates() method.

The drop_duplicates() Method

The drop_duplicates() method is used to drop duplicate rows from a pandas dataframe. It has the following syntax.

DataFrame.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)

Here,

  • The subset parameter is used to compare two rows to determine duplicate rows. By default, the subset parameter is set to None. Due to this, values from all the columns are used from rows for comparison. If you want to compare two rows by only a single column, you can pass the column name to the subset parameter as the input argument. If you want to compare rows by two or more columns, you can pass the list of column names to the subset parameter.  
  • The keep parameter is used to decide whether we want to keep one of the duplicate rows in the output dataframe. If we want to drop all the duplicate rows except the first occurrence, we can set the keep parameter to “first” which is its default value. If we want to drop all the duplicate rows except the last occurrence, we can set the keep parameter to “last”. If we need to drop all the rows having duplicates, we can set the keep parameter to False.
  • The inplace parameter is used to decide if we get a new dataframe after the drop operation or if we want to modify the original dataframe. When inplace is set to False, which is its default value, the original dataframe isn’t changed and the drop_duplicates() method returns the modified dataframe after execution. To alter the original dataframe, you can set inplace to True. 
  • When rows are dropped from a dataframe, the order of the indices becomes irregular. If you want to refresh the index and assign the ordered index from 0 to (length of dataframe)-1, you can set ignore_index to True. 

After execution, the drop_duplicates() method returns a dataframe if the inplace parameter is set to False. Otherwise, it returns None.

Drop Duplicate Rows From a Pandas Dataframe

To drop duplicate rows from a pandas dataframe, you can invoke the drop_duplicates() method on the dataframe. After execution, it returns a dataframe containing all the unique rows. You can observe this in the following example.

import pandas as pd
df=pd.read_csv("grade2.csv")
print("The dataframe is:")
print(df)
df=df.drop_duplicates()
print("After dropping duplicates:")
print(df)

Output:

The dataframe is:
    Class  Roll        Name  Marks Grade
0       2    27       Harsh     55     C
1       2    23       Clara     78     B
2       3    33        Tina     82     A
3       3    34         Amy     88     A
4       3    15    Prashant     78     B
5       3    27      Aditya     55     C
6       3    34         Amy     88     A
7       3    23  Radheshyam     78     B
8       3    11       Bobby     50     D
9       2    27       Harsh     55     C
10      3    15      Lokesh     88     A
After dropping duplicates:
    Class  Roll        Name  Marks Grade
0       2    27       Harsh     55     C
1       2    23       Clara     78     B
2       3    33        Tina     82     A
3       3    34         Amy     88     A
4       3    15    Prashant     78     B
5       3    27      Aditya     55     C
7       3    23  Radheshyam     78     B
8       3    11       Bobby     50     D
10      3    15      Lokesh     88     A

In the above example, we have an input dataframe containing the Class, Roll, Name, Marks, and Grades of some students. As you can observe, the input dataframe contains some duplicate rows. The rows at index 0 and 9 are the same. Similarly, rows at the index 3 and 6 are the same. After execution of the drop_duplicates() method, we get a pandas dataframe in which all the rows are unique. Hence, the rows at indexes 6 and 9 are dropped from the dataframe so that the rows at indexes 0 and 3 become unique.

Drop All Duplicate Rows From a Pandas Dataframe

In the above example, one entry from each set of duplicate rows is preserved. If you want to delete all the duplicate rows from the dataframe, you can set the keep parameter to False in the drop_duplicates() method. After this, all the rows having duplicate values will be deleted. You can observe this in the following example.

import pandas as pd
df=pd.read_csv("grade2.csv")
print("The dataframe is:")
print(df)
df=df.drop_duplicates(keep=False)
print("After dropping duplicates:")
print(df)

Output:

The dataframe is:
    Class  Roll        Name  Marks Grade
0       2    27       Harsh     55     C
1       2    23       Clara     78     B
2       3    33        Tina     82     A
3       3    34         Amy     88     A
4       3    15    Prashant     78     B
5       3    27      Aditya     55     C
6       3    34         Amy     88     A
7       3    23  Radheshyam     78     B
8       3    11       Bobby     50     D
9       2    27       Harsh     55     C
10      3    15      Lokesh     88     A
After dropping duplicates:
    Class  Roll        Name  Marks Grade
1       2    23       Clara     78     B
2       3    33        Tina     82     A
4       3    15    Prashant     78     B
5       3    27      Aditya     55     C
7       3    23  Radheshyam     78     B
8       3    11       Bobby     50     D
10      3    15      Lokesh     88     A

In this example, you can observe that rows at index 0 and 9 are the same. Similarly, rows at the index 3 and 6 are the same. When we set the keep parameter to False in the drop_duplicates() method, you can observe that all the rows that have duplicate values i.e. rows at index 0, 3, 6, and 9 are dropped from the input dataframe.

Suggested Reading: If you are into machine learning, you can read this MLFlow tutorial with code examples. You might also like this article on 15 Free Data Visualization Tools for 2023.

Drop Duplicate Rows Inplace From a Pandas Dataframe

By default, the drop_duplicates() method returns a new dataframe. If you want to alter the original dataframe instead of creating a new one, you can set the inplace parameter to True in the drop_duplicates() method as shown below.

import pandas as pd
df=pd.read_csv("grade2.csv")
print("The dataframe is:")
print(df)
df.drop_duplicates(keep=False,inplace=True)
print("After dropping duplicates:")
print(df)

Output:

The dataframe is:
    Class  Roll        Name  Marks Grade
0       2    27       Harsh     55     C
1       2    23       Clara     78     B
2       3    33        Tina     82     A
3       3    34         Amy     88     A
4       3    15    Prashant     78     B
5       3    27      Aditya     55     C
6       3    34         Amy     88     A
7       3    23  Radheshyam     78     B
8       3    11       Bobby     50     D
9       2    27       Harsh     55     C
10      3    15      Lokesh     88     A
After dropping duplicates:
    Class  Roll        Name  Marks Grade
1       2    23       Clara     78     B
2       3    33        Tina     82     A
4       3    15    Prashant     78     B
5       3    27      Aditya     55     C
7       3    23  Radheshyam     78     B
8       3    11       Bobby     50     D
10      3    15      Lokesh     88     A

In this example, we have set the inplace parameter to True in the drop_duplicates() method. Hence, the  drop_duplicates() method modifies the input dataframe instead of creating a new one. Here, the drop_duplicates() method returns None.

Drop Rows Having Duplicate Values in Specific Columns

By default, the drop_duplicates() method compares all the columns for similarity to check for duplicate rows. If you want to compare the rows for duplicate values on the basis of specific columns, you can use the subset parameter in the drop_duplicates() method. 

The subset parameter takes a list of columns as its input argument. After this, the drop_duplicates() method compares the rows only based on the specified columns. You can observe this in the following example.

import pandas as pd
df=pd.read_csv("grade2.csv")
print("The dataframe is:")
print(df)
df.drop_duplicates(subset=["Class","Roll"],inplace=True)
print("After dropping duplicates:")
print(df)

Output:

The dataframe is:
    Class  Roll        Name  Marks Grade
0       2    27       Harsh     55     C
1       2    23       Clara     78     B
2       3    33        Tina     82     A
3       3    34         Amy     88     A
4       3    15    Prashant     78     B
5       3    27      Aditya     55     C
6       3    34         Amy     88     A
7       3    23  Radheshyam     78     B
8       3    11       Bobby     50     D
9       2    27       Harsh     55     C
10      3    15      Lokesh     88     A
After dropping duplicates:
   Class  Roll        Name  Marks Grade
0      2    27       Harsh     55     C
1      2    23       Clara     78     B
2      3    33        Tina     82     A
3      3    34         Amy     88     A
4      3    15    Prashant     78     B
5      3    27      Aditya     55     C
7      3    23  Radheshyam     78     B
8      3    11       Bobby     50     D

In this example, we have passed the python list [“Class”, “Roll”] to the subset parameter in the drop_duplicates() method. Hence, the duplicate rows are decided on the basis of these two columns only. As a result, the rows having the same value in the Class and Roll columns are considered duplicates and are dropped from the dataframe.

Conclusion

In this article, we have discussed different ways to drop duplicate rows from a dataframe using the drop_duplicates() method.

To know more about the pandas module, you can read this article on how to sort a pandas dataframe. You might also like this article on how to drop columns from a pandas dataframe.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

The post Drop Duplicate Rows From a Pandas Dataframe appeared first on PythonForBeginners.com.


Viewing all articles
Browse latest Browse all 22907

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>