Skip to main content

Data Cleaning: Removing Duplicates

Data is the backbone of most modern applications. But raw data often comes with imperfections and needs to be cleaned before it can be used effectively. One common issue is duplicate data. In this tutorial, we will learn how to handle duplicate data using the Python library, `pandas`.

## Understanding Duplicate Data

Duplicate data refers to rows in the dataset that are exactly the same or have some repeating values in certain columns. Duplicate data can lead to biased analysis and incorrect results. Hence, it is essential to identify and handle these duplicates.

## Identifying Duplicates

In pandas, the `duplicated()` function is used to identify the duplicate rows. It returns a Boolean series that is True for each duplicated row.

Here's an example:

```python
import pandas as pd

# Creating a sample dataframe
data = {
'Name': ['John', 'Anna', 'John', 'Charles', 'Anna'],
'Age': [28, 25, 28, 20, 25],
'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles']
}

df = pd.DataFrame(data)

# Identifying duplicates
print(df.duplicated())

In the above code, the duplicated() function identifies the third and fifth rows as duplicates because the values in these rows are exactly the same as those in the first and second rows, respectively.

Removing Duplicates

Pandas provides the drop_duplicates() function to remove duplicate rows. By default, it removes completely identical rows and keeps the first occurrence.

Here's how to use it:

# Removing duplicates
df_no_duplicates = df.drop_duplicates()

print(df_no_duplicates)

If you want to consider only certain columns when identifying duplicates, you can pass their names to both the duplicated() and drop_duplicates() functions. For example:

# Identifying duplicates based on 'Name' column
print(df.duplicated(subset='Name'))

# Removing duplicates based on 'Name' column
df_no_duplicates = df.drop_duplicates(subset='Name')

print(df_no_duplicates)

In the above code, the duplicated() function identifies the third row as a duplicate because the 'Name' value in this row is the same as that in the first row.

Conclusion

Handling duplicate data is a critical step in data cleaning. In this tutorial, we learned how to identify and remove duplicate data using pandas. Remember, the decision to remove duplicates may depend on the context, and sometimes it might be better to keep the duplicate data. Always understand your data and the implications of removing duplicates before making a decision.