Skip to main content

Data Analysis Project

Introduction

In this tutorial, we will walk through a data analysis project from start to finish using the pandas library in Python. This project will help you to put into practice the skills you've learned. We will be working with a real-world dataset, perform exploratory data analysis (EDA), and draw some conclusions based on our analysis.

Importing the Necessary Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Loading the Dataset

In the first step, we load the dataset using pandas. For this project, we'll use a dataset about Airbnb listings in New York City.

df = pd.read_csv('AB_NYC_2019.csv')

Understanding the Dataset

Before we jump into the analysis, let's understand our dataset. We'll look at the shape, columns, and summary statistics.

# Display the shape of the dataset
print(df.shape)

# Display the columns of the dataset
print(df.columns)

# Display summary statistics
print(df.describe())

Data Cleaning

This step involves handling missing values, duplicate values, and irrelevant data.

# Check for missing values
print(df.isnull().sum())

# Handle missing values (if any)
df = df.dropna()

# Check for duplicates
print(df.duplicated().sum())

# Remove duplicates (if any)
df = df.drop_duplicates()

# Drop irrelevant columns
df = df.drop(['column1', 'column2'], axis=1)

Exploratory Data Analysis (EDA)

EDA is a crucial step where we explore and visualize the data to understand the underlying patterns.

# Display the distribution of 'price'
plt.figure(figsize=(6,4))
sns.histplot(df['price'], bins=50, kde=True)

# Display the correlation matrix
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

Data Analysis

Now, let's answer some questions based on our data.

  1. What is the average price of listings in each neighbourhood?
avg_price_neighbourhood = df.groupby('neighbourhood')['price'].mean()
print(avg_price_neighbourhood)
  1. How many listings are available per neighbourhood?
listings_per_neighbourhood = df['neighbourhood'].value_counts()
print(listings_per_neighbourhood)

Conclusion

In this project, we've gone through the complete data analysis process. We started from loading and understanding the data, cleaned it, explored it, and finally analyzed it to answer some questions. The pandas library has made this process straightforward and efficient. Keep practicing with different datasets to hone your pandas skills.


Remember, data analysis is an iterative process. You may need to go back and forth between steps, and that's completely normal. Happy analyzing!