Skip to main content

Using pandas for Data Analysis

Introduction to Pandas

Pandas is a powerful data analysis tool built on top of the Python programming language. It provides data scientists and analysts with high-performance, easy-to-use data structures, and data analysis tools. The name 'Pandas' is derived from the term 'Panel Data', an econometrics term for datasets that include observations over multiple time periods for the same individuals.

What is Pandas?

Pandas is a software library for Python. It provides you with high-performance, easy-to-use data structures and data analysis tools. With Pandas, you can manipulate tables, time series, and perform operations on these data structures.

Installing Pandas

To install pandas, you can use pip which is a package management system used to install and manage software packages written in Python.

pip install pandas

Importing Pandas

Once you have installed pandas, you need to import the library. This is done with the import command.

import pandas as pd

We import pandas as pd. This means that we can use the keyword pd instead of pandas when we want to use functions from the pandas library.

Pandas Data Structures

Pandas primarily uses two types of data structures:

  1. Series: It is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.
s = pd.Series([3, -5, 7, 4], index=['a', 'b', 'c', 'd'])
  1. DataFrame: It is a two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects.
data = {'Country': ['Belgium', 'India', 'Brazil'],
'Capital': ['Brussels', 'New Delhi', 'Brasília'],
'Population': [11190846, 1303171035, 207847528]}
df = pd.DataFrame(data, columns=['Country', 'Capital', 'Population'])

Reading Data

Pandas can read data from various file formats like CSV, Excel, SQL databases, and more. Here is an example of reading a CSV file.

df = pd.read_csv('filename.csv')

Writing Data

Pandas can also write data to various file formats like CSV, Excel, SQL databases, and more. Here is an example of writing a DataFrame to a CSV file.

df.to_csv('filename.csv')

Data Exploration

Pandas provides various methods to have a quick overview of the data.

  • To check the first few records of the DataFrame, use the head() function.
  • To check the last few records of the DataFrame, use the tail() function.
  • To get the summary of the numerical columns in a DataFrame, use the describe() function.
df.head()
df.tail()
df.describe()

Data Cleaning

Pandas provides several methods to clean the data.

  • To check the null values in the DataFrame, use the isnull() function.
  • To fill the null values in the DataFrame, use the fillna() function.
  • To drop the rows where at least one element is missing, use the dropna() function.
df.isnull()
df.fillna(value)
df.dropna()

Data Manipulation

Pandas provides various methods to manipulate the data.

  • To select a single column, use the DataFrame with the column name inside the brackets.
  • To select multiple columns, use the DataFrame with the column names inside the brackets.
  • To filter records, use the DataFrame with the condition inside the brackets.
  • To apply a function to a column, use the apply() function.
df['column_name']
df[['column1', 'column2']]
df[df['column_name'] > 50]
df['column_name'].apply(lambda x: x*2)

This is a basic introduction to data analysis with pandas. There are many more functionalities provided by pandas like merging and joining dataframes, reshaping data, pivot tables, etc. which can be explored as you progress further. Happy learning!