Skip to main content

What is Pandas

Pandas is a powerful data manipulation tool developed by Wes McKinney. It's built on the Numpy package and its key data structure is called the DataFrame. DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables.

There are several ways to create a DataFrame. One way is to use a dictionary.

import pandas as pd

data = {
'apples': [3, 2, 0, 1],
'oranges': [0, 3, 7, 2]
}

purchases = pd.DataFrame(data)

print(purchases)

This will output:

   apples  oranges
0 3 0
1 2 3
2 0 7
3 1 2

The DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects. It is generally the most commonly used pandas object.

Pandas DataFrames are great for representing real data: rows correspond to instances (examples, observations, etc.), and columns correspond to features of these instances.

Reading data from CSVs

With CSV files all you need is a single line to load in the data:

df = pd.read_csv('purchases.csv')

Reading data from SQL databases

If you’re working with data from a SQL database you would need to first establish a connection using an appropriate Python library, then pass a query to pandas. Here we'll use SQLite to demonstrate.

import sqlite3

con = sqlite3.connect("database.db")

df = pd.read_sql_query("SELECT * FROM purchases", con)

Just like with CSVs, we could pass index_col='index', but we can also set an index after-the-fact:

df.set_index('index')

Converting back to a CSV, JSON, or SQL

So after extensive work on cleaning your data, you’re now ready to save it as a file of your choice. Similar to the ways we read in data, pandas provides intuitive commands to save it:

df.to_csv('new_purchases.csv')

df.to_json('new_purchases.json')

df.to_sql('new_purchases', con)

Where df is the DataFrame you want to save.

Basic DataFrame operations

Just like with NumPy ndarrays, we can perform element-wise arithmetic operations on DataFrame. Here we add a new column to our DataFrame by adding the 'apples' column to the 'oranges' column:

df['total_fruit'] = df['apples'] + df['oranges']

DataFrame slicing, selecting, extracting

It's also easy to select and extract data from DataFrames, and there are several ways to do it. We can extract by column name, by index (row names), and by numerical index. For instance, to select all the 'apples' values:

apples = df['apples']

We can also use the loc and iloc functions to select rows. For example, df.loc[1] will return all column values for the row with index 1.

Conclusion

We've only just scratched the surface of what pandas can do. As we delve deeper into the library, we'll learn about more complex operations such as merging, grouping, and reshaping data. Pandas is an extensive library with a lot of functionality, but these basics should be enough to get you off the ground. Happy data wrangling!