[…] is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
Open your Terminal:
Let’s have a quick tour!
Fundamental package for high-performance data manipulation with Python
👉 NumPy Cheat Sheet to print/bookmark
The key concept NumPy introduces is the N-dimensional Array (ndarray)
Characteristics of the ndarray:
# Let's build a 2D-array from a list of lists
data_list = [
[ 0, 1, 2, 3, 4],
[10, 11, 12, 13, 14],
[20, 21, 22, 23, 24],
[30, 31, 32, 33, 34],
[40, 41, 42, 43, 44],
]
data_np = np.array(data_list)
data_np
array([[ 0, 1, 2, 3, 4],
[10, 11, 12, 13, 14],
[20, 21, 22, 23, 24],
[30, 31, 32, 33, 34],
[40, 41, 42, 43, 44]])
ndarray[start:stop:step]
Let’s compute the sum, row by row (8 additions), to create a 1D-vector
Axes 🤯
Boolean Indexing 🔥
Build a boolean mask from an ndarray.
👉 Pandas builds on NumPy to solve these problems
[…] is an open source library providing high-performance easy-to-use data structures and data analysis tools for Python.
👉 Pandas cheat sheet to print/bookmark
Pandas’ equivalent to NumPy’s 1D-array (both accept the same methods) Has an additional index Has support for multiple data types
Pandas’ equivalent of a NumPy 2D-array: Has additional labels on both axes (rows and columns) Has support for multiple data types
Let’s start a new notebook to explore the following dataset: Countries of the World.
You can have a look at it in this Gist and download it with:
This is how notebooks typically start:
In a new cell:
Go ahead and load the CSV into a countries_df
DataFrame:
Here are some utility methods to call on a fresh DataFrame
:
Replace .shape
with:
You can also do:
You can manipulate a DataFrame in the same way you query a relational database’s table.
Use the []
syntax to get one or many columns:
👉 After the lecture, read this Stackoverflow Q&A thread
🤔 What are the countries with more than one billion inhabitants?
Pure Python (naive) implementation:
🤔 What are the countries of the American continent?
🤔 What are the countries of Europe?
We can use pandas.Series.isin()
But why are there no results?
countries_df['Country'] = countries_df['Country'].map(str.strip)
countries_df.set_index('Country', inplace=True)
The index is no longer a sequence of integers, but instead the countries’ names!
We now can do something like this:
loc
vs iloc
Note the difference between loc and iloc: loc is typically used for label indexing and can access multiple columns, while . iloc is used for integer indexing 😉
We can sort by the index with pandas.DataFrame.sort_index
:
We can sort by specific columns with pandas.DataFrame.sort_values
:
Very close to GROUP BY
in SQL; it’s a 3-step process:
Split: a DataFrame is split into groups, depending on chosen keys
Apply: an aggregative function (sum
, mean
, etc.) is applied to each group
Combine: results from the previous operations are merged (i.e. reduced) into one new DataFrame
🤔 Which region of the world is the most populated?
Pandas dataframe.resample() function is primarily used for time series data.
Pandas stack is used for stacking the levels from column to index. It returns a new DataFrame or Series with a multi-level index
df_a = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
print(df_a, '\n')
df_s = df_a.stack()
print(df_s, '\n')
Random sample \(n\) observations.
import pandas as pd
# Create a dictionary of students
students = {
'Name': ['Lisa', 'Kate', 'Ben', 'Kim', 'Josh',
'Alex', 'Evan', 'Greg', 'Sam', 'Ella','Ahmed','Joe','Mark'],
'ID': ['001', '002', '003', '004', '005', '006',
'007', '008', '009', '010','011','012','013'],
'Grade': ['A', 'A', 'C', 'B', 'B', 'B', 'C',
'A', 'A', 'A','A', 'C', 'B'],
'Category': [2, 3, 1, 3, 2, 3, 3, 1, 2, 1,3, 2, 3]
}
# Create dataframe from students dictionary
df = pd.DataFrame(students)
# view the dataframe
df
df.sample(5)
Random sample by group (within group)