Data Management

Plan of the Module

  1. Data Analysis
  2. Data Sourcing
  3. Data Visualization

Jupyter

[…] is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

👉 Jupyter.org

Open your Terminal:

cd ~/some/where
jupyter notebook

Let’s have a quick tour!

NumPy

Fundamental package for high-performance data manipulation with Python

👉 NumPy.org

👉 NumPy Cheat Sheet to print/bookmark

The key concept NumPy introduces is the N-dimensional Array (ndarray)

Characteristics of the ndarray:

  • it is multidimensional
  • data is homogenous
  • it has a fixed size defined upon creation

import numpy as np # canonical import

list_ = [[1, 2, 3], [4, 5, 6]]
print(type(list_))
<class 'list'>
list_ # list of lists
[[1, 2, 3], [4, 5, 6]]

array = np.array([[1, 2, 3], [4, 5, 6]]) 
print(type(array))
<class 'numpy.ndarray'>
array # ndarray
array([[1, 2, 3],
       [4, 5, 6]])

# Key attributes of ndarrays

print('array.ndim: ', array.ndim)
array.ndim:  2
print('array.shape:', array.shape)
array.shape: (2, 3)
print('array.size: ', array.size)
array.size:  6
print('array.dtype:', array.dtype)
array.dtype: int64

Data Selection 😎


# Let's build a 2D-array from a list of lists
data_list = [
    [ 0,  1,  2,  3,  4],
    [10, 11, 12, 13, 14],
    [20, 21, 22, 23, 24],
    [30, 31, 32, 33, 34],
    [40, 41, 42, 43, 44],
]

data_np = np.array(data_list)
data_np
array([[ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [20, 21, 22, 23, 24],
       [30, 31, 32, 33, 34],
       [40, 41, 42, 43, 44]])

Select 1D slice (row)


# Pure Python
data_list[2][1:4]

# NumPy
[21, 22, 23]
data_np[2, 1:4] # data_np[row(s), column(s)]
array([21, 22, 23])

Select 1D slice (column)


# Pure Python
selection = []

for index, row in enumerate(data_list):
    if index > 0:
        selection.append(row[4])

selection # we could also have used list comprehension for fewer lines

# NumPy
[14, 24, 34, 44]
data_np[1:, 4] # '1:' means from line 1 until the end
array([14, 24, 34, 44])

General Syntax for Slicing

ndarray[start:stop:step]


array = np.arange(0, 10)
array
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
array[1:7:2]
array([1, 3, 5])

Vectorized Operations ⚡️

Let’s compute the sum, row by row (8 additions), to create a 1D-vector


my_list = [
    [6, 5],
    [1, 3],
    [5, 6],
    [1, 4],
    [3, 7],
    [5, 8],
    [3, 5],
    [8, 4],
]


# Python way
sums = []

for row in my_list:
    sums.append(row[0] + row[1]) # standard integer "+" operator

sums
[11, 4, 11, 5, 10, 13, 8, 12]


# The NumPy Way
array = np.array(my_list)

my_sum = array[:, 0] + array[:, 1] # vectorial "+" operator
my_sum
array([11,  4, 11,  5, 10, 13,  8, 12])

Axes 🤯

2D Example


np.sum(array, axis=0)  # eq. to A[0,:] + A[1,:] + A[2,:] + ...
array([32, 42])
np.sum(array, axis=1)  # eq. to A[:,0] + A[:,1] + A[:,2] + ...
array([11,  4, 11,  5, 10, 13,  8, 12])

How much faster is NumPy? ⚡️


# 2D-array of shape (10.000, 10.000) with random floats in the interval [0, 1]. That's 100M numbers!
array = np.random.rand(10000, 10000)
array_list = array.tolist()
#%%time
total = 0

for row in array_list:
    for number in row:
        total += number

round(total, 2)
50003174.53
#%%time
round(np.sum(array), 2)
50003174.53

Boolean Indexing 🔥

Build a boolean mask from an ndarray.

Limitations of NumPy

  1. Lack of support for column names
  2. Only one data type per ndarray
  3. Some useful data processing methods are missing

👉 Pandas builds on NumPy to solve these problems

Introduction to Pandas

[…] is an open source library providing high-performance easy-to-use data structures and data analysis tools for Python.

👉 Pandas.pydata.org

👉 Pandas cheat sheet to print/bookmark

Pandas Series

Pandas’ equivalent to NumPy’s 1D-array (both accept the same methods) Has an additional index Has support for multiple data types

import pandas as pd # canonical import

series = pd.Series(data=[1, 2, 'three'], index=['id1', 'id2', 'id3'])
series = pd.Series({'id1': 1, 'id2': 2, 'id3': 'three'})

series
id1        1
id2        2
id3    three
dtype: object

Pandas DataFrames

Pandas’ equivalent of a NumPy 2D-array: Has additional labels on both axes (rows and columns) Has support for multiple data types

import pandas as pd

df = pd.DataFrame(
    [[4, 7, 10],
    [5, 8, 11],
    [6, 9, 12]],
    index=['row_1', 'row_2', 'row_3'],
    columns=['col_a', 'col_b', 'col_c']
)

df
       col_a  col_b  col_c
row_1      4      7     10
row_2      5      8     11
row_3      6      9     12

A DataFrame is a dictionary of Series

apples  = pd.Series(data=[1, 2, 3], index=['id1', 'id2', 'id3'])
oranges = pd.Series(data=[4, 5, 6], index=['id1', 'id2', 'id3'])

dict_ = {
    'apples': apples,
    'oranges': oranges,
}

pd.DataFrame(dict_)
     apples  oranges
id1       1        4
id2       2        5
id3       3        6

Exploratory Data Analysis (EDA)

Let’s start a new notebook to explore the following dataset: Countries of the World.

You can have a look at it in this Gist and download it with:

curl -s -L https://lew.ag/countries-dataset > countries.csv
head -n 3 countries.csv

This is how notebooks typically start:

import numpy as np
import pandas as pd

Notebook Superpowers

In a new cell:

pd.read<TAB>
pd.read_csv<SHIFT+TAB> # (up to four times)

Go ahead and load the CSV into a countries_df DataFrame:

file = 'countries.csv' # path relative to your notebook
countries_df = pd.read_csv(file, decimal=',')

Get a Quick Sense of the Data

Here are some utility methods to call on a fresh DataFrame:

countries_df.shape # => Tuple: Dims of the DataFrame

Replace .shape with:

You can also do:

countries_df.isnull().sum()

Get a Quick Look

countries_df.head()

countries_df.tail()

Same logic as SQL!

You can manipulate a DataFrame in the same way you query a relational database’s table.

👉 Pandas documentation: comparison with SQL

Reading Columns

Use the [] syntax to get one or many columns:

countries_df['Country']

type(countries_df['Country']) # => pandas.core.series.Series
countries_df[['Country', 'Region']]

type(countries_df[['Country']]) # => pandas.core.frame.DataFrame

Group of Rows/Columns

countries_df.loc[0:5, ['Country', 'Region']] # from row index 0 to 5 (included)

👉 After the lecture, read this Stackoverflow Q&A thread

Selection based on multiple conditions

countries_df.where((countries_df['Population'] < 1447700) \
&  (countries_df['Net migration'] < 0))

 

countries_df.loc[(countries_df['Population']> 1_000_000) \
& ~countries_df["Literacy (%)"].isnull()]

Boolean Indexing with Pandas

🤔 What are the countries with more than one billion inhabitants?

Pure Python (naive) implementation:

big_countries = []

for index, country in countries_df.iterrows():
    if country['Population'] > 1_000_000_000:
        big_countries.append(country)

pd.DataFrame(big_countries)

In Pandas, this is a one-line with Boolean Indexing:

countries_df[countries_df['Population'] > 1_000_000_000]

🤔 What are the countries of the American continent?

american = countries_df['Region'].str.contains('AMER')
countries_df[american]

🤔 What are the countries of Europe?

We can use pandas.Series.isin()

countries_df[countries_df['Region'].isin(['WESTERN EUROPE', 'EASTERN EUROPE'])]

But why are there no results?

countries_df['Region'].unique()

We need to clean up first:

countries_df['Region'] = countries_df['Region'].str.strip()

If we want to answer the inverse question, we can use the bitwise operator ~:

countries_df[~countries_df['Region'].\
isin(['WESTERN EUROPE', 'EASTERN EUROPE'])]

Re-Indexing

countries_df['Country'] = countries_df['Country'].map(str.strip)
countries_df.set_index('Country', inplace=True)

The index is no longer a sequence of integers, but instead the countries’ names!

We now can do something like this:

# Get region names and population from France to Germany
countries_df.loc['France':'Germany', ['Region', 'Population']]

loc vs iloc

Note the difference between loc and iloc: loc is typically used for label indexing and can access multiple columns, while . iloc is used for integer indexing 😉

Sorting

We can sort by the index with pandas.DataFrame.sort_index:

countries_df.sort_index(ascending=False)

We can sort by specific columns with pandas.DataFrame.sort_values:

countries_df.sort_values(by='Population', ascending=False)
# Makes sure NaNs are shown at the top
countries_df.sort_values(by='GDP ($ per capita)', na_position='first')

Grouping

Very close to GROUP BY in SQL; it’s a 3-step process:

  1. Split: a DataFrame is split into groups, depending on chosen keys

  2. Apply: an aggregative function (sum, mean, etc.) is applied to each group

  3. Combine: results from the previous operations are merged (i.e. reduced) into one new DataFrame

🤔 Which region of the world is the most populated?

regions = countries_df.groupby('Region')

regions[['Population', 'Area (sq. mi.)']].sum()
        
regions[['Population', 'Area (sq. mi.)']].sum() \
    .sort_values('Population', ascending=False)

Resample

Pandas dataframe.resample() function is primarily used for time series data.

  • W : weekly frequency
  • M : month end frequency
  • SM : semi-month end frequency (15th and end of month)
  • Q: quarter end frequency
# Resampling the time series data based on months
#will find the mean closing price
# of each month for a duration of 12 months.
monthly_resampled_data = df.close.resample('M').mean()

Stack & Pivot

Pandas stack is used for stacking the levels from column to index. It returns a new DataFrame or Series with a multi-level index

df_a = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
print(df_a, '\n')

df_s = df_a.stack()
print(df_s, '\n')

map - apply

# MAP (for Series)

series.map(function)
series.map({mapping dict})
# APPLY (for DataFrame)

df.apply(lambda col: col.max(), axis = 0) # default axis

df.apply(lambda row: row['A'] + row['B'], axis = 1)

Replace

data.Alley.replace(np.nan, "NoAlley", inplace=True)

Sampling & Stratified sampling

Random sample \(n\) observations.

import pandas as pd
  
# Create a dictionary of students
students = {
    'Name': ['Lisa', 'Kate', 'Ben', 'Kim', 'Josh',
             'Alex', 'Evan', 'Greg', 'Sam', 'Ella','Ahmed','Joe','Mark'],
    'ID': ['001', '002', '003', '004', '005', '006', 
           '007', '008', '009', '010','011','012','013'],
    'Grade': ['A', 'A', 'C', 'B', 'B', 'B', 'C', 
              'A', 'A', 'A','A', 'C', 'B'],
    
    'Category': [2, 3, 1, 3, 2, 3, 3, 1, 2, 1,3, 2, 3]
}
  
# Create dataframe from students dictionary
df = pd.DataFrame(students)
  
# view the dataframe
df

df.sample(5)

Random sample by group (within group)

df.groupby("Category").sample(2)

df.groupby("Category").sample(frac=0.2, random_state=1)

Plotting

#%matplotlib inline
import matplotlib

gdp = 'GDP ($ per capita)'

top_ten_countries_df = countries_df[[gdp]] \
    .sort_values(gdp, ascending=False) \
    .head(10)

#top_ten_countries_df
top_ten_countries_df.plot(kind="bar")