This notebook was created by Jean de Dieu Nyandwi for the love of machine learning community. For any feedback, errors or suggestion, he can be reached on email (johnjw7084 at gmail dot com), Twitter, or LinkedIn.
Data Manipulation with Pandas¶
In this lab, you will learn how to manipulate data with Pandas. Here is an overview:
1. Basics of Pandas for data manipulation¶
A. Series and DataFrames¶
Both series and DataFrames are Pandas Data structures.
Series is like one dimensional NumPy array with axis labels.
DataFrame is multidimensional NumPy array with labels on rows and columns.
Working with NumPy, we saw that it supports numeric type data. Pandas on other hand supports whole range of data types, from numeric to strings, etc..
Since we are using python notebook, we do not need to install Pandas. We only just have to import it.
import pandas as pd
# importing numpy and pandas
import numpy as np
import pandas as pd
Creating Series¶
Series can be created from a Python list, dictionary, and NumPy array.
# Creating the series from a Python list
num_list = [1,2,3,4,5]
pd.Series(num_list)
0 1 1 2 2 3 3 4 4 5 dtype: int64
week_days = ['Mon','Tues','Wed','Thur','Fri']
pd.Series(week_days, index=["a", "b", "c", "d", "e"])
a Mon b Tues c Wed d Thur e Fri dtype: object
Note the data types int64
and object
.
# Creating the Series from dictionary
countries_code = { 1:"United States",
91:"India",
49:"Germany",
86:"China",
250:"Rwanda"}
pd.Series(countries_code)
1 United States 91 India 49 Germany 86 China 250 Rwanda dtype: object
d = {1:'a', 2:'b', 3:'c', 4:'d'}
pd.Series(d)
1 a 2 b 3 c 4 d dtype: object
# Creating the Series from NumPy array
# We peovide the list of indexes
# if we don't provide the indexes, the default indexes are numbers...starts from 0,1,2..
arr = np.array ([1, 2, 3, 4, 5])
pd.Series(arr)
0 1 1 2 2 3 3 4 4 5 dtype: int64
pd.Series(arr, index=['a', 'b', 'c', 'd', 'e'])
a 1 b 2 c 3 d 4 e 5 dtype: int64
Creating DataFrames¶
DataFrames are the most used Pandas data structure. It can be created from a dictionary, 2D array, and Series.
# Creating DataFrame from a dictionary
countries = {'Name': ['USA', 'India', 'German', 'Rwanda'],
'Codes':[1, 91, 49, 250] }
pd.DataFrame(countries)
Name | Codes | |
---|---|---|
0 | USA | 1 |
1 | India | 91 |
2 | German | 49 |
3 | Rwanda | 250 |
# Creating a dataframe from a 2D array
# You pass the list of columns
array_2d = np.array ([[1,2,3], [4,5,6], [7,8,9]])
pd.DataFrame(array_2d, columns = ['column 1', 'column 2', 'column 3'])
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
2 | 7 | 8 | 9 |
# Creating a dataframe from Pandas series
# Pass the columns in a list
countries_code = { "United States": 1,
"India": 91,
"Germany": 49,
"China": 86,
"Rwanda":250}
pd_series = pd.Series (countries_code)
pd.Series(countries_code)
df = pd.DataFrame(pd_series, columns = ['Codes'])
df
Codes | |
---|---|
United States | 1 |
India | 91 |
Germany | 49 |
China | 86 |
Rwanda | 250 |
# Adding a column
# Number in population are pretty random
df ['Population'] = [100, 450, 575, 5885, 533]
df
Codes | Population | |
---|---|---|
United States | 1 | 100 |
India | 91 | 450 |
Germany | 49 | 575 |
China | 86 | 5885 |
Rwanda | 250 | 533 |
# Removing a column
df.drop('Population', axis =1)
Codes | |
---|---|
United States | 1 |
India | 91 |
Germany | 49 |
China | 86 |
Rwanda | 250 |
df.columns
Index(['Codes', 'Population'], dtype='object')
df.keys
<bound method NDFrame.keys of Codes Population United States 1 100 India 91 450 Germany 49 575 China 86 5885 Rwanda 250 533>
df.index
Index(['United States', 'India', 'Germany', 'China', 'Rwanda'], dtype='object')
B. Data Indexing, Selection and Iteration¶
Indexing and selection works in both Series and Dataframe.
Because DataFrame is made of Series, let's focus on how to select data in DataFrame.
# Creating DataFrame from a dictionary
countries = {'Name': ['USA', 'India', 'German', 'Rwanda'],
'Codes':[1, 91, 49, 250] }
df = pd.DataFrame(countries, index=['a', 'b', 'c', 'd'])
df
Name | Codes | |
---|---|---|
a | USA | 1 |
b | India | 91 |
c | German | 49 |
d | Rwanda | 250 |
df['Name']
a USA b India c German d Rwanda Name: Name, dtype: object
df.Name
a USA b India c German d Rwanda Name: Name, dtype: object
df ['Codes']
a 1 b 91 c 49 d 250 Name: Codes, dtype: int64
## When you have many columns, columns in list will be selected
df [['Name', 'Codes']]
Name | Codes | |
---|---|---|
a | USA | 1 |
b | India | 91 |
c | German | 49 |
d | Rwanda | 250 |
# This will return the first two rows
df [0:2]
Name | Codes | |
---|---|---|
a | USA | 1 |
b | India | 91 |
You can also use loc
to select data by the label indexes and iloc
to select by default integer index (or by the position of the row)
df.loc['a']
Name USA Codes 1 Name: a, dtype: object
df.loc['b':'d']
Name | Codes | |
---|---|---|
b | India | 91 |
c | German | 49 |
d | Rwanda | 250 |
df [:'b']
Name | Codes | |
---|---|---|
a | USA | 1 |
b | India | 91 |
df.iloc[2]
Name German Codes 49 Name: c, dtype: object
df.iloc[1:3]
Name | Codes | |
---|---|---|
b | India | 91 |
c | German | 49 |
df.iloc[2:]
Name | Codes | |
---|---|---|
c | German | 49 |
d | Rwanda | 250 |
Conditional Selection¶
df
Name | Codes | |
---|---|---|
a | USA | 1 |
b | India | 91 |
c | German | 49 |
d | Rwanda | 250 |
#Let's select a country with code 49
df [df['Codes'] ==49 ]
Name | Codes | |
---|---|---|
c | German | 49 |
df [df['Codes'] < 250 ]
Name | Codes | |
---|---|---|
a | USA | 1 |
b | India | 91 |
c | German | 49 |
df [df['Name'] =='USA' ]
Name | Codes | |
---|---|---|
a | USA | 1 |
# You can use and (&) or (|) for more than conditions
#df [(condition 1) & (condition 2)]
df [(df['Codes'] == 91 ) & (df['Name'] == 'India') ]
Name | Codes | |
---|---|---|
b | India | 91 |
You can also use isin()
and where()
to select data in a series or dataframe.
# isin() return false or true when provided value is included in dataframe
sample_codes_names=[1,3,250, 'USA', 'India', 'England']
df.isin(sample_codes_names)
Name | Codes | |
---|---|---|
a | True | True |
b | True | False |
c | False | False |
d | False | True |
As you can see, it returned True
wherever a country code or name was found. Otherwise, False
. You can use a dictinary to match search by columns. A key must be a column and values are passed in list.
sample_codes_names = {'Codes':[1,3,250], 'Name':['USA', 'India', 'England']}
df.isin(sample_codes_names)
Name | Codes | |
---|---|---|
a | True | True |
b | True | False |
c | False | False |
d | False | True |
df2 = pd.DataFrame(np.array ([[1,2,3], [4,5,6], [7,8,9]]),
columns = ['column 1', 'column 2', 'column 3'])
df2
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 |
2 | 7 | 8 | 9 |
df2.isin([0,3,4,5,7])
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | False | False | True |
1 | True | True | False |
2 | True | False | False |
df2 [df2 > 4]
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | NaN | NaN | NaN |
1 | NaN | 5.0 | 6.0 |
2 | 7.0 | 8.0 | 9.0 |
df2.where(df2 > 4)
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | NaN | NaN | NaN |
1 | NaN | 5.0 | 6.0 |
2 | 7.0 | 8.0 | 9.0 |
where
allows you to replace the values that doesn't meet the provided condition with any other value. So, if we do df2.where(df2 > 4, 0)
as follows, all values less than 4
will be replaced by 0
.
df2.where(df2 > 4, 0)
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 0 | 5 | 6 |
2 | 7 | 8 | 9 |
Similarly, we can achieve the above by...
df2 [df2 <= 4] = 0
df2
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | 0 | 0 | 0 |
1 | 0 | 5 | 6 |
2 | 7 | 8 | 9 |
Iteration¶
df.items() #Iterate over (column name, Series) pairs.
df.iteritems() Iterate over (column name, Series) pairs.
DataFrame.iterrows() Iterate over DataFrame rows as (index, Series) pairs.
DataFrame.itertuples([index, name]) Iterate over DataFrame rows as namedtuples.
# Iterate over (column name, Series) pairs.
for col_name, content in df2.items():
print(col_name)
print(content)
column 1 0 0 1 0 2 7 Name: column 1, dtype: int64 column 2 0 0 1 5 2 8 Name: column 2, dtype: int64 column 3 0 0 1 6 2 9 Name: column 3, dtype: int64
# Iterate over (column name, Series) pairs.
# Same as df.items()
for col_name, content in df2.iteritems():
print(col_name)
print(content)
column 1 0 0 1 0 2 7 Name: column 1, dtype: int64 column 2 0 0 1 5 2 8 Name: column 2, dtype: int64 column 3 0 0 1 6 2 9 Name: column 3, dtype: int64
# Iterate over DataFrame rows as (index, Series) pairs
for row in df2.iterrows():
print(row)
(0, column 1 0 column 2 0 column 3 0 Name: 0, dtype: int64) (1, column 1 0 column 2 5 column 3 6 Name: 1, dtype: int64) (2, column 1 7 column 2 8 column 3 9 Name: 2, dtype: int64)
# Iterate over DataFrame rows as namedtuples
for row in df2.itertuples():
print(row)
Pandas(Index=0, _1=0, _2=0, _3=0) Pandas(Index=1, _1=0, _2=5, _3=6) Pandas(Index=2, _1=7, _2=8, _3=9)
Notes: Thanks to Prit Kalariya for Contributing the Iteration part!
C. Dealing with Missing data¶
Real world datasets are messy, often with missing values. Pandas replace NaN with missing values by default. NaN stands for not a number.
Missing values can either be ignored, droped or filled.
# Creating a dataframe
df3 = pd.DataFrame(np.array ([[1,2,3], [4,np.nan,6], [7,np.nan,np.nan]]),
columns = ['column 1', 'column 2', 'column 3'])
Checking Missing values¶
# Recognizing the missing values
df3.isnull()
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | False | False | False |
1 | False | True | False |
2 | False | True | True |
# Calculating number of the missing values in each feature
df3.isnull().sum()
column 1 0 column 2 2 column 3 1 dtype: int64
# Recognizng non missig values
df3.notna()
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | True | True | True |
1 | True | False | True |
2 | True | False | False |
df3.notna().sum()
column 1 3 column 2 1 column 3 2 dtype: int64
Removing the missing values¶
## Dropping missing values
df3.dropna()
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | 1.0 | 2.0 | 3.0 |
All rows are deleted because dropna() will remove each row which have missing value.
# you can drop NaNs in specific column(s)
df3['column 3'].dropna()
0 3.0 1 6.0 Name: column 3, dtype: float64
# You can drop data by axis
# Axis = 1...drop all columns with Nans
# df3.dropna(axis='columns')
df3.dropna(axis=1)
column 1 | |
---|---|
0 | 1.0 |
1 | 4.0 |
2 | 7.0 |
# axis = 0...drop all rows with Nans
# df3.dropna(axis='rows') is same
df3.dropna(axis=0)
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | 1.0 | 2.0 | 3.0 |
Filling the missing values¶
# Filling Missing values
df3.fillna(10)
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | 1.0 | 2.0 | 3.0 |
1 | 4.0 | 10.0 | 6.0 |
2 | 7.0 | 10.0 | 10.0 |
df3.fillna('fillme')
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | 1.0 | 2.0 | 3.0 |
1 | 4.0 | fillme | 6.0 |
2 | 7.0 | fillme | fillme |
# You can forward fill (ffill) or backward fill(bfill)
# Or fill a current value with previous or next value
df3.fillna(method='ffill')
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | 1.0 | 2.0 | 3.0 |
1 | 4.0 | 2.0 | 6.0 |
2 | 7.0 | 2.0 | 6.0 |
# Won't change it because the last values are NaNs, so it backward it
df3.fillna(method='bfill')
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | 1.0 | 2.0 | 3.0 |
1 | 4.0 | NaN | 6.0 |
2 | 7.0 | NaN | NaN |
# If we change the axis to columns, you can see that Nans at row 2 and col 2 is backfilled with 6
df3.fillna(method='bfill', axis='columns')
column 1 | column 2 | column 3 | |
---|---|---|---|
0 | 1.0 | 2.0 | 3.0 |
1 | 4.0 | 6.0 | 6.0 |
2 | 7.0 | NaN | NaN |
D. More Operations and Functions¶
This section will show the more and most useful functions of Pandas.
df4 = pd.DataFrame({'Product Name':['Shirt','Boot','Bag'],
'Order Number':[45,56,64],
'Total Quantity':[10,5,9]},
columns = ['Product Name', 'Order Number', 'Total Quantity'])
Retrieving basic info about the Dataframe¶
# Return a summary about the dataframe
df4.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Product Name 3 non-null object 1 Order Number 3 non-null int64 2 Total Quantity 3 non-null int64 dtypes: int64(2), object(1) memory usage: 200.0+ bytes
# Return dataframe columns
df4.columns
Index(['Product Name', 'Order Number', 'Total Quantity'], dtype='object')
# Return dataframe data
df4.keys
<bound method NDFrame.keys of Product Name Order Number Total Quantity 0 Shirt 45 10 1 Boot 56 5 2 Bag 64 9>
# Return the head of the dataframe ....could make sense if you have long frame
# Choose how many rows you want in head()
df4.head(1)
Product Name | Order Number | Total Quantity | |
---|---|---|---|
0 | Shirt | 45 | 10 |
# Return the tail of the dataframe
df4.tail(1)
Product Name | Order Number | Total Quantity | |
---|---|---|---|
2 | Bag | 64 | 9 |
# Return NumPy array of the dataframe
df4.values
array([['Shirt', 45, 10], ['Boot', 56, 5], ['Bag', 64, 9]], dtype=object)
# Return the size or number of elements in a dataframe
df4.size
9
# Return the shape
df4.shape
(3, 3)
# Return the length of the dataframe/the number of rows in a dataframe
df4.shape[0]
3
# Return the length of the dataframe/the number of columns in a dataframe
df4.shape[1]
3
Unique Values¶
# Return unique values in a given column
df4['Product Name'].unique()
array(['Shirt', 'Boot', 'Bag'], dtype=object)
# Return a number of unique values
df4['Product Name'].nunique()
3
# Counting the occurence of each value in a column
df4['Product Name'].value_counts()
Shirt 1 Boot 1 Bag 1 Name: Product Name, dtype: int64
Applying a Function to Dataframe¶
# Double the quantity product
def double_quantity(x):
return x * x
df4['Total Quantity'].apply(double_quantity)
0 100 1 25 2 81 Name: Total Quantity, dtype: int64
# You can also apply an anonymous function to a dataframe
# Squaring each value in dataframe
df5 = pd.DataFrame([[1,2], [4,5]], columns=['col1', 'col2'])
df5.applymap(lambda x: x**2)
col1 | col2 | |
---|---|---|
0 | 1 | 4 |
1 | 16 | 25 |
Sorting values in dataframe¶
# Sort the df4 by the order number
df4.sort_values(['Order Number'])
Product Name | Order Number | Total Quantity | |
---|---|---|---|
0 | Shirt | 45 | 10 |
1 | Boot | 56 | 5 |
2 | Bag | 64 | 9 |
df4.sort_values(['Order Number'], ascending=False)
Product Name | Order Number | Total Quantity | |
---|---|---|---|
2 | Bag | 64 | 9 |
1 | Boot | 56 | 5 |
0 | Shirt | 45 | 10 |
E. Aggregation Methods¶
df4
Product Name | Order Number | Total Quantity | |
---|---|---|---|
0 | Shirt | 45 | 10 |
1 | Boot | 56 | 5 |
2 | Bag | 64 | 9 |
# summary statistics
df4.describe()
Order Number | Total Quantity | |
---|---|---|
count | 3.000000 | 3.000000 |
mean | 55.000000 | 8.000000 |
std | 9.539392 | 2.645751 |
min | 45.000000 | 5.000000 |
25% | 50.500000 | 7.000000 |
50% | 56.000000 | 9.000000 |
75% | 60.000000 | 9.500000 |
max | 64.000000 | 10.000000 |
df4.describe().transpose()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Order Number | 3.0 | 55.0 | 9.539392 | 45.0 | 50.5 | 56.0 | 60.0 | 64.0 |
Total Quantity | 3.0 | 8.0 | 2.645751 | 5.0 | 7.0 | 9.0 | 9.5 | 10.0 |
# Mode of the dataframe
# Mode is the most recurring values
df4['Total Quantity'].mode()
0 5 1 9 2 10 dtype: int64
# The maximum value
df4['Total Quantity'].max()
10
# The minimum value
df4['Total Quantity'].min()
5
# The mean
df4['Total Quantity'].mean()
8.0
# The median value in a dataframe
df4['Total Quantity'].median()
9.0
# Standard deviation
df4['Total Quantity'].std()
2.6457513110645907
# Variance
df4['Total Quantity'].var()
7.0
# Sum of all values in a column
df4['Total Quantity'].sum()
24
# Product of all values in dataframe
df4['Total Quantity'].prod()
450
F. Groupby¶
Group by
involves splitting data into groups, applying function to each group, and combining the results.
df4 = pd.DataFrame({'Product Name':['Shirt','Boot','Bag', 'Ankle', 'Pullover', 'Boot', 'Ankle', 'Tshirt', 'Shirt'],
'Order Number':[45,56,64, 34, 67, 56, 34, 89, 45],
'Total Quantity':[10,5,9, 11, 11, 8, 14, 23, 10]},
columns = ['Product Name', 'Order Number', 'Total Quantity'])
df4
Product Name | Order Number | Total Quantity | |
---|---|---|---|
0 | Shirt | 45 | 10 |
1 | Boot | 56 | 5 |
2 | Bag | 64 | 9 |
3 | Ankle | 34 | 11 |
4 | Pullover | 67 | 11 |
5 | Boot | 56 | 8 |
6 | Ankle | 34 | 14 |
7 | Tshirt | 89 | 23 |
8 | Shirt | 45 | 10 |
# Let's group the df by product name
df4.groupby('Product Name').mean()
Order Number | Total Quantity | |
---|---|---|
Product Name | ||
Ankle | 34.0 | 12.5 |
Bag | 64.0 | 9.0 |
Boot | 56.0 | 6.5 |
Pullover | 67.0 | 11.0 |
Shirt | 45.0 | 10.0 |
Tshirt | 89.0 | 23.0 |
df4.groupby('Product Name').sum()
Order Number | Total Quantity | |
---|---|---|
Product Name | ||
Ankle | 68 | 25 |
Bag | 64 | 9 |
Boot | 112 | 13 |
Pullover | 67 | 11 |
Shirt | 90 | 20 |
Tshirt | 89 | 23 |
df4.groupby('Product Name').min()
Order Number | Total Quantity | |
---|---|---|
Product Name | ||
Ankle | 34 | 11 |
Bag | 64 | 9 |
Boot | 56 | 5 |
Pullover | 67 | 11 |
Shirt | 45 | 10 |
Tshirt | 89 | 23 |
df4.groupby('Product Name').max()
Order Number | Total Quantity | |
---|---|---|
Product Name | ||
Ankle | 34 | 14 |
Bag | 64 | 9 |
Boot | 56 | 8 |
Pullover | 67 | 11 |
Shirt | 45 | 10 |
Tshirt | 89 | 23 |
df4.groupby(['Product Name', 'Order Number']).max()
Total Quantity | ||
---|---|---|
Product Name | Order Number | |
Ankle | 34 | 14 |
Bag | 64 | 9 |
Boot | 56 | 8 |
Pullover | 67 | 11 |
Shirt | 45 | 10 |
Tshirt | 89 | 23 |
df4.groupby(['Product Name', 'Order Number']).sum()
Total Quantity | ||
---|---|---|
Product Name | Order Number | |
Ankle | 34 | 25 |
Bag | 64 | 9 |
Boot | 56 | 13 |
Pullover | 67 | 11 |
Shirt | 45 | 20 |
Tshirt | 89 | 23 |
You can also use aggregation()
after groupby.
df4.groupby('Product Name').aggregate(['min', 'max', 'sum'])
Order Number | Total Quantity | |||||
---|---|---|---|---|---|---|
min | max | sum | min | max | sum | |
Product Name | ||||||
Ankle | 34 | 34 | 68 | 11 | 14 | 25 |
Bag | 64 | 64 | 64 | 9 | 9 | 9 |
Boot | 56 | 56 | 112 | 5 | 8 | 13 |
Pullover | 67 | 67 | 67 | 11 | 11 | 11 |
Shirt | 45 | 45 | 90 | 10 | 10 | 20 |
Tshirt | 89 | 89 | 89 | 23 | 23 | 23 |
# Creating dataframes
df1 = pd.DataFrame({'Col1':['A','B','C'],
'Col2':[1,2,3]},
index=['a','b','c'])
df2 = pd.DataFrame({'Col1':['D','E','F'],
'Col2':[4,5,6]},
index=['d','e','f'])
df3 = pd.DataFrame({'Col1':['G','I','J'],
'Col2':[7,8,9]},
index=['g', 'i','j'])
df1
Col1 | Col2 | |
---|---|---|
a | A | 1 |
b | B | 2 |
c | C | 3 |
df2
Col1 | Col2 | |
---|---|---|
d | D | 4 |
e | E | 5 |
f | F | 6 |
df3
Col1 | Col2 | |
---|---|---|
g | G | 7 |
i | I | 8 |
j | J | 9 |
# Concatenating: Adding one dataset to another
pd.concat([df1, df2, df3])
Col1 | Col2 | |
---|---|---|
a | A | 1 |
b | B | 2 |
c | C | 3 |
d | D | 4 |
e | E | 5 |
f | F | 6 |
g | G | 7 |
i | I | 8 |
j | J | 9 |
The default axis is 0
. This is how the combined dataframes will look like if we change the axis to 1
.
pd.concat([df1, df2, df3], axis=1)
Col1 | Col2 | Col1 | Col2 | Col1 | Col2 | |
---|---|---|---|---|---|---|
a | A | 1.0 | NaN | NaN | NaN | NaN |
b | B | 2.0 | NaN | NaN | NaN | NaN |
c | C | 3.0 | NaN | NaN | NaN | NaN |
d | NaN | NaN | D | 4.0 | NaN | NaN |
e | NaN | NaN | E | 5.0 | NaN | NaN |
f | NaN | NaN | F | 6.0 | NaN | NaN |
g | NaN | NaN | NaN | NaN | G | 7.0 |
i | NaN | NaN | NaN | NaN | I | 8.0 |
j | NaN | NaN | NaN | NaN | J | 9.0 |
# We can also use append()
df1.append([df2, df3])
Col1 | Col2 | |
---|---|---|
a | A | 1 |
b | B | 2 |
c | C | 3 |
d | D | 4 |
e | E | 5 |
f | F | 6 |
g | G | 7 |
i | I | 8 |
j | J | 9 |
Merging¶
If you have worked with SQL, what pd.merge()
does may be familiar. It links data from different sources (different features) and you have a control on the structure of the combined dataset.
Pandas Merge method(how
): SQL Join Name : Description
* left : LEFT OUTER JOIN : Use keys or columns from left frame only
* right : RIGHT OUTER JOIN : Use keys or columns from right frame only
* outer : FULL OUTER JOIN : Use union of keys or columns from both frames
* inner : INNER JOIN : Use intersection of keys or columns from both frames
df1 = pd.DataFrame({'Name': ['Joe', 'Joshua', 'Jeanne', 'David'],
'Role': ['Manager', 'Developer', 'Engineer', 'Scientist']})
df2 = pd.DataFrame({'Name': ['David', 'Joshua', 'Joe', 'Jeanne'],
'Year Hired': [2018, 2017, 2020, 2018]})
df3 = pd.DataFrame({'Name': ['David', 'Joshua', 'Joe', 'Jeanne'],
'No of Leaves': [15, 3, 10, 12]})
df1
Name | Role | |
---|---|---|
0 | Joe | Manager |
1 | Joshua | Developer |
2 | Jeanne | Engineer |
3 | David | Scientist |
df2
Name | Year Hired | |
---|---|---|
0 | David | 2018 |
1 | Joshua | 2017 |
2 | Joe | 2020 |
3 | Jeanne | 2018 |
pd.merge(df1, df2)
Name | Role | Year Hired | |
---|---|---|---|
0 | Joe | Manager | 2020 |
1 | Joshua | Developer | 2017 |
2 | Jeanne | Engineer | 2018 |
3 | David | Scientist | 2018 |
## Let's merge on Role being a key
pd.merge(df1, df2, how='inner', on="Name")
Name | Role | Year Hired | |
---|---|---|---|
0 | Joe | Manager | 2020 |
1 | Joshua | Developer | 2017 |
2 | Jeanne | Engineer | 2018 |
3 | David | Scientist | 2018 |
df1 = pd.DataFrame({'col1': ['K0', 'K0', 'K1', 'K2'],
'col2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'col1': ['K0', 'K1', 'K1', 'K2'],
'col2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
df1
col1 | col2 | A | B | |
---|---|---|---|---|
0 | K0 | K0 | A0 | B0 |
1 | K0 | K1 | A1 | B1 |
2 | K1 | K0 | A2 | B2 |
3 | K2 | K1 | A3 | B3 |
df2
col1 | col2 | C | D | |
---|---|---|---|---|
0 | K0 | K0 | C0 | D0 |
1 | K1 | K0 | C1 | D1 |
2 | K1 | K0 | C2 | D2 |
3 | K2 | K0 | C3 | D3 |
pd.merge(df1, df2, how='inner', on=['col1', 'col2'])
col1 | col2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
pd.merge(df1, df2, how='outer', on=['col1', 'col2'])
col1 | col2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | NaN | NaN |
2 | K1 | K0 | A2 | B2 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | C2 | D2 |
4 | K2 | K1 | A3 | B3 | NaN | NaN |
5 | K2 | K0 | NaN | NaN | C3 | D3 |
pd.merge(df1, df2, how='left')
col1 | col2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K0 | K1 | A1 | B1 | NaN | NaN |
2 | K1 | K0 | A2 | B2 | C1 | D1 |
3 | K1 | K0 | A2 | B2 | C2 | D2 |
4 | K2 | K1 | A3 | B3 | NaN | NaN |
pd.merge(df1, df2, how='right')
col1 | col2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
3 | K2 | K0 | NaN | NaN | C3 | D3 |
Joining¶
Joining is a simple way to combine columns of two dataframes with different indexes.
df1 = pd.DataFrame({'Col1': ['A', 'B', 'C'],
'Col2': [11, 12, 13]},
index=['a', 'b', 'c'])
df2 = pd.DataFrame({'Col3': ['D', 'E', 'F'],
'Col4': [14, 14, 16]},
index=['a', 'c', 'd'])
df1
Col1 | Col2 | |
---|---|---|
a | A | 11 |
b | B | 12 |
c | C | 13 |
df2
Col3 | Col4 | |
---|---|---|
a | D | 14 |
c | E | 14 |
d | F | 16 |
df1.join(df2)
Col1 | Col2 | Col3 | Col4 | |
---|---|---|---|---|
a | A | 11 | D | 14.0 |
b | B | 12 | NaN | NaN |
c | C | 13 | E | 14.0 |
df2.join(df1)
Col3 | Col4 | Col1 | Col2 | |
---|---|---|---|---|
a | D | 14 | A | 11.0 |
c | E | 14 | C | 13.0 |
d | F | 16 | NaN | NaN |
You can see that with df.join()
, the alignment of data is on indexes.
df1.join(df2, how='inner')
Col1 | Col2 | Col3 | Col4 | |
---|---|---|---|---|
a | A | 11 | D | 14 |
c | C | 13 | E | 14 |
df1.join(df2, how='outer')
Col1 | Col2 | Col3 | Col4 | |
---|---|---|---|---|
a | A | 11.0 | D | 14.0 |
b | B | 12.0 | NaN | NaN |
c | C | 13.0 | E | 14.0 |
d | NaN | NaN | F | 16.0 |
Learn more about Merging, Joining, and Concatenating the Pandas Dataframes here.
H. Beyond Dataframes: Working with CSV and Excel¶
In this last section of Pandas' fundamentals, we will see how to read real world data with different formats: CSV and Excel
CSV and Excel¶
Let's use california housing dataset.
# Let's download the data
!curl -O https://raw.githubusercontent.com/nyandwi/public_datasets/master/housing.csv
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 1390k 100 1390k 0 0 409k 0 0:00:03 0:00:03 --:--:-- 409k
data = pd.read_csv('housing.csv')
data.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
type(data)
pandas.core.frame.DataFrame
## Exporting dataframe back to csv
data.to_csv('housing_dataset', index=False)
If you look into the folder sidebar, you can see Housing Dataset
.
## Exporting CSV to Excel
data.to_excel('housing_excel.xlsx', index=False)
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) /var/folders/9x/fscj3yx566q3y3y1kf5yh9m40000gn/T/ipykernel_1348/1131967869.py in <module> 1 ## Exporting CSV to Excel 2 ----> 3 data.to_excel('housing_excel.xlsx', index=False) ~/miniforge3/envs/TensorPro/lib/python3.9/site-packages/pandas/core/generic.py in to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes, storage_options) 2282 inf_rep=inf_rep, 2283 ) -> 2284 formatter.write( 2285 excel_writer, 2286 sheet_name=sheet_name, ~/miniforge3/envs/TensorPro/lib/python3.9/site-packages/pandas/io/formats/excel.py in write(self, writer, sheet_name, startrow, startcol, freeze_panes, engine, storage_options) 832 # error: Cannot instantiate abstract class 'ExcelWriter' with abstract 833 # attributes 'engine', 'save', 'supported_extensions' and 'write_cells' --> 834 writer = ExcelWriter( # type: ignore[abstract] 835 writer, engine=engine, storage_options=storage_options 836 ) ~/miniforge3/envs/TensorPro/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py in __init__(self, path, engine, date_format, datetime_format, mode, storage_options, if_sheet_exists, engine_kwargs, **kwargs) 46 ): 47 # Use the openpyxl module as the Excel writer. ---> 48 from openpyxl.workbook import Workbook 49 50 engine_kwargs = combine_kwargs(engine_kwargs, kwargs) ModuleNotFoundError: No module named 'openpyxl'
## Reading the Excel file back
excel_data = pd.read_excel('housing_excel.xlsx')
excel_data.head()
Real World: Exploratory Data Analysis (EDA)¶
All above was the basics. Let us apply some of these techniques to the real world dataset, Red wine quality
.
!curl -O https://raw.githubusercontent.com/nyandwi/public_datasets/master/winequality-red.csv
wine_data = pd.read_csv('winequality-red.csv')
# Displaying the head of the dataset
wine_data.head()
# Displaying the tail of the dataset
wine_data.tail()
# Displaying summary statistics
wine_data.describe().transpose()
# Displaying quick information about the dataset
wine_data.info()
# Checking missing values
wine_data.isnull().sum()
# wine quality range from 0 to 10. The higher the quality value, the good wine is
wine_data['quality'].value_counts()
wine_data.groupby(['fixed acidity', 'volatile acidity', 'citric acid']).sum()
wine_data.groupby(['free sulfur dioxide', 'total sulfur dioxide']).sum()
This is the end of the lab that was about using Pandas to manipulate data. Alot of things will make sense when we start to prepare data for machine learning models in the next notebooks.