This notebook was created by Jean de Dieu Nyandwi for the love of machine learning community. For any feedback, errors or suggestion, he can be reached on email (johnjw7084 at gmail dot com), Twitter, or LinkedIn.

Data Manipulation with Pandas¶

In this lab, you will learn how to manipulate data with Pandas. Here is an overview:

1. Basics of Pandas for data manipulation:
2. Real World Exploratory Data Analysis (EDA)

1. Basics of Pandas for data manipulation¶

A. Series and DataFrames¶

Both series and DataFrames are Pandas Data structures.

Series is like one dimensional NumPy array with axis labels.

DataFrame is multidimensional NumPy array with labels on rows and columns.

Working with NumPy, we saw that it supports numeric type data. Pandas on other hand supports whole range of data types, from numeric to strings, etc..

Since we are using python notebook, we do not need to install Pandas. We only just have to import it.

import pandas as pd

In [1]:

            
                Copied!
                
# importing numpy and pandas

import numpy as np
import pandas as pd
# importing numpy and pandas

import numpy as np
import pandas as pd

Creating Series¶

Series can be created from a Python list, dictionary, and NumPy array.

In [2]:

            
                Copied!
                
# Creating the series from a Python list

num_list = [1,2,3,4,5]

pd.Series(num_list)
# Creating the series from a Python list

num_list = [1,2,3,4,5]

pd.Series(num_list)

Out[2]:

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [3]:

            
                Copied!
                
week_days = ['Mon','Tues','Wed','Thur','Fri']

pd.Series(week_days, index=["a", "b", "c", "d", "e"])
week_days = ['Mon','Tues','Wed','Thur','Fri']

pd.Series(week_days, index=["a", "b", "c", "d", "e"])

Out[3]:

a     Mon
b    Tues
c     Wed
d    Thur
e     Fri
dtype: object

Note the data types int64 and object.

In [4]:

            
                Copied!
                
# Creating the Series from dictionary 

countries_code = { 1:"United States",
                 91:"India",
                 49:"Germany",
                 86:"China",
                250:"Rwanda"}

pd.Series(countries_code)
# Creating the Series from dictionary 

countries_code = { 1:"United States",
                 91:"India",
                 49:"Germany",
                 86:"China",
                250:"Rwanda"}

pd.Series(countries_code)

Out[4]:

1      United States
91             India
49           Germany
86             China
250           Rwanda
dtype: object

In [5]:

            
                Copied!
                
d = {1:'a', 2:'b', 3:'c', 4:'d'}
pd.Series(d)
d = {1:'a', 2:'b', 3:'c', 4:'d'}
pd.Series(d)

Out[5]:

1    a
2    b
3    c
4    d
dtype: object

In [6]:

            
                Copied!
                
# Creating the Series from NumPy array
# We peovide the list of indexes
# if we don't provide the indexes, the default indexes are numbers...starts from 0,1,2..

arr = np.array ([1, 2, 3, 4, 5])
pd.Series(arr)
# Creating the Series from NumPy array
# We peovide the list of indexes
# if we don't provide the indexes, the default indexes are numbers...starts from 0,1,2..

arr = np.array ([1, 2, 3, 4, 5])
pd.Series(arr)

Out[6]:

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [7]:

            
                Copied!
                
pd.Series(arr, index=['a', 'b', 'c', 'd', 'e'])
pd.Series(arr, index=['a', 'b', 'c', 'd', 'e'])

Out[7]:

a    1
b    2
c    3
d    4
e    5
dtype: int64

Creating DataFrames¶

DataFrames are the most used Pandas data structure. It can be created from a dictionary, 2D array, and Series.

In [8]:

            
                Copied!
                
# Creating DataFrame from a dictionary

countries = {'Name': ['USA', 'India', 'German', 'Rwanda'], 
             
             'Codes':[1, 91, 49, 250] }

pd.DataFrame(countries)
# Creating DataFrame from a dictionary

countries = {'Name': ['USA', 'India', 'German', 'Rwanda'], 
             
             'Codes':[1, 91, 49, 250] }

pd.DataFrame(countries)

Out[8]:

	Name	Codes
0	USA	1
1	India	91
2	German	49
3	Rwanda	250

In [9]:

            
                Copied!
                
# Creating a dataframe from a 2D array
# You pass the list of columns

array_2d = np.array ([[1,2,3], [4,5,6], [7,8,9]])

pd.DataFrame(array_2d, columns = ['column 1', 'column 2', 'column 3'])
# Creating a dataframe from a 2D array
# You pass the list of columns

array_2d = np.array ([[1,2,3], [4,5,6], [7,8,9]])

pd.DataFrame(array_2d, columns = ['column 1', 'column 2', 'column 3'])

Out[9]:

	column 1	column 2	column 3
0	1	2	3
1	4	5	6
2	7	8	9

In [10]:

            
                Copied!
                
# Creating a dataframe from Pandas series 
# Pass the columns in a list

countries_code = { "United States": 1,
                 "India": 91,
                 "Germany": 49,
                 "China": 86,
                 "Rwanda":250}

pd_series = pd.Series (countries_code)

pd.Series(countries_code)

df = pd.DataFrame(pd_series, columns = ['Codes'])
df
# Creating a dataframe from Pandas series 
# Pass the columns in a list

countries_code = { "United States": 1,
                 "India": 91,
                 "Germany": 49,
                 "China": 86,
                 "Rwanda":250}

pd_series = pd.Series (countries_code)

pd.Series(countries_code)

df = pd.DataFrame(pd_series, columns = ['Codes'])
df

Out[10]:

	Codes
United States	1
India	91
Germany	49
China	86
Rwanda	250

In [11]:

            
                Copied!
                
# Adding a column
# Number in population are pretty random

df ['Population'] = [100, 450, 575, 5885, 533]

df
# Adding a column
# Number in population are pretty random

df ['Population'] = [100, 450, 575, 5885, 533]

df

Out[11]:

	Codes	Population
United States	1	100
India	91	450
Germany	49	575
China	86	5885
Rwanda	250	533

In [12]:

            
                Copied!
                
# Removing a column 

df.drop('Population', axis =1)
# Removing a column 

df.drop('Population', axis =1)

Out[12]:

	Codes
United States	1
India	91
Germany	49
China	86
Rwanda	250

In [13]:

            
                Copied!
                
df.columns
df.columns

Out[13]:

Index(['Codes', 'Population'], dtype='object')

In [14]:

            
                Copied!
                
df.keys
df.keys

Out[14]:

<bound method NDFrame.keys of                Codes  Population
United States      1         100
India             91         450
Germany           49         575
China             86        5885
Rwanda           250         533>

In [15]:

            
                Copied!
                
df.index
df.index

Out[15]:

Index(['United States', 'India', 'Germany', 'China', 'Rwanda'], dtype='object')

B. Data Indexing, Selection and Iteration¶

Indexing and selection works in both Series and Dataframe.

Because DataFrame is made of Series, let's focus on how to select data in DataFrame.

In [16]:

            
                Copied!
                
# Creating DataFrame from a dictionary

countries = {'Name': ['USA', 'India', 'German', 'Rwanda'], 
             
             'Codes':[1, 91, 49, 250] }

df = pd.DataFrame(countries, index=['a', 'b', 'c', 'd'])
df
# Creating DataFrame from a dictionary

countries = {'Name': ['USA', 'India', 'German', 'Rwanda'], 
             
             'Codes':[1, 91, 49, 250] }

df = pd.DataFrame(countries, index=['a', 'b', 'c', 'd'])
df

Out[16]:

	Name	Codes
a	USA	1
b	India	91
c	German	49
d	Rwanda	250

In [17]:

            
                Copied!
                
df['Name']
df['Name']

Out[17]:

a       USA
b     India
c    German
d    Rwanda
Name: Name, dtype: object

In [18]:

            
                Copied!
                
df.Name
df.Name

Out[18]:

a       USA
b     India
c    German
d    Rwanda
Name: Name, dtype: object

In [19]:

            
                Copied!
                
df ['Codes']
df ['Codes']

Out[19]:

a      1
b     91
c     49
d    250
Name: Codes, dtype: int64

In [20]:

            
                Copied!
                
## When you have many columns, columns in list will be selected

df [['Name', 'Codes']]
## When you have many columns, columns in list will be selected

df [['Name', 'Codes']]

Out[20]:

	Name	Codes
a	USA	1
b	India	91
c	German	49
d	Rwanda	250

In [21]:

            
                Copied!
                
# This will return the first two rows
df [0:2]
# This will return the first two rows
df [0:2]

Out[21]:

	Name	Codes
a	USA	1
b	India	91

You can also use loc to select data by the label indexes and iloc to select by default integer index (or by the position of the row)

In [22]:

            
                Copied!
                
df.loc['a']
df.loc['a']

Out[22]:

Name     USA
Codes      1
Name: a, dtype: object

In [23]:

            
                Copied!
                
df.loc['b':'d']
df.loc['b':'d']

Out[23]:

	Name	Codes
b	India	91
c	German	49
d	Rwanda	250

In [24]:

            
                Copied!
                
df [:'b']
df [:'b']

Out[24]:

	Name	Codes
a	USA	1
b	India	91

In [25]:

            
                Copied!
                
df.iloc[2]
df.iloc[2]

Out[25]:

Name     German
Codes        49
Name: c, dtype: object

In [26]:

            
                Copied!
                
df.iloc[1:3]
df.iloc[1:3]

Out[26]:

	Name	Codes
b	India	91
c	German	49

In [27]:

            
                Copied!
                
df.iloc[2:]
df.iloc[2:]

Out[27]:

	Name	Codes
c	German	49
d	Rwanda	250

Conditional Selection¶

In [28]:

Out[28]:

	Name	Codes
a	USA	1
b	India	91
c	German	49
d	Rwanda	250

In [29]:

            
                Copied!
                
#Let's select a country with code 49

df [df['Codes'] ==49 ]
#Let's select a country with code 49

df [df['Codes'] ==49 ]

Out[29]:

	Name	Codes
c	German	49

In [30]:

            
                Copied!
                
df [df['Codes'] < 250 ]
df [df['Codes'] < 250 ]

Out[30]:

	Name	Codes
a	USA	1
b	India	91
c	German	49

In [31]:

            
                Copied!
                
df [df['Name'] =='USA' ]
df [df['Name'] =='USA' ]

Out[31]:

	Name	Codes
a	USA	1

In [32]:

            
                Copied!
                
# You can use and (&) or (|) for more than conditions
#df [(condition 1) & (condition 2)]

df [(df['Codes'] == 91 ) & (df['Name'] == 'India') ]
# You can use and (&) or (|) for more than conditions
#df [(condition 1) & (condition 2)]

df [(df['Codes'] == 91 ) & (df['Name'] == 'India') ]

Out[32]:

	Name	Codes
b	India	91

You can also use isin() and where() to select data in a series or dataframe.

In [33]:

            
                Copied!
                
# isin() return false or true when provided value is included in dataframe
sample_codes_names=[1,3,250, 'USA', 'India', 'England']

df.isin(sample_codes_names)
# isin() return false or true when provided value is included in dataframe
sample_codes_names=[1,3,250, 'USA', 'India', 'England']

df.isin(sample_codes_names)

Out[33]:

	Name	Codes
a	True	True
b	True	False
c	False	False
d	False	True

As you can see, it returned True wherever a country code or name was found. Otherwise, False. You can use a dictinary to match search by columns. A key must be a column and values are passed in list.

In [34]:

            
                Copied!
                
sample_codes_names = {'Codes':[1,3,250], 'Name':['USA', 'India', 'England']}

df.isin(sample_codes_names)
sample_codes_names = {'Codes':[1,3,250], 'Name':['USA', 'India', 'England']}

df.isin(sample_codes_names)

Out[34]:

	Name	Codes
a	True	True
b	True	False
c	False	False
d	False	True

In [35]:

            
                Copied!
                
df2 = pd.DataFrame(np.array ([[1,2,3], [4,5,6], [7,8,9]]), 
                   columns = ['column 1', 'column 2', 'column 3'])

df2
df2 = pd.DataFrame(np.array ([[1,2,3], [4,5,6], [7,8,9]]), 
                   columns = ['column 1', 'column 2', 'column 3'])

df2

Out[35]:

	column 1	column 2	column 3
0	1	2	3
1	4	5	6
2	7	8	9

In [36]:

            
                Copied!
                
df2.isin([0,3,4,5,7])
df2.isin([0,3,4,5,7])

Out[36]:

	column 1	column 2	column 3
0	False	False	True
1	True	True	False
2	True	False	False

In [37]:

            
                Copied!
                
df2 [df2 > 4]
df2 [df2 > 4]

Out[37]:

	column 1	column 2	column 3
0	NaN	NaN	NaN
1	NaN	5.0	6.0
2	7.0	8.0	9.0

In [38]:

            
                Copied!
                
df2.where(df2 > 4)
df2.where(df2 > 4)

Out[38]:

	column 1	column 2	column 3
0	NaN	NaN	NaN
1	NaN	5.0	6.0
2	7.0	8.0	9.0

where allows you to replace the values that doesn't meet the provided condition with any other value. So, if we do df2.where(df2 > 4, 0) as follows, all values less than 4 will be replaced by 0.

In [39]:

            
                Copied!
                
df2.where(df2 > 4, 0)
df2.where(df2 > 4, 0)

Out[39]:

	column 1	column 2	column 3
0	0	0	0
1	0	5	6
2	7	8	9

Similarly, we can achieve the above by...

In [40]:

            
                Copied!
                
df2 [df2 <= 4] = 0
df2
df2 [df2 <= 4] = 0
df2

Out[40]:

	column 1	column 2	column 3
0	0	0	0
1	0	5	6
2	7	8	9

Iteration¶

df.items() #Iterate over (column name, Series) pairs.
df.iteritems() Iterate over (column name, Series) pairs.
DataFrame.iterrows() Iterate over DataFrame rows as (index, Series) pairs.
DataFrame.itertuples([index, name]) Iterate over DataFrame rows as namedtuples.

In [41]:

            
                Copied!
                
# Iterate over (column name, Series) pairs.

for col_name, content in df2.items():
    print(col_name)
    print(content)
# Iterate over (column name, Series) pairs.

for col_name, content in df2.items():
    print(col_name)
    print(content)

column 1
0    0
1    0
2    7
Name: column 1, dtype: int64
column 2
0    0
1    5
2    8
Name: column 2, dtype: int64
column 3
0    0
1    6
2    9
Name: column 3, dtype: int64

In [42]:

            
                Copied!
                
# Iterate over (column name, Series) pairs.
# Same as df.items()

for col_name, content in df2.iteritems():
    print(col_name)
    print(content)
# Iterate over (column name, Series) pairs.
# Same as df.items()

for col_name, content in df2.iteritems():
    print(col_name)
    print(content)

column 1
0    0
1    0
2    7
Name: column 1, dtype: int64
column 2
0    0
1    5
2    8
Name: column 2, dtype: int64
column 3
0    0
1    6
2    9
Name: column 3, dtype: int64

In [43]:

            
                Copied!
                
# Iterate over DataFrame rows as (index, Series) pairs

for row in df2.iterrows():
    print(row)
# Iterate over DataFrame rows as (index, Series) pairs

for row in df2.iterrows():
    print(row)

(0, column 1    0
column 2    0
column 3    0
Name: 0, dtype: int64)
(1, column 1    0
column 2    5
column 3    6
Name: 1, dtype: int64)
(2, column 1    7
column 2    8
column 3    9
Name: 2, dtype: int64)

In [44]:

            
                Copied!
                
# Iterate over DataFrame rows as namedtuples

for row in df2.itertuples():
    print(row)
# Iterate over DataFrame rows as namedtuples

for row in df2.itertuples():
    print(row)

Pandas(Index=0, _1=0, _2=0, _3=0)
Pandas(Index=1, _1=0, _2=5, _3=6)
Pandas(Index=2, _1=7, _2=8, _3=9)

Notes: Thanks to Prit Kalariya for Contributing the Iteration part!

C. Dealing with Missing data¶

Real world datasets are messy, often with missing values. Pandas replace NaN with missing values by default. NaN stands for not a number.

Missing values can either be ignored, droped or filled.

In [45]:

            
                Copied!
                
# Creating a dataframe

df3 = pd.DataFrame(np.array ([[1,2,3], [4,np.nan,6], [7,np.nan,np.nan]]), 
                   columns = ['column 1', 'column 2', 'column 3'])
# Creating a dataframe

df3 = pd.DataFrame(np.array ([[1,2,3], [4,np.nan,6], [7,np.nan,np.nan]]), 
                   columns = ['column 1', 'column 2', 'column 3'])

Checking Missing values¶

In [46]:

            
                Copied!
                
# Recognizing the missing values

df3.isnull()
# Recognizing the missing values

df3.isnull()

Out[46]:

	column 1	column 2	column 3
0	False	False	False
1	False	True	False
2	False	True	True

In [47]:

            
                Copied!
                
# Calculating number of the missing values in each feature

df3.isnull().sum()
# Calculating number of the missing values in each feature

df3.isnull().sum()

Out[47]:

column 1    0
column 2    2
column 3    1
dtype: int64

In [48]:

            
                Copied!
                
# Recognizng non missig values

df3.notna()
# Recognizng non missig values

df3.notna()

Out[48]:

	column 1	column 2	column 3
0	True	True	True
1	True	False	True
2	True	False	False

In [49]:

            
                Copied!
                
df3.notna().sum()
df3.notna().sum()

Out[49]:

column 1    3
column 2    1
column 3    2
dtype: int64

Removing the missing values¶

In [50]:

            
                Copied!
                
## Dropping missing values 

df3.dropna()
## Dropping missing values 

df3.dropna()

Out[50]:

	column 1	column 2	column 3
0	1.0	2.0	3.0

All rows are deleted because dropna() will remove each row which have missing value.

In [51]:

            
                Copied!
                
# you can drop NaNs in specific column(s)

df3['column 3'].dropna()
# you can drop NaNs in specific column(s)

df3['column 3'].dropna()

Out[51]:

0    3.0
1    6.0
Name: column 3, dtype: float64

In [52]:

            
                Copied!
                
# You can drop data by axis 
# Axis = 1...drop all columns with Nans
# df3.dropna(axis='columns')

df3.dropna(axis=1)
# You can drop data by axis 
# Axis = 1...drop all columns with Nans
# df3.dropna(axis='columns')

df3.dropna(axis=1)

Out[52]:

	column 1
0	1.0
1	4.0
2	7.0

In [53]:

            
                Copied!
                
# axis = 0...drop all rows with Nans
# df3.dropna(axis='rows') is same 

df3.dropna(axis=0)
# axis = 0...drop all rows with Nans
# df3.dropna(axis='rows') is same 

df3.dropna(axis=0)

Out[53]:

	column 1	column 2	column 3
0	1.0	2.0	3.0

Filling the missing values¶

In [54]:

            
                Copied!
                
# Filling Missing values

df3.fillna(10)
# Filling Missing values

df3.fillna(10)

Out[54]:

	column 1	column 2	column 3
0	1.0	2.0	3.0
1	4.0	10.0	6.0
2	7.0	10.0	10.0

In [55]:

            
                Copied!
                
df3.fillna('fillme')

df3.fillna('fillme')

Out[55]:

	column 1	column 2	column 3
0	1.0	2.0	3.0
1	4.0	fillme	6.0
2	7.0	fillme	fillme

In [56]:

            
                Copied!
                
# You can forward fill (ffill) or backward fill(bfill)
# Or fill a current value with previous or next value

df3.fillna(method='ffill')
# You can forward fill (ffill) or backward fill(bfill)
# Or fill a current value with previous or next value

df3.fillna(method='ffill')

Out[56]:

	column 1	column 2	column 3
0	1.0	2.0	3.0
1	4.0	2.0	6.0
2	7.0	2.0	6.0

In [57]:

            
                Copied!
                
# Won't change it because the last values are NaNs, so it backward it

df3.fillna(method='bfill')
# Won't change it because the last values are NaNs, so it backward it

df3.fillna(method='bfill')

Out[57]:

	column 1	column 2	column 3
0	1.0	2.0	3.0
1	4.0	NaN	6.0
2	7.0	NaN	NaN

In [58]:

            
                Copied!
                
# If we change the axis to columns, you can see that Nans at row 2 and col 2 is backfilled with 6

df3.fillna(method='bfill', axis='columns')
# If we change the axis to columns, you can see that Nans at row 2 and col 2 is backfilled with 6

df3.fillna(method='bfill', axis='columns')

Out[58]:

	column 1	column 2	column 3
0	1.0	2.0	3.0
1	4.0	6.0	6.0
2	7.0	NaN	NaN

D. More Operations and Functions¶

This section will show the more and most useful functions of Pandas.

In [59]:

            
                Copied!
                
df4 = pd.DataFrame({'Product Name':['Shirt','Boot','Bag'], 
              'Order Number':[45,56,64], 
              'Total Quantity':[10,5,9]}, 
              columns = ['Product Name', 'Order Number', 'Total Quantity'])
df4 = pd.DataFrame({'Product Name':['Shirt','Boot','Bag'], 
              'Order Number':[45,56,64], 
              'Total Quantity':[10,5,9]}, 
              columns = ['Product Name', 'Order Number', 'Total Quantity'])

Retrieving basic info about the Dataframe¶

In [60]:

            
                Copied!
                
# Return a summary about the dataframe

df4.info()
# Return a summary about the dataframe

df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Product Name    3 non-null      object
 1   Order Number    3 non-null      int64 
 2   Total Quantity  3 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 200.0+ bytes

In [61]:

            
                Copied!
                
# Return dataframe columns

df4.columns
# Return dataframe columns

df4.columns

Out[61]:

Index(['Product Name', 'Order Number', 'Total Quantity'], dtype='object')

In [62]:

            
                Copied!
                
# Return dataframe data

df4.keys
# Return dataframe data

df4.keys

Out[62]:

<bound method NDFrame.keys of   Product Name  Order Number  Total Quantity
0        Shirt            45              10
1         Boot            56               5
2          Bag            64               9>

In [63]:

            
                Copied!
                
# Return the head of the dataframe ....could make sense if you have long frame
# Choose how many rows you want in head()

df4.head(1)
# Return the head of the dataframe ....could make sense if you have long frame
# Choose how many rows you want in head()

df4.head(1)

Out[63]:

	Product Name	Order Number	Total Quantity
0	Shirt	45	10

In [64]:

            
                Copied!
                
# Return the tail of the dataframe

df4.tail(1)
# Return the tail of the dataframe

df4.tail(1)

Out[64]:

	Product Name	Order Number	Total Quantity
2	Bag	64	9

In [65]:

            
                Copied!
                
# Return NumPy array of the dataframe

df4.values
# Return NumPy array of the dataframe

df4.values

Out[65]:

array([['Shirt', 45, 10],
       ['Boot', 56, 5],
       ['Bag', 64, 9]], dtype=object)

In [66]:

            
                Copied!
                
# Return the size or number of elements in a dataframe

df4.size
# Return the size or number of elements in a dataframe

df4.size

Out[66]:

In [67]:

            
                Copied!
                
# Return the shape

df4.shape
# Return the shape

df4.shape

Out[67]:

(3, 3)

In [68]:

            
                Copied!
                
# Return the length of the dataframe/the number of rows in a dataframe

df4.shape[0]
# Return the length of the dataframe/the number of rows in a dataframe

df4.shape[0]

Out[68]:

In [69]:

            
                Copied!
                
# Return the length of the dataframe/the number of columns in a dataframe

df4.shape[1]
# Return the length of the dataframe/the number of columns in a dataframe

df4.shape[1]

Out[69]:

Unique Values¶

In [70]:

            
                Copied!
                
# Return unique values in a given column 

df4['Product Name'].unique()
# Return unique values in a given column 

df4['Product Name'].unique()

Out[70]:

array(['Shirt', 'Boot', 'Bag'], dtype=object)

In [71]:

            
                Copied!
                
# Return a number of unique values
df4['Product Name'].nunique()
# Return a number of unique values
df4['Product Name'].nunique()

Out[71]:

In [72]:

            
                Copied!
                
# Counting the occurence of each value in a column 

df4['Product Name'].value_counts()
# Counting the occurence of each value in a column 

df4['Product Name'].value_counts()

Out[72]:

Shirt    1
Boot     1
Bag      1
Name: Product Name, dtype: int64

Applying a Function to Dataframe¶

In [73]:

            
                Copied!
                
# Double the quantity product

def double_quantity(x):
  return x * x
# Double the quantity product

def double_quantity(x):
  return x * x

In [74]:

            
                Copied!
                
df4['Total Quantity'].apply(double_quantity)
df4['Total Quantity'].apply(double_quantity)

Out[74]:

0    100
1     25
2     81
Name: Total Quantity, dtype: int64

In [75]:

            
                Copied!
                
# You can also apply an anonymous function to a dataframe
# Squaring each value in dataframe

df5 = pd.DataFrame([[1,2], [4,5]], columns=['col1', 'col2'])

df5.applymap(lambda x: x**2)
# You can also apply an anonymous function to a dataframe
# Squaring each value in dataframe

df5 = pd.DataFrame([[1,2], [4,5]], columns=['col1', 'col2'])

df5.applymap(lambda x: x**2)

Out[75]:

	col1	col2
0	1	4
1	16	25

Sorting values in dataframe¶

In [76]:

            
                Copied!
                
# Sort the df4 by the order number

df4.sort_values(['Order Number'])
# Sort the df4 by the order number

df4.sort_values(['Order Number'])

Out[76]:

	Product Name	Order Number	Total Quantity
0	Shirt	45	10
1	Boot	56	5
2	Bag	64	9

In [77]:

            
                Copied!
                
df4.sort_values(['Order Number'], ascending=False)
df4.sort_values(['Order Number'], ascending=False)

Out[77]:

	Product Name	Order Number	Total Quantity
2	Bag	64	9
1	Boot	56	5
0	Shirt	45	10

E. Aggregation Methods¶

In [78]:

Out[78]:

	Product Name	Order Number	Total Quantity
0	Shirt	45	10
1	Boot	56	5
2	Bag	64	9

In [79]:

            
                Copied!
                
# summary statistics

df4.describe()
# summary statistics

df4.describe()

Out[79]:

	Order Number	Total Quantity
count	3.000000	3.000000
mean	55.000000	8.000000
std	9.539392	2.645751
min	45.000000	5.000000
25%	50.500000	7.000000
50%	56.000000	9.000000
75%	60.000000	9.500000
max	64.000000	10.000000

In [80]:

            
                Copied!
                
df4.describe().transpose()

df4.describe().transpose()

Out[80]:

	count	mean	std	min	25%	50%	75%	max
Order Number	3.0	55.0	9.539392	45.0	50.5	56.0	60.0	64.0
Total Quantity	3.0	8.0	2.645751	5.0	7.0	9.0	9.5	10.0

In [81]:

            
                Copied!
                
# Mode of the dataframe
# Mode is the most recurring values

df4['Total Quantity'].mode()
# Mode of the dataframe
# Mode is the most recurring values

df4['Total Quantity'].mode()

Out[81]:

0     5
1     9
2    10
dtype: int64

In [82]:

            
                Copied!
                
# The maximum value

df4['Total Quantity'].max()
# The maximum value

df4['Total Quantity'].max()

Out[82]:

In [83]:

            
                Copied!
                
# The minimum value

df4['Total Quantity'].min()
# The minimum value

df4['Total Quantity'].min()

Out[83]:

In [84]:

            
                Copied!
                
# The mean

df4['Total Quantity'].mean()
# The mean

df4['Total Quantity'].mean()

Out[84]:

8.0

In [85]:

            
                Copied!
                
# The median value in a dataframe

df4['Total Quantity'].median()
# The median value in a dataframe

df4['Total Quantity'].median()

Out[85]:

9.0

In [86]:

            
                Copied!
                
# Standard deviation

df4['Total Quantity'].std()
# Standard deviation

df4['Total Quantity'].std()

Out[86]:

2.6457513110645907

In [87]:

            
                Copied!
                
# Variance 

df4['Total Quantity'].var()
# Variance 

df4['Total Quantity'].var()

Out[87]:

7.0

In [88]:

            
                Copied!
                
# Sum of all values in a column

df4['Total Quantity'].sum()
# Sum of all values in a column

df4['Total Quantity'].sum()

Out[88]:

In [89]:

            
                Copied!
                
# Product of all values in dataframe

df4['Total Quantity'].prod()
# Product of all values in dataframe

df4['Total Quantity'].prod()

Out[89]:

F. Groupby¶

Group by involves splitting data into groups, applying function to each group, and combining the results.

In [90]:

            
                Copied!
                
df4 = pd.DataFrame({'Product Name':['Shirt','Boot','Bag', 'Ankle', 'Pullover', 'Boot', 'Ankle', 'Tshirt', 'Shirt'], 
              'Order Number':[45,56,64, 34, 67, 56, 34, 89, 45], 
              'Total Quantity':[10,5,9, 11, 11, 8, 14, 23, 10]}, 
              columns = ['Product Name', 'Order Number', 'Total Quantity'])
df4 = pd.DataFrame({'Product Name':['Shirt','Boot','Bag', 'Ankle', 'Pullover', 'Boot', 'Ankle', 'Tshirt', 'Shirt'], 
              'Order Number':[45,56,64, 34, 67, 56, 34, 89, 45], 
              'Total Quantity':[10,5,9, 11, 11, 8, 14, 23, 10]}, 
              columns = ['Product Name', 'Order Number', 'Total Quantity'])

In [91]:

Out[91]:

	Product Name	Order Number	Total Quantity
0	Shirt	45	10
1	Boot	56	5
2	Bag	64	9
3	Ankle	34	11
4	Pullover	67	11
5	Boot	56	8
6	Ankle	34	14
7	Tshirt	89	23
8	Shirt	45	10

In [92]:

            
                Copied!
                
# Let's group the df by product name

df4.groupby('Product Name').mean()
# Let's group the df by product name

df4.groupby('Product Name').mean()

Out[92]:

	Order Number	Total Quantity
Product Name
Ankle	34.0	12.5
Bag	64.0	9.0
Boot	56.0	6.5
Pullover	67.0	11.0
Shirt	45.0	10.0
Tshirt	89.0	23.0

In [93]:

            
                Copied!
                
df4.groupby('Product Name').sum()
df4.groupby('Product Name').sum()

Out[93]:

	Order Number	Total Quantity
Product Name
Ankle	68	25
Bag	64	9
Boot	112	13
Pullover	67	11
Shirt	90	20
Tshirt	89	23

In [94]:

            
                Copied!
                
df4.groupby('Product Name').min()
df4.groupby('Product Name').min()

Out[94]:

	Order Number	Total Quantity
Product Name
Ankle	34	11
Bag	64	9
Boot	56	5
Pullover	67	11
Shirt	45	10
Tshirt	89	23

In [95]:

            
                Copied!
                
df4.groupby('Product Name').max()
df4.groupby('Product Name').max()

Out[95]:

	Order Number	Total Quantity
Product Name
Ankle	34	14
Bag	64	9
Boot	56	8
Pullover	67	11
Shirt	45	10
Tshirt	89	23

In [96]:

            
                Copied!
                
df4.groupby(['Product Name', 'Order Number']).max()
df4.groupby(['Product Name', 'Order Number']).max()

Out[96]:

		Total Quantity
Product Name	Order Number
Ankle	34	14
Bag	64	9
Boot	56	8
Pullover	67	11
Shirt	45	10
Tshirt	89	23

In [97]:

            
                Copied!
                
df4.groupby(['Product Name', 'Order Number']).sum()
df4.groupby(['Product Name', 'Order Number']).sum()

Out[97]:

		Total Quantity
Product Name	Order Number
Ankle	34	25
Bag	64	9
Boot	56	13
Pullover	67	11
Shirt	45	20
Tshirt	89	23

You can also use aggregation() after groupby.

In [98]:

            
                Copied!
                
df4.groupby('Product Name').aggregate(['min', 'max', 'sum'])
df4.groupby('Product Name').aggregate(['min', 'max', 'sum'])

Out[98]:

	Order Number			Total Quantity
	min	max	sum	min	max	sum
Product Name
Ankle	34	34	68	11	14	25
Bag	64	64	64	9	9	9
Boot	56	56	112	5	8	13
Pullover	67	67	67	11	11	11
Shirt	45	45	90	10	10	20
Tshirt	89	89	89	23	23	23

G. Combining Datasets: Concatenating, Joining and Merging¶

Concatenation¶

In [99]:

            
                Copied!
                
# Creating dataframes

df1 = pd.DataFrame({'Col1':['A','B','C'],
                   'Col2':[1,2,3]}, 
                   index=['a','b','c'])

df2 = pd.DataFrame({'Col1':['D','E','F'],
                   'Col2':[4,5,6]}, 
                   index=['d','e','f'])

df3 = pd.DataFrame({'Col1':['G','I','J'],
                   'Col2':[7,8,9]}, 
                   index=['g', 'i','j'])
# Creating dataframes

df1 = pd.DataFrame({'Col1':['A','B','C'],
                   'Col2':[1,2,3]}, 
                   index=['a','b','c'])

df2 = pd.DataFrame({'Col1':['D','E','F'],
                   'Col2':[4,5,6]}, 
                   index=['d','e','f'])

df3 = pd.DataFrame({'Col1':['G','I','J'],
                   'Col2':[7,8,9]}, 
                   index=['g', 'i','j'])

In [100]:

Out[100]:

	Col1	Col2
a	A	1
b	B	2
c	C	3

In [101]:

Out[101]:

	Col1	Col2
d	D	4
e	E	5
f	F	6

In [102]:

Out[102]:

	Col1	Col2
g	G	7
i	I	8
j	J	9

In [103]:

            
                Copied!
                
# Concatenating: Adding one dataset to another

pd.concat([df1, df2, df3])
# Concatenating: Adding one dataset to another

pd.concat([df1, df2, df3])

Out[103]:

	Col1	Col2
a	A	1
b	B	2
c	C	3
d	D	4
e	E	5
f	F	6
g	G	7
i	I	8
j	J	9

The default axis is 0. This is how the combined dataframes will look like if we change the axis to 1.

In [104]:

            
                Copied!
                
pd.concat([df1, df2, df3], axis=1)
pd.concat([df1, df2, df3], axis=1)

Out[104]:

	Col1	Col2	Col1	Col2	Col1	Col2
a	A	1.0	NaN	NaN	NaN	NaN
b	B	2.0	NaN	NaN	NaN	NaN
c	C	3.0	NaN	NaN	NaN	NaN
d	NaN	NaN	D	4.0	NaN	NaN
e	NaN	NaN	E	5.0	NaN	NaN
f	NaN	NaN	F	6.0	NaN	NaN
g	NaN	NaN	NaN	NaN	G	7.0
i	NaN	NaN	NaN	NaN	I	8.0
j	NaN	NaN	NaN	NaN	J	9.0

In [105]:

            
                Copied!
                
# We can also use append()

df1.append([df2, df3])
# We can also use append()

df1.append([df2, df3])

Out[105]:

	Col1	Col2
a	A	1
b	B	2
c	C	3
d	D	4
e	E	5
f	F	6
g	G	7
i	I	8
j	J	9

Merging¶

If you have worked with SQL, what pd.merge() does may be familiar. It links data from different sources (different features) and you have a control on the structure of the combined dataset.

Pandas Merge method(how): SQL Join Name : Description

* left : LEFT OUTER JOIN : Use keys or columns from left frame only

* right : RIGHT OUTER JOIN : Use keys or columns from right frame only

* outer : FULL OUTER JOIN : Use union of keys or columns from both frames

* inner : INNER JOIN : Use intersection of keys or columns from both frames

In [106]:

            
                Copied!
                
df1 = pd.DataFrame({'Name': ['Joe', 'Joshua', 'Jeanne', 'David'],
                        'Role': ['Manager', 'Developer', 'Engineer', 'Scientist']})

df2 = pd.DataFrame({'Name': ['David', 'Joshua', 'Joe', 'Jeanne'],
'Year Hired': [2018, 2017, 2020, 2018]})

df3 = pd.DataFrame({'Name': ['David', 'Joshua', 'Joe', 'Jeanne'],
'No of Leaves': [15, 3, 10, 12]})
df1 = pd.DataFrame({'Name': ['Joe', 'Joshua', 'Jeanne', 'David'],
                        'Role': ['Manager', 'Developer', 'Engineer', 'Scientist']})

df2 = pd.DataFrame({'Name': ['David', 'Joshua', 'Joe', 'Jeanne'],
'Year Hired': [2018, 2017, 2020, 2018]})

df3 = pd.DataFrame({'Name': ['David', 'Joshua', 'Joe', 'Jeanne'],
'No of Leaves': [15, 3, 10, 12]})

In [107]:

Out[107]:

	Name	Role
0	Joe	Manager
1	Joshua	Developer
2	Jeanne	Engineer
3	David	Scientist

In [108]:

Out[108]:

	Name	Year Hired
0	David	2018
1	Joshua	2017
2	Joe	2020
3	Jeanne	2018

In [109]:

            
                Copied!
                
pd.merge(df1, df2)
pd.merge(df1, df2)

Out[109]:

	Name	Role	Year Hired
0	Joe	Manager	2020
1	Joshua	Developer	2017
2	Jeanne	Engineer	2018
3	David	Scientist	2018

In [110]:

            
                Copied!
                
## Let's merge on Role being a key 

pd.merge(df1, df2, how='inner', on="Name")
## Let's merge on Role being a key 

pd.merge(df1, df2, how='inner', on="Name")

Out[110]:

	Name	Role	Year Hired
0	Joe	Manager	2020
1	Joshua	Developer	2017
2	Jeanne	Engineer	2018
3	David	Scientist	2018

In [111]:

            
                Copied!
                
                    
                    
                
                

        
df1 = pd.DataFrame({'col1': ['K0', 'K0', 'K1', 'K2'],
                     'col2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
    
df2 = pd.DataFrame({'col1': ['K0', 'K1', 'K1', 'K2'],
                               'col2': ['K0', 'K0', 'K0', 'K0'],
                                  'C': ['C0', 'C1', 'C2', 'C3'],
                                  'D': ['D0', 'D1', 'D2', 'D3']})
df1 = pd.DataFrame({'col1': ['K0', 'K0', 'K1', 'K2'],
                     'col2': ['K0', 'K1', 'K0', 'K1'],
                        'A': ['A0', 'A1', 'A2', 'A3'],
                        'B': ['B0', 'B1', 'B2', 'B3']})
    
df2 = pd.DataFrame({'col1': ['K0', 'K1', 'K1', 'K2'],
                               'col2': ['K0', 'K0', 'K0', 'K0'],
                                  'C': ['C0', 'C1', 'C2', 'C3'],
                                  'D': ['D0', 'D1', 'D2', 'D3']})

In [112]:

Out[112]:

	col1	col2	A	B
0	K0	K0	A0	B0
1	K0	K1	A1	B1
2	K1	K0	A2	B2
3	K2	K1	A3	B3

In [113]:

Out[113]:

	col1	col2	C	D
0	K0	K0	C0	D0
1	K1	K0	C1	D1
2	K1	K0	C2	D2
3	K2	K0	C3	D3

In [114]:

            
                Copied!
                
pd.merge(df1, df2, how='inner', on=['col1', 'col2'])
pd.merge(df1, df2, how='inner', on=['col1', 'col2'])

Out[114]:

	col1	col2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K1	K0	A2	B2	C1	D1
2	K1	K0	A2	B2	C2	D2

In [115]:

            
                Copied!
                
pd.merge(df1, df2, how='outer', on=['col1', 'col2'])
pd.merge(df1, df2, how='outer', on=['col1', 'col2'])

Out[115]:

	col1	col2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN
5	K2	K0	NaN	NaN	C3	D3

In [116]:

            
                Copied!
                
pd.merge(df1, df2, how='left')
pd.merge(df1, df2, how='left')

Out[116]:

	col1	col2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN

In [117]:

            
                Copied!
                
pd.merge(df1, df2, how='right')
pd.merge(df1, df2, how='right')

Out[117]:

	col1	col2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K1	K0	A2	B2	C1	D1
2	K1	K0	A2	B2	C2	D2
3	K2	K0	NaN	NaN	C3	D3

Joining¶

Joining is a simple way to combine columns of two dataframes with different indexes.

In [118]:

            
                Copied!
                
df1 = pd.DataFrame({'Col1': ['A', 'B', 'C'],
                     'Col2': [11, 12, 13]},
                      index=['a', 'b', 'c']) 

df2 = pd.DataFrame({'Col3': ['D', 'E', 'F'],
                    'Col4': [14, 14, 16]},
                      index=['a', 'c', 'd'])
df1 = pd.DataFrame({'Col1': ['A', 'B', 'C'],
                     'Col2': [11, 12, 13]},
                      index=['a', 'b', 'c']) 

df2 = pd.DataFrame({'Col3': ['D', 'E', 'F'],
                    'Col4': [14, 14, 16]},
                      index=['a', 'c', 'd'])

In [119]:

Out[119]:

	Col1	Col2
a	A	11
b	B	12
c	C	13

In [120]:

Out[120]:

	Col3	Col4
a	D	14
c	E	14
d	F	16

In [121]:

            
                Copied!
                
df1.join(df2)
df1.join(df2)

Out[121]:

	Col1	Col2	Col3	Col4
a	A	11	D	14.0
b	B	12	NaN	NaN
c	C	13	E	14.0

In [122]:

            
                Copied!
                
df2.join(df1)
df2.join(df1)

Out[122]:

	Col3	Col4	Col1	Col2
a	D	14	A	11.0
c	E	14	C	13.0
d	F	16	NaN	NaN

You can see that with df.join(), the alignment of data is on indexes.

In [123]:

            
                Copied!
                
df1.join(df2, how='inner')
df1.join(df2, how='inner')

Out[123]:

	Col1	Col2	Col3	Col4
a	A	11	D	14
c	C	13	E	14

In [124]:

            
                Copied!
                
df1.join(df2, how='outer')
df1.join(df2, how='outer')

Out[124]:

	Col1	Col2	Col3	Col4
a	A	11.0	D	14.0
b	B	12.0	NaN	NaN
c	C	13.0	E	14.0
d	NaN	NaN	F	16.0

Learn more about Merging, Joining, and Concatenating the Pandas Dataframes here.

H. Beyond Dataframes: Working with CSV and Excel¶

In this last section of Pandas' fundamentals, we will see how to read real world data with different formats: CSV and Excel

CSV and Excel¶

Let's use california housing dataset.

In [125]:

            
                Copied!
                
# Let's download the data 

!curl -O https://raw.githubusercontent.com/nyandwi/public_datasets/master/housing.csv
# Let's download the data 

!curl -O https://raw.githubusercontent.com/nyandwi/public_datasets/master/housing.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1390k  100 1390k    0     0   409k      0  0:00:03  0:00:03 --:--:--  409k

In [126]:

            
                Copied!
                
data = pd.read_csv('housing.csv')
data = pd.read_csv('housing.csv')

In [127]:

            
                Copied!
                
data.head()
data.head()

Out[127]:

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

In [128]:

            
                Copied!
                
type(data)
type(data)

Out[128]:

pandas.core.frame.DataFrame

In [129]:

            
                Copied!
                
## Exporting dataframe back to csv

data.to_csv('housing_dataset', index=False)
## Exporting dataframe back to csv

data.to_csv('housing_dataset', index=False)

If you look into the folder sidebar, you can see Housing Dataset.

In [130]:

            
                Copied!
                
## Exporting CSV to Excel

data.to_excel('housing_excel.xlsx', index=False)
## Exporting CSV to Excel

data.to_excel('housing_excel.xlsx', index=False)

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/var/folders/9x/fscj3yx566q3y3y1kf5yh9m40000gn/T/ipykernel_1348/1131967869.py in <module>
      1 ## Exporting CSV to Excel
      2 
----> 3 data.to_excel('housing_excel.xlsx', index=False)

~/miniforge3/envs/TensorPro/lib/python3.9/site-packages/pandas/core/generic.py in to_excel(self, excel_writer, sheet_name, na_rep, float_format, columns, header, index, index_label, startrow, startcol, engine, merge_cells, encoding, inf_rep, verbose, freeze_panes, storage_options)
   2282             inf_rep=inf_rep,
   2283         )
-> 2284         formatter.write(
   2285             excel_writer,
   2286             sheet_name=sheet_name,

~/miniforge3/envs/TensorPro/lib/python3.9/site-packages/pandas/io/formats/excel.py in write(self, writer, sheet_name, startrow, startcol, freeze_panes, engine, storage_options)
    832             # error: Cannot instantiate abstract class 'ExcelWriter' with abstract
    833             # attributes 'engine', 'save', 'supported_extensions' and 'write_cells'
--> 834             writer = ExcelWriter(  # type: ignore[abstract]
    835                 writer, engine=engine, storage_options=storage_options
    836             )

~/miniforge3/envs/TensorPro/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py in __init__(self, path, engine, date_format, datetime_format, mode, storage_options, if_sheet_exists, engine_kwargs, **kwargs)
     46     ):
     47         # Use the openpyxl module as the Excel writer.
---> 48         from openpyxl.workbook import Workbook
     49 
     50         engine_kwargs = combine_kwargs(engine_kwargs, kwargs)

ModuleNotFoundError: No module named 'openpyxl'

In [ ]:

            
                Copied!
                
## Reading the Excel file back

excel_data = pd.read_excel('housing_excel.xlsx')
## Reading the Excel file back

excel_data = pd.read_excel('housing_excel.xlsx')

In [ ]:

            
                Copied!
                
excel_data.head()

excel_data.head()

Real World: Exploratory Data Analysis (EDA)¶

All above was the basics. Let us apply some of these techniques to the real world dataset, Red wine quality.

In [ ]:

            
                Copied!
                
!curl -O https://raw.githubusercontent.com/nyandwi/public_datasets/master/winequality-red.csv
!curl -O https://raw.githubusercontent.com/nyandwi/public_datasets/master/winequality-red.csv

In [ ]:

            
                Copied!
                
wine_data = pd.read_csv('winequality-red.csv')
wine_data = pd.read_csv('winequality-red.csv')

In [ ]:

            
                Copied!
                
# Displaying the head of the dataset

wine_data.head()
# Displaying the head of the dataset

wine_data.head()

In [ ]:

            
                Copied!
                
# Displaying the tail of the dataset

wine_data.tail()
# Displaying the tail of the dataset

wine_data.tail()

In [ ]:

            
                Copied!
                
# Displaying summary statistics

wine_data.describe().transpose()
# Displaying summary statistics

wine_data.describe().transpose()

In [ ]:

            
                Copied!
                
# Displaying quick information about the dataset 

wine_data.info()
# Displaying quick information about the dataset 

wine_data.info()

In [ ]:

            
                Copied!
                
# Checking missing values

wine_data.isnull().sum()
# Checking missing values

wine_data.isnull().sum()

In [ ]:

            
                Copied!
                
# wine quality range from 0 to 10. The higher the quality value, the good wine is

wine_data['quality'].value_counts()
# wine quality range from 0 to 10. The higher the quality value, the good wine is

wine_data['quality'].value_counts()

In [ ]:

            
                Copied!
                
wine_data.groupby(['fixed acidity', 'volatile acidity', 'citric acid']).sum()
wine_data.groupby(['fixed acidity', 'volatile acidity', 'citric acid']).sum()

In [ ]:

            
                Copied!
                
wine_data.groupby(['free sulfur dioxide', 'total sulfur dioxide']).sum()
wine_data.groupby(['free sulfur dioxide', 'total sulfur dioxide']).sum()

This is the end of the lab that was about using Pandas to manipulate data. Alot of things will make sense when we start to prepare data for machine learning models in the next notebooks.

BACK TO TOP ¶

In [ ]:

	Col1	Col2	Col1	Col2	Col1	Col2
a	A	1.0	NaN	NaN	NaN	NaN
b	B	2.0	NaN	NaN	NaN	NaN
c	C	3.0	NaN	NaN	NaN	NaN
d	NaN	NaN	D	4.0	NaN	NaN
e	NaN	NaN	E	5.0	NaN	NaN
f	NaN	NaN	F	6.0	NaN	NaN
g	NaN	NaN	NaN	NaN	G	7.0
i	NaN	NaN	NaN	NaN	I	8.0
j	NaN	NaN	NaN	NaN	J	9.0

	col1	col2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN
5	K2	K0	NaN	NaN	C3	D3

	col1	col2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN

	Col1	Col2	Col1	Col2	Col1	Col2
a	A	1.0	NaN	NaN	NaN	NaN
b	B	2.0	NaN	NaN	NaN	NaN
c	C	3.0	NaN	NaN	NaN	NaN
d	NaN	NaN	D	4.0	NaN	NaN
e	NaN	NaN	E	5.0	NaN	NaN
f	NaN	NaN	F	6.0	NaN	NaN
g	NaN	NaN	NaN	NaN	G	7.0
i	NaN	NaN	NaN	NaN	I	8.0
j	NaN	NaN	NaN	NaN	J	9.0

	col1	col2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN
5	K2	K0	NaN	NaN	C3	D3

	col1	col2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN

Data Manipulation with Pandas¶

1. Basics of Pandas for data manipulation¶

A. Series and DataFrames¶

Creating Series¶

Creating DataFrames¶

B. Data Indexing, Selection and Iteration¶

Conditional Selection¶

Iteration¶

C. Dealing with Missing data¶

Checking Missing values¶

Removing the missing values¶

Filling the missing values¶

D. More Operations and Functions¶

Retrieving basic info about the Dataframe¶

Unique Values¶

Applying a Function to Dataframe¶

Sorting values in dataframe¶

E. Aggregation Methods¶

F. Groupby¶

G. Combining Datasets: Concatenating, Joining and Merging¶

Concatenation¶

Merging¶

Joining¶

H. Beyond Dataframes: Working with CSV and Excel¶

CSV and Excel¶

Real World: Exploratory Data Analysis (EDA)¶

BACK TO TOP¶

BACK TO TOP ¶

	Col1	Col2	Col1	Col2	Col1	Col2
a	A	1.0	NaN	NaN	NaN	NaN
b	B	2.0	NaN	NaN	NaN	NaN
c	C	3.0	NaN	NaN	NaN	NaN
d	NaN	NaN	D	4.0	NaN	NaN
e	NaN	NaN	E	5.0	NaN	NaN
f	NaN	NaN	F	6.0	NaN	NaN
g	NaN	NaN	NaN	NaN	G	7.0
i	NaN	NaN	NaN	NaN	I	8.0
j	NaN	NaN	NaN	NaN	J	9.0

	col1	col2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN
5	K2	K0	NaN	NaN	C3	D3

	col1	col2	A	B	C	D
0	K0	K0	A0	B0	C0	D0
1	K0	K1	A1	B1	NaN	NaN
2	K1	K0	A2	B2	C1	D1
3	K1	K0	A2	B2	C2	D2
4	K2	K1	A3	B3	NaN	NaN