This notebook was created by Jean de Dieu Nyandwi for the love of machine learning community. For any feedback, errors or suggestion, he can be reached on email (johnjw7084 at gmail dot com), Twitter, or LinkedIn.
Pandas for Data Visualization¶
Pandas that we used for data analysis and manipulation can also be used to visualize data.
And it is so simple. To step back a bit, Matplotlib
is the primary visualization library in Python. Both Seaborn and Pandas visualization are built on top of Matplotlib.
Contents:
1. Imports and Loading datasets¶
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Loading dataset
titanic = sns.load_dataset('titanic')
tips = sns.load_dataset('tips')
titanic.head()
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
tips.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
<table align="left">
<td>
<a href="https://colab.research.google.com/github/nyandwi/machine_learning_complete/blob/main/0_python_for_ml/intro_to_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</td>
<td>
</table>
*This notebook was created by [Jean de Dieu Nyandwi](https://twitter.com/jeande_d) for the love of machine learning community. For any feedback, errors or suggestion, he can be reached on email (johnjw7084 at gmail dot com), [Twitter](https://twitter.com/jeande_d), or [LinkedIn](https://linkedin.com/in/nyandwi).*# Checking if the dataset is a Pandas DataFrame
type(tips)
pandas.core.frame.DataFrame
2. Basic Plots¶
tips[['tip', 'total_bill']].plot()
<AxesSubplot:>
titanic['age'].hist()
<AxesSubplot:>
We can change the style of the plot with plt.style.use('style_name')
to create beatiful visualizations.
plt.style.use('ggplot')
# ggplot is a visualization library in R language
titanic['age'].hist()
<AxesSubplot:>
plt.style.use('seaborn-talk')
titanic['age'].hist()
<AxesSubplot:>
plt.style.use('dark_background')
titanic['age'].hist()
<AxesSubplot:>
plt.style.use('grayscale')
titanic['age'].hist()
<AxesSubplot:>
plt.style.use('fivethirtyeight')
titanic['age'].hist()
<AxesSubplot:>
There are more great style cheets that you should check out if you are interested in creating attractive visualizations.
Learn more about style cheets.
More Plots¶
We can use plot()
to create more plot types. Here are the following types that we are going to see in this notebook:
- Bar plot
- Histogram
- Box plots
- Area plots
- Kernel Density estimation plots (KDE)
- Scattter plots
- Pie charts
A. Bar Plot¶
plt.style.use('seaborn-dark')
top_20 = tips['total_bill'][0:20]
top_20.plot(kind='bar')
# Same as
#top_20.plot.bar()
<AxesSubplot:>
We can also plot stacked bar plots. We will have to set stacked
to True
.
top_30_rows = tips[0:30]
top_30_rows.plot(kind='bar',stacked=True)
<AxesSubplot:>
Use .barh()
to create horizontal bar charts
first_30_pasesengers = titanic[0:30]
first_30_pasesengers.plot(kind='barh',stacked=True)
<AxesSubplot:>
B. Histogram¶
titanic['age'].plot(kind='hist')
<AxesSubplot:ylabel='Frequency'>
first_30_pasesengers.plot(kind='hist',stacked=True, bins=20)
<AxesSubplot:ylabel='Frequency'>
first_30_pasesengers.plot(kind='hist',stacked=True, bins=20, orientation='horizontal')
<AxesSubplot:xlabel='Frequency'>
You can also create histograms easily with dataframe.hist()
. We saw this in the beginning.
tips['size'].hist()
<AxesSubplot:>
C. Box Plots¶
top_30_rows.plot(kind='box')
<AxesSubplot:>
# You can also use dataframe.boxplot()
top_30_rows.boxplot()
<AxesSubplot:>
D. Area Plots¶
size_top_bill = tips[['size','tip', 'total_bill']]
size_top_bill.plot(kind='area')
<AxesSubplot:>
By default, area plot is stacked. But you can disable it.
# Only displaying top 30 rows for clarity
size_top_bill[0:30].plot(kind='area', stacked=False)
<AxesSubplot:>
E. Kernel Density estimation plots (KDE)¶
titanic['age'].plot.kde()
<AxesSubplot:ylabel='Density'>
F. Scatter Plots¶
tips.plot.scatter(x='tip', y='total_bill')
<AxesSubplot:xlabel='tip', ylabel='total_bill'>
tips.plot.scatter(x='tip', y='total_bill', color='Blue')
<AxesSubplot:xlabel='tip', ylabel='total_bill'>
G. Hexagonal Plots¶
tips.plot.hexbin(x='tip', y='total_bill',gridsize=30)
<AxesSubplot:xlabel='tip', ylabel='total_bill'>
H. Pie Plots¶
df = pd.DataFrame({'qty': [10, 20, 30],
'sales': [200, 700, 500]},
index=['Apple', 'Orange','Lemon'])
df.plot(kind='pie', y='qty')
<AxesSubplot:ylabel='qty'>
df.plot(kind='pie', subplots=True);