Data Wrangling And Visualization Of Students Performances.

Oyinkansola Awosan
6 min readMar 28, 2021

--

Data wrangling, also known as data cleaning or data mungling is perhaps the most important step needed to transform raw data into a functional form that can be used for model building, data analysis, and visualization.

The aim of data wrangling is to make raw data ready for analysis. It includes all the work done to make the data ready, all the work done on the data before actual analysis of the data starts.

Usually, this entails restructuring, cleaning and preprocessing of the data. According to Express Analytics, it has been observed that 80% of data analysts spend most of their time on data wrangling and not the actual analysis. This shows just how important this step is in the science and analysis of data.

I recently performed data wrangling and analysis on a dataset of Students Performance in Exams. This dataset was gotten from Kaggle.

I started by importing the necessary libraries and packages needed for the task as seen below.

import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
%matplotlib inline

Numpy is imported here as it is used to perform mathematical operations in Python.

Pandas is one of the packages used in the Python language and is very useful for data wrangling as it makes importing data and data analysis easier for users.

Seaborn is a data visualization library, and is imported here for the data analysis and visualization that will be done as this article progresses.

Pyplot is a plotting library in the Matplotlib package.

Using the Pandas library, I imported the csv file containing the dataset. You can import your dataset directly from the site, however, I downloaded the file containing the dataset, and imported it from my system folder as seen below.

import io
import requests
from io import StringIO
df=pd.read_csv(r"C:\Users\user\Downloads\archive.zip")

Now, to see what the dataset looks like and the features it contains, I called the data set as

df

and got the output below:

Features of the data set

This displays the first and last 5 rows.

Now to specifically view the columns we have, I inputed this,

df.columns

and got this:

All the columns in the dataset

To know the shape of the dataframe,

df.shape

with outcome of

Shape of dataset

This shows that the dataset contains 1000 rows, 8 columns.

To get information about the columns we have,

df.info()

with output of

Data frame information

I then went ahead to know the description of the data frame, the numeric parts of the data frame using

df.describe()
Description of data frame.

This gave me the description of the data frame, the statistical summary of the numerical columns.

After doing this, I wanted to know if there are any duplicates, so I could find and remove them.

Dataframe duplicates

These two images show that there are no duplicates in the data frame.

Now, I wanted to know the obtainable number of elements in each column, so I put in

df.nunique()

and got this:

Output for df.nunique()

What this tells me is that there are only 2 types of gender in the data set, 5 types of ethnicity, 6 levels of parental education and so on.

To get the number of missing values in the dataset,

df.isnull().sum()

and got;

Missing values output

This tells me that there are no missing values in the data set, so I don’t have to go further on that.

I wanted to change the names of some of the columns, to something shorter, easier, so I input this

df.rename(columns={'race/ethnicity':'ethnicity','parental level of education':'parents_education'},inplace=True);
df.rename(columns=lambda x:x.strip().replace(' ','_'),inplace=True)

after which I called the columns and got this:

New names of columns

If you observe the code and image above carefully, you will notice that some columns now have shorter, simpler names.

I then moved on to remove columns which I felt are irrelevant in the analysis of this data set.

to_remove = ['lunch',
'parents_education',
'test_preparation_course']
df.drop(to_remove, inplace=True, axis=1)

I then wanted to view the new dataset with the removed columns, so;

df.head()

I got the output;

First five rows of new data set

I then began to sort the data set according to the columns, which I also used for the visualization later.

Ethnicity and Gender sorting

These two columns sorted above are objects, now for the parts that are integers, i.e numerical, because the data set represents 1000 students, I had to sort, and picked a sample to visualize. See below.

Math score sorting

Here, after sorting according to math_score, I went ahead to create a new variable ‘maths’ which takes on just a sample size from the ‘math_score’ column.

I then sorted it, to get it ready for data visualization.

I repeated this process for the other two numerical columns, used ‘readings’ to take on a sample from ‘reading_score’ as seen below:

Reading scores sorting

I also repeated this for ‘writing_scores’ using ‘writings’

Writing scores sorting

After this, I went ahead to visualize the sorted data, which is where the data visualization libraries/packages come into play.

I first visualized according to gender as seen below,

labels=df['gender'].value_counts().index
values=df['gender'].value_counts().values
colors = ["skyblue", "darkblue"]
plt.figure(figsize=(10,8))
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=90, colors=colors)
plt.title('Gender')
plt.show()

and got this output,

Gender distribution visualization

Then I went ahead to visualize according to ethnicity, inputing this block of code,

base_color=sb.color_palette()[6]
freq=df['ethnicity'].value_counts()
gen_order=freq.index
sb.countplot(data=df,x='ethnicity',color=base_color,order=gen_order);

then got this;

Data visualization of ethnicity

I then went ahead to visualize that of the maths scores

base_color=sb.color_palette()[8]
freq=maths.value_counts()
gen_order=freq.index
sb.countplot(data=df,x=maths,color=base_color,order=gen_order)
plt.xticks(rotation=45)

and got this output;

Maths scores data visualization

Moving on, I repeated the process for the reading scores

base_color=sb.color_palette()[5]
freq=readings.value_counts()
gen_order=freq.index
sb.countplot(data=df,x=readings,color=base_color,order=gen_order)
plt.xticks(rotation=45)

with an output of;

Reading scores visualization

Lastly, I did that of the writing scores.

base_color=sb.color_palette()[1]
freq=writings.value_counts()
gen_order=freq.index
sb.countplot(data=df,x=writings,color=base_color,order=gen_order)
plt.xticks(rotation=45)

with the image below as the output.

Writing scores visualization

I hope you enjoyed reading this article or found it helpful. You can access the notebook containing the entire code here.

Please feel free to drop your questions or suggestions as comments or you can reach out to me here.

--

--

Oyinkansola Awosan
Oyinkansola Awosan

Written by Oyinkansola Awosan

Technical Writer, Open Source Enthusiast, Machine Learning & Site Reliability Engineer

No responses yet