Data Wrangling And Visualization Of Students Performances.
Data wrangling, also known as data cleaning or data mungling is perhaps the most important step needed to transform raw data into a functional form that can be used for model building, data analysis, and visualization.
The aim of data wrangling is to make raw data ready for analysis. It includes all the work done to make the data ready, all the work done on the data before actual analysis of the data starts.
Usually, this entails restructuring, cleaning and preprocessing of the data. According to Express Analytics, it has been observed that 80% of data analysts spend most of their time on data wrangling and not the actual analysis. This shows just how important this step is in the science and analysis of data.
I recently performed data wrangling and analysis on a dataset of Students Performance in Exams. This dataset was gotten from Kaggle.
I started by importing the necessary libraries and packages needed for the task as seen below.
import numpy as np
import pandas as pd
import seaborn as sbimport matplotlib.pyplot as plt
%matplotlib inline
Numpy is imported here as it is used to perform mathematical operations in Python.
Pandas is one of the packages used in the Python language and is very useful for data wrangling as it makes importing data and data analysis easier for users.
Seaborn is a data visualization library, and is imported here for the data analysis and visualization that will be done as this article progresses.
Pyplot is a plotting library in the Matplotlib package.
Using the Pandas library, I imported the csv file containing the dataset. You can import your dataset directly from the site, however, I downloaded the file containing the dataset, and imported it from my system folder as seen below.
import io
import requests
from io import StringIO
df=pd.read_csv(r"C:\Users\user\Downloads\archive.zip")
Now, to see what the dataset looks like and the features it contains, I called the data set as
df
and got the output below:
This displays the first and last 5 rows.
Now to specifically view the columns we have, I inputed this,
df.columns
and got this:
To know the shape of the dataframe,
df.shape
with outcome of
This shows that the dataset contains 1000 rows, 8 columns.
To get information about the columns we have,
df.info()
with output of
I then went ahead to know the description of the data frame, the numeric parts of the data frame using
df.describe()
This gave me the description of the data frame, the statistical summary of the numerical columns.
After doing this, I wanted to know if there are any duplicates, so I could find and remove them.
These two images show that there are no duplicates in the data frame.
Now, I wanted to know the obtainable number of elements in each column, so I put in
df.nunique()
and got this:
What this tells me is that there are only 2 types of gender in the data set, 5 types of ethnicity, 6 levels of parental education and so on.
To get the number of missing values in the dataset,
df.isnull().sum()
and got;
This tells me that there are no missing values in the data set, so I don’t have to go further on that.
I wanted to change the names of some of the columns, to something shorter, easier, so I input this
df.rename(columns={'race/ethnicity':'ethnicity','parental level of education':'parents_education'},inplace=True);
df.rename(columns=lambda x:x.strip().replace(' ','_'),inplace=True)
after which I called the columns and got this:
If you observe the code and image above carefully, you will notice that some columns now have shorter, simpler names.
I then moved on to remove columns which I felt are irrelevant in the analysis of this data set.
to_remove = ['lunch',
'parents_education',
'test_preparation_course']df.drop(to_remove, inplace=True, axis=1)
I then wanted to view the new dataset with the removed columns, so;
df.head()
I got the output;
I then began to sort the data set according to the columns, which I also used for the visualization later.
These two columns sorted above are objects, now for the parts that are integers, i.e numerical, because the data set represents 1000 students, I had to sort, and picked a sample to visualize. See below.
Here, after sorting according to math_score, I went ahead to create a new variable ‘maths’ which takes on just a sample size from the ‘math_score’ column.
I then sorted it, to get it ready for data visualization.
I repeated this process for the other two numerical columns, used ‘readings’ to take on a sample from ‘reading_score’ as seen below:
I also repeated this for ‘writing_scores’ using ‘writings’
After this, I went ahead to visualize the sorted data, which is where the data visualization libraries/packages come into play.
I first visualized according to gender as seen below,
labels=df['gender'].value_counts().index
values=df['gender'].value_counts().values
colors = ["skyblue", "darkblue"]
plt.figure(figsize=(10,8))
plt.pie(values, labels=labels, autopct='%1.1f%%', startangle=90, colors=colors)
plt.title('Gender')
plt.show()
and got this output,
Then I went ahead to visualize according to ethnicity, inputing this block of code,
base_color=sb.color_palette()[6]
freq=df['ethnicity'].value_counts()
gen_order=freq.index
sb.countplot(data=df,x='ethnicity',color=base_color,order=gen_order);
then got this;
I then went ahead to visualize that of the maths scores
base_color=sb.color_palette()[8]
freq=maths.value_counts()
gen_order=freq.index
sb.countplot(data=df,x=maths,color=base_color,order=gen_order)
plt.xticks(rotation=45)
and got this output;
Moving on, I repeated the process for the reading scores
base_color=sb.color_palette()[5]
freq=readings.value_counts()
gen_order=freq.index
sb.countplot(data=df,x=readings,color=base_color,order=gen_order)
plt.xticks(rotation=45)
with an output of;
Lastly, I did that of the writing scores.
base_color=sb.color_palette()[1]
freq=writings.value_counts()
gen_order=freq.index
sb.countplot(data=df,x=writings,color=base_color,order=gen_order)
plt.xticks(rotation=45)
with the image below as the output.
I hope you enjoyed reading this article or found it helpful. You can access the notebook containing the entire code here.
Please feel free to drop your questions or suggestions as comments or you can reach out to me here.