Intro To Data Analytics For Everyone Part 3
We hope you have enjoyed both previous parts(part 1, part 2) of this intro to data analysis and data science.
In the next few sections we will be looking at different data sets, asking questions and posing possible stopping points where you can try to think about what you might do or conclude from the data and visualizations, etc.
We have been busy with some recent projects so we are a little behind on this post!
In this first part. We will be using this HR data set from Kaggle.com. It is already aggregated and clean…
Anyone out there who is a data engineer or data scientist knows that data does not come neatly packaged. Instead, there are usually a myriad of problems like missing data, duplicated data, garbage data, etc.
Getting a clean data feed is not easy. It requires involving subject matter experts and data teams working together to ensure everyone involved is speaking the same language.
There is plenty of normalizing, testing, and recording to ensure the data feeds remain consistent while projects are starting up.
Cleaning data remains a large part of data professional’s workdays. Figure 1 below is the aggregation of several of our team members days at past jobs. There are new tools that make certain portions of these jobs easier. Nevertheless, some of it is unavoidable.
A Typical Data Scientists Work Day
Enough Chit Chat, Back To Data Analysis
If you recall from our previous part. We showed how there was already some correlation between the satisfaction level and an employee leaving. Is that the only thing that affects an employee leaving?
Correlation Matrix For HR Data Set
If that is the end of the analysis, then a data analyst could go to their boss and simply say: “Satisfaction level plays a role in your employees leaving! You should go boost morale. That will fix all your problems”!
Now give me a promotion?
Maybe…but that doesn’t really tell the data analyst’s boss why employees are leaving.
Good data scientists should have more questions! And a lot of times, those questions don’t move laterally. See figure 3 below.
Ask Why!
Normalizing Data
First, we took all of the data points that were numeric and normalized them. This is the process of taking a range of numbers and creating a linear relationship between the minimum and maximum values in the data set.
What do we mean? Image there was a data set between 1–1000 and another set of data between 0–10. Now, the disparage beteen these sets of numbers could confuse algorithms. Thus, we normalize them.
Instead of 1–1000 and 1–10 . We want limit the range to 0–1. This mean you will make the maximum value =1 and the minimum value = 0 and then create a line between both where each of the numbers in between will fit. So for 1–10, 5 will be about .5 and for 1–1000, 500 will be about .5. This reduces noise.
How do you normalize data?
Here is a quick function that will normalize your data in python
normalized = (x-min(x))/(max(x)-min(x))
Mathematically speaking it looks like the below:
Loading Data Analytics Libraries
When performing data analytics with python, these are the typical python libraries you will need.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sis
Marking The Employees That Left
We decided to look for clusters based on who is leaving and who is staying. Below is one of those figures that we output using 2 of the fields. In figure 4 the red represents the employees that have left and the blue represents the employees that have stayed.
What do you see? There does seem to be 3 clusters, right?
There is a group of employees with high satisfaction levels and with high average monthly hours. This might be the high performers, but why did they leave? If it weren’t for them leaving. Most of the employees leaving would be the ones that have low satisfaction levels. That would make a lot of sense.
It would also make the earlier conclusion to simply boost morale correct.
Creating A Basic Scatter Plot With Color
X = hr_data_s[['average_monthly_hours','satisfaction_level']] # we only take the first two features.
y = hr_data_s.left
plt.scatter(X['average_monthly_hours'], X['satisfaction_level'], c=y, cmap=plt.cm.coolwarm)
plt.xlabel('Average Monthly Hours')
plt.ylabel('Satisfaction Levels')
plt.title('Average Monthly Hours by Satisfaction Levels')
plt.legend()
plt.show()
Being part of this company, an analyst should also ask what the ROI is to keep each of these types of employees.
We don’t have enough information to figure out which employees leaving are the most valuable. However, common sense says, the group in the top right corner of figure 4 might be. They seem to be very valuable employees purely based off of the metrics we are currently looking at.
Is that true though?
We will urge you to look back at figure 3 in this post. Remember the “Ask Why” figure!
It is hard to say which group is the most valuable, just off this data alone. The group that seems to have a high value could be the employees that are asked to perform just the right amount of work so they excel. They could also be employees that have the lowest salary and are the cheapest to replace.
This is why, it is important in analysis, to continue to ask why! Why are there 3 separate groups? Are each of them different? How and why?
Looking at figure 5, there also seems to be very few employees that leave that have high salaries. The employees that leave in this figure are both low and medium salary employees. This, makes a little more sense.
Why Do Employees Leave?
The next section will be to break down some of the differences of the groups listed above.
Let’s start out with figuring out what the total amount of employees in the “Employees that have left” group.
The total employees that have left out of 14999 total employees are 3571.
From here, we will check the size of each group in the table below. We have labeled each of the groups based on the characteristics we have seen thus far.
The groups below are just our personal hypotheses why we see each of these different groups. It might not be a bad idea to simply label them group 1,2,3 etc to avoid confirmation bias.
So the group with high evaluation scores and high hours are the “High Performers”
The group with low hours and low evaluation scores are “Low Performers”
The group with high hours and low evaluation scores are “Overworked”
Then there is an outlier group for everyone who did not fit neatly into these clusters.
Breakdown Of Employees That Left
Breaking it down, the groups add up to the 3571. This is great!
As any type of data specialist. It should be a goal to create checks like this along the way where you make sure your numbers continue to add up. There is always the chance that data might suddenly disappear or grow. Especially in SQL.
Data Analysis Is About The Why
Remember this is just as much about data analysis, not just data science. We could easily throw this data set into an algorithm and let the computer do the thinking for us. There are plenty of libraries in python and R that will happily do that for you.
It is important to develop intuition about data and not rely fully on premade algorithms.
So let’s look at just the people who left and try to see if we could create a story to tell a manager.
Our next few figures are going to be distributions of the number of projects, average monthly hours and evaluation scores.
This is great because, unlike an average that can hide the actual spread, we will be able to visualize the employees.
Number Of Project Distribution
Creating A Distribution Plot
sns.distplot(hr_data_s[‘last_evaluation’][hr_data_s.clutser == ‘Poor Performers’], label=’Poor Performers’);
sns.distplot(hr_data_s[‘last_evaluation’][hr_data_s.clutser == ‘High Performers’], label=’High Performers’);
sns.distplot(hr_data_s[‘last_evaluation’][hr_data_s.clutser == ‘Overworked’], label=’Overworked’);
plt.title(‘Last Evaluation Distribution’)
plt.legend()
plt.show()
Looking at figure 7, we see that the overworked employees had a substantially higher amount of projects that they were taking on. That is strange? Why would the overworked employees who have lower satisfaction levels be getting more projects? That doesn’t make sense…does it?
Last Evaluation Distribution
Then we take a look at figure 8 and see that there is a large portion of high performers who have amazing evaluations. If we were find the area underneath that curve(integrals…oh no), from where the final peak of the higher performers start around .9 and ends at 1, I am sure that would be about 20% ish that have higher scores then the “Overworked” group.
Of course, the poor performers at this point neither score well during evaluations or get a lot of projects.
Average Monthly Hours Distribution
Now, just to round it all out we brought in hours again in figure 9. We already know thanks to figure 1 that average monthly hours and projects are positively correlated. So figure 9 should not be a surprise
This is still good to check, because perhaps the “High Performers” and “Overworked” still had the same distributions like in figure 9 but had switched distributions for figure 7.
Then, one might make the assumption that the “Overworked” group was just slow because they were doing less projects per hour than the “High Performers”.
That is not the case!
Group and Job Type Breakdown Pivot
Finally, for this this section, we have figure 10. It isn’t elegant, but the purpose is to show how each job type breaks down in each group.
It is odd, if you actually look at salaries. The “High Performers” on average have lower salaries than both “Poor Performers” and the “Overworked” groups.
In addition, now that we have pulled out all these groups, we can see how low the satisfaction levels really are for each group.
“Overworked” are often close to .01 (normalized)…this is very different compared to even “Poor Performers”
What gives!
So based off this information, right now. What would you conclude thus far?
What are the facts you have already noted thus far?
Pretend you are Sherlock Holmes for a second. Look at the current clues?
- We see 3 distinct groups
- Each group has a different spread of hours spent at work
- There seems to be low satisfaction in two groups?
- A group with lower evaluation scores than the high performers also has a high amount of project? Why were they trusted?
- What other characteristics did you notice
Now, as an analyst or data scientist. It is your job to come up with a conclusion! And actionable set of steps!
Again, we want you to think of a conclusion. Feel free to respond below what your thoughts would be right now. Before proceeding. Your conclusion should be able to create actionable steps for a manager, or your team. That is key.
The more you can drill into the problem, the more you search for the root cause. The better chance you can come up with a clear solution and can explain to a manager why you are seeing the trends you are seeing.
Why Not Just Throw This Into A Algorithm Off The Bat?
If you throw this data set into an algorithm like a decision tree. It will actually give you a pretty accurate output. However, now explain that to your manager?
We don’t think it has ever worked well to tell a manager “The algorithm” proved it.
Be A Data Storyteller
As an analyst, it is important to be able to tell the data’s story. Some of the figures above could begin to tell that story! Then you could pair that with a decision tree or a logistic regression algorithm.
Then it would be more than just a bunch of abstracted math that you don’t even understand…
We talked about this in part 1. You must be a good communicator, a “data story teller”. Don’t bury the why!
Some institutions make it feel like all you need to do is run data through algorithms and then preach the results as gospel.
We believe in order to get real manager buy in. There needs to a solid report that walks them through how you came to your final conclusion. Simply going from data exploration to conclusion doesn’t prove your point, nor is it scientific.
Theories in scientific fields require multiple research papers proving the same conclusion over and over and over again!
Data analysts also should have evidence that they are right beyond some algorithm. That way, when your manager asks them how they know why, they can answer them!
For further reading on data science and analytics, here are some great articles
Intro To Data Analysis For Everyone Part 1
Predicting Budgeting With Kafka Streaming Analytics
How To Apply Data Science To Real Business Problems