How To Apply Data Science To Real Business Problems
Data science and statistics are not magic. They won’t magically fix all of a company’s problems. However, they are useful tools to help companies make more accurate decisions and automate repetitive work and choices that teams need to make.
Machine learning and data science get referenced a lot when referring to natural language processing, imaging recognition and chat bots.
However, they also can be applied to help managers make decisions, predict future revenues, segment markets, produce better content and diagnosis patients more effectively.
Below, we are going to discuss some case examples of statistics and applied data science algorithms that can help your business and team produce more accurate results.
This doesn’t require complex hadoop clusters and cloud analytics. Not that those aren’t amazing. Just, let’s get the basics going first!
Before we jump to far down the rabbit hole of technology and hype!
We are going to be giving examples on e-commerce and medical operations.
Our team focuses mostly on practical and applied data science. So we are reaching into our past experiences to show you some awesome but easy to apply ways you can use statistics today to start making better decisions.
The methods below are typically only a small piece of a larger system. We believe that getting these small pieces and details are required to start building systems that are accurate and effective.
Some of these statistical methods won’t even require heavy programming or technical expertise. However, these basic techniques could be applied on a much larger scale in software to
We are doing this because we know it requires a lot of effort to implement an algorithm.
Teams have to plan properly how they are going to integrate databases, business logic, algorithms, and new policies to ensure projects succeed.
That takes a large amount of resources and time.
Fraud Detection and Bayes Theorem
Whether you are an insurance company that deals with medical, property, or vehicle claims. Insurance fraud is a major problem.
How do they solve it? Insurance providers have to either set up an algorithm, or have auditors manually go through and see if a claim is fraudulent or not.
Believe it or not, there are still a large handful of insurance companies who do this manually(In this case, we are considering getting a data feed from a database and filtering in excel as manual).
Insurance providers will put a lot of effort into auditing. For at least these two reasons:
- To get money back
- To attempt to deter future insurance fraud claims
The issue here is this takes up valuable resource hours and if they incorrectly identify claims, it can cost even more money in salaries and other resources than what costs are recouped from getting money back from fraudulent claims.
In this example, an Insurance provider named Itena has created an algorithm to help increase the speed at which their team can handle claims.
So how does Itena know if the algorithm they have developed is worth it?
Well, what if you had a mathematical theorem to analyze your algorithm!
Woah! That is so “meta”.
Just think of it as a confidence check, that can later help them calculate whether it is worth the costs to invest in the algorithm
In this case, they are going to use Bayes Theorem
What Bayes is great at doing is providing statistical backing for how accurate the information they are being provided actually is.
How much can they trust their algorithm? Since more than likely they will have both false positives and false negatives. How accurate is their algorithm really?
Let’s say that Itena’s data science team knows that 2% of the claims received are fraudulent.
They calculate that they could save $1 million dollars if they correctly identify all the claims that are fraudulent!
Woah! They just got all your executives signing off on this! I mean, it is $1 million dollars! Even if the company nets $100 million dollars. This is a great save
Awesome!
So the Itena data science team develops an algorithm to detect fraudulent claims.
It isn’t perfect, but it is a great start!
But, how accurate is it really? Sure, they know that 85% of fraudulent claims that are predicted fraudulent actually are?
As good data scientists, and scientists in general. They also know you have to check for errors as well. Like false positives.
For them to go to management and to say that this test is 85% accurate is incorrect! Look in the normal claims column.
They also have a 4% false positive rate. How does that come into play?
This is when they bring in our old friend Bayes Theorem.
Bayes Theorem is great for testing how much they should trust tests.
Algorithms, like medical tests, can detect things that don’t exist.
Just like some medical exams can return false positives for cancer. Algorithms can return false positives for fraudulent claims, whether you should get a loan or not, and if you should get a discount or not when you visit Amazon.com.
We know the data science team could possibly save the company $1 million dollars. However, they are also going to be costing the Itena money with resource hours. So they need to make sure they return more than they cost!
So how accurate is this algorithm?
They might actually be surprised to find out how inaccurate it actually is!
Here is the scratch math:
P(Fraud Claim) = 2%
P(Positively Flagged As Fraud(PFF) | True Fraud Claim) = 85%
P(Not Fraud) = 98%
P(Flagged Fraud Claim | True Not Fraud) = 4%
P(Fraud Claim)*P(Positively Flagged As Fraud | True Fraud Claim)/
P(Fraud Claim)*P(Positively Flagged As Fraud | True Fraud Claim) + P(Flagged Fraud Claim | True Not Fraud)
Why?
It is due to the fact that only 2% of claims are fraudulent. That means, although the algorithm is 85% accurate. It is only 85% accurate for 2% worth of the data.
It also incorrectly classifiys 4% of 98% of their data.
They have a mucher bigger set of data in the false negatives than in the true positives.
It is kind of intersting when you really sit and think about it.
Now the question is, can they justify the savings? What will it costs the company to look into all the claims? This really depends on the claim size.
If they are dealing with $7 upcoding in medical claims…maybe not so much.
On the other hand, if it is $10,000 car accident claims, the company will still want to jump on it!
Luckily we have computers that can run these algorithms quickly and hopefully have an amazing process set up that quickly allows claims to be adjudicated.
We do want to put this into perspective if the computer were not there.
What if instead you had the same accuracy and had a human performing the task. Let’s say it cost them $200 worth of resources to perform the analysis on one claim!
If the claim is worth $500, is it worth the time?
What Is The Probability A User Will Buy Your Product
E-commerce is predicted to have over $2 trillion dollars worth of purchases in 2017. Although plenty of people go straight through Amazon or Alibaba, there are plenty of other sites trying to get customers to buy their products.
This involves heavy amounts of cross platform marketing, content marketing, and advertising.
How do companies know if their ads or sites where they promote are actually effective? Are the impressions and engagements they are getting actually turning into real dollars?
How do you start to answer these questions?
Let’s say your company sells kitchen equipment online.
You pay several sites to cross promote your products and e-commerce site. You know the average purchase rate you get from each site as you have been diligent about tracking cookies and keeping a clean database.
You want to know how much money you should invest into future campaigns. Based off of current data, you know that 10 people an hour purchase a product if they come from “site A”. You believe that as long as you have a greater than 80% chance of keeping a rate of at least 6 per hour. You can justify the cost to market.
How can we figure this out?
Well, we can use poisson’s distribution to help us out.
If you already know that on average 10 people buy products from your site every hour, you can calculate from there
You can take this information to an even more granular level. That would allow you to utilize a combination of seasonality techniques with the poisson distribution to predict future revenues and allocate funds more effectively.
That would be require a more extensive explanation and also a lot of data. For now, we are going to focus on this first problem.
We will utilize the Poisson cumulative probability function. Essentially, you are just adding each probabilty greater than 6 from the poisson distribution equation.
If you run it for a lambda of 10, you get a graph like the one below.
Using Poisson cumulative probability function you will get about 94%
So your company can continue to pay for marketing on Site A!
However!
If it had only been about 7 an hour. Then you would only have a 70% chance of selling at least 6 items an hour! So the program would have to be cut.
You can easily set this threshold and start to create an auction type system where your budget is automatically allocated based on a ratio of future returns and probability of purchases!
That sounds like a fun project!
Don’t get us wrong. At the end of the day, data science can be used to create systems that interact with your customers.
However, it can also be used to help increase the rate of accurate decision making. As well as develop systems that make decisions with FAIL SAFES that limit the amount of simple and complex decisions that are made by analysts and management.
Applied Linear Regression
Let’s say you work for a hospital and you noticed the cost of a specific surgery has been going up consistently month over month for the past few years.
You wonder if there might be a linear relationship between the months and cost of surgery.
Step one of your analysis would be to figure out if there was a model that could be built to predict the rising cost of surgery.
There would be a second step that we are not going to go over which would be to figure out the why!
Data science does not only supply the tools to create models.
It also supplies the tools the allow people to figure out the why! So after you finish with the model, you would want to look into why.
Maybe you would theorize it is increasing salaries, equipment costs, increased complexity of steps, etc. That would require more complex data compared to price per surgery and month.
If your scatter plot seems to follow a linear pattern like the one below. You can start to look at the problem with the concept of linear regression or even multivariate linear regression . This depends on what looks like it will fit(or even better, which model your automated system detects!)
This is one of the simplest forms of predictions as you are simply trying to create a trend line.
You could estimate this by taking a line from your starting point to your end point. However, this might not be the “Goodness-of-Fit”
Although you can use excel, python, R or just about any other language to find a linear regression model. We wanted to show you a video on how to do it by hand.
It involves a lot of summations, but don’t let that scare you. In the end, the reason most people don’t do it by hand is not because the math is hard
Instead, the math is pretty straight forward. However, a lot can go wrong. Especially with simple calculations.
Just because you have finished the model, does not mean you are done! Your data sets are not often perfect fit to the line. So it is important to test the validity of your model.
How do you determine the best fit for a trend line or linear regression model?
There are several methods. Below we will discuss the R-Squared error.
What is R-Squared? also known as the coefficient of determination.
Most models have one or several methods to calculate the accuracy of a model.
We have ROC curves, AUC, Mean Squared Error, Variance, and so on.
In this case, R-Squared error is equal to:
The Sum Squared regression error is the delta between the current point you are examining and what is the corresponding point on the model line.
Let the point on your trend line be equal to ŷ
((mx1+b)-ŷ )2+…((mxn+b)-ŷ )2
The total variatian is the difference between the point on the model and the average y from the actual data.
(y1–y¯) 2+ (yn–y¯)2
So in this case, we can run linear regression in excel, R, or python and get the model that fits the line pretty well.
In this case, based off the data the R-Squared error is .93. This is really only stating the amount of explained variance.
The importance is to remember what the model is being used for! Not every line that fits is correct!
Especially when you start developing more complex models!
This is why data scientists don’t just create models. They also look for the why!
That is one of the biggest changes in the last 20 or so years! We have the power to give context. Before statistics were limited to numbers.
They could only tell what had happened and what might happen.
There was very little information on what needed to change! Now data scientists can give the why! So we have the ability to actually give strategic advice when we know the why.
For instance, in this case, you can show your medical directors this simple trend and then analyze the why!
When we go out to help teams. Part of our focus is helping them go to their directors with their analysis to get their approval.
This is the start of being data driven. It requires curiosity and a little bit of entrepreneurial spirit.
With the discovery above image if you could find out that you can reduce surgery costs by $5.37 on average and the hospital does 100,000 surgeries a year.
You can prove that you saved your company $537,000 annually. Hopefully you get a raise!
It is still important to remember that “All models are wrong, but some are useful”.
Logistic Regression, A Binary Assessment
Logistic regression, unlike linear regression has a binary output. Typically it is pass or fail, 1 or 0. Linear regression’s output is continuous where as logistic is defined typically by yes or no.
This is why it is used a lot for business tasks like deciding whether you should give someone a loan , it can predict if a patient has a specific disease or not and many other yes or no type questions that plague us every day.
Using logistic regression allows for multiple variables to be utilized. So even if you have a complex business decision that needs to take several different variables, logistic regression can be a great solution.
We can look back to the example of the fraudulent claims. The algorithm that determines whether or not the claim is fraudulent could be a logistic regression model? You might have certain information about if the claim is fraudulent or not. You might have location, patient information, hospital stats, etc.
Let’s say you wanted to know if a doctor was actually doing open heart surgery and not just pretending to bill for it (it’s a stretch, but go with it).
You might know that the same claim for open heart surgery should also have billed for specific equipment usage like an MRI or Lab work. If this did not occur, there is a good chance it is fraudulent.
You can also engineer some features. You could find the probability that the heart surgery from that specific doctor is fradulent based off of passed claims and audits, you could count the number of surgeries done per day by doctor, or anything else your team has subject matter expert or contextual data supports as playing a role in fraudulent claims.
This model is much harder to calculate by hand. Luckily! That is why R and Python are amazing languages. Here is the R code implementation(glm stands for the general linear model libraries). This is basically a 1 line implementation
Note: This is a little misleading. Although it seams like 1 line implementation. There is probably a lot of data cleansing and normalizing prior to using the formula before!
model <- glm(formula = FraudulentClaim ~ ., family = binomial(link = "logit"), data = train)
To explain some of this implementation. The formula variable is set up with the output on the left side. Then the tilde symbol states that “FraudulentClaims” is dependent on the period. In R, the period represents all the variables in the training set except the dependent variable.
If your team only believes it is another set of variables then they could use the example below instead of the period.
glm(formula = FraudulentClaim ~ YearsOfExperience+Income+…….x, family = binomial(link = “logit”), data = train)
The mathematical model would look like:
Each of those “b” variables represents another possible variable. It could be sex, age, income(all normalized typically!)
This statistical principal could also be used in other hospital tasks like readmission, diagnostics, and fraudulent claims
Applied Data Science
These were a few basic case studies where we showed how you could implement some theorems and algorithms into your decisions processes.
They are a great start and could be used in much larger project to help improve your data science and companies data driven culture!
With that, comes a few things we would like to note
Some pros and cons with algorithm and data science usage
Pros
Focuses On Data Driven Decisions Over Politics and Gut Feelings
Automates Decisions That Might Be Financially and Mentally Taxing
Improves Consistency, Accuracy And Forces Teams To Draw Out Their Decisions Processes
Reduce Time Spent On Tasks
Cons
If An Algorithm Is Incorrect The Team Might Overly Trust It
Our next post our hope is to focus on some more technical and programming based implementations and applications! If you have any specific case studies you would like us to explore us, please let us know!!
Other Data Science Resources You Might Enjoy:
Personalization With Contextual Bandits