Healthcare Fraud Detection With Python
This April a 1.5 billion dollar medicare scheme took advantage of hundreds of thousands of seniors in the US. In reality, this is just a small sliver of the billions of dollars healthcare fraud costs both consumers and insurance providers annually.
Healthcare fraud can come from many different directions. Some people might think of the patient who pretends to be injured, but actually, much of fraud is caused by providers(as in the NYT article).
Providers often have financial incentives for increasing performing unnecessary surgeries or claiming work they never even did. This leads to many different flavors of fraud that can all be difficult to detect on a claim by claim basis.
For example, fraud from healthcare providers could include:
- Upcoding
- Medically unnecessary procedures
- Kickbacks
- Providing services with nurses and staff that should be provided by doctors
These four methods of fraud are often effective for several reasons. First, there are so many claims it can be hard for claims processors to discover them before paying them. So often these fraudulent claims will be paid before getting caught. Another good reason is that sometimes the cost of adjudicating the claims might be greater than the claims value themselves.
That makes it difficult for insurance providers to rationalize spending money on creating methods to capture these bad behaviors.
This is why before investing hundreds of thousands of dollars into your first fraud detection system you should first analyze your claims from multiple directions to get an idea where fraud could be coming from.
This is where the exploratory data analysis step comes into play.
What is Exploratory Data Analysis
When you first start to analyze data your goal will be to get a good sense of the data set.
This is known as exploratory data analysis. In particular, if your company follows the O.S.E.M.N (Randy Lao) data science process which stands for Obtain, Scrub, Explore, Model and iNterpret, then this is the E step.
Using this process can help provide clarity to the management of your progress.
The purpose of this step is to become familiar with the data as well as to drive future analysis. Generally, this step has a combination of analyzing data sets for skew, trends, making charts, etc.
It’s not about structure or process but instead meant to bring out possible insights through a flow state.
How you approach this step depends on how you work best. For instance, our preference is to think of questions we want to answer about the data set and then go about answering said questions.
For example, in our analysis today we will be looking at the Healthcare Fraud data set from Kaggle.com. The data set is focused on fraud and providing insights into which providers are likely to have fraudulent claims.
So our questions will be based around looking into what could support the case of fraud for these providers and why it is worth it for our business providers to invest in our project.
Here are some example questions:
- Does age play a role in which claims are fraudulent?
- Month over month are there any patterns of when fraud occurs?
- Do Fraudulent providers make more per claim than non-fraudulent providers?
- Do fraudulent providers make more per patient than non-fraudulent providers (e.g. Per patient per month PMPM)
These questions can help frame and guide our analysis so we don’t spend too much time wandering without a purpose.
What Is The Purpose Of EDA
The purpose of this EDA step is to provide support for later more in-depth analysis. In a recent post, we discussed the concept of agile data science. Not so much as a strict process but as a framework. This is one of those steps where you are doing the analysis you can bring up points that are interesting using charts and metrics that might help move your business case along.
Note: In this example we have already joined all the data sets together for easy use. You can find the code for that here.
For example, in our questions above we are looking to support the idea that it is worth looking into fraudulent providers.
So let’s look at how they play out. We are running this all in SaturnCloud.io because it is easy to spin up a VM and run this analysis as well as sharing it.
In this first part, we look at age. It’s usually a great place to start because it is a natural place you might see some patterns in the data.
So we can use the histogram function in Pandas to analyze this.
However, once we look at it, it seems to break down at a pretty even distribution. This means there is a pretty similar sample across both sets of data.
So the next question we wanted to answer was focused around spend. So first we wanted to look at spend in general. Initially, when analyzing the gross amount nothing sticks out as seen in the charts below.
Overall, the months seem to line up. Except, the total amounts month over month seems to be much higher on the fraud side. So we wanted to look into this.
A better way we can look at this. First, let’s look at the average claim cost per month.
Now, we can see here that there is a drastic difference in the average cost per claim. In case of providers that are likely to commit fraud, they often charge 2x what the non-fraud providers charge(This would required more analysis into what the claims were to support this).
Another great metric used in healthcare is PMPM, this stands for per patient per month. This is a great metric to see how much a patient is costing per month.
So instead of looking at the average claim costs we will look at the average patient cost per month. Technically, we should be looking at this by calculating whether or not a patient has valid coverage for the month.
However, due to the data set we don’t really have that specific data. So for now we are using the proxy of the patient’s ID. It’s not perfect, but it is what we will use for now as seen in the code below.
Looking at this you will notice that going to an insurance provider that is likely to have fraudulent claims also charges 2 times per patient than the non-fraudulent providers.
Now, why is it important that we have done this exploratory analysis before diving into model development?
The reason is because this provides a solid business case to sell to your stakeholders for why you would like to invest further into this project.
You already have a business reason that would intrigue any business partner. Based on the monthly spend charts, your an provider could be saving upwards of 750,000 USD a month or several million dollars a year if you were able to crack down on this insurance fraud.
In addition, you area already seeing some tendencies of fraudulent providers.
But you can’t stop analyzing the data just yet.
Always Get More Support
Now, as a data scientist or analyst you will want further supporting evidence to continue down this avenue.
This means bringing in other angles from this data that can further support the point of the providers costing your insurance company far more than is required.
Here are a few ways we can do so.
Let’s first start by looking at the overall count per physician of claims they had in a year. Let’s take a look at what the break down looks like comparing fraud to non-fraudulent claims.
What you will notice is that there is a drastic difference in the number of claims done by the physicians at providers where there is a likelihood of fraud vs our non-fraud physicians.
In addition, physician PHY330576 seems to be doing a much larger amount of claims compared to even his peers at the fraudulent providers. This would be worth digging into.
There could be a business reason for why this physician provides so many more claims. Perhaps they handle procedures that are very small and easy to do and it could just be a confounding factor.
Again, hard to say. But, this still further supports the idea that the fraudulent providers are providing or claiming to provide extra services that are not needed.
Let’s take one last look at this from another angle.
Instead of monthly breakdowns let’s try analyzing the average number of claims a physician provides per day. If we analyze the number of claims done by physicians on a daily basis depending on if the provider is fraud or not fraud what do we find?
Looking at the two charts we can see there is a much larger amount of claims that exceed 3 or more claims per day in the fraudulent providers vs. the non-fraudulent providers.
In addition, when you further look into it you will find that fraudulent providers have 15% of claims with 3 or more claim ids in a day compared to 3% for non fraudulent providers.
As you can see the fraudulent providers are claiming much more in the way of claims per day.
This is highly suspect and would be a great place to start analyzing data.
From here your goal as an analyst would be to analyze what types of claims have 3 or more claims per day. This might give you a pattern of behavior.
However, we will stop our analysis for now.
Now before we go on we wanted to point out a nifty feature that helped us during our analysis. SaturnCloud.io is automatically integrated with git. This means as you are working on answering these various questions if you accidentally change something in your code and don’t remember what it was, then you can easily role back. This came in nifty because you’re not even seeing all the charts we developed.
Thus, having the ability to role back and see if there were snippets of code that made more sense was very helpful!
In The End, Exploratory Analysis Can Save Time And Increase Buy-In
In the original analysis on Kaggle they tried to develop a model right away without really finding a target population.
Here, we have a possible population (physicians that provide 3 or more claims per day) that we might want to target. Now this would again be brought up in a meeting with stakeholders. But it goes to show why EDA is important.
It’s not always about going head first into the model. Sometimes it is about first developing solid support into what populations might be worth looking at.
In this case, there is value in analyzing the 3 or more claims per day as that seems to be a factor.
From here you would want to see what procedures or diagnosis are included in these cases as that might further provide information into what is going on.
We do hope this gave you valuable insight into why EDA is important. It helps you get a better understanding about the data while at the same time providing support you can provide your business partners.
If you team needs data consulting help feel free to contact us! If you would like to read more posts about data science and data engineering, Check out the links below!
The Advantages Healthcare Providers Have In Healthcare Analytics
142 Resources for Mastering Coding Interviews
Learning Data Science: Our Top 25 Data Science Courses
The Best And Only Python Tutorial You Will Ever Need To Watch
Dynamically Bulk Inserting CSV Data Into A SQL Server
4 Must Have Skills For Data Scientists
What Is A Data Scientist