Using BigQuery And SaturnCloud To Analyze Medical Data

Using BigQuery And SaturnCloud To Analyze Medical Data

September 17, 2019 Data Science Consulting programming 0
healthcare analytics

Today we wanted to use discuss using cloud tools that are available to everyone to analyze a medical data set.

In particular will be using the Kaggle data set for medicare providers. This has information on diagnosis related groups average costs, hospital locations and interesting facts about providers and their service quality.

These data sets provide you the ability to answer questions about which states have the highest quality of service (based on the metrics they provide), the highest average costs as well as look into demographics of the patient bases.

Since healthcare costs tend to be one of the largest focuses, we are going to look at which states have the largest percentage of costs that are above average. We will be using SaturnCloud.io as a platform for our Jupyter notebook and BigQuery in order to query the data.

Tools We Will Use

SaturnCloud.io

For those unfamiliar with tools like SaturnCloud. SaturnCloud.io allows you to easily spin up a VM that has already has Jupyter notebook installed. But it is really much more than just that.

Using SaturnCloud.io makes it easy to share your work with other teammates without having to deal with the hassle of making sure they have all the correct libraries installed. Also, for those of you who have had the issue of running Jupyter Notebooks locally and run out of memory, it also allows you to spin up VMs with the memory and Ram required and only pay for what you use.

There are also a few other bells and whistles that really mitigate some of the complexities of work typically done from a DevOps perspective. That’s what makes tools like this great. In many ways, until you work in a corporate environment where even sharing a basic Excel document can become cumbersome, it is almost hard to explain to younger data scientists why this is so great!

Big Query

Due to where the data is being stored. We will also need to interact with Big Query. This is a Google Cloud service that allows you to query data. In this case, the medicare data is open to everyone but is stored on Google Cloud.

As like most cloud services, this makes it convenient to manage as long as Google keeps this data public. It is really easy to pull the data and run queries without having to download a csv and load it into your own database.

Before we get started, you can also check out the Jupyter notebook here.

Connecting to Big Query

For step one we will be retrieving the cms_medicare public data set. In order to do so you will need to set up Google API credentials (see instructions here). You will notice that the credentials don’t so much require an API key as much as the .json file that has the key as seen below.

from google.cloud import bigquery
from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file(
filename="JSON_KEYPATH"
)
    client = bigquery.Client(
    credentials=credentials,
    project=credentials.project_id,
)

medicare = client.dataset('cms_medicare', project='bigquery-public-data')

 

Once you have set up your credentials you can then create a reference to a client. This will be your access point.

To get an understanding about what you are working with, you can actually print the list of tables in the cms_medicare data set. Using the code below you can iterate through the tables so you know what you are working with.

Now once you have pulled the data set, you actually have access to several tables. You can run the print statement above to get a list of all the tables.

 

medicare = client.dataset('cms_medicare', project='bigquery-public-data')
print([x.table_id for x in client.list_tables(medicare)])

 

For this analysis we will be using the inpatient_charges_2014 data set. This has information on the provider, the DRG code (which stands for Diagnosis Related Group) which isn’t 100% on the diagnosis as is an ICD code, but it is what the data set contains. We say that because we will be comparing DRG codes across states. This isn’t really a fair comparison because there could be all sorts of specifics lost in the roll-up that could lead to higher or lower costs. We are doing this for the sake of showing how you could do future work.

Start Your Analysis by Asking A Question

To start this analysis let’s just ask a basic question. Like stated earlier we want to know which states consistently charges more per drg definition than other states.

What States Consistently have Higher Than Average Prices

SELECT
    sum(total_discharges*average_total_payments)/sum(total_discharges) AS avg_cost_drug,
    drg_definition
FROM    `bigquery-public-data.cms_medicare.inpatient_charges_2014`
GROUP BY
Drg_definition

 

In order to do so, we will need to calculate the average price at the granularity of state. This can be done both using pandas as well as SQL (personally, I believe SQL is usually better built for this kind of work). So, thats what we will use. We can write the query like the one below.

 

#!/usr/bin/python
# -*- coding: utf-8 -*-
import pandas

# How many nurrsing facilitins in different States

query1 = """ SELECT
  (
  sum(
    CASE
      WHEN avg_cost_drug < average_total_payments THEN 1
    ELSE
      0
    END
      )*1.0/count(*)) perc_over,
    count(distinct a.drg_definition) total_drg,
    provider_state
FROM 
    (
    SELECT
        sum(total_discharges*average_total_payments)/sum(total_discharges) AS avg_cost_drug,
        drg_definition
    FROM  `bigquery-public-data.cms_medicare.inpatient_charges_2014`
    GROUP BY
    drg_definition 
    ) a
INNER JOIN  `bigquery-public-data.cms_medicare.inpatient_charges_2014` b
ON  a.drg_definition = b.drg_definition
INNER JOIN  `bigquery-public-data.cms_medicare.hospital_general_info` c
ON  b.provider_id = c.provider_id
GROUP BY
provider_state
"""

query_job = client.query(query1)
results = query_job.result()  # Waits for job to complete.

perc_over = []
total_drg = []
provider_state = []

for i in results:
    perc_over.append(i[0])
    total_drg.append(i[1])
    provider_state.append(i[2])
state_df = pd.DataFrame({'perc_over': perc_over,
                        'total_drg': total_drg,
                        'state': provider_state})

 

Using this query as a subquery we can join that both to the hospital_general_info and then back to the original table. This will provide the original cost as well as the state information. This will allow us to compare the cost of drg_definitions as well as group by state.

Then we can use a case statement to calculate what percentages of drg definitions are over the average costs. This can be seen in the query below in the query1 variable.

Using this query you can now figure out which states have the highest percent of drg definitions that charge more than the average. Below you can see us pulling the top few and you will notice that the highest states are Alaska, California and Vermont.

Data Visualization — Mapping Healthcare Costs

This is a great example of somewhere where you can use a simple data visual to distill the information into a concise graphic. Looking at numbers and state codes is hard. It becomes hard to really get what is going on. Instead, let’s use the ploty library to map it out

from plotly.offline import init_notebook_mode, iplot

import plotly.graph_objs as go
import matplotlib.pyplot as plt
import chart_studio.plotly as py
from plotly import tools

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
init_notebook_mode(connected=True)
scl = [[0.0, 'rgb(248,255,206)'],[0.2, 'rgb(203,255,205)'],[0.4, 'rgb(155,255,164)'], [0.6, 'rgb(79,255,178)'],[0.8, 'rgb(15,183,132)'], [1, '#008059']]


data = [ dict(
        type='choropleth',
        colorscale = scl,
        autocolorscale = False,
        locations = state_df.state,
        z = state_df.perc_over,
        locationmode = 'USA-states',
        text = state_df.state,
        marker = dict(
            line = dict (
                color = 'rgb(255,255,255)',
                width = 2
            ) ),
        colorbar = dict(
            title = "Percentage ")
        )
       ]


layout = dict(
        title = 'Percentage Of DRGs That Charge More Than Average By State',
        geo = dict(
            scope='usa',
            projection=dict( type='albers usa' ),
            showlakes = True,
            lakecolor = 'rgb(255, 255, 255)'),
             )

fig = dict( data=data, layout=layout )
iplot( fig, filename='d3-cloropleth-map' )

 

With this map you can better see which states are the biggest culprits of high costs. Instead of being stuck staring at numbers you now have a clear picture of which states charge more than the average.

Again, this isn’t at the granularity of ICD codes so a lot of information is lost but this is a good generalization.

Next Steps And Conclusion And Youtube

You can check out part 1 and 2 on my youtube channel

From here we would look into the various high cost states and see if there were specific DRG codes that were consistently over at the provider level. But for now we will wrap up this post before it gets too long.

If you would like you can check out the Jupyter notebook here. You might even want to pose you own questions about the costs of different medical procedure categories, the level of service at different providers, and other conclusions you can draw using this data set.

If you do happen to use this as a base for some project, let us know about it! We would love to see what questions you ask and what you end up finding out.

Also, if you would like to read more helpful topics on data science and programming, then check out these articles:

Connecting To Big Query Using Jupyter Notebook On SaturnCloud Part 2

Hadoop Vs Relational Databases

How Algorithms Can Become Unethical and Biased

How To Improve Your Data Driven Strategy

How To Develop Robust Algorithms

4 Must Have Skills For Data Scientists

SQL Best Practices — Designing An ETL Video