Time Series Data Modeling for Medical Devices

Posted by Dani Traphagen on Mon 15 February 2016

Medical Devices as IoT

Over several years, Internet of Things (IoT) has exploded into a massive market encompassing a huge variety of industries. I have heard a lot of theories as to why this has happened. Our smart phones have become trusted advisors, aggregating our information from various sources and simplifying our lives. Perhaps it's in our willingness to offer personal information to a database somewhere for added comfort or ease.

There's another reason that IoT has begun dominating the consumer market. It's become easier to meet consumer demand due to decreasing storage costs. As the price per gigabyte of data has decreased, operations at massive scale have become more cost efficient. So, to summarize, what have we got? The technological feasibility to produce a "Meet the Jetsons" lifestyle at costs that make sense for the typical middle class family and a major profit margin to those who are pushing futurism.

IOT Oven

So yeah, it makes sense for your oven to text you, or for your coffee maker to learn more about your caffeine addiction to better serve you a hot cup, or even for your water bottle to tell you how much water you drank today. Have you thought about the possibility of your insulin pump doing the same? How about managing your conditions with an application on your phone? What if this same application wasn't just for you to manage your health? What if it was also sent to an ongoing health profile for your doctor to provide even better and more precise treatment? Now, let's take it up another notch, what if we were able to do machine learning and predictive analytics on the data we accrue to personalize medicine and employ safer medical techniques? This is what's possible when medical devices, drug delivery and health care utilize today's hottest technologies. It starts with managing Time Series data with the right software stack.


Why is Time Series the right data model for this?

Firstly, let's define time series data for the big data world. TL;DR Time series data is a metric crap-ton of data that we expect to receive at high volume and velocity continuously for a given time period. Just as time is linear (until some awesome, freak show science reveals otherwise) our time series data is as well, and we leverage both the cold and hot data we store for various uses. For instance, we can store cold and hot data differently depending what we may want to do with it. We might find we keep what is meaningful at the ready on memory or in our super sleek SSDs, and dump old data in less expensive locations like a designated file system we ETL (Extract, Transform, Load) to. We can get creative with how we store this data, but the main take away here is we want to store what is meaningful given our application and since chemical interactions within the human body are tracked by and dependent upon time, that's what we need to store our data with...and we need to store it smart.

What db should I use for Time Series Data?

Apache Cassandra hands down. Why do I say this? Whenever we're looking at a choice in distributed databases we need to consider the type of distributed problems they solve. Apache Cassandra prioritizes Availability (continuous availability) and Partition tolerance to achieve zero downtime. It also allows for tunable consistency as part of a trade-off, which means that we can get stronger consistency on our data if we pay a performance price. If we choose eventual consistency, we retain the high performance Cassandra is known for and while writes or reads may lag on the freshest update this is only by a matter of milliseconds. This is okay for our purposes because we can plan for this in our application, whereas sluggish operations would be crippling. In addition to the problems Cassandra solves, it also serves as an ideal solution for managing clustered data, which is what we have in spades with time series. But, how does it all connect?

Cassandra has a peer-to-peer architecture where data is distributed around a ring of nodes. Each node owns specific portions of a token range, and the data is replicated amongst the nodes with a tunable replication factor. This has important impacts for data retrieval for a number of reasons. If we want performant reads and our data is distributed around peer nodes, we need to consider our data access patterns ahead of time so we know how to optimize for our reads in our data model design.

We also need to employ a process called denormalization. That means writing the same data more than once by designing a table for each query and inserting the same data into it. Denormalization will help us retrieve data faster and since writes are cheap and as referenced earlier so are disks, it makes sense to denormalize the hell out of our data. A related and historical fact about Cassandra was it was architected to reduce rotational disk seeks, by using an append-only process for writes. This is a reason among many that with Cassandra, writes are cheap operations and gives us some added benefits for time series data.

Since we write data in a sequential fashion with Cassandra, by providing the Primary Key (partition key and any clustering columns, if applicable) and the time bounds required, the data will be retrieved with minimal disk seek overhead. To make important application decisions on how to process and respond to the results of our query, performance is key. For medical devices or drug delivery, this can be crucial, mission critical information that is as important as life and death. We simply can't afford downtime and Cassandra meets these demands.

How to Model Time Series Data:

Now that we know Time Series data is the type of data we want to collect for medical device applications and Cassandra is the right database to store it, we can start to talk about how to actually model this data. I've put together some examples based upon a time series collection of insulin readings that are being gathered on a minute by minute basis. This could become a useful application for Diabetes patients to manage their insulin levels and a great tool for their doctors to understand the patterns of their behavior to offer better care.


Imagine a world where you AND your doctor can interact with your medical device or drug delivery system via a user interface, providing a single source of truth into the current view of your health. That would have drastic effects on the way health care is provided. So, let's start creating a theoretical table based upon this example. Pro tip: If you fancy following along and want to try this for yourself, scroll to the bottom of this page and use the script I've provided.

Creating a Keyspace and Table:

First we need to create a Keyspace for our Table:

CREATE KEYSPACE insulin_pump
WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 3 }

I've used SimpleStrategy as our Topology Strategy but you would never do that in production if you ever intended to go multi-dc. Read more about Keyspace syntax here.

USE insulin_pump;

This command allows us to use the Keyspace we just created. You will get an error if you do not do this step.

CREATE TABLE insulin_level (
insulin_pump_id text,
event_time timestamp,
insulin_level_mcU_ml int,
PRIMARY KEY (insulin_pump_id,event_time)

We create the table for insulin_level which is partitioned by insulin_pump_id as our partition key and event_time as our clustering column.

Insert Data:

Next let's insert some data into our insulin_level table:

INSERT INTO insulin_level(insulin_pump_id,event_time,insulin_level_mcU_ml)
VALUES ('0908070605','2016-02-01 09:01:00','5');

INSERT INTO insulin_level(insulin_pump_id,event_time,insulin_level_mcU_ml)
VALUES ('0908070605','2016-02-01 09:02:00','6');

INSERT INTO insulin_level(insulin_pump_id,event_time,insulin_level_mcU_ml)
VALUES ('0908070605','2016-02-01 09:03:00','6');

INSERT INTO insulin_level(insulin_pump_id,event_time,insulin_level_mcU_ml)
VALUES ('0908070605','2016-02-01 09:04:00','7');

Let's query what we just added:

SELECT event_time,insulin_level
FROM insulin_level
WHERE insulin_pump_id='0908070605';

Here's our output:

event_time               | insulin_level_mcU_ml
2016-02-01 09:01:00+0000 |      5
2016-02-01 09:02:00+0000 |      6
2016-02-01 09:03:00+0000 |      6
2016-02-01 09:04:00+0000 |      7

Now, let's take a look at a range of data, bucketed by two time points:

SELECT insulin_level_mcU_ml
FROM insulin_level
WHERE insulin_pump_id='0908070605'
AND event_time > '2016-02-01 09:01:00'
AND event_time < '2016-02-01 09:04:00';

Here we are looking at the range of time between February 2nd, 2016 at 9:01 to 9:04 and returning results within those bounds:


Now while a minute by minute reading is helpful for the internals of the insulin pump itself, it isn't particularly informative for the patient or for the doctor involved. Let's see what we can do to have a better overview of the patient's state over the course of the day.

Creating a table for insulin_level_by_day:

CREATE TABLE insulin_level_by_day (
insulin_pump_id text,
date text,
event_time timestamp,
insulin_level_mcU_ml int,
PRIMARY KEY ((insulin_pump_id,date),event_time)

Here our key couples the device id and the date, we call this type of PRIMARY KEY a compound Primary Key. This allows us to retrieve data by a given day when we query this table. I'll show you how to access that too.

Inserting values for insulin_level_by_day:

INSERT INTO insulin_level_by_day(insulin_pump_id,date,event_time,insulin_level_mcU_ml)
VALUES ('0908070605','2016-02-03','2016-02-03 09:01:00','9');

INSERT INTO insulin_level_by_day(insulin_pump_id,date,event_time,insulin_level_mcU_ml)
VALUES ('0908070605','2016-02-03','2016-02-03 09:02:00','10');

INSERT INTO insulin_level_by_day(insulin_pump_id,date,event_time,insulin_level_mcU_ml)
VALUES ('0908070605','2016-02-04','2016-02-04 09:01:00','11');

INSERT INTO insulin_level_by_day(insulin_pump_id,date,event_time,insulin_level_mcU_ml)
VALUES ('0908070605','2016-02-04','2016-02-04 09:02:00','11');

Here's how we get all the results for insulin level, given a particular device and date:

FROM insulin_level_by_day
WHERE insulin_pump_id = '0908070605'
AND date = '2016-02-04';

Then when we run the above, we get our results by day:

insulin_pump_id | date       | event_time               | insulin_level_mcU_ml
     0908070605 | 2016-02-04 | 2016-02-04 09:01:00+0000 |     11
     0908070605 | 2016-02-04 | 2016-02-04 09:02:00+0000 |     11

Pretty stellar huh?


Cassandra and time series go hand and hand. Hopefully, you can now see how powerful the cocktail can be for the health care industry. You can also do more advanced time series data modeling and obtain even more useful information from our data by leveraging Spark for analytics. We can use this awesome open sourced project to view our data differently and pursue a deeper understanding of the interactions we're tracking.

To try this out for yourself:

  1. Download Cassandra and install locally or in a VM. You can also use a pre-loaded Data Modeling VM from DataStax Academy too. Sign up and download the free VM here.
  2. Run the provided CQL script accessible here. and use the queries I've given above to follow along.
  3. Ask questions! Hit me up on twitter with @dtrapezoid
  4. Share this post if it will help out a friend!

As a final note, I'd like to take the time to expressly thank my reviewers for confirming my biohacker insanity: