Announcing Bunsen: FHIR Data with Apache Spark
We’re excited to open source Bunsen, a library to make analyzing FHIR data with Apache Spark simple and scalable. Bunsen encodes FHIR resources directly into Apache Spark’s native data structures. This lets users leverage well-defined FHIR data models directly within Spark SQL.
Here’s a simple query against a table of FHIR observations that produces a table of heart rate values:
spark.sql(""" select subject.reference person_id, effectiveDateTime date_time, valueQuantity.value value from observations where in_valueset(code, 'heart_rate') """).show()
Which prints a table like this:
Notice that each field in the above SQL is fully defined by the FHIR Observation model. This is because the table schemas are generated directly from FHIR resource definitions, ensuring these queries exactly match other FHIR-based views of the same data.
Bunsen also provides a collection of helpful functions to make querying data easy. The above query includes the in_valueset user-defined function, allowing users to use code value sets directly in the query. You can see the Bunsen value set documentation for details.
Scalability and Performance
Because Bunsen encodes FHIR resources in Apache Spark’s efficient binary format, we get all of Spark’s scalability and performance advantages. Simple queries across billions of FHIR resources typically return in single-digit seconds in internal clusters. Arbitrary joins and aggregations of complex datasets scale with your Apache Spark cluster. We take advantage of Spark’s built-in support for Apache Parquet to read and write FHIR with an efficient columnar data format that is readable by other systems as well.
Spark SQL offers rich query semantics that can now be used directly over FHIR data models. For instance, here is a query that builds a timeseries-like table directly from a collection of observations by simply grouping items by the person and time period. This is just standard Spark SQL wrapped around our simple valueset-based function.
Typical queries may aggregate many other types of data and join to conditions, allergies, or other tables to build a more complete report. All of this can be done interactively over billions of records.
Bunsen uses the HAPI FHIR library to represent data in object form. Java users can convert their objects from the HAPI to Spark-native structures and back with a few lines of code. Here’s an example:
Users can also leverage Spark’s built-in functionality to write these datasets to tables and query these tables later.