Google’s $2 billion acquisition of Fitbit last month has been met with concern from privacy advocates worried about how the tech giant will use personal fitness data. This reaction prompted the tech giant to clarify that the acquisition is "about devices, not data."
The deal has brought to light a larger issue that we all seem to gloss over: Every day, millions of people publicly share seemingly innocuous personal health information with many stakeholders, including employers, insurance companies, providers and even publicly on the Internet.
This becomes especially concerning during a time when there are literally hundreds of clinical studies, some of them with hundreds of thousands of participants, that may request permission to use the same fitness-tracker data to study everything from obesity to COVID-19 symptoms. In the service of public health, many of these datasets are then made publicly available to allow other researchers to reproduce their research or perform new research. But this is not a risk-free situation.
Examples of fine-grained step data shared on public social networks: Garmin connect platform (left), Fitbit steps shared automatically on Twitter (right).
In a world where "anonymized" study participants can be individually re-identified simply by using a genealogy database, it’s not a huge leap to imagine malicious actors being able to figure out the true identity of a participant in a study by triangulating something as simple as your step count.
Consider that fitness data such as step counts is just a sequence of numbers, much like DNA is a sequence of the nucleotides C, G, T and A. As the length of the sequence grows, the likelihood of someone having exactly that sequence for some given date decreases exponentially.
Just six days of step counts are enough to uniquely identify you among 100 million other people. Step counts are a unique key that can be used to match the weekly step-log from your latest Tweet to the "anonymized" step count in a research dataset – a dataset that may also list other sensitive information, like a mental health diagnosis. Without a course correction, exposing such data using these kinds of re-identification attempts will become increasingly easier, as it's been for other complex datasets in the past.
Schematics of a re-identification attack based on wearable data. A person with a heart condition decides to participate in a research study that collects physical-activity information through a wearable device, in addition to information about his condition (1). The participant also uses a social network to share the outcomes of his physical activity and set weekly goals (2). At the end of the study, the research data is anonymized and made publicly available (3). A malicious actor can retrieve the anonymized data set and the data published on the social network and match them on the physical-activity time-series (4). The malicious actor can re-identify the study participant and link his social network identity to the medical condition (5).
To reduce these risks, we would ideally see fundamental changes in the business models of companies gathering fitness data. In the meantime, we need to educate research participants about the risks of their wearable data leaking through other channels. If someone is enrolling in a study that involves using their own personal wearable, researchers should warn them to turn off public dashboards and unlink other apps using their data if the person is concerned about their privacy.
Researchers should also make sure that datasets are not naively released into the public domain, but instead limited in use to qualified researchers who pledge to respect participant privacy. To release data without restriction, researchers should prove both meaningful informed consent to the release and prove a true de-identification. (Methods from differential privacy have recently been used by Google for COVID-19-relevant data releases, as well as by the U.S. Census for conducting Census 2020. However, such methods are still in the research phase for fitness data.)
At a broader level, the sensitivity of fitness data goes beyond the high risk of people getting re-identified, thus creating real risk for every individual with a digital health product. Fitness data contains information about our hearts, our sleep and our lungs – and soon enough, will contain information about our cognition.
Asking each of us to manage this kind of risk over time as individuals is unacceptable, particularly as collectors, aggregators and users of this data do not have restrictions in place to force them to consider ethics, data privacy and anti-discrimination.
We need systemic reform through legislation regarding digital specimens that mirrors the Genetic Information Nondiscrimination Act (GINA). GINA protects the constitutional right to privacy of Americans’ genetic information when it comes to health insurance and employment. It’s long past time to create those same protections in fitness and other digitally captured – and unprotected – health data.
Wearables and other sensors can transform how we understand the health of individuals and populations, and they can do it at scale. However, to ensure that we deploy these tools to help improve lives and not harm them, better privacy protections are needed. Immediately.
Luca Foschini (@calimagna) is a cofounder and the chief data scientist of Evidation, a health data analytics company developing new ways of measuring health in everyday life while respecting individual privacy. Foschini's research in digital medicine spans cybersecurity, machine learning and (big-) data privacy.