Blog Post
May 30, 2019

Organizing Silos of Patient information for Rare Disease Research

As described previously, it is difficult to do rare disease research since the data required to do so is scatted in multiple silos. Here we link to a paper describing a method of querying multiple silos without disclosing data from any one. This allows silo owners to leverage unused data while maintaining control of it.  This post is technical and geared towrds information technologists.  Key points from the paper are below.

Mitigating Forgetting in Small Federated Learning Networks

Introduction

Cancer drug development is slow and costly. Just 6.6% of cancer patients currently see benefits from existing drugs. Also, at the current rate of progress, it would take more than 200 years for all existing patients to be helped. [20180424] One way to mitigate both time and cost is to automate early stages of the drug development pipeline. Several stages of various pipelines now utilize Deep Learning models to assist in this.

Unfortunately, Deep Learning models require a great deal of data and most of that data is fragmented and resides behind the paywalls of disparate organizations. Collecting the data into a central repository is difficult due to a variety of competitive, legal and privacy constraints (such as HIPAA[HIPAA]). What is needed, and what is described here, are Federated Learning mechanisms whereby various organizations can collaborate while maintaining control over their own data. Here, we describe several Federated Learning mechanisms which overcome model “detuning” (called “Catastrophic Forgetting” in the literature) which can come about when implementing “Federated Learning” in a distributed environment.

Abstract

We describe a series of "Federated Learning" experiments which create "Deep Learning" models while preserving the privacy of their distributed, siloed datasets. We do this by creating randomized equal length mini-batches in each silo at the beginning of each epoch, running Stochastic Gradient Descent locally, then combining the results and looping to the next epoch. Scheduling can be done either peer-to-peer or using a central server. This approach avoids the effect of "forgetting" (model detuning) which occurs when a fully-programmed model is passed to each silo in succession for training. It is suited to organizations which cannot overtly make their data public such as pharmaceutical and healthcare organizations who want to jointly create a Deep Learning model using all their datasets without exposing their data (due to HIPAA or competitive reasons). Questions can be directed to bill@rarekidneycancer.org.

 

Add new comment