As many of you know, for the last three years I've been trying to help move papillary kidney cancer research forward. To minimize correlation with "big team/big money" efforts, I've focused on smaller projects: patient support (a website, a clinical trial search tool, encouraging a recommendation paper), data science (supporting Quantum insights and Alex Feltus' work), citizen science (co-organizing a hackathon), etc.
During this time, the key constraint I've come across is data access. With the notable exceptions of the NIH's TCGA and George Church’s companies (openHuman, php, etc.), every repository I've come across is "deposit only" (and as wonderful as it is, even TCGA has data access limitations). Some organizations list "HIPAA" as a requirement and use that as a reason to restrict data access. Some require data be kept private for competitive advantage. Others justifiably require "credit" for their hard work, which they won't get if they just open their data. Some have "quality" requirements on the resulting work product.
So, as Felix Frayman and I have observed, maybe instead viewing this as a "Data Processing" problem, we ought view it as a "Game Theory" problem. So instead of trying to convince everyone to play the openData game of putting their data into a single open data silo, we can propose other forms of collaboration (games) that researcher's and silo owner's (player's) are willing to play.
For example, here's a different game: Suppose players simply wanted some aggregation statistics about the incidence of kidney cancer where the data is kept private but the model can be public? With the right set of standards, Players could create a privateDataPublicModelAggregation game where they pass an empty spreadsheet to UCSF, who in turn fills it up with their aggregated data. Since it is aggregated, it is naturally anonymized. UCSF could make that spreadsheet public without exposing their data. Then Clemson could ask for it and aggregate in their data. Any published work would credit players who added to the model. And if a player added garbage, the game could just rollback to the prior model. So, at this point, the privateDataPublicModelAggregation meets the players' "HIPAA", "credit" and "quality" requirements.
Now, privateDataPublicModelAggregation is "commutative". We get the same result no matter the order of play. Some games are not. E.g. in deep learning, Clemson's plus UCSF's data does not give the same result as UCSF's plus Clemson's data. I've figured out one way around that (for the dataClosedModelOpenDeeplearning game anyway). Also, privateDataPublicModelAggregation has something called a Nash Equilibrium where the first guy could do the work, and later players could add nothing but still get credit. Contributing Players wouldn't like that. However in the dataClosedModelOpenDeeplearning game, each party has a "holdout set" which measures if the model is getting better, and the model is not allowed to be transferred or credited unless it does.
There are issues with this. As James Hsieh commented, "It is a tall order to have hospital do that, I think", so it may make sense to have a centralized tool to aid in this (E.g. QRESP or if the goal is engagement, even something like twitch). As well as monitoring intermediate results, this platform could also monitor global constraints, e.g. publishing game results once some metric was achieved (like kaggle).
So, in summary, in the last 3 years, I realize that I have had a pretty limited idea of what constitutes "collaboration". By opening that up, maybe we can make some progress on the "data silo" problem and move on the next constraint, which is probably data cleaning/standardization.