Big Data and Industrial Research in Healthcare
Concluding Remarks
Biomedical data are by definition characterized by high variety and heterogeneity; the diversity of possible measurements directly depends on the many levels through which biology can be investigated. These data-producing biological levels are genomics, transcriptomics, proteomics (from the side of more traditional omics), tissues or single cells specificity, imaging and clinical quantification of a big number of parameters, often repeated over time. Connected to the biology there is the phenotypic layer which is the ensemble of physiologic readouts impacting health status, disease and individual characteristics. The personal clinical parameters are also collected and usually stored in EHRs, along with any other medical treatment information. Traditionally experimental disciplines, i.e., life science sectors that just few years ago were relying solely on wet bench work, have been flooded by data, and almost any laboratory technique has been digitized and can be quantified numerically. We are constantly producing and collecting higher volumes of diverse data, thus besides the 4Vs it is an imperative to add an additional "V": value.
Is there actual value in embracing the big data paradigm? Data are not good data just because of their size, so big data per se are not a value. The added value is actually present and perceivable only if simple, comprehensible and possibly new and actionable information can be extracted from the big mass of data with a reasonable effort.
Information is actionable in research and in clinics if it allows to form further hypotheses or take medical decisions, respectively. In fact, ever increasing data volume and variety challenge human cognitive capacity and too much data is not usable for informed decisions. There is a significant gap between the human cognitive capacity and data availability. The decision, or educated guess, by clinical phenotype is a hallmark of traditional healthcare, but nowadays and future biomedical sciences need to rely on multiple data, analyzed and integrated by computational methods, and finally summarized into smaller annotated pieces of rich information to produce new actionable knowledge allowing augmented decision-making capacity (Figure 1).
Figure 1. Transformation of decision making process. The increasing speed of data production and their volume and variety, are challenging human cognitive capacity. Educated guessing as the process of inference when information is not at hand was quite the norm in the past; today the bottleneck of human ability to process information can be bypassed if data are correctly integrated to produce new actionable knowledge, thus augmenting human cognitive capacity. Decisions by phenotype, which is typical of traditional healthcare, tends to be replaced by data-driven decisions extending the reach of medical actions either by efficacy or by speed.
In this context, data integration assumes a primary importance for biomedical sciences. We can describe at least two different scenarios for integration: horizontal and vertical integration, which are not necessarily happening in a strictly separated manner.
Horizontal integration (across many data from independent sources) applies more commonly to basic research and academia. Public databases and repositories of published datasets are a gold mine for research and can be used to look for correlations, confirm hypotheses, validate own results. Integration processes require computational skills often prerogative of computational biologists, bioinformaticians or computer scientists, and usually take a significant part of research time.
Vertical integration (across data produced by a single organization) is particularly important for pharma industries, which often produce and handle different kind of data, but have the difficulty or even impossibility to work with data outside personal or departmental silos. Difficulties are not of technical nature only, but can originate from organizational and decisional issues.
Finally, data collection needs to be curated and quality controlled and then published and shared in a way that they can be easily reused and reproduced: the publication of the FAIR principles emphasizes four key aspects that should be a priority for data practices across the scientific community: published data should be Findable, Accessible, Interoperable and Reusable.
In the autumn of 2017 a course at the University of Washington in Seattle taught by biologist Carl Bergstrom and information scientist Jevin West quickly filled up to capacity in a few minutes as soon as the syllabus went public: the running title of the course was quite irreverently addressing the wrong way to approach information in the era of big data. While the course was designed to teach the ability to recognize falsehood and weakness in many different domains, there were specific lectures on big data and scientific publication bias. One of the message there is that good science should beware of the so called "big data hubris," the often implicit assumption that big data are a substitute for traditional data collection and analysis: the textbook example is the Google Flu Trends project, which claimed to be able to anticipate seasonal flu outbreaks just by tracking internet user searches for flu-related terms: this actually turned out to be a predictor of outbreaks much less reliable than a simple model of local temperatures. The scientific community sometimes slips into the problem of data overfitting which is a common concern in data analysis, i.e., when too many parameters are used to match a particular set of data and following too close the training data (the data set used to infer the model), thus running the risk to infer models that are ambiguously artificial. A common clue and warning of possible overfitting, yet too often disregarded, is the occurrence of odd correlations, even if it is widely known and accepted that correlation does not imply causation.
In the quantifying era we live in, the dream of many analysts is to reduce every signal to a common metric which would make them much easier to integrate and compare. The reduction of physiology to quantitative signals and the ability to measure biological quantities, somehow allowed a first digitization of the human being: in this way we can ask and hope to be able to infer much more about health and disease. But as J. Derridà stated in his 1967 De la grammatologie: "il n'y a pas de hors-texte" (there is no outer text). In other words, everything we receive is interpreted and no matter the effort in peeling off all the interpretation levels, we cannot be connected to an un-interpreted reality. We need models to interpret the reality and models to fit the reality to something that allows forecasting. Weather, health, biology, behavior and financial markets are predicted with models relying on data points. The more the points the better the prediction.
In conclusion, if we want to make biomedical sciences a productive big data science and precision medicine a reality we certainly need to address challenges given by technicalities of computational methods and infrastructure scalability, but we will need to allow a real and productive data integration focusing on issues of data governance, policies of data sharing, curation and standardization.