Read this article about the healthcare sector. Compare and contrast the challenges associated with each industry.
Introduction
Big Data in Health and Biotechnology
Big data analytics is potentially transformative. There are companies (such as Amazon, Facebook, or Uber) that thoroughly based their success on big data and their analysis; others, particularly in the telecommunication or financial sectors, that drastically changed their competitive strategies using big data. Industries and institutions of any sort can expect from the collection, creation and analysis of big data at least one of these outcomes: improving effectiveness and performances, thus increasing revenue; significantly reducing costs of processes; reducing costly risks such as lack of compliance, and production or service delivery risks. The pharmaceutical and biomedical communities are not immune to this process and are facing a data-driven transformation that need to be actively addressed. Data scientists usually describe "big data" as data having four main characteristics, famously known as the "four Vs": volume, velocity, veracity, and variety.
Volume refers to data at scale, obtained from many sources with high number of data points. In the biomedical sciences next generation sequencing technologies are allowing a constant increase of data production: samples, tissues and patients are sequenced and re-sequenced both in bulk (extracting nucleic acids from whole tissues or cell cultures) and from single cells. Single cell sequencing also is expected to further skyrocket the amount of data produced, since, even though at a lower depth, thousands of cells are going to be analyzed for each tissue or patient.
Big data domains are those able to store data in the order of magnitude of Peta to Exabyte. One Exabyte equals 1 billion Gigabytes, being the Gigabyte the scale in which the current portable storage cards are measured (our smartphones work with memories of 16 Gigabytes on average). Storage volumes are actually much smaller than volumes produced by the acquisition processes, which globally sum up to the order of zettabytes (the actual footprint), due to the fact that intermediate data are often heavily pruned and selected by quality control and data reduction processes. According to the recorded historical growth rate, the growth of DNA sequencing (in number of genomes) is almost twice as fast as predicted by Moore's Law, i.e., it doubled every 7 months since the first Illumina genome sequences in 2008.
Due to these numbers genomics is comparable to other big data domains, such as astronomy, physics, and social media (particularly Twitter and YouTube). Research institutions and consortia are sequencing genomes at unprecedented rhythms, collecting genomic data for thousands of individuals, such as the Genomics England project or Saudi Human Genome Program. UK's project alone will sequence 100,000 human genomes producing more than 20 petabytes of data, but many other sequencing initiatives on species other than humans will have a huge impact too. In vegetal kingdom research, pushed by agricultural applications, thousands to millions of vegetable varieties, as rice for instance, are also being sequenced. Moreover, given the central role that genome or exome sequencing has for personalized medicine it is reasonable to believe that a significant portion of the world's population will be sequenced, dwarfing current estimates of a five orders of magnitude factor by 2025, thus largely exceeding the growth of the other big data domain previously cited.
Velocity, the second "V," refers to the infrastructure speed in efficiently transferring big files. Extreme data volumes require extreme remedies: for data in the Exabyte scale the traditional four-wheeled method (truck delivery) is still the fastest and most secure way (https://aws.amazon.com/snowmobile/). In fact, uploading 100 petabytes over the internet would take about 30 years with nowadays hi-speed internet. Volume and velocity are not a big problem for biomedical data yet: computer power and storage dropped in costs and high volume data are easily produced and stores within the walls of the same institution. Data are not constantly transferred back and forth over the internet (transfers are usually mono directional, i.e., from service provider to researcher) and bigger volumes are usually shipped onto physical drives.
The other two "Vs" are more critical: veracity and variety. Veracity refers to data uncertainty. Biases are intrinsic to genomic sequencing data and are naturally occurring due to error rates, experimental batch effects, different statistical models applied. Variety is probably the most impacting characteristic of biomedical data. Data from this domain come in many different forms, biological data are highly heterogeneous, to this respect big data also means different signals and detection systems from the same source. Thus, heterogeneity of biological data is certainly a challenge but it is also what makes data integration needed and interesting, along with the technical possibility to use data to refine data and the consequent possibility to discover emergent properties and unpredictable results.
On the wave of the always bigger issue of reproducible research awareness of the importance of collecting and organizing data facilitating quick storage and easy re-processing is spreading to the biological research domain. Data integration is becoming an independent and horizontal discipline and is generating a wealth of diverse projects and resources; here we are going to review some of them (Table 1).
Table 1. Project and initiatives.
How Data are Shaping Life Science and Health
Biomedical sciences are constantly evolving pushed by technological advances; life, health and disease are investigated in an increasingly quantitative way. Most laboratory equipment produces bigger volumes of data than it did in the past, and data points available in a common lab pile up to quantities not amenable to traditional processing such as electronic spreadsheets. Researchers have often to deal with many different data at different stages of research: experimental design, samples collection and data gathering and cleaning, analysis, quantification of experimental results and finally, interpretation. Biologists, technicians and clinical science professionals interact almost on a daily basis with analysts, statisticians, or bioinformaticians and they need to develop a correct and updated vocabulary and acquire skills up to the task of efficiently collaborate with them. Without necessarily transforming themselves into bioinformaticians, biomedical researchers have to make a cultural shift embracing a new domain of "infobiology". As a consequence, the many flavors of bioinformatics and computational biology skills are now a must-have in the technologically advanced research laboratories or R&D departments: companies and research institutions, as well as single laboratories, should also promote and organize computationally skilled personnel.
Basic and clinical research is now well aware of the value of collecting and organizing data in the form of tidy data i.e., a form that is both easy to read for humans and easy to process for computers. Research as a whole is more and more computer assisted and fairly advanced statistical methods are regularly run through computers, such as machine learning (ML) methods to classify and discriminate samples through data. ML is the ensemble of methods giving computers the ability to learn without being explicitly programmed and it is considered a subfield of artificial intelligence. ML can be divided into two more subfields, very useful to biomedical research: supervised learning, which is used to make data classification and regressions, and unsupervised learning which fuels methods for clustering, reduction of dimensionality, recommendation. Such methods are nowadays regularly used in drug discovery and biomarker prioritization.
There is no research branch escaping this data-driven transformation. Data are equally percolating through – and are generated by – both traditionally "wet" experimental disciplines and domains that stemmed from mathematics and computer science. This is clearly observable in the industrial biotechnology sector, where traditionally experimental disciplines such as vaccine development, as well as more computer-driven disciplines such as network-mediated drug discovery are heavily influenced by the stream of different kind of data and the possibility to integrate them. Precision medicine, drug design/repurposing, and development of vaccines, are only three of the many domains of biomedical sciences that have quite a lot to gain from a structured, machine learning assisted and integrated use of data, either small or big.