numbers of compounds. But simply making new
chemicals robotically and looking at data apparently
doesn’t lead to new drugs. 6 Rather, effective drug discovery comes from having a theoretical framework
for structure-activity patterns, or models of drug absorption and metabolism that enable the selection of
targets and drug candidates, and inform the design
of new experiments.
Although fields such as drug discovery and genomics build upon explicit models, sometimes the
models can be implicit. Even the Google Flu Trends
service, which attempted to predict the spread of influenza based purely on a big data analysis of internet
searches rather than traditional epidemiological
studies, relied on an implicit model of how the
disease spreads based on location and physical proximity. 7 While a new technique appears to be relying
solely on data for its conclusions, usually there is a
theoretical underpinning that we might not recognize or simply take for granted.
One area we examined in which large datasets and
machine learning appear to have made tremendous
advances over model-based approaches is speech recognition. Traditional speech recognition systems use
probabilistic models for things like determining how
a speaker’s voice varies or measuring acoustic characteristics of the speaking environment. 8 A computer
program optimizes parameters by having the speaker
read a sequence of words to “train” the system, and
algorithms then match the incoming sounds to words.
Google’s data-driven approach leverages the recordings of millions of users talking to Google’s
voice search or Android’s voice input service, which
are then fed into machine learning systems. Does
this mean that there is no longer a need for models?
Quite the contrary — these machine learning sys-
tems invariably incorporate a model that includes
knowledge of the structure of the data to train the
system. Often researchers use what are known as
generative models: The first layer of the machine
learning algorithm trains the next layer, applying
what it knows. 9 So progress comes from more than
just having and using the data.
Having a priori hypotheses helps researchers spot
and exploit natural experiments. In a modern factory that collects large-scale data, it’s possible to
exploit natural variation in process conditions without actually comparing different approaches. The
director of manufacturing technology at a large
pharmaceutical company pointed out to us that it is
often impractical to do experiments on actual production batches of drugs: The manufacturers will
not allow it. But by using established theories and
models, managers can design data collection strategies that exploit the natural variation in the data
being acquired — without having to explicitly design and run separate experiments.
Opportunities to Improve Models
Cautions aside, we believe that combining data-driven research with traditional approaches provides
managers with opportunities to refine or develop
new and more powerful models. Unexpected correlations that arise from mining large data can strengthen
existing models or even establish new ones.
Strengthening Existing Models Weather modeling
and forecasting offer a good example. Modern meteorology is based on dividing the Earth’s atmosphere into
a three-dimensional grid of interconnected cells and
then mathematically modeling how the conditions in
each cell evolve over time. Each meteorological prediction needs to be divided into two stages. First, during
the data assimilation stage, data collected from radar
and satellites are used as the initial inputs into a physics
model that can be calculated within each cell. For
points in the atmosphere that can’t be measured directly, the models infer temperature, humidity, wind
speed, and other factors from light wavelengths, infrared readings, and thermal radiances acquired by
By using established theories and models, managers can
design data collection strategies that exploit the natural
variation in the data being acquired — without having to
explicitly design and run separate experiments.