Defining Big Data
Big Data was the buzz phrase of 2017, but in truth, the concept has been around far longer than that. We know what data is - it is the raw information collected from any study, but particularly in science. Data science is the study of this data. Big Data takes this concept one step further; it is a data set of such complexity that it would be impossible to process, examine, manipulate and present using traditional methods. The intended results are often so complex (1) that it's difficult to process even using tried and tested electronic methods. It's important to note that the term does not necessarily denote the size of the data set (although sometimes a large volume of data is unavoidable), merely it's complexity. Big Data is determined using five metrics (2):
- Volume: Big Data gathers data every second. We no longer measure this data in gigabytes, not even in terabytes, but the next stages up (by the petabyte, exabyte, zettabyte and yottabyte). To put these in context, 2 petabytes will hold all data from US academic research libraries and 5 exabytes will store every word ever spoken in every human language that ever existed (3)
- Variety: Present technologies make it possible not just to acquire enormous sets of data, but also allows for variety. Current storage makes it possible to acquire enormous volumes in a short space of time. For any analysis, variety is as important as volume. Much of the world's stored data is unstructured but Big Data allows for structuring and unstructured information (2)
- Velocity: To be relevant, Big Data must be able to cope with the speed at which data is generated in order to store it and retain the most up-to-date and relevant information. This is useful in most areas but vital in early warning systems ahead of natural disasters (5) or in finance to detect fraudulent activities (2), for example. Velocity covers speed of creation, the speed of capture, and the speed of processing (4)
- Veracity: Arguably the most important (but surprisingly a new addition) in science, is the need for verifiable data. This seeks to determine a data set's accuracy and integrity, not just of the data but also the sources that generate it. If there is no trust in the data source, the data itself is virtually useless (5)
- Value: More concerned with the results that users extract from the Big Data, if we cannot make sense of the data then it has no value (2). The exercise in capturing and storing the data will have been completely useless and the exercise a waste of time, money and resources
Big Data is here to stay. It has many uses in business such as marketing and finance, for public policy such as crime and urban planning, and healthcare administration and planning such as disease outbreak management and monitoring. It's been useful in sciences that have traditionally always required large sets of data but lacked the methods to process and use them. In genetics and ecology especially, there has always been a disparity between the amount of data they are able to acquire and store, and the processing methods that could allow them to extract the most use from that data.
A History of Big Data
Early Big Data Problems
If we see any attempt at storing, harnessing and making available data for consumption and use as “Big Data” then it's arguable that the concept of Big Data goes back into antiquity with the original Great Library at Alexandria (6). It is believed that the facility, one of the Seven Great Wonders of the World, stored up to half a million scrolls. We could also argue that the world's first computer, The Antikythera Mechanism used to predict astronomical events years and even decades in advance (7), also technically qualifies as Big Data. Proving that you don't need a lot of data to make sense of information, this is one of the earliest computers. What we have here in antiquity are the two sides to Big Data from two seemingly completely different concepts - the volume of storage (Great Library) and calculation based on the quality of evidence (Antikythera Mechanism).
In the modern age (since the Enlightenment) Big Data is and has always been inextricably linked to the young science of computing and the much older science of statistics, first used in Bubonic Plague prediction in Renaissance Europe (6). Even before the dawn of modern computing in the 1940s did researchers begin to experience the problems of the continual and exponential accumulation of data. Issues concerning how and where to store such data, cataloguing and indexing, and sorting the useful from the irrelevant alongside the need to ensure relevance for proper results extraction.
This is the problem that faced the US government following the 1880 census. They predicted it would take eight years to process all the data from that census and 10 years to process the 1890 data by which time the next census would be ready. They hired a man called Herman Hollerith who invented a device known as the Hollerith Tabulating Machine (8) which used punch cards to process the data to a matter of months.
20th Century: The Dawn of True Computing
In the 1940s, a technical term arose that remains in common to use today “information explosion” (9). Researchers had been aware of such problems for centuries (see the previous section) but with a rapid population increase from the Enlightenment, access to better standards of health in evidence-based medicine, it was only a matter of time. Now, and for a variety of reasons, this explosion seems never to have levelled out. Researchers, institutes, governments, even commerce have always sought more data, to make better use of it, and to store it in such a way as to make it useful. The 1920s saw the arrival of magnetic tape storage while Nikola Tesla theorized the arrival of wireless technology to help store this information (13).
By 1941, the computer was five years old. Alan Turing is credited with inventing the world's first computational machine in 1936 (10). Within a decade, academics were expressing concern about the expansion of data that mirrored the problems expressed in the late 19th century. Computing was able to process more data and faster, but the same problems remained - could the processing power of the computer age ever keep up with the greater demand placed on it for the applications of Big Data? This would plague the burgeoning science right through the 1960s until 1965 when the US government established the world's first ever data center. It was set up to store tax information and criminal records (mostly fingerprint information) on magnetic tapes (11).
Four years earlier, Derek Price commented that the number of academic journals and their published work was increasing exponentially, not linearly, and by 1967 the first theory of ADC (Automatic Data Compression) was compiled as a method of storing such data in future (12). By the 1980s, others were commenting on the potential usefulness of continued and exponential acquisition of data sets. It was suggested by many that the data increase was not simply down to population growth and data generation, but that those holding such information did not know how to discard of obsolete data or separate “the wheat from the chaff”. This problem would plague almost every organization and body interested in Big Data right through to the end of the century (13) and the emergence of the internet coupled with a new relative cheap cost of storage.
Big Data in the 21st Century
The story of modern Big Data begins in the year 2000 with the interest in how much data people produce (6). The publication of a seminal paper titled “How Much Information?” (14) begun in the late 1990s, published in 2000 and updated in 2003 found that each person produced, each year, 1.5bn gigabytes. Divided up, that makes 250mb per person. Just one year later the first three items on the list of Five Vs (that would later form the pillars of Big Data) were defined (6). It was also the year we began to see SaaS (Software as a Service), driving towards the Cloud storage we have today. More exciting developments came in 2005 with the emergence of Web 2.0. This meant content produced by and for users of the internet rather than solely by web service providers. We cannot underestimate the importance of both of these trends in pushing towards Big Data collection and processing.
With the publication of the 2010 report on “How Much Information?”, it was revealed that in 2008, the internet's servers processed an eye-opening 9.57 zettabytes of data. To put this in context, that amounts to 9.57 trillion gigabytes (6). Also, it seemed that commerce was adapting to the connected world in storing 200 terabytes of data each on average. But much of this would not be relevant to the average person for several years. The election campaign of President Barack Obama in 2008 was notable for many reasons; he is credited with being the first candidate to harness the power of the internet, especially social media, in petitioning voters. The campaign was also the first to raise funds through “crowdsourcing”.
In 2012, the reelection campaign took using the internet as a tool one step further. Obama's team sought re-election (and won) by harnessing Big Data and Data Analytics (14). The campaign's success gave validation to the science and in 2012, the administration released details of the Big Data Research and Development Initiative (15). Spread across multiple departments and programs, it seeks to improve government decision making in a wide variety of areas, particularly in science and engineering in partnership with the education sector, and in commerce and industry. It was, perhaps, in response to a prediction in 2011 that the latter part of the decade would see a massive skills shortage for people entering Data Science.
Now, in the age of Big Data, its predicted growth has arrived with the capacity to hold, store and use it, recruiters expect the number of openings in these roles to balloon to several million globally by 2020. It is up to the various government agencies and the private sector to prepare for a new decade where Big Data is the norm rather than the exception.
How Does Environmental Sciences Use Big Data? Real World Case Studies
It should hardly surprise us that government bodies and university research departments all over the world are already using Big Data to aid research and decision-making. Here is a selection of the applied science of Big Data and success stories.
The EPA and Public Health
One of the biggest areas in the US for unifying big data with environmental science is public and environmental health (16). Already, we've seen improvements in the monitoring and mitigation of toxicological issues of industrial chemicals released into the atmosphere. Monitoring has always used the tried and tested methods such as localized environmental sampling, but now we can process such data through computational methods, the result is more accurate, more up-to-date, faster produced, with more analytical information to allow experts to make an informed decision. Big Data allows for high throughput (more resources, a longer period of time), combined data sets (bringing together multiple, otherwise seemingly disparate data sets) and meta-analysis (studies that are the compilation of existing studies to create a more thorough and hopefully accurate picture), and deeper analysis of the results produced from these studies.
EPA is presently using such data acquired through Big Data Analytics to synthesize more accurate predictions for areas where data either does not exist or is difficult to acquire. Also, researchers can identify gaps in the data and potential vulnerabilities in the system and process of investigation. Overall, this mitigates the problems and enhances data for better decision making for public health concerns. They are now working with NCDS (National Consortium for Data Science) to identify current challenges that they hope to address through big data science (16).
For Geographic Data
Few tools have proven as useful to so many environmental sciences as the map. From simple cartography for naval navigation, geographic surveying, to modern uses for Geographic Information Systems (databases of data sets from which we can produce digestible maps and create visually striking imagery for an intended audience), GIS thrives on Big Data. Much of GIS strength lies in its ability to consolidate, utilize and present statistical data. The more data you have from a geographic area, the better the quality of the output and the more informed the decision making is likely to be. Its biggest contribution (so far) seems to be in spatial analytics, and that's good news for GIS technicians and for those people charged with making decisions based on the outputs of their data.
One example is in disaster and emergency relief (17). As recently as 2017, a researcher showed in a seminal study that it would be possible in future to parse textual references to GIS databases for up-to-the-minute problem areas currently suffering from tsunamis, flooding, and earthquakes. This would not have been possible before due to the sheer intensity of cross-referencing requirements. Satellite data and aerial imagery have already informed GIS in disaster management, with Hurricane Katrina being one of the first and best-known choices in using the technology. In future, Big Data will further enhance its efficacy.
Further, the EPA is using geographic data to inform research into public health through the Environmental Quality Index (16). Big Data is informing a number of areas and bringing them together in the most comprehensive analysis of its kind examining air, water, and dry land, and the built environment and socio-economic data (18). It is expected that this information will inform public health decisions and allow for medical research into health disparities of child mortality and poverty.
Climate Change and Planetary Monitoring
In 2013, the UK government announced large-scale investment in Big Data infrastructure for science, particularly in the environmental sector. Of particular note to global research was a commitment to maintaining funding for a program called CEMS (Climate and Environmental Monitoring from Space) (19). This allowed for the creation of larger databases to cope with the upcoming Big Data revolution and to allow research partner organizations to work with more data and produce more results. With a specific focus on climate change and planetary monitoring, CEMS storage removed the need to download enormous data sets while reducing the cost of access (20). It provides the tools as well as the data, allowing for greater efficiency, sharing in the academic community, and providing resources once beyond the reach of many institutes due to budgetary restrictions alone. Along with Cloud data, this is now the standard globally for some of the world's top research institutes.
At the same time, one of the UK's top universities announced plans to open a Big Data center for environmental science research and analysis. It intends to bridge the “data gap” between those who research global environmental problems and those charged with making decisions to remedy such issues (21). That's also at the core of the relationship between the US-based Lighthill Risk Network - an insurance representative organization - and the UK's Institute for Environmental Analytics - a data research organization. Working in partnership to see how big data can be applied to a variety of issues in risk management and natural disasters, particularly in light of increased frequency of erratic and extreme weather, Lighthill is now committed to developing global databases and making the business case for sharing data (22). Such cross-government and partnerships between industry and government are working as shown with the previously discussed EPA programs and the EU-wide Copernicus Climate Change Service which recently went live.
Finally, there are immense implications for the uses of Big Data for climate modeling. As early as 2010, NASA was utilizing Big Data capture and storage for creating climate models to make the most accurate climate projection models yet (30). It is estimated the agency stores as much as 32 petabytes of information for modeling purposes. Models thrive on enormous data sets, complex data and accumulated metadata. As far as the sciences are concerned, climate modeling could be the single most important area of academia for Big Data applications. Learn more about the history of climate change.
With an ever-growing global population putting more pressure on resources, agritech is going to have to invest in some important developments. It's projected that barring major disaster, the global population in 2060 will be 9 billion (23, p401) with the highest growth in poorer countries. This means a lot of investment in agricultural systems to cope. One of these is GM technology, expected to help the world's poorest communities grow resilient crops for sustainable food supply and economies. However, GM alone is not going to solve this problem.
Essential resource management plans will need to be put into place to ensure we are making the most of agricultural land and effectively using ground nutrients, limiting deforestation, properly managing water resources and developing new methods of farming that could use even less space than before. In the US, some notable agricultural organizations are already using crowdsourced data in conjunction with remote sensing and publicly available data such as weather forecast information (23, p402). This allows the creation of Big Data sets so domestic farmers can improve land use efficiency, maximizing productivity and revenue stream. Here, Big Data is used in environmental engineering to inform farmer what crops they should plant this year and even the likely event of when their machinery will break down. This information may be used for crop management in the first instance (to cope with predicted extreme weather) or order parts ahead of time so that work is not lost in the second.
This is expected to be even more important in the developing world for people who live in so-called “marginal landscapes” (29). This is where the agricultural growth production is low due to erratic water supply, low precipitation, located in particularly acidic or alkaline soils. Some people have had to use such landscapes through little choice; they may be bad choices, but they are still the best available to them. The use of Big Data here is two-fold: firstly, providing mitigation and management tools for marginal landscapes already in use. Second, identifying the best uses for marginal landscapes not already turned over to agriculture (24).
Although not technically an environmental science, it has many uses beneficial to the environment from GM technology to gene mapping, examination of the spread and transmission of infectious diseases in vital food crops such as Panama Disease in bananas (25). It's useful in a wide range of biological sciences. We expect many advances in genetics to come thanks to the advent of Big Data. When the human genome was decoded in the early part of the last decade, the process took over 10 years. Now, with Big Data analytics, OECD estimates that the exact same process, if carried out for the first time today, would take just 24 hours (26). Faster research of genetic structures means faster reaction and identification to problematic genes and faster implementation for mitigation measures.
One of the unexpected benefits of Big Data to any science, but particularly the environment is so-called “Citizen Science”. This is the accumulation of data reported from people in geographic locations all over the world voluntarily offering information on conditions where they live. It is often beyond the financial and time resources of researchers to investigate all claims directly, so they rely on local people to report such information. This is not new, but the term “citizen science” and the overt public engagement is new. Indeed, there are many examples of successful citizen science projects already such as the Christmas Bird Census of 1900 (27) and that came long before global communication, cloud storage and mobile technology - arguably the three technologies that have enabled public engagement like no other.
When many people report phenomena, it reduces the possibility of hoax, misinterpretation and fake reporting. While anecdotal evidence is not useful in some areas and, indeed counterproductive in others, science organizations all over the globe are inviting input from interested amateurs and stoking interest in environmental science. The Christmas Bird Census may have been born out of collective horror of the mass slaughter of native North American birds, but it did raise consciousness later of the potential ecological problems of such a “tradition” and how citizen themselves could help with conservation if engaged in the right way. Even widespread voluntary human drug trials for new pharmaceuticals can be considered “citizen science” with volunteers in a wide range of lifestyles engaging in experiments and reporting side effects and effects on medical conditions back to researchers (28).
Anthropology & Archaeology
The study of people in the past (and their material remains) may not be the first outlet you might consider for Big Data application, mostly because they tend to study small groups of individuals on specific sites. However, compiling such data can have benefits to studies over large areas to determine the spread of technology, cultural evolution, and even track the spread of ancient farming practices such as slash and burn. Accumulated digital data is not new to these two areas. Statistics are and have always been a useful tool in such methods as aerial survey data and remote sensing, both of which are profoundly useful to relatively new technologist such as GIS (Geographic Information Systems) (31). In 2017, it was suggested that Big Data could be used to plow through old excavation reports to “data mine” in a hope of extracting new information.
Archaeologists and anthropologists often deal with complex data, comparing site analyses and trying to marry up otherwise seemingly disparate data sets. In theory, this could make large-scale investigations into the affairs of humans in the past much faster, broader and more complex. This should result in more complex and useful results, improved visualizations, greater computing power and more informed/useful results in cultural studies (32). This can be just as useful in studying modern populations as for societies in the past. Learn more about archaeology.
In Environmental Conservation
It was reported in 2014 that Big Data was not yet part of the world of sustainability and environmental conservation (33). Although some applications have proven useful in climate science and climate modelling, there are still few areas where Big Data is useful in such areas as land conservation, sustainability and local environmental mitigation. The seminal report did go on to acknowledge a number of essential areas that could (in theory) benefit from the application of Big Data and Big Data Analytics in the future (33, p7). These included:
- Environmental NGOs may use data as evidence for lobbying governments to instigate laws or other measures to protect individual landscapes. As these groups are often at the forefront of advocacy because they are at the forefront of application, they produce the data and could use it in support of their findings
- Third-party specialists and consultants who can accumulate data and provide such information in reports for clients, similar work to the NGOs noted in the first point
- Corporate entities may employ Big Data in two forms: firstly as evidence that they are complying with government regulation pertaining to their industry and sector; secondly to launch investigations into issues to determine the cause of an environmental problem
- International organizations who work in environmental policy research to make recommendations to other international decision-making organizations and lobby groups
- Government bodies in determining policy and bills on environmental regulation and sustainability. At present, the US is working with the Dutch government in ensuring open data policy for Big Data analytics in this area
In Regional Planning
Urban landscapes are often overlooked when discussing environmental sciences. But urban centers are environments too, sometimes with their own ecosystems. They are a curious ecology, impact the environment, are impacted the environment, providing life and work for residents and becoming self-contained ecological islands. Yet studies in urbanism represent some of the best and earliest examples of the application of Big Data. In 2014, a report on China's applied statistics and Big Data to examine urban systems and urban-rural planning highlighted the project (begun in the year 2000) as a major success (34). 2014 was the year they engaged in rapid expansion of the practice. It requires a unification of data between information technologists, geographers, logistics and urban planning.
Big Data can be applied to examine problems areas for traffic (and aid decision making on where to place new roads), crime centers (and where to focus law enforcement resources), health problems (and to attempt to understand why certain areas experience certain health problems - pollution, poverty, poor access to resources etc). Standard data sets are insufficient, lacking depth, and urban planning requires information from disparate sources - demographics, geographic information, resources, employment figures, pollution, employment, health and many more to understand the complex parts that go into making an urban center function.
Big Data should improve the process of urban planning and resource allocation. In fact, it's already doing so. More recently, studies have shown the usefulness of Big Data in planning “smart urban planning” (35) through large data sets, and the relative usefulness of doing so in future. It's expected to be both a time saver and a money saver.
What are the Advantages of Using Big Data in Environmental Sciences?
As you can see from the above section, across the world, government departments and university research facilities already using or preparing for big data. Few have made as many strides as the US EPA (Environmental Protection Agency) (16). The advantages are numerous.
Collecting, Sorting, Analyzing, Presenting Quickly
As hinted in the scenarios presented above, Big Data's major advantage is in the capacity to collect masses of data and analyze it quickly; it's a realistic cost and resource saving tool in areas often drastically underfunded and having to cut costs. The storage capacity now exists to collect and collate; the computing power is also affordable to process and manipulate in any way necessary. This is most obvious in climatology, even if the community has been relatively slow to adopt it (36). These two processes alone make Big Data vital for environmental science presentation and accuracy.
How to handle errors in data, reporting, rogue data and anomalous results has been one of the biggest problems facing any science. When sample sizes are too small, anomalous data can be given more importance than it deserves. But studies are often limited by sample size alone due to resource factors. The larger a data set, the more likely a rogue piece of information will fall in significance and not damage the overall result (37). Coupled with the cost and resource saving, environmental studies can, in theory, become larger and more thorough, producing more accurate results.
Better Environmental Management
This applies to urban management as our cities continue to undergo rapid and vast changes in line with changing technology and demands of residents. In one study, the Norwegian capital of Oslo was able to reduce its energy consumption through the application of Big Data Analytics when examining its energy resources (38). Similarly, in Denver, predictive reporting and risk analysis at the city's Police Departments was able to reduce serious crime by around 30%. Portland in Oregon used a similar system to analyze stop light changes at intersections in order to manage traffic flow better. After just six years, the city eliminated 157,000 metric tonnes of CO2 emissions. Traffic flow varies as a city grows; what was once a sufficient stoplight pattern can change.
Better Decision Making
By sheer weight of numbers, Big Data and the analytical tools used in its processing is able to process and analyze more past data than ever before. Previously, this too was limited by resources but with its increased access and availability, it is expected to permit easier presentation and reporting, delivering more confident results and therefore, better to aid decision makers and policy development professionals. Scientists and government can work together more efficiently in future, not just to react to the environmental problems of today, but work with greater foresight today to make better decisions for tomorrow.
What are the Current Challenges for Big Data in Environmental Sciences?
Big Data is not expected to be a panacea for all the world's environmental problems or for research or applied science in general. Nor is it designed to be a one-size-fits-all answer. Like any other emerging technology, there are problems and limitations to keep in mind when extolling the virtues of Big Data.
Due to the complexity of so-called Big Data, the method presents a number of other challenges to those who seek to acquire and use it. For instance, the framework for each of the following concepts:
- Methods for capturing data
- The capacity for storing it
- Analyzing the data when captured
- Searching, sharing and transferring during the utilization process
- Visualization and querying of data
- Updating the information in line with recent changes
- Data security, privacy issues and the sources of storage
May not always have the capacity, especially where the volume of data quickly outstrips the capacity for present computing technology to perform any of the above functions. Big Data's increase is and so far, has been, exponential in growth. To keep up, hardware in all of the areas above will need to keep up, if not exceed the necessary capacities. We must also not underestimate the problems with human error - wrongly entered data, poor processing due to mistakes, and interpretation of that data. The information may not lie, but humans can and do make mistakes.
In all this, it's important to remember that some sciences concern data pertaining to humans. Issues will include problems such as cultural sensitivities as in archaeology and anthropology (32). Some critics are concerned that in reducing populations to Big Data information, we reduce their humanity, their individuality. However, with the improvements in disaster response time, applications in climate science, and in the enormous data processes when examining archaeological/anthropological information, it's likely that these human sciences and humanities concerned with the environment will benefit in the long-term.
Also, we must be aware of the legal ramification of data storage. Here in the US, HIPAA protects a patient's rights to their medical history. The European Union is introducing a set of regulations called GDPR (General Data Protection Regulation) in May 2018. This will affect the USA, especially researchers, scientific institutes and anybody handling Big Data from entities operating in the European Union, or information relating to any citizen living within a member state of the EU and EEA (39). It is understood that the US government is watching closely to see how GDPR functions and how it might adopt such a law in future. It's likely such information will receive protection with required deletion at the owner's request; the ramifications for information stored about people will certainly apply.
Lack of Widespread “Open Access”
Research institutes and businesses are often incredibly protective of their research data, especially where mass profitability is involved. Yet there has been a move in recent decades to call for subscription-free public access to scientific data. Known as Open Access, not enough strides have been made in this area, in some disciplines, that Big Data Analytics is not presently experiencing its full potential and much data is restricted, meaning that - although studies can call on more data and do more with it - there is still a large amount of data that could prove useful in environmental science, held privately with limited or no access. Although fear of handing over information to competitors is part of the issue, other problems include lack of resources to do so or a lack of awareness of how useful Open Access can be (32). Together, Open Access and Big Data has the capacity to be a powerful force in research science, but the latter is being held back by a lack of the former.