Agility and expansion have become the standard call to arms for technology and data evangelists warning businesses and consumers of pending change or emerging technology. A few decades ago, the common example of technology change was the VHS vs BETA wars that saw the inferior VHS format crowned the champion while sending BETA to the appendices of technology textbooks.
Today the velocity of these changes seems to be coming faster and on many more fronts and this increased pace may be due to how data is moved between machines. The “next industrial revolution” will take place between the communication and computing capabilities of machines talking to other machines. Machines are no longer just large mainframes or server farms used to complete business functions. Almost any device can be considered a computer that can create and send data to another device for industrial, commercial, government and consumer applications.
The Internet of Things is driving Big Data
Devices connected over a network or the internet can stream data continuously like an electric meter that completes a reading every minute of the day or 1440 data points. Multiply this number by the number of data collection and computing devices entering service and it becomes apparent that Big Data is a response to handle these volumes. Google Trends notes that references to “Big Data” have increased 10 fold since 2011. As a point of reference, Wayne Balta, IBM Vice President of Environmental Affairs once declared that “90% of the data in the world today has been created in the last two years.” This overwhelming growth leads data strategist and CIOs looking for new ways to analyze and store vast quantities of data that numerically exceed the boundaries of traditional relational database management systems and looking instead for scalable solutions that can be expanded based on demand.
Hadoop was created to manage these enormous data volumes in a similar fashion to server farms where that grew from open source initiative and became the launching pad for the virtual server industry like standouts VMware and Hyper-V. Hadoop offers an open-source Java toolbox part of the Apache consortium that enables the query and analysis processes customized and orchestrated to on commodity hardware. In Hadoop terms, this distribution of analytic computing is referred to as a cluster. The power of the hadoop cluster is its ability to distribute the data and processing jobs from the central hub or scheduler to nodes where the data and processes run in parallel and complete the tasks independently.
Unstructured Data Requires a New Processing Model
Growth in data volumes is both massive and evolving in multiple formats – unstructured data. The format of stored data elements to be queried and analyzed are no longer just tabular columns and rows that can be manipulated easily with SQL where the most difficult issue may be LOB or CLOB objects embedded as references in a table. Today data formats may include e-mails, videos of various formats, RSS feeds based on news, html, human voices, streaming data, source code and device state information. Imagine the volumes of data generated on a daily basis from the MARS rover and all the supporting devices that are collecting data and transmitting it to terrestrial data repositories. This example is one such data collection service that is part of hundreds of thousands running all the time here on the earth.
Addressing Scalability with Structured and Unstructured Data
Hadoop offers a credible advantage for scalability to grow with the volume and velocity of Big Data. A company like NASA can add cheap hardware as additional nodes to their cluster. The other real benefit is that Hadoop offers many pre-built programs to handle these types of unstructured data. Further, the open source nature of the software allows the business to customize the data management capabilities to fit the business requirements for new and emerging data formats.