Wednesday, 1 July 2015


Big Data Moves Toward Real-Time Analysis


The data warehouse, as valuable as it is, is history. The most valuable data will be that which is collected and analyzed during the customer interaction, not the review afterward.
(Image: Max Altana/iStockphoto)
One phase of this transformation is in the scale of the data being accumulated, as valuable "machine data" piles up faster than sawdust in a lumber mill. Another phase, one that's less frequently discussed, is the movement of data toward near real-time use.It's clear there's a transformation in enterprise data handling underway. This was evident among the big data aficionados attending theHadoop Summit, in San Jose, Calif., and theSpark Summit in San Francisco earlier this month.
Thomas Davenport, writing June 3 in the Wall Street Journal's online CIO Report, said the shift in data architectures toward the much larger capacities of distributed systems, such as Hadoop, from one large relational database server, was "transformational." But he left out the element of rapid-fire timing. The digital economy demands not only analysis of prodigious amounts of data, but the ability to process it -- sort out nuggets -- in near real-time.
The data warehouse, as valuable as it is, is history. The most valuable data will be that which is collected and analyzed during the customer interaction, not the review afterward. The analysis that counts is not the results of the last three months, or even the last three days, but the last 30 seconds -- probably less.
In the digital economy, interactions will occur in near real-time. Data analytics will need to be able to keep up. Hadoop and its early implementers, such as Cloudera and Hortonworks, have risen to prominence based on their mastery of scale. They gobble data at a prodigious rate, one that was inconceivable a few years ago.
"We see 50 billion machines connected to the Internet in five to ten years," said Vince Campisi, CIO of GE Software, at the Hadoop Summit. "We see significant convergence of the physical and digital world." The convergence of the physical operation of wind turbines and jet engines with machine data means the physical object gets a virtual counterpart. Its existence is captured as sensor data and stored in the database. When analytics are applied, its existence there can take on a life of its own, and the system can predict when parts will break down and cause real-life operations to grind to a halt.

(Image: Max Altana/iStockphoto)
But Davenport's outline of the transformation was incomplete. It didn't include the element of immediacy, of near real-time results needed as data is analyzed. It's that immediacy element that IBM was acting on as it issued itsringing endorsement of Apache Spark.
Spark is the new kid on the block, an in-memory system that's not exactly unknown, but is still a stranger in data warehouse circles. IBM said it would pour resources into Spark, an Apache Foundation open source project.
"IBM will offer Apache Spark as a service on Bluemix, commit 3,500 researchers to work on Spark-related projects, donate IBM SystemML to the Spark ecosystem, and offer courses to train 1 million data scientists and engineers to use Spark," wrote InformationWeek's William Terdoslavich after IBM's announcement.
Is it wise to focus as much attention and effort on Spark? The big data field is basically in ferment. There's RethinkDB, an ambitious Redis project or, for that matter, commercial in-memory SAP Hana. With so many initiatives underway, was it wise for IBM to announce that Spark is "potentially the most significant open source project of the next decade"?
It's always tempting to ask: Significant to whom? Big data users, who need its speed? Or IBM, which was caught flat-footed by the NoSQL wave. Now IBM is clearly looking for fresh options, and in Spark, it's found one.
Doug Henschen, formerly with InformationWeek and now part of Constellation Research, had this to say in his blog after the IBM endorsement: 
"IBM execs told analysts at the company’s new Spark Technology Center [in San Francisco that] it’s an all-in bet to integrate nearly everything in the analytics portfolio with Spark. Other tech vendors betting on Spark range from Amazon to Zoomdata …"
In addition, IBM executives explained the salient features of Spark that they liked:
  1. The task of data conversion and loading is handled automatically, allowing the Spark user to concentrate on data analysis, not data movement.
  2. Spark is flexible in its data processing capabilities. It's a platform where the task can be distributed, scheduled, and given proper I/O capacity, while the data gets filtered, reduced, and joined as needed.
  3. Its in-memory feature gives it an outlandish speed advantage over classic Hadoop, which relies on MapReduce, a disk-based system. In short, it excels at performance.
  4. It can host SQL queries, perform machine learning analytics, Spark Streaming data analysis and the analytics in the recently released SparkR language coming out of Berkeley.
IBM said it would run its own analytics software on top of Spark, including SystemML for machine learning, SPSS, and IBM Streams.
Henschen concluded that the combination of analysis capabilities being built on top of Spark, along with its ability to make use of distributed, in-memory computing, was going to give it an edge in the long run. "By blending machine learning and streaming, for example, you could create a real-time risk-management app," he wrote. What’s more, Spark supports development in Scala, Java, Python, and R, which is another reason the community is growing so quickly.
At Spark Summit, Amazon Web Services announced a free Spark service running on Amazon Elastic Map Reduce, and IBM announced plans for Spark services on BlueMix (currently in private beta) and SoftLayer. These cloud services will open the floodgates to developers, and IBM’s contributions will surely help to harden the Spark Core for enterprise adoption.