Needle in the Hay Stack

New kid in the block

The undisputed hottest topic in IT in the recent past has been cloud computing. Then, big data stepped in and arguably stole the show last year. If you thought smartphones and tablets were the ones to look out for in 2013, it might be a wrong statement, as there’s likely to be a new kid on the block: The Internet of Things (IoT). Also known as machine-to-machine (M2M), The Internet of Things is all about sensors that can connect lots of formerly-mundane objects to the Internet and automatically send their data to IT systems for analysis.

“The Internet of Things is the network of physical objects that contain embedded technology to communicate and sense or interact with their internal states or the external environment.” – Gartner

The Internet of Things is a fast growing market where IPv6 will play a central role.  IPv6 will ensure that every known object in this world will have one unique IP. The objects can be everything from health care monitors to traffic lights to thermostats to trains.

The Internet of Things would encode 50 to 100 trillion objects, and be able to follow the movement of those objects. The huge amount of data generated by these devices makes one more technology inevitable: “Big Data”.  Big data will make it possible to effectively use the data generated by the trillions of objects. Big Data in turn requires two more ingredients :“Compute” and “Storage Space,” which is where Cloud computing makes its special appearance. Cloud will play the pivotal role in making the compute and storage required for all this things to work. In a world where the hardware and operating system have become commoditized, the apps are the differentiator and more and more, the apps are a viewport into a cloud service driven by machine learning.

Although it is going to take some time before all of these things materialize and mature,  one thing for sure is it’s already making us feel as if we are living in a science fiction movie… imagine your toaster, car or house talking back to you.


Big Data Techniques

Most big data techniques have been around for many years. What’s new is their availability to more people, the speed with which they run (so that many variations can be processed), the variety of data they can process (to provide richer and deeper context), and the volume of data they can handle.

The various types of analytics in big data are

  1. Descriptive analytics
  2. Diagnostic analytics
  3. Prescriptive analytics
  4. Predictive analytics

Descriptive analytics – What happened?

Descriptive analytics aims to provide insight into what has happened. There are various methods and technologies that are involved in descriptive analytics like A/B testing, dashboards, business activity monitoring, complex event processing, content analytics, geospatial analytics, graph analytics, pattern/anomaly detection and clustering/classification.

Diagnostic analytics – Why did it happen?

Diagnostic analytics focuses on analysis of data to find out the causes of the event and relates to Root-cause analysis. The method/technologies include online analytical processing, data mining and interactive visualization.

Predictive analytics – What will happen?

Predictive analytics helps model and forecast what might happen.  Predictive analysis involves technologies like crowdsourcing, data mining, forecasting, machine learning and simulations.

Prescriptive analytics – Make it happen

Prescriptive analytics seeks to determine the best solution or outcome among various choices, given the known parameters. The methods and technologies include fuzzy logic, optimization, rules engines and decision analysis.

Rag picker or Gold miner

Data is exploding at an astounding rate. While it took from the dawn of civilization to 2003 to create 5 exabytes of information, we now create that same volume in just two days!

The Gold rush has begun and this time data is the gold mine. Organizations have slowly started to realize the importance of data. There is still time when things reach the feverish pitch but unlike before when there was less gold but more miners it’s the other way around now. There is immense data and very less miners.


Finding skilled personnel is one of the major challenges associated with big data analytics. Successful big data analytics initiatives involve close collaboration between IT, business users, and “data scientists” to identify and implement the analytics that will solve the right business problems. As a data scientists sometimes one may wonder if they are working on the right data, is the data worth anything or just garbage.

One question that pops up might be  “Are you a Gold miner or Rag picker?” and the answer is “Depends on what pile you are working on … garbage or gold? ”

One of the ‘V’s that characterizes big data is ‘Variety’ which mostly is data in unstructured format. Unstructured data is heterogeneous and variable in nature and comes in many formats, including text, document, image, video and more. Unstructured data is growing faster than structured data. According to a IDC study it will account for 90 percent of all data created in the next decade. As a data analyst it is important to identify the usefulness of the data that you are working on however sometimes it is better not to restrict yourself. It is the uncertainty that makes it more interesting.

Never forget “the greatest things in the world are found in the most unusual places”

Needle in the Haystack

Big data is all about analyzing millions of data and making meaningful sense out of it. You can compare it to “finding the needles in the haystack”. The trick here is that the hay stack is ever growing (Volume), in fast phase (Velocity). Wikipedia defines “Hay is grass, legumes or other herbaceous plants that have been cut, dried, and stored” (Variety). The needles are useful tool (predictions) that are hidden inside.

While working with our clients in various projects we do witness lot of data being generated and probably stored but never used. Customer interactions, customer requests (ticketing systems), auto-generated system data, logs, etc., these data are mostly captured and archived let to decay and then discarded. Its time we identify the importance of this data.

The sudden interest in Big Data is resultant of the behavioral changes that we have seen in the recent past. Internet which had been a mere static source of information now has become a media for immense information exchange due to increased activity in social network and availability of mobile internet. This is the transformation of internet from an ugly cocoon to a beautiful butterfly… “It’s mobile now”. It’s also important to note how extreme compute and storage which was once at the reach of large enterprises or research establishment is now available to end users, thanks to Cloud computing.

Keep an eye in this space for more about Big Data.