Big Data is one of the hottest topics and the Google trends search on big data shows a tremendous increase in interest since 2011. It is often referred to as the three V’s of volume, velocity, and variety, which is not quite right in approaching the differences between big data and traditional data management and analysis.
Defining big data using volume and velocity is only bound by the limitations of current technology. Many years ago, computers had just above a few hundred MB of storage and 1GB was quite big in terms of volume at that time but it is tiny nowadays. The same to the velocity when we compare the network technology and computing power to determine how much and how fast data can be processed. Both volume and velocity are just the limitations of technologies that could be broken through but not the nature of big data differentiating it from “small” data.
Variety is the key point in differentiating big data from small. There was also variety in old data but we filtered them and only took the information we were capable of handling and put them into a tabulated approach, then ignored (or simply stored but not using) other information that is difficult to handle, such as textual comments, image, and other binary data. Useful or usable data are usually in the form of numbers or text with a few hundred characters, they are well-structured with clearly defined meaning. Larger and unstructured data, such as long text, images, and sound are stored as binary data and produce no actual values. Some efforts of attributing binary data to give them some values but the result is still very limited. As technology advances, we are starting to get values from those data. We are trying to get the sentiment and even the meaning of the text, we can detect faces from pictures and try to identify who those people are or identify objects that appeared in images. This reveals values from data are big and make the meaningful data space bigger. However, we are still in the early stage of getting values from binary or unstructured data, a picture is worth a thousand words and we are only getting a few now.
On top of getting values from old data we stored but were unable to utilise, there is more and more usable data from new sources and in an unstructured format, especially user-generated content, such as forums, blogs, and social network updates. There is also more data captured from many connected devices, such as mobile phones, fitness monitors, and as well as all kinds of web or digital analytics tracking. People’s lives are digitized and data are generated and captured every second. Data are not only captured due to some transactions with a predefined purpose and usage but more like a journal about the data subject where the usage of data could be discovered later.
Another key point in variety is the inferred linkage among data from different sources. In the past, data could only be related in the same domain with built-in relationships during the design and capture. Efforts to connect data from different sources of data largely remained in the transaction database. But in big data, data could be dynamically linked together and the linkage is not built in by any specific vendors during the data capture but created or inferred by data users after data had been collected, it can be done by information from social networks, various tracking, and profiling. The linkage itself is data and the connected data can generate new data exponentially. The multi-source data provided a comprehensive view of the data subject, however, the inferred linkage between data from different sources is error-prone and would result in an incorrect view of the data subject. This is new in the big data era and requires extra efforts to minimize the error down to an accepted level.
Leave a Reply