Monday, July 1, 2019

Three V’s in Big Data

Big data is a term used to refer to large and complex datasets that are too large for traditional data processing software (read databases, spread sheets, and traditional statistics packages like SPSS) to handle. The industry talks about big data using three different concepts, called the “Three V’s”: volume, variety, and velocity.



A brief about what these are:

Volume

Volume refers to how big the dataset is that we are considering. It can be really, really big — almost hard-to-believe big. For example, Facebook has more users than the population of China. Approximately there are over 250 billion images on Facebook and 2.5 trillion posts. Now that's a lot of data or as we say Big data.

Gartner, one of the world’s leading analysis companies, estimates 22 billion devices by 2022.That is 22 billion devices producing thousands of pieces of data. Imagine that you are sampling the temperature in your kitchen once a minute for a year. That is over ½ million data points. Add the humidity in to the measurements and now you have 1 million data points. Multiply that by five rooms and a garage, all with temperature and humidity measurements, and your house is producing 6 million pieces of data from just one little IOT device per room.

Imagine how many pieces of data our smartphone produces in a day. Location, usage, power levels, cellphone connectivity spews out of your phone into databases and your apps and application dashboards constantly.

Sometimes location information is being collected and sold even without your consent or opt-in.
Data, data, and more data. Data science is how we make use of this.

Variety

The photos are very different data types from temperature and humidity or location information. Sometimes they go together and sometimes they don’t. Photos are very sophisticated data structures and are hard to interpret and hard to get machines to classify. Throw audio recordings in on that and you have a rather varied set of data types.

Let’s consider voice. You might know about Alexa being very good at translating voice to text but not so good at assigning meaning to the text. One reason is the lack of context, but another reason is the many different ways that people ask for things, make comments, and so on. Imagine, then, Alexa (and Amazon) keeping track of all the queries and then doing data science on them to find out the sorts of things that people are asking for and the variety of ways they ask for them. That is a lot of data and a lot of information that can be gathered. 


Data science has a much better chance of identifying patterns if the voice has been translated to text. It is much easier. However, in this translation you do lose a lot of information about tone of voice, emphasis, and so on.

Velocity

Velocity refers to how fast the data is changing and how fast it is being added to the data piles. Normally Facebook users upload about 1 billion pictures a day, thus it is estimated that in the next couple of years Facebook will have over 1 trillion images. Hence we can say Facebook is a high velocity dataset.

A low velocity dataset (not changing at all) may be the set of temperature and humidity readings from your house in the last five years. Needless to say, high velocity datasets take different techniques than low velocity datasets.


Data scientists have developed many methods for processing data with variations of the three V’s. The three V’s describe the dataset and give you an idea of the parameters of your particular set of data. The process of gaining insights in data is called data analytics.

In the next posts, we'll focus on gaining knowledge about analytics and on learning how to ask some data analytics questions using Python.


Share:

0 comments:

Post a Comment