The term for a collection of data sets so large and complex that it becomes difficult to store and process using on-hand database management tools or traditional data processing applications. Big data posses lot of challenge which includes capturing of huge data, storing of data, analysis of data, searching of huge data, transferring of that data is a nightmare, querying, updating, visualization etc.
Big data problems have been categorized under four Vs. Following are the four V's in big data:
Veracity means uncertain data. Unstructured data consist of significant amount of veracity.
Following things comes under velocity:
● Speed of generation
● Rate of analysis
Velocity is the measure of how fast the data is getting generated. In today's world the speed of generation of data is increasing day by day. If you take a look at the Facebook, it generates around 250 billion images which is too much to consume for any of the big systems and is increasing day by day with the increase of its active users. Similary their are other social sites like Instagram, Twitter, Linked In etc which are also generating tons of data everyday.
Different type and nature of data are:
Unstructered data - is the form of daga which is not structured. Examples of unstructured data include emails, videos, audio files, web pages, and social media messages. Most of the data genetrated by big organizations is unstructured data. In order to deal with this issue some of the organizations are using No SQL, MongoDb etc.
Structered data - The data which has defined length and format comes under structered data. This includes numbers, string, dates etc.
Semi-structered data - Examples of semi-structered data includes XML, JSON, No SQL etc.
Following are the sources of volume:
● Click stream
● Active/passive sensor
● Printed corpus
● Social media
Volume is very important component in 4V concept of big data. It is basically the amount of big data that is being generated,stored and managed or maintained by the big organizations. There are multiple sources from which this volume is getting generated these days. Most important and crucial is the amount of data that is getting generated via social media. Social media is too diverse, wide ranging these days. Take a look at the big data generation of some of the top social networking sites:
- If you take a look at the Facebook, it generates around 250 billion images which is too much to consume for any of the big systems and is increasing day by day with the increase of its active users. The current active users are above 1 billion.
- Instagram users post 216,000 new photos every minute of the day.
- Tweeter users tweets 277,000 times in every minute of the day.
- Tinder users swipe 416,667 times in every minute of the day.
- Apart from social networking sites, site such as google receives over 4 million searches every minute of the day.
- Mobile applications like whatsapp users shares 347,222 photos every minute of the day.
- WordPress users alone publish over 2 million posts every day. That comes out to 24 blog posts every second.
- Youtube uploads 73 hours of video every minute of the day.
As the time is passing and the number of active users are increasing day by day for these giant web based companies the amount of big data is increasing which is one of the major concern.
Following are the big data challenges:
● Capturing the big data
● Data storage
● Data analysis
● Data search
● Data sharing
● Data transfer
● Data visualization
● Data querying
● Data updation
● Privacy of information
● Data analysis
Big Data virtualization is a way of gathering data from multiple sources in a single layer. The gathered data layer is virtual i.e these are the virtual structures which have been created for big data systems. Like within the organizations there can be big data virtual pool which can be used by multiple projects to fetch the data as needed.
Unlike other methods, most of the data remains in place as it is and is taken on demand directly from the source systems.
Following are the characteristics of virtualization that support the scalability and operating efficiency required for big data environments:
Network virtualization provides an efficient way to use networking as a pool of connection resources. Instead of relying on the physical network for managing traffic, multiple virtual networks can be created where all the networks utilize the same physical implementation.