Big Data

Rithik Sharma
3 min readSep 17, 2020

What is big data?

Big data is a concept that refers to massive collection of complex structured, unstructured and semi-structured data sets that are rapidly generated from various sources.

big data is also data but it is enormous and increasing at an exponential speed, in short it is so large that it can not be stored or processed by the means of traditional data management software.

Big data is usually characterized by 3V’s which were first identified by Doug Laney but as of recent years several other Vs have been added to different aspects of big data.

The five V’s

Volume :-

By the name itself we know the amount of data matters. various organisations collect the data from myraid sources and they have to process this huge amounts of data which is of unknown value and can vary for different organisations.

Variety :-

It refers to the various types and nature of data. structured data in databases, unstructured such as text and document files and semi-structured data, such as streaming data from sensors.

Velocity:-

In today’s world data is being generated at incredibly fast rates, the flow of data to the companies is massive and continuous that needs to be processed and analyzed.

Variability:-

Data can sometimes show inconsistency which hampers the process of being able to handle and manage the data effectively.

Value:-

Data has intrinsic value as seen from example of some of the tech giants substantial part of the value they offer comes from their data which is constantly being analyzed to develop new products.

The difficulty storing and processing of big data

The handling of big data imposes unique demands and processing huge volumes and varieties of data can overwhelm a single server so many of the big companies that are working on big data tasks are using technologies like Hadoop.

Handling big data to achieve adequate velocity in a cost effective manner is also a challenge. Using extensive servers can also be problematic thus cloud computing is now a primary way for hosting big data systems.

What is Hadoop and how it slove big data problem?

Hadoop is a open source framework that is used to capture as well as manage and process data from distributed systems as it works on a cluster of nodes. One of the core components of hadoop is HDFS(Hadoop Distributed File System). It is used to scale a single cluster of hadoop to hundred and even more nodes and by these nodes more storage and data processing speeds can be achived.

Hadoop is design to manage the various aspects of big data like the V’s of big data like Volume, variety and Velocity. In case of volume as hadoop uses distributive system architecture which is designed to scale out, as we need more data storage or computational power all we have to do is increase the number of nodes in cluster. Hadoop allows us to store data of any format be it structured, unstructured or semi-structured. we can load our raw data in hadoop and later define how we want to view it.

When it comes to analyzing the data stores in hadoop we also have distributed processing by which we can process data in parallel and hadoop’s compute framework in known as MapReduce.

Hadoop has become the most famous software used to deal with big data and big companies like Facebook, amazon etc using it gives us the idea of how good it is to use on big data.

--

--