Big Data World, Part 1: Definitions
This post is the first in a series about Big Data. In it, we’d like to tell you how we at JetBrains see Big Data, and consequently, how we’re creating products for it.
- This article
- Big Data World, Part 2: Roles
- Big Data World, Part 3: Building Data Pipelines
- Big Data World: Part 4. Architecture
- Big Data World, Part 5: CAP Theorem
Table of contents:
- What is Big Data?
The world of big data can seem mysterious, hidden behind a curtain of unknown and weird words. It’s time to clear up this mystery and define Big Data.
What is Big Data?
As any term that has been overhyped at some point, the term “Big Data” has become convoluted with vast meaning. I will use the three definitions that I feel are most accurate:
Data that won’t fit the node’s memory
This is dependent on each piece of hardware, so we can’t define a universal, static value for what constitutes “big data”. I remember my ancient Intel 80386 – its 16 MB memory meant that anything more than 8 MB would be classed as “big data”.
100 MB of data looks small now, but it was considered huge in the past and required sophisticated algorithms to process.
Today, Big Data is much bigger in absolute terms, but still requires sophisticated processing, distributed computing, and special storage formats.
Data that scales on 3V
3V (pronounced as triple-v) stands for Volume, Velocity, and Variety. Scaling on 3V means that you won’t have to re-architecture your storage, jobs, and processes if volume, velocity, or variety will grow, say, ten times.
It’s hard to say what “ten times” means in terms of variety, but data tends to change frequently and rapidly in terms of form and velocity.
As you might have guessed, this definition is primarily determined by software.
Enough data to make reliable business decisions
Let’s not forget why data, big or small, matters in the first place – to do business. Taking this into consideration, defining “Big Data” in terms of business applications is useful.
Successful businesses are almost always data driven, and usually focus on making business reliable, predictable, and consistent. Doing these things well, however, requires more data than merchants had during, say, the Middle Ages. The modern business model, user-centric, and working with each person differently, is not possible without large amounts of data
For example, most big e-commerce companies have huge clickstreams (streams of user-generated events) based on marketing that predicts which goods will be more popular than others.
Now that we understand what “Big Data” is, let’s try to understand who the consumers are.
There are three main categories of internal customers:
Management needs reports to understand what’s going on in the company, improve existing plans, and create new plans.
Product managers want to improve their products through experimentation and need data to analyze the results of experiments and propose new ideas.
Marketing needs data to analyze marketing metrics, such as COA (cost of acquisition), LTV (lifetime value), and so on. They also need data to build successful marketing companies.
This is how we understand what big data is and who consumes the results of working with big data.
At JetBrains, our main projects for big data include:
In the next post I’ll define what kinds of professionals work with data and what qualifications they need.
If you would like to read more posts like this, please do not forget to subscribe to our blog. Please let us know what you think here in comments or in our Twitter.
Subscribe to Blog updates
Thanks, we've got you!
Big Data World, Part 6: PACELC
This is the sixth installment of our ongoing series on Big Data, how we see it, and how we build products for it. In this episode, we’ll cover the PACELC theorem. It is an extension of the CAP theorem, which describes trade-offs in distributed systems that exist before partition happens. Big Data…
Big Data World, Part 5: CAP Theorem
This is the fifth installment of our ongoing series on Big Data, how we see it, and how we build products for it. In this episode, we’ll cover the CAP theorem. What is it? Is it correct? And why is it needed for data engineers? Big Data World, Part 1: DefinitionsBig Data World, Part 2: Role…
Big Data World, Part 4: Architecture
This is the fourth part of our ongoing series on Big Data, how we see it, and how we build products for it. In this installment, we’ll cover the second responsibility of data engineers: architecture.