Company

Visit jetbrains.com

Big Data Tools

Big Data World, Part 1: Definitions

Pasha Finkelshteyn

Read this post in other languages:

This post is the first in a series about Big Data. In it, we’d like to tell you how we at JetBrains see Big Data, and consequently, how we’re creating products for it.

Next parts:

Table of contents:

What is Big Data?
Customers
Conclusion

The world of big data can seem mysterious, hidden behind a curtain of unknown and weird words. It’s time to clear up this mystery and define Big Data.

What is Big Data?

As any term that has been overhyped at some point, the term “Big Data” has become convoluted with vast meaning. I will use the three definitions that I feel are most accurate:

Data that won’t fit the node’s memory

This is dependent on each piece of hardware, so we can’t define a universal, static value for what constitutes “big data”. I remember my ancient Intel 80386 – its 16 MB memory meant that anything more than 8 MB would be classed as “big data”.

100 MB of data looks small now, but it was considered huge in the past and required sophisticated algorithms to process.

Today, Big Data is much bigger in absolute terms, but still requires sophisticated processing, distributed computing, and special storage formats.

Data that scales on 3V

3V (pronounced as triple-v) stands for Volume, Velocity, and Variety. Scaling on 3V means that you won’t have to re-architecture your storage, jobs, and processes if volume, velocity, or variety will grow, say, ten times.

It’s hard to say what “ten times” means in terms of variety, but data tends to change frequently and rapidly in terms of form and velocity.

As you might have guessed, this definition is primarily determined by software.

Enough data to make reliable business decisions

Let’s not forget why data, big or small, matters in the first place – to do business. Taking this into consideration, defining “Big Data” in terms of business applications is useful.

Successful businesses are almost always data driven, and usually focus on making business reliable, predictable, and consistent. Doing these things well, however, requires more data than merchants had during, say, the Middle Ages. The modern business model, user-centric, and working with each person differently, is not possible without large amounts of data

For example, most big e-commerce companies have huge clickstreams (streams of user-generated events) based on marketing that predicts which goods will be more popular than others.

Customers

Now that we understand what “Big Data” is, let’s try to understand who the consumers are.

There are three main categories of internal customers:

Management
Marketing
Analysts

Management needs reports to understand what’s going on in the company, improve existing plans, and create new plans.

Product managers want to improve their products through experimentation and need data to analyze the results of experiments and propose new ideas.

Marketing needs data to analyze marketing metrics, such as COA (cost of acquisition), LTV (lifetime value), and so on. They also need data to build successful marketing companies.

Conclusion

This is how we understand what big data is and who consumes the results of working with big data.

At JetBrains, our main projects for big data include:

In the next post I’ll define what kinds of professionals work with data and what qualifications they need.

If you would like to read more posts like this, please do not forget to subscribe to our blog. Please let us know what you think here in comments or in our Twitter.

Big Data Tools 1.0 Generally Available Big Data Tools Update

Discover more

This is the sixth installment of our ongoing series on Big Data, how we see it, and how we build products for it. In this episode, we’ll cover the PACELC theorem. It is an extension of the CAP theorem, which describes trade-offs in distributed systems that exist before partition happens. Big Dat…

This is the fifth installment of our ongoing series on Big Data, how we see it, and how we build products for it. In this episode, we’ll cover the CAP theorem. What is it? Is it correct? And why is it needed for data engineers? Big Data World, Part 1: Definitions Big Data World, Part 2:…

This is the fourth part of our ongoing series on Big Data, how we see it, and how we build products for it. In this installment, we’ll cover the second responsibility of data engineers: architecture.

This is the third part of our ongoing series on Big Data, how we see it, and how we build products for it. In this installment, we’ll cover the first responsibility of the data engineer: building pipelines.

Company

Big Data World, Part 1: Definitions

What is Big Data?

Data that won’t fit the node’s memory

Data that scales on 3V

Enough data to make reliable business decisions

Customers

Conclusion

Discover more

Big Data World, Part 6: PACELC

Big Data World, Part 5: CAP Theorem

Big Data World, Part 4: Architecture

Big Data World, Part 3: Building Data Pipelines

Company

Big Data World, Part 1: Definitions

What is Big Data?

Data that won’t fit the node’s memory

Data that scales on 3V

Enough data to make reliable business decisions

Customers

Conclusion

Subscribe to JetBrains Blog updates

Discover more

Big Data World, Part 6: PACELC

Big Data World, Part 5: CAP Theorem

Big Data World, Part 4: Architecture

Big Data World, Part 3: Building Data Pipelines