Big Data Tools
Data Engineers Are Like Plumbers Who Install Pipes for Big Data
Roman Poborchiy, the Marketing Manager for the Machine Learning team, interviewed Pasha Finkelshteyn, a Big Data IDE Developer Advocate.
Why We Need Hive Metastore
Everybody in IT works with data, including frontend and backend developers, analysts, QA engineers, product managers, and people in many other roles. The data used and the data processing methods vary with the role, but data itself is more often than not the key. — "It's a very special key, m…
Kotlin API for Apache Spark: Streaming, Jupyter, and More
Hello, fellow data engineers! It’s Pasha here, and today I'm going to introduce you to the new release of Kotlin API for Apache Spark. It's been a long time since the last major release announcements, mainly because we wanted to avoid bothering you with minor improvements. But today's announcement i…
dbt® deeper concepts: materialization
In the first part of this blog series, I described basic dbt® concepts such as installation, creation of views, and describing models. I could have stopped there, but indeed, there are some drawbacks to only using views to build the whole transformation layer in our database. Sometimes we don't real…
How I started out with dbt®
For some time now, I’ve noticed that dbt® is gaining popularity. I’ve seen more questions and more success stories, so a couple of days ago I decided to try it out. But what exactly is dbt anyway? Here is the first phrase you can find in its documentation: “dbt (data build tool) enables anal…
Kotlin API for Apache Spark 1.0 Released
The Kotlin API for Apache Spark is now widely available. This is the first stable release of the API that we consider to be feature-complete with respect to the user experience and compatibility with core Spark APIs. Get on Maven Central Let’s take a look at the new features this release bring…
Big Data World, Part 6: PACELC
This is the sixth installment of our ongoing series on Big Data, how we see it, and how we build products for it. In this episode, we’ll cover the PACELC theorem. It is an extension of the CAP theorem, which describes trade-offs in distributed systems that exist before partition happens. Big Data…
Big Data World, Part 5: CAP Theorem
This is the fifth installment of our ongoing series on Big Data, how we see it, and how we build products for it. In this episode, we’ll cover the CAP theorem. What is it? Is it correct? And why is it needed for data engineers? Big Data World, Part 1: DefinitionsBig Data World, Part 2: Role…
Big Data World, Part 4: Architecture
This is the fourth part of our ongoing series on Big Data, how we see it, and how we build products for it. In this installment, we’ll cover the second responsibility of data engineers: architecture.
Big Data World, Part 3: Building Data Pipelines
This is the third part of our ongoing series on Big Data, how we see it, and how we build products for it. In this installment, we’ll cover the first responsibility of the data engineer: building pipelines.
Big Data World, Part 2: Roles
In this part, we’ll talk about the roles of people working with Big Data. All these roles are data-centric, but they’re very different. Let’s describe them in broad brushstrokes to understand better who are those people we target.
Big Data World, Part 1: Definitions
This post is the first in a series about Big Data. It is aimed at telling you how we at JetBrains see Big Data, and consequently, how we're creating products for it. The world of big data can seem mysterious, hidden behind a curtain of unknown and weird words. It’s time to clear up this mystery and define Big Data.