Big Data Tools
Data Engineers Are Like Plumbers Who Install Pipes for Big Data
Roman Poborchiy, the Marketing Manager for the Machine Learning team, interviewed Pasha Finkelshteyn, a Big Data IDE Developer Advocate.
Why We Need Hive Metastore
Everybody in IT works with data, including frontend and backend developers, analysts, QA engineers, product managers, and people in many other roles. The data used and the data processing methods vary with the role, but data itself is more often than not the key. — "It's a very special key, meant only for The One"— "What does it unlock?"— "The future"The Matrix Reloaded In the data engineering world, data is more than “just data” – it’s the lifeblood of our work. It’s all we work with, most of the time. Our code is data-centric, and we use the only real 5th-generation language there
Kotlin API for Apache Spark: Streaming, Jupyter, and More
Hello, fellow data engineers! It’s Pasha here, and today I'm going to introduce you to the new release of Kotlin API for Apache Spark. It's been a long time since the last major release announcements, mainly because we wanted to avoid bothering you with minor improvements. But today's announcement is huge! First, let me remind you what the Kotlin API for Apache Spark is and why it was created. Apache Spark is a framework for distributed computations. It is usually used by data engineers for solving different tasks, for example for the ETL process. It supports multiple languages straight out
dbt® deeper concepts: materialization
In the first part of this blog series, I described basic dbt® concepts such as installation, creation of views, and describing models. I could have stopped there, but indeed, there are some drawbacks to only using views to build the whole transformation layer in our database. Sometimes we don't really need to use a view and a view may run slowly even in databases oriented toward analytical workflows. I’ll start by giving an overview of ephemeral views. Ephemeral views In some cases, we don't really want to have an entity for a dbt® model, rather we want this model to be inlined in oth
How I started out with dbt®
For some time now, I’ve noticed that dbt® is gaining popularity. I’ve seen more questions and more success stories, so a couple of days ago I decided to try it out. But what exactly is dbt anyway? Here is the first phrase you can find in its documentation: “dbt (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles turn these select statements into tables and views.” It sounds interesting, but maybe that’s not entirely clear. Here’s my interpretation: dbt is a half-declarative tool for describing transformati
Kotlin API for Apache Spark 1.0 Released
The Kotlin API for Apache Spark is now widely available. This is the first stable release of the API that we consider to be feature-complete with respect to the user experience and compatibility with core Spark APIs. Get on Maven Central Let’s take a look at the new features this release brings to the API. Typed select and sortMore column functionsMore KeyValueGroupedDataset wrapper functionsSupport for Scala TupleN classesSupport for date and time typesSupport for maps encoded as tuplesConclusion Typed select and sort The Scala API has a typed select method that returns Dataset
Big Data World, Part 6: PACELC
This is the sixth installment of our ongoing series on Big Data, how we see it, and how we build products for it. In this episode, we’ll cover the PACELC theorem. It is an extension of the CAP theorem, which describes trade-offs in distributed systems that exist before partition happens. Big Data World, Part 1: DefinitionsBig Data World, Part 2: RolesBig Data World, Part 3: Building Data PipelinesBig Data World, Part 4: ArchitectureBig Data World, Part 5: CAP TheoremThis article After reading Big Data World, Part 5: CAP Theorem, you might think that this theorem hardly helps in actual de
Big Data World, Part 5: CAP Theorem
This is the fifth installment of our ongoing series on Big Data, how we see it, and how we build products for it. In this episode, we’ll cover the CAP theorem. What is it? Is it correct? And why is it needed for data engineers? Big Data World, Part 1: DefinitionsBig Data World, Part 2: RolesBig Data World, Part 3: Building Data PipelinesBig Data World, Part 4: ArchitectureThis article Table of contents: Distributed systemsConsistencyAvailabilityPartition toleranceTrade-offsCriticism The life of a data engineer is basically built on working with distributed systems. Every syst
Big Data World, Part 4: Architecture
This is the fourth part of our ongoing series on Big Data, how we see it, and how we build products for it. In this installment, we’ll cover the second responsibility of data engineers: architecture.
Big Data World, Part 3: Building Data Pipelines
This is the third part of our ongoing series on Big Data, how we see it, and how we build products for it. In this installment, we’ll cover the first responsibility of the data engineer: building pipelines.
Big Data World, Part 2: Roles
In this part, we’ll talk about the roles of people working with Big Data. All these roles are data-centric, but they’re very different. Let’s describe them in broad brushstrokes to understand better who are those people we target.
Big Data World, Part 1: Definitions
This post is the first in a series about Big Data. It is aimed at telling you how we at JetBrains see Big Data, and consequently, how we're creating products for it. The world of big data can seem mysterious, hidden behind a curtain of unknown and weird words. It’s time to clear up this mystery and define Big Data.