dbt® deeper concepts: materialization
In the first part of this blog series, I described basic dbt® concepts such as installation, creation of views, and describing models. I could have stopped there, but indeed, there are some drawbacks to only using views to build the whole transformation layer in our database. Sometimes we don't really need to use a view and a view may run slowly even in databases oriented toward analytical workflows. I’ll start by giving an overview of ephemeral views. Ephemeral views In some cases, we don't really want to have an entity for a dbt® model, rather we want this model to be inlined in oth
How I started out with dbt®
For some time now, I’ve noticed that dbt® is gaining popularity. I’ve seen more questions and more success stories, so a couple of days ago I decided to try it out. But what exactly is dbt anyway? Here is the first phrase you can find in its documentation: “dbt (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles turn these select statements into tables and views.” It sounds interesting, but maybe that’s not entirely clear. Here’s my interpretation: dbt is a half-declarative tool for describing transformati
Kotlin API for Apache Spark 1.0 Released
The Kotlin API for Apache Spark is now widely available. This is the first stable release of the API that we consider to be feature-complete with respect to the user experience and compatibility with core Spark APIs. Get on Maven Central Let’s take a look at the new features this release brings to the API. Typed select and sortMore column functionsMore KeyValueGroupedDataset wrapper functionsSupport for Scala TupleN classesSupport for date and time typesSupport for maps encoded as tuplesConclusion Typed select and sort The Scala API has a typed select method that returns Dataset
Big Data World, Part 6: PACELC
This is the sixth installment of our ongoing series on Big Data, how we see it, and how we build products for it. In this episode, we’ll cover the PACELC theorem. It is an extension of the CAP theorem, which describes trade-offs in distributed systems that exist before partition happens. Big Data World, Part 1: DefinitionsBig Data World, Part 2: RolesBig Data World, Part 3: Building Data PipelinesBig Data World, Part 4: ArchitectureBig Data World, Part 5: CAP TheoremThis article After reading Big Data World, Part 5: CAP Theorem, you might think that this theorem hardly helps in actual de
Big Data World, Part 5: CAP Theorem
This is the fifth installment of our ongoing series on Big Data, how we see it, and how we build products for it. In this episode, we’ll cover the CAP theorem. What is it? Is it correct? And why is it needed for data engineers? Big Data World, Part 1: DefinitionsBig Data World, Part 2: RolesBig Data World, Part 3: Building Data PipelinesBig Data World, Part 4: ArchitectureThis article Table of contents: Distributed systemsConsistencyAvailabilityPartition toleranceTrade-offsCriticism The life of a data engineer is basically built on working with distributed systems. Every syst
Big Data World, Part 4: Architecture
This is the fourth part of our ongoing series on Big Data, how we see it, and how we build products for it. In this installment, we’ll cover the second responsibility of data engineers: architecture.
Big Data World, Part 3: Building Data Pipelines
This is the third part of our ongoing series on Big Data, how we see it, and how we build products for it. In this installment, we’ll cover the first responsibility of the data engineer: building pipelines.
Big Data World, Part 2: Roles
In this part, we’ll talk about the roles of people working with Big Data. All these roles are data-centric, but they’re very different. Let’s describe them in broad brushstrokes to understand better who are those people we target.
Big Data Tools Update Is Out: Experimental Python Support and Search Function in Zeppelin Notebooks
Big Data Tools plugin for version 2021.1 of IntelliJ IDEA Ultimate, PyCharm Professional, and DataGrip has been released. You can install it from the JetBrains Marketplace or from inside your IDE. The plugin allows you to edit Zeppelin notebooks, upload files to cloud filesystems, and monitor Hadoop and Spark clusters. In this release, we've added experimental Python support and global search inside Zeppelin notebooks. We’ve also addressed a variety of bugs. Let's talk about the details. Experimental and preliminary Python support Although PySpark in Zeppelin is getting a lot of
Big Data Tools Plugin for Apache Zeppelin
Zeppelin is a web-based notebook for data engineers that enables data-driven, interactive data analytics with Spark, Scala, and more. The project recently reached version 0.9.0-preview2 and is being actively developed, but there are still many things to be implemented. One such thing is an API for getting comprehensive information about what's going on inside the notebook. There is already an API that completely solves the problems of high-level notebook management, but it doesn’t help if you want to do anything more complex. That was a real problem for Big Data Tools, a plugin for IntelliJ