Data Engineers Are Like Plumbers Who Install Pipes for Big Data
Roman Poborchiy interviewed Pasha Finkelshteyn, a Big Data IDE developer advocate. Pasha loves talking to people about big data and has broad experience across the IT sphere. He also has a degree in psychology and is a speaker, author, and host of several podcasts.
Roman: Good morning!
Pasha: Morning! Though 10:30 AM is not actually morning for me. I’m generally an early bird, but in December I have one more reason to get up early. Every day, I solve programming puzzles in Advent of Code. Do you know what that is?
Roman: There’s a hint in the name, but you’d better tell me.
Pasha: This year, more than a hundred thousand people are participating. Every day, new programming challenges are posted on the website, primarily in mathematics and algorithms. You can solve them in any programming language and submit your solutions. The task for the day is the same for everyone, but the input data, against which your solution is checked, is different for each participant.
Roman: And why wake up early?
Pasha: In my time zone (GMT+1), a new task is published at 6 AM, and ideally it has to be dealt with around the same time. To be honest, I’m not enough of an early bird to start it at 6 AM, but I’ll do my best to get it done as early as possible. And since many of my peers are in Europe too — the competition is still tight!
Not to mention that I enjoy solving problems before getting down to work. It sort of gets me in a working mood.
Roman: Great! So there you are, in the right mood, and your workday begins. What do you do?
Pasha: I work as a developer advocate on the Big Data Tools team.
My job is about making users awesome. There is an amazing book called Badass: Making Users Awesome about how we should work to make people productive and happy at work.
I try to educate people on various topics in the hope that someday they will talk to me, and I will be able to listen to them and say, “Yes, we can implement this for you in Big Data Tools.”
The phenomenon of developer advocacy appeared in response to the fact that IT people don’t buy products that are advertised in traditional ways. They want to understand why the product exists. Internally, advocacy is often a kind of sales – people come to their managers and tell them which tool they want to use.
I’d like to believe that someday our Big Data Tools will become the default tool.
Roman: The default tool for whom?
Pasha: Big Data Tools is a plugin for data engineers…
Roman: Let me stop you right there. From your point of view, who is a data engineer? What kind of profession is it?
Pasha: I really like to make an analogy with plumbers when I talk about data engineers.
We always have sources and receivers, we can collect data, transfer it, pour it back and forth, and perhaps transform it with every operation. It can be a complex system, like plumbing or sewage. And along the way, we can say how to lay pipes correctly and in which facilities we should store data.
In other words, data engineers are people who know how to handle data. At the same time, they normally don’t extract any business value from the data, neither analytical nor scientific. For them, data is just a raw material that needs to be prepared and handed over to other experts.
Roman: OK, let’s assume this is clear. What’s the problem with tools for data engineers?
Pasha: I think that Big Data is still too young a field. That’s why we are seeing a growing number of specializations, deepening knowledge in particular sub-fields, and that’s why we have data scientists, data analysts, and data engineers.
Sometimes, they distinguish between a lot of different kinds of data engineers. Often people are engaged in one very specific task which eats up all eight hours of their working time.
This isn’t cool. I believe that people should be pentagon-shaped. It’s when all of your knowledge is deep, and only one point of the figure, in my case it’s Java, is a little deeper than everything else.
Roman: Why doesn’t it work that way in Big Data?
Pasha: The field is not mature enough.
The tooling is still poor, and we need a lot of people who know how to work with different tools. A lot of things just haven’t gotten around to being done yet. For example, there is Apache Spark, a very popular tool. And, let’s say, we stumble upon a bug.
Roman: Our bug or a Spark bug?
Pasha: Our bug. In our code. If we have a bug in the backend of a standard enterprise application, we usually have an understandable stacktrace. We can try debugging or using debug prints – these are working methods.
But there’s no way you can debug Spark. And you can’t do debug prints – you have too much data and won’t find what you’re looking for in the debug output. You can’t even download this output. So you have to look for the bug analytically, which isn’t always a trivial task.
Roman: You mean the only thing you can do is read the code?
Pasha: Sort of. I really hope that one day JetBrains will make a debugger for Spark. I have an idea how to approach the task, and we’re considering it.
And I’ve only mentioned Spark, which is the most advanced of all the tools. There is this picture, State of BigData Ecosystem, with more than a thousand technologies. There are so many because none of them solves the problem perfectly.
Roman: But, as a business, we don’t want to wait until the ecosystem evolves. We want to be able to act right now, even though the field is immature. How do you develop Big Data Tools?
Pasha: Actually, the fact that the industry is immature is to our advantage. There are so many tools on the market, they are so diverse and achieve such different goals, that we can integrate all of them (ultimately) and solve all the problems. When industry matures, the number of tools will get smaller, but we will still support these tools and thus we’ll support most data engineers.
Roman: Or we could make our tool the perfect tool?
Pasha: That’s possible, but there’s a problem. At the moment, the no-code and low-code paradigms are becoming increasingly popular in data engineering.
If by chance the perfect tool happens to be a no-code tool, there will simply be no place for us. We make great IDEs, but visual programming has never been our strong suit. We have a different focus.
Therefore, beating a tool originally designed as no-code on its home field seems unrealistic.
For now, we’re doing well in the absence of a universal tool, where we can integrate a million different technologies into our solution. We can say, “No matter what you’ve got under the hood, you can work with it in our IDEs, and we will provide a consistent user experience.”
Roman: Could you give an example?
Pasha: For example, we’ve recently added an integration with Amazon EMR. It is not only a MapReduce, but a cluster with a Spark-like UI. This work was finished and rolled out, as well as many exciting things like Tencent CLoudand Alibaba Cloud integrations..
Our next step could be to support the Google Data Proc to have feature parity between major cloud vendors.
Then there is integration with Zeppelin, which allows you to work with notebooks without having to leave the IDE. A lot of work has been done here. With a bit of luck, we might be able to reuse part of this work to support other notebooks.
And the work on massively requested features — dbt support and Avro Schema Registry — is in progress too.
Roman: What made this integration so hard to implement? You have a Zeppelin notebook, which can be located anywhere, and you need to connect to it?
Pasha: Connection is no big deal. The hardest part was to make a Zeppelin editor because it’s not just about showing a web interface in the IDE. It is a full-featured editor with lots of cells where every cell is another editor. It renders HTML for you and all sorts of settings.
We’ve also made ZTools. It’s a tool that shows you a scheme of your Spark data frames. We know this scheme and can offer autocompletion for columns in Spark-SQL. The database doesn’t exist in reality, but we can create a synthetic Spark database, autocomplete column names in your code, and check whether the arguments are passed correctly.
Roman: And how do you choose and prioritize the directions for Big Data Tools development? How do you decide what needs to be done first?
Pasha: I do the search, but not prioritization. But I can tell you how I do my research.
I hang out on data engineer forums, like in the international data engineers Slack channel. There are not many members, around a thousand people, but it’s interesting to read the questions that they ask.
These questions can’t always be answered, but they help me understand the trends. Then I go to our team lead, Ekaterina Podkhaliuzina, and tell her, “There is this tool, and people have been showing a lot of interest in it; perhaps we could help them.”
I have some kind of a threshold in mind which helps me judge whether a tool is popular. For example, a couple of days ago I realized that about DBT (Data Build Tool), which allows you to describe your data model in YAML. In a nutshell, it allows you to create data marts.
It is extremely popular now, and perhaps we’ll be able to come up with something to support it. Since we have connectors to everything, we could provide coding assistance for SQL inside the models.
Roman: In order to follow what’s happening, you have to have some understanding of everything, right?
Pasha: I like the idea of bringing domains together. Anton Keks, a great speaker and person, in his talk “The world needs full-stack craftsmen” describes the need for artisans who can do any job. This resonates with me a lot because it fits well with my experience. I can do almost any job, except for something very science-intensive, like compilers or advanced mathematics. I can develop frontend with React or write databases – these are the things I’ve learned by doing.
Roman: How did you get all this experience? I know that you have a degree in organizational psychology. What is it all about?
Pasha: The aspect of psychology that has been gaining popularity deals with the issues of individuals. Psychotherapy, and all things related, can be called personality psychology.
But psychologists are needed not only to help individuals, sometimes they can help solve issues for organizations. A popular request is formulating company values.
Roman: What brought you from psychology to software development? You once said that it was out of despair. Why despair? I’ve come across many psychological startups – it seems that a lot is going on in the sphere. I mean, there’s life there.
Pasha: It’s definitely not dead. A lot of interesting things are going on.
As for me, it is worth starting with my connection to IT. My parents were and still are mathematicians and developers. We got our first PC when I was 6 years old, back in 1992.
But I never did anything seriously with it and did not think I could. When I was 8 years old, I tried learning programming together with a girl who was my mother’s student. It didn’t work out, and my mother said that programming was probably not for me.
After university, I didn’t have that many options. I could do an internship in a consulting firm and work as a psychologist, but at that time it was very little money, and I was already married and wanted to earn a little more.
The other option was to try something different. Between ages 6 and 22, I managed to learn a thing or two about computers, I enjoyed exploring software and got a job in technical support at a company called Poligor. I’m grateful to them, although they fired me on the last day of my probation, saying that IT was probably not for me.
It wasn’t the first time I heard this, as you can imagine.
Roman: But that didn’t stop you.
Pasha: Right. I started working for Philips in technical support, but it was too bureaucratic for me, and to be honest, they also fired me.
Then I got a job at an insurance company where it turned out that during my time at Philips and Polygor I became a good administrator. But the company started going bankrupt, and I got downsized. This time they didn’t say that IT was not for me.
Then a classmate invited me to a scientific institute to write code for them as an intern. I told him right away that programming wasn’t my thing. He replied that I didn’t have many options, but had my head screwed on right and had to try.
So I started programming with smart forms. This development paradigm doesn’t exist anymore.
Roman: Was it the visual programming that wasn’t our strong suit at JetBrains?
Pasha: Probably, although the term is from Delphi. And you know, I did it. And then it turned out that you can still write code on top of the forms, and I did that too.
I worked for 5 years in a different team at the same place. We developed secondary monitoring systems in Java for nuclear power plants all over Russia. It was an interesting and rewarding job.
You know what was the best part? I had a chance to explore a gazillion technologies. I had an amazing team leader who allowed me to experiment. After that, job interviews were a piece of cake, because I had gained some experience with pretty much everything. I used to include a huge list of technologies in my CV. Of course, I don’t do that anymore, but back then I could.
And then things were off and rolling. I became a developer, then a team lead, then worked as a CTO for a while with 35 people under me. Next, I returned to being a team lead and then moved to a linear data engineer position.
Roman: Do you sometimes feel that you lack fundamental education in programming? If so, how do you fill the gaps?
Pasha: Honestly, working at JetBrains, it’s hard not to have this feeling. Too many people around you are very well versed in programming.
Roman: Looking closer, everyone at JetBrains, including marketers and UX-researchers, writes code. They are often very good at programming, regardless of their job title.
Pasha: Sure thing. How do I fill the gaps? The honest answer is that I don’t. I usually do research when I need to solve a particular problem.
Going back to the talk that I’ve mentioned about full-stack craftsmen, Anton said something cool: “The more you have learned, the easier it is for you to learn new things.” Polyglots can confirm this: they say that the more languages you speak, the easier it is for you to master a new one.
But not everyone can master languages easily. For some people, it requires a huge effort. It’s worth learning all sorts of things. I learned how to make clay pots at one point. It is also fun, and it increases your brain’s plasticity too. Honestly, I don’t really think that the ability to learn declines with age. Perhaps we become slower learners, but if you keep training this muscle, so to speak, it will be easier to learn new things.
So I keep trying new technologies. At my current job, I can afford to try anything.
Roman: Are you happy with your current role?
Pasha: Totally, I do what I enjoy doing. I enjoy talking to people and I’m always on the lookout for curious things that are going on in the industry. I’m really looking forward to the return of offline events (and multiple conferences are not in-person yet) because it is so much easier to meet new people there.
Roman: As a final thought, let’s hope that we and our readers can get back to enjoying in-person events as soon as possible.
Pasha: Right – and without having to worry so much.