IntelliJ Rust

Procedural macros under the hood: Part I

Have you ever wondered how Rust’s procedural macros work? In this blog post series, we will get into the details! A member of our team, Vladislav Beskrovny, recently gave a talk on the topic at RustCon. This series is based on that talk, with some slight modifications and additions.

In this post, we’ll look at the basics of macros in Rust and get into the procedural macros specifics, paying close attention to their API.

Macros in general

Macros are everywhere in Rust, but there are some programming languages that don’t use them at all. Let’s investigate what macros are and what opportunities they bring.

Macros serve three main purposes:

  • They allow you to write code that creates other code.
  • They allow you to expand the language syntax with custom constructs.
  • They help you reduce the amount of boilerplate code.

How macros generate new code

Let’s create a vector and push three numbers into it. The simplest code for this would be the following:

fn main() {
   let mut a = Vec::new();
   a.push(1);
   a.push(2);
   a.push(3);
}

We can rewrite this code using the standard library’s vec macro:
fn main() {
   let a = vec![1, 2, 3];
}

The vec![1,2,3] part is a call for the vec macro. This macro is a declarative macro with the following (simplified) declaration:
macro_rules! vec {
   ($($x:expr),+) => ({
       let mut v = Vec::new();
       $( v.push($x); )+
       v
   });
}

Here, $($x:expr),+ is called a macro pattern. The macro call’s body, which is [1,2,3], is matched against this pattern:

macro pattern
Meta variables are then placed into the expansion template in the following way:

macros meta variables

Notice that the macro expansion code (on the right) looks very much like the initial code we used for this example. Indeed, when the expansion code replaces the macro call, it transforms the original code into the following:

fn main() {
   let a = {
       let mut v = Vec::new();
       v.push(1);
       v.push(2);
       v.push(3);
       v
   };
}

The compiler can then process this code as usual.

How macros create new syntax

Let’s take a look at the procedural macro html from the yew library, which helps write web frontends in Rust (we will look at procedural macros later).

Here is an example of how this macro can be called:

use yew::html;
html! {
      <div>
         <div class="panel">
             { "Hello, World!" }
         </div>
      </div>
}

The macro call doesn’t look like Rust at all, does it? But html parses the code of its call as a language similar to HTML and generates a hierarchical structure called virtual DOM. The resulting expansion code is purely Rust, and it can be then compiled by rustc.
VTag::new(
   "div",
   vec![VTag::new("div", ...)],
);

This is how a macro can embed another language into Rust.

Tip: whitespaces are eliminated from the macro body, so there’s no way to write, for example, a macro that would inject a language like Python.

More examples of how macros can help create custom syntax include:

  • Collection literals like vec from the standard library.
  • Text formatting, presented by macros like println or format (println is a declarative macro which expands to a procedural macro format_args_nl included in rustc).

Note that forbidden symbols will not be allowed inside a procedural macro. To sum up, a procedural macro can contain only the tokens that are already allowed in Rust.

How macros help reduce boilerplate code

To illustrate this, we’ll use the example of writing a structure. Usually, there are many traits to be implemented:

struct Foo { x: i32, y: i32 }
impl Copy for Foo { ... }
impl Clone for Foo { ... }
impl Ord for Foo { ... }
impl PartialOrd for Foo { ... }
impl Eq for Foo { ... }
impl PartialEq for Foo { ... }
impl Debug for Foo { ... }
impl Hash for Foo { ... }

What can come in handy here is derive, which is a procedural macro. The above traits can be rewritten using derive macros from rustc:
#[derive(Copy, Clone, Ord, PartialOrd, Eq, PartialEq, Debug, Hash, Default)]
struct Foo { x: i32, y: i32 }

Each derive will generate a particular impl based on the original structure.

Procedural macros

Essentially, a procedural macro is a Rust function executed at compile time. Such functions belong to a special crate marked with the proc-macro flag. In Cargo.toml, this looks like the following:

[package]
name = "my-proc-macro"
version = "0.1.0"
edition = "2021"

[lib]
proc-macro = true

Types of procedural macros

There are three types of procedural macros:

  • Function-like procedural macros
    These macros are declared using the #[proc_macro] attribute and called like regular functions, similar to declarative macros:
    #[proc_macro]
    pub fn foo(body: TokenStream) -> TokenStream { ... }
    …
    foo!( foo bar baz );
  • Custom derive procedural macros
    These macros are declared using the #[proc_macro_derive] attribute and are used in #[derive] for structures and enums:
    #[proc_macro_derive(Bar)]
    pub fn bar(body: TokenStream) -> TokenStream { ... }
    …
    #[derive(Bar)]
    struct S;
  • Custom attributes
    These macros are declared using #[proc_macro_attribute] and are called as item attributes:
    #[proc_macro_attribute]
    pub fn baz(
       attr: TokenStream,
       item: TokenStream
    ) -> TokenStream { ... }
    …
    #[baz]
    fn some_item() {}

Procedural macros API

Procedural macro body

Let’s first clarify what a procedural macro body is. In the case of a function-like macro, the body is everything between the round brackets:

proc macro body

In the case of a custom derive macro, the body is the whole attributed structure:
proc macro body

For an attribute macro, the body includes the whole item (fn some_item() {}). There can also be more parts for the macro body in the attribute itself (they are passed as additional attributes to the function as well):

proc macro body

To illustrate this, we’ll examine an identity macro, which simply returns the body that it takes, without doing anything else:

extern crate proc_macro;
use proc_macro::TokenStream;

#[proc_macro]
pub fn foo(body: TokenStream) -> TokenStream {
   return body
}

Suppose we have a program that calls hello(), where hello is inside a foo! macro. In this situation, the foo macro will be expanded in such a way that it will look like there was no macro in the first place:
 use my_proc_macro::*;

// foo! {
   fn hello() {
       println!("Hello, world!");
   }
// }

fn main() {
   hello();
}

Similarly, this could be written with an attribute macro:
extern crate proc_macro;
use proc_macro::TokenStream;

#[proc_macro_attribute]
pub fn baz(
   attr: TokenStream,
   item: TokenStream
) -> TokenStream { 
   return item
}
…
use my_proc_macro::*;

#[baz]
fn hello() {
   println!("Hello, world!");
}

fn main() {
hello();
}

Tokens, TokenStream, and TokenTree

The body of a procedural macro is divided into pieces called tokens:

tokens

A token is a string of a particular type, which is assigned to it during the parsing of a macro body. There are three types of tokens: identifiers, punctuation symbols, and literals.

Procedural macros operate with special data types from the proc_macro crate, which is a part of the standard library and is linked automatically when procedural macros are compiled. One of these special types, TokenTree, represents the enum of the possible token types:

struct TokenStream(Vec<TokenTree>);
enum TokenTree {
   Ident(Ident),
   Punct(Punct),
   Literal(Literal),
   ...
}

Another data structure, TokenStream, represents the list of tokens and allows you to iterate the token list (body.into_iter()):
#[proc_macro]
pub fn foo(body: TokenStream) -> TokenStream {
   for tt in body.into_iter() {
       match tt {
           TokenTree::Ident(_) => eprintln!("Ident"),
           TokenTree::Punct(_) => eprintln!("Punct"),
           TokenTree::Literal(_) => eprintln!("Literal"),
           _ => {}
       }
   }
   return TokenStream::new();
}

$ cargo build

Ident
Punct
Literal
Punct

There is one more enum variant in the TokenTree, which is called Group:
enum TokenTree {
   Ident(Ident),
   Punct(Punct),
   Literal(Literal),
   Group(Group),
}

Groups appear when the parser encounters brackets. The brackets that form a group can be either round, square, or braces.

For example, a macro with the following body

foo!( foo { 2 + 2 } bar );

will be parsed into two identifiers (foo and bar) and a group ({2+2}). A group here includes braces and another TokenStream (literals 2 and 2, and a punctuation symbol +):

TokenStream and TokenTree

We can see that TokenStream is not strictly a stream. It’s a kind of tree, where each node is formed by brackets and the leaves represent singular tokens.

How to write a procedural macro

Let’s write a simple procedural macro that will expand into a function call and pass all the arguments given to it:

simple proc macro

Here’s a variant of how we could write it:

#[proc_macro]
pub fn foo(body: TokenStream) -> TokenStream {
   return [
       TokenTree::Ident(Ident::new("foo", Span::mixed_site())),
       TokenTree::Group(Group::new(Delimiter::Parenthesis, body))
   ].into_iter().collect();
}

In the code above, we do the following:

  • Create an array of two elements
    1) Identifier foo: Ident(Ident::new("foo", Span::mixed_site()))
    2) A group with round brackets in which we place the macro body Group(Group::new(Delimiter::Parenthesis, body)). Notice the body being passed from the foo call: foo(body: TokenStream)
  • Arrange the created array into a TokenStream: .into_iter().collect()

Now the macro can be called in this way:

fn main() {
   foo!(1, 2);
}

Let’s see how our macro will be processed.

The macro body is 1,2. When expanded, the body will be wrapped in parentheses and prepended with foo, so that it looks just like a function call:

processing of a proc macro

Spans

Why are TokenStream and TokenTree necessary for the procedural macros API? Why aren’t raw strings enough? We can think of this kind of code (which will not work):

#[proc_macro]
pub fn foo(body: String) -> String {// this doesn't work!
   format!("foo({})", body)
}

To understand why the code above doesn’t work, we need to go back to the token structure.

Besides the type and the actual string of symbols, a token structure also includes a Span:

spans

Span contains information about where in the original code the token was placed. This is necessary for the compiler to highlight the errors correctly.

For example, we can take the same macro and intentionally pass a string instead of a numeric value into the call:

error in proc macro

Since the function expects an i32 value, the compiler will report an error. But where will the compiler report the error? It will be shown at the token where the error would be expected if it were a regular function call, not at the whole macro call:

error[E0308]: mismatched types
  --> src/main.rs:26:13
   |
26 |     foo!(1, "");
   |             ^^ expected `i32`, found `&str`

This is possible because we passed the whole TokenStream into the expansion, and each token contains a Span. Span informs the compiler that this particular code fragment should be mapped to that particular fragment in macro expansion. This way, the compiler can map the errors that occur during the compilation of the expanded code.

Now to summarize, a procedural macro structure is built from the following blocks:

  • TokenStream, which is a vector of TokenTrees
  • TokenTree is an enum of 3 token types plus a Group
  • A Group is formed by brackets
  • Each token has a Span, which is used for error mapping

________________________________

Are procedural macros any clearer for you after this deep dive? In the second part of this series, we will cover the process of procedural macro compilation, the ABI in use, and the IDE’s way of dealing with them.

Stay tuned!

Your Rust team

JetBrains
The Drive to Develop