Procedural macros under the hood: Part I
Have you ever wondered how Rust’s procedural macros work? In this blog post series, we will get into the details! A member of our team, Vladislav Beskrovny, recently gave a talk on the topic at RustCon. This series is based on that talk, with some slight modifications and additions.
In this post, we’ll look at the basics of macros in Rust and get into the procedural macros specifics, paying close attention to their API.
Macros in general
Macros are everywhere in Rust, but there are some programming languages that don’t use them at all. Let’s investigate what macros are and what opportunities they bring.
Macros serve three main purposes:
- They allow you to write code that creates other code.
- They allow you to expand the language syntax with custom constructs.
- They help you reduce the amount of boilerplate code.
How macros generate new code
Let’s create a vector and push three numbers into it. The simplest code for this would be the following:
fn main() { let mut a = Vec::new(); a.push(1); a.push(2); a.push(3); }
We can rewrite this code using the standard library’s vec
macro:
fn main() { let a = vec![1, 2, 3]; }
The vec![1,2,3]
part is a call for the vec
macro. This macro is a declarative macro with the following (simplified) declaration:
macro_rules! vec { ($($x:expr),+) => ({ let mut v = Vec::new(); $( v.push($x); )+ v }); }
Here, $($x:expr),+
is called a macro pattern. The macro call’s body, which is [1,2,3]
, is matched against this pattern:
Meta variables are then placed into the expansion template in the following way:
Notice that the macro expansion code (on the right) looks very much like the initial code we used for this example. Indeed, when the expansion code replaces the macro call, it transforms the original code into the following:
fn main() { let a = { let mut v = Vec::new(); v.push(1); v.push(2); v.push(3); v }; }
The compiler can then process this code as usual.
How macros create new syntax
Let’s take a look at the procedural macro html
from the yew
library, which helps write web frontends in Rust (we will look at procedural macros later).
Here is an example of how this macro can be called:
use yew::html; html! { <div> <div class="panel"> { "Hello, World!" } </div> </div> }
The macro call doesn’t look like Rust at all, does it? But html
parses the code of its call as a language similar to HTML and generates a hierarchical structure called virtual DOM. The resulting expansion code is purely Rust, and it can be then compiled by rustc.
VTag::new( "div", vec![VTag::new("div", ...)], );
This is how a macro can embed another language into Rust.
Tip: whitespaces are eliminated from the macro body, so there’s no way to write, for example, a macro that would inject a language like Python.
More examples of how macros can help create custom syntax include:
- Collection literals like
vec
from the standard library. - Text formatting, presented by macros like
println
orformat
(println
is a declarative macro which expands to a procedural macroformat_args_nl
included in rustc).
Note that forbidden symbols will not be allowed inside a procedural macro. To sum up, a procedural macro can contain only the tokens that are already allowed in Rust.
How macros help reduce boilerplate code
To illustrate this, we’ll use the example of writing a structure. Usually, there are many traits to be implemented:
struct Foo { x: i32, y: i32 } impl Copy for Foo { ... } impl Clone for Foo { ... } impl Ord for Foo { ... } impl PartialOrd for Foo { ... } impl Eq for Foo { ... } impl PartialEq for Foo { ... } impl Debug for Foo { ... } impl Hash for Foo { ... }
What can come in handy here is derive
, which is a procedural macro. The above traits can be rewritten using derive
macros from rustc:
#[derive(Copy, Clone, Ord, PartialOrd, Eq, PartialEq, Debug, Hash, Default)] struct Foo { x: i32, y: i32 }
Each derive
will generate a particular impl
based on the original structure.
Procedural macros
Essentially, a procedural macro is a Rust function executed at compile time. Such functions belong to a special crate marked with the proc-macro
flag. In Cargo.toml, this looks like the following:
[package] name = "my-proc-macro" version = "0.1.0" edition = "2021" [lib] proc-macro = true
Types of procedural macros
There are three types of procedural macros:
- Function-like procedural macros
These macros are declared using the#[proc_macro]
attribute and called like regular functions, similar to declarative macros:#[proc_macro] pub fn foo(body: TokenStream) -> TokenStream { ... } … foo!( foo bar baz );
- Custom derive procedural macros
These macros are declared using the#[proc_macro_derive]
attribute and are used in#[derive]
for structures and enums:#[proc_macro_derive(Bar)] pub fn bar(body: TokenStream) -> TokenStream { ... } … #[derive(Bar)] struct S;
- Custom attributes
These macros are declared using#[proc_macro_attribute]
and are called as item attributes:#[proc_macro_attribute] pub fn baz( attr: TokenStream, item: TokenStream ) -> TokenStream { ... } … #[baz] fn some_item() {}
Procedural macros API
Procedural macro body
Let’s first clarify what a procedural macro body is. In the case of a function-like macro, the body is everything between the round brackets:
In the case of a custom derive macro, the body is the whole attributed structure:
For an attribute macro, the body includes the whole item (fn some_item() {}
). There can also be more parts for the macro body in the attribute itself (they are passed as additional attributes to the function as well):
To illustrate this, we’ll examine an identity macro, which simply returns the body that it takes, without doing anything else:
extern crate proc_macro; use proc_macro::TokenStream; #[proc_macro] pub fn foo(body: TokenStream) -> TokenStream { return body }
Suppose we have a program that calls hello()
, where hello
is inside a foo!
macro. In this situation, the foo
macro will be expanded in such a way that it will look like there was no macro in the first place:
use my_proc_macro::*; // foo! { fn hello() { println!("Hello, world!"); } // } fn main() { hello(); }
Similarly, this could be written with an attribute macro:
extern crate proc_macro; use proc_macro::TokenStream; #[proc_macro_attribute] pub fn baz( attr: TokenStream, item: TokenStream ) -> TokenStream { return item } … use my_proc_macro::*; #[baz] fn hello() { println!("Hello, world!"); } fn main() { hello(); }
Tokens, TokenStream, and TokenTree
The body of a procedural macro is divided into pieces called tokens:
A token is a string of a particular type, which is assigned to it during the parsing of a macro body. There are three types of tokens: identifiers, punctuation symbols, and literals.
Procedural macros operate with special data types from the proc_macro
crate, which is a part of the standard library and is linked automatically when procedural macros are compiled. One of these special types, TokenTree
, represents the enum of the possible token types:
struct TokenStream(Vec<TokenTree>); enum TokenTree { Ident(Ident), Punct(Punct), Literal(Literal), ... }
Another data structure, TokenStream
, represents the list of tokens and allows you to iterate the token list (body.into_iter()
):
#[proc_macro] pub fn foo(body: TokenStream) -> TokenStream { for tt in body.into_iter() { match tt { TokenTree::Ident(_) => eprintln!("Ident"), TokenTree::Punct(_) => eprintln!("Punct"), TokenTree::Literal(_) => eprintln!("Literal"), _ => {} } } return TokenStream::new(); }
$ cargo build Ident Punct Literal Punct
There is one more enum variant in the TokenTree,
which is called Group
:
enum TokenTree { Ident(Ident), Punct(Punct), Literal(Literal), Group(Group), }
Groups appear when the parser encounters brackets. The brackets that form a group can be either round, square, or braces.
For example, a macro with the following body
foo!( foo { 2 + 2 } bar );
will be parsed into two identifiers (foo
and bar
) and a group ({2+2}
). A group here includes braces and another TokenStream
(literals 2
and 2
, and a punctuation symbol +
):
We can see that TokenStream
is not strictly a stream. It’s a kind of tree, where each node is formed by brackets and the leaves represent singular tokens.
How to write a procedural macro
Let’s write a simple procedural macro that will expand into a function call and pass all the arguments given to it:
Here’s a variant of how we could write it:
#[proc_macro] pub fn foo(body: TokenStream) -> TokenStream { return [ TokenTree::Ident(Ident::new("foo", Span::mixed_site())), TokenTree::Group(Group::new(Delimiter::Parenthesis, body)) ].into_iter().collect(); }
In the code above, we do the following:
- Create an array of two elements
1) Identifierfoo
:Ident(Ident::new("foo", Span::mixed_site()))
2) A group with round brackets in which we place the macro bodyGroup(Group::new(Delimiter::Parenthesis, body))
. Notice the body being passed from the foo call: foo(body: TokenStream) - Arrange the created array into a
TokenStream
:.into_iter().collect()
Now the macro can be called in this way:
fn main() { foo!(1, 2); }
Let’s see how our macro will be processed.
The macro body is 1,2
. When expanded, the body will be wrapped in parentheses and prepended with foo
, so that it looks just like a function call:
Spans
Why are TokenStream and TokenTree necessary for the procedural macros API? Why aren’t raw strings enough? We can think of this kind of code (which will not work):
#[proc_macro] pub fn foo(body: String) -> String {// this doesn't work! format!("foo({})", body) }
To understand why the code above doesn’t work, we need to go back to the token structure.
Besides the type and the actual string of symbols, a token structure also includes a Span
:
Span contains information about where in the original code the token was placed. This is necessary for the compiler to highlight the errors correctly.
For example, we can take the same macro and intentionally pass a string instead of a numeric value into the call:
Since the function expects an i32
value, the compiler will report an error. But where will the compiler report the error? It will be shown at the token where the error would be expected if it were a regular function call, not at the whole macro call:
error[E0308]: mismatched types --> src/main.rs:26:13 | 26 | foo!(1, ""); | ^^ expected `i32`, found `&str`
This is possible because we passed the whole TokenStream
into the expansion, and each token contains a Span
. Span informs the compiler that this particular code fragment should be mapped to that particular fragment in macro expansion. This way, the compiler can map the errors that occur during the compilation of the expanded code.
Now to summarize, a procedural macro structure is built from the following blocks:
- TokenStream, which is a vector of TokenTrees
- TokenTree is an enum of 3 token types plus a Group
- A Group is formed by brackets
- Each token has a Span, which is used for error mapping
________________________________
Are procedural macros any clearer for you after this deep dive? In the second part of this series, we will cover the process of procedural macro compilation, the ABI in use, and the IDE’s way of dealing with them.
Stay tuned!
Your Rust team
JetBrains
The Drive to Develop