diff --git a/src/librustc/README.md b/src/librustc/README.md index c24d3d82b2f7..f2abaa6f9573 100644 --- a/src/librustc/README.md +++ b/src/librustc/README.md @@ -13,49 +13,82 @@ https://github.com/rust-lang/rust/issues Your concerns are probably the same as someone else's. +You may also be interested in the +[Rust Forge](https://forge.rust-lang.org/), which includes a number of +interesting bits of information. + +Finally, at the end of this file is a GLOSSARY defining a number of +common (and not necessarily obvious!) names that are used in the Rust +compiler code. If you see some funky name and you'd like to know what +it stands for, check there! + The crates of rustc =================== -Rustc consists of a number of crates, including `libsyntax`, -`librustc`, `librustc_back`, `librustc_trans`, and `librustc_driver` -(the names and divisions are not set in stone and may change; -in general, a finer-grained division of crates is preferable): +Rustc consists of a number of crates, including `syntax`, +`rustc`, `rustc_back`, `rustc_trans`, `rustc_driver`, and +many more. The source for each crate can be found in a directory +like `src/libXXX`, where `XXX` is the crate name. -- [`libsyntax`][libsyntax] contains those things concerned purely with syntax – - that is, the AST, parser, pretty-printer, lexer, macro expander, and - utilities for traversing ASTs – are in a separate crate called - "syntax", whose files are in `./../libsyntax`, where `.` is the - current directory (that is, the parent directory of front/, middle/, - back/, and so on). +(NB. The names and divisions of these crates are not set in +stone and may change over time -- for the time being, we tend towards +a finer-grained division to help with compilation time, though as +incremental improves that may change.) -- `librustc` (the current directory) contains the high-level analysis - passes, such as the type checker, borrow checker, and so forth. - It is the heart of the compiler. +The dependency structure of these crates is roughly a diamond: -- [`librustc_back`][back] contains some very low-level details that are - specific to different LLVM targets and so forth. - -- [`librustc_trans`][trans] contains the code to convert from Rust IR into LLVM - IR, and then from LLVM IR into machine code, as well as the main - driver that orchestrates all the other passes and various other bits - of miscellany. In general it contains code that runs towards the - end of the compilation process. - -- [`librustc_driver`][driver] invokes the compiler from - [`libsyntax`][libsyntax], then the analysis phases from `librustc`, and - finally the lowering and codegen passes from [`librustc_trans`][trans]. - -Roughly speaking the "order" of the three crates is as follows: - - librustc_driver - | - +-----------------+-------------------+ - | | - libsyntax -> librustc -> librustc_trans +```` + rustc_driver + / | \ + / | \ + / | \ + / v \ +rustc_trans rustc_borrowck ... rustc_metadata + \ | / + \ | / + \ | / + \ v / + rustc + | + v + syntax + / \ + / \ + syntax_pos syntax_ext +``` -The compiler process: -===================== +The idea is that `rustc_driver`, at the top of this lattice, basically +defines the overall control-flow of the compiler. It doesn't have much +"real code", but instead ties together all of the code defined in the +other crates and defines the overall flow of execution. + +At the other extreme, the `rustc` crate defines the common and +pervasive data structures that all the rest of the compiler uses +(e.g., how to represent types, traits, and the program itself). It +also contains some amount of the compiler itself, although that is +relatively limited. + +Finally, all the crates in the bulge in the middle define the bulk of +the compiler -- they all depend on `rustc`, so that they can make use +of the various types defined there, and they export public routines +that `rustc_driver` will invoke as needed (more and more, what these +crates export are "query definitions", but those are covered later +on). + +Below `rustc` lie various crates that make up the parser and error +reporting mechanism. For historical reasons, these crates do not have +the `rustc_` prefix, but they are really just as much an internal part +of the compiler and not intended to be stable (though they do wind up +getting used by some crates in the wild; a practice we hope to +gradually phase out). + +Each crate has a `README.md` file that describes, at a high-level, +what it contains, and tries to give some kind of explanation (some +better than others). + +The compiler process +==================== The Rust compiler is comprised of six main compilation phases. @@ -172,3 +205,29 @@ The 3 central data structures: [back]: https://github.com/rust-lang/rust/tree/master/src/librustc_back/ [rustc]: https://github.com/rust-lang/rust/tree/master/src/librustc/ [driver]: https://github.com/rust-lang/rust/tree/master/src/librustc_driver + +Glossary +======== + +The compiler uses a number of...idiosyncratic abbreviations and +things. This glossary attempts to list them and give you a few +pointers for understanding them better. + +- AST -- the **abstract syntax tree** produced the `syntax` crate; reflects user syntax + very closely. +- cx -- we tend to use "cx" as an abbrevation for context. See also tcx, infcx, etc. +- HIR -- the **High-level IR**, created by lowering and desugaring the AST. See `librustc/hir`. +- `'gcx` -- the lifetime of the global arena (see `librustc/ty`). +- generics -- the set of generic type parameters defined on a type or item +- infcx -- the inference context (see `librustc/infer`) +- MIR -- the **Mid-level IR** that is created after type-checking for use by borrowck and trans. + Defined in the `src/librustc/mir/` module, but much of the code that manipulates it is + found in `src/librustc_mir`. +- obligation -- something that must be proven by the trait system. +- sess -- the **compiler session**, which stores global data used throughout compilation +- substs -- the **substitutions** for a given generic type or item + (e.g., the `i32, u32` in `HashMap`) +- tcx -- the "typing context", main data structure of the compiler (see `librustc/ty`). +- trans -- the code to **translate** MIR into LLVM IR. +- trait reference -- a trait and values for its type parameters (see `librustc/ty`). +- ty -- the internal representation of a **type** (see `librustc/ty`). diff --git a/src/librustc/hir/README.md b/src/librustc/hir/README.md new file mode 100644 index 000000000000..d4f4e48963a3 --- /dev/null +++ b/src/librustc/hir/README.md @@ -0,0 +1,123 @@ +# Introduction to the HIR + +The HIR -- "High-level IR" -- is the primary IR used in most of +rustc. It is a desugared version of the "abstract syntax tree" (AST) +that is generated after parsing, macro expansion, and name resolution +have completed. Many parts of HIR resemble Rust surface syntax quite +closely, with the exception that some of Rust's expression forms have +been desugared away (as an example, `for` loops are converted into a +`loop` and do not appear in the HIR). + +This README covers the main concepts of the HIR. + +### Out-of-band storage and the `Crate` type + +The top-level data-structure in the HIR is the `Crate`, which stores +the contents of the crate currently being compiled (we only ever +construct HIR for the current crate). Whereas in the AST the crate +data structure basically just contains the root module, the HIR +`Crate` structure contains a number of maps and other things that +serve to organize the content of the crate for easier access. + +For example, the contents of individual items (e.g., modules, +functions, traits, impls, etc) in the HIR are not immediately +accessible in the parents. So, for example, if had a module item `foo` +containing a function `bar()`: + +``` +mod foo { + fn bar() { } +} +``` + +Then in the HIR the representation of module `foo` (the `Mod` +stuct) would have only the **`ItemId`** `I` of `bar()`. To get the +details of the function `bar()`, we would lookup `I` in the +`items` map. + +One nice result from this representation is that one can iterate +over all items in the crate by iterating over the key-value pairs +in these maps (without the need to trawl through the IR in total). +There are similar maps for things like trait items and impl items, +as well as "bodies" (explained below). + +The other reason to setup the representation this way is for better +integration with incremental compilation. This way, if you gain access +to a `&hir::Item` (e.g. for the mod `foo`), you do not immediately +gain access to the contents of the function `bar()`. Instead, you only +gain access to the **id** for `bar()`, and you must some function to +lookup the contents of `bar()` given its id; this gives us a change to +observe that you accessed the data for `bar()` and record the +dependency. + +### Identifiers in the HIR + +Most of the code that has to deal with things in HIR tends not to +carry around references into the HIR, but rather to carry around +*identifier numbers* (or just "ids"). Right now, you will find four +sorts of identifiers in active use: + +- `DefId`, which primarily name "definitions" or top-level items. + - You can think of a `DefId` as being shorthand for a very explicit + and complete path, like `std::collections::HashMap`. However, + these paths are able to name things that are not nameable in + normal Rust (e.g., impls), and they also include extra information + about the crate (such as its version number, as two versions of + the same crate can co-exist). + - A `DefId` really consists of two parts, a `CrateNum` (which + identifies the crate) and a `DefIndex` (which indixes into a list + of items that is maintained per crate). +- `HirId`, which combines the index of a particular item with an + offset within that item. + - the key point of a `HirId` is that it is *relative* to some item (which is named + via a `DefId`). +- `BodyId`, this is an absolute identifier that refers to a specific + body (definition of a function or constant) in the crate. It is currently + effectively a "newtype'd" `NodeId`. +- `NodeId`, which is an absolute id that identifies a single node in the HIR tree. + - While these are still in common use, **they are being slowly phased out**. + - Since they are absolute within the crate, adding a new node + anywhere in the tree causes the node-ids of all subsequent code in + the crate to change. This is terrible for incremental compilation, + as you can perhaps imagine. + +### HIR Map + +Most of the time when you are working with the HIR, you will do so via +the **HIR Map**, accessible in the tcx via `tcx.hir` (and defined in +the `hir::map` module). The HIR map contains a number of methods to +convert between ids of various kinds and to lookup data associated +with a HIR node. + +For example, if you have a `DefId`, and you would like to convert it +to a `NodeId`, you can use `tcx.hir.as_local_node_id(def_id)`. This +returns an `Option` -- this will be `None` if the def-id +refers to something outside of the current crate (since then it has no +HIR node), but otherwise returns `Some(n)` where `n` is the node-id of +the definition. + +Similarly, you can use `tcx.hir.find(n)` to lookup the node for a +`NodeId`. This returns a `Option>`, where `Node` is an enum +defined in the map; by matching on this you can find out what sort of +node the node-id referred to and also get a pointer to the data +itself. Often, you know what sort of node `n` is -- e.g., if you know +that `n` must be some HIR expression, you can do +`tcx.hir.expect_expr(n)`, which will extract and return the +`&hir::Expr`, panicking if `n` is not in fact an expression. + +Finally, you can use the HIR map to find the parents of nodes, via +calls like `tcx.hir.get_parent_node(n)`. + +### HIR Bodies + +A **body** represents some kind of executable code, such as the body +of a function/closure or the definition of a constant. Bodies are +associated with an **owner**, which is typically some kind of item +(e.g., a `fn()` or `const`), but could also be a closure expression +(e.g., `|x, y| x + y`). You can use the HIR map to find find the body +associated with a given def-id (`maybe_body_owned_by()`) or to find +the owner of a body (`body_owner_def_id()`). + + + + diff --git a/src/librustc/hir/map/README.md b/src/librustc/hir/map/README.md new file mode 100644 index 000000000000..34ed325705ab --- /dev/null +++ b/src/librustc/hir/map/README.md @@ -0,0 +1,4 @@ +The HIR map, accessible via `tcx.hir`, allows you to quickly navigate the +HIR and convert between various forms of identifiers. See [the HIR README] for more information. + +[the HIR README]: ../README.md diff --git a/src/librustc/hir/mod.rs b/src/librustc/hir/mod.rs index dd2a3978d884..ea3cdbaad413 100644 --- a/src/librustc/hir/mod.rs +++ b/src/librustc/hir/mod.rs @@ -413,6 +413,9 @@ pub struct WhereEqPredicate { pub type CrateConfig = HirVec>; +/// The top-level data structure that stores the entire contents of +/// the crate currently being compiled. +/// #[derive(Clone, PartialEq, Eq, RustcEncodable, RustcDecodable, Debug)] pub struct Crate { pub module: Mod, @@ -927,7 +930,27 @@ pub struct BodyId { pub node_id: NodeId, } -/// The body of a function or constant value. +/// The body of a function, closure, or constant value. In the case of +/// a function, the body contains not only the function body itself +/// (which is an expression), but also the argument patterns, since +/// those are something that the caller doesn't really care about. +/// +/// Example: +/// +/// ```rust +/// fn foo((x, y): (u32, u32)) -> u32 { +/// x + y +/// } +/// ``` +/// +/// Here, the `Body` associated with `foo()` would contain: +/// +/// - an `arguments` array containing the `(x, y)` pattern +/// - a `value` containing the `x + y` expression (maybe wrapped in a block) +/// - `is_generator` would be false +/// +/// All bodies have an **owner**, which can be accessed via the HIR +/// map using `body_owner_def_id()`. #[derive(Clone, PartialEq, Eq, RustcEncodable, RustcDecodable, Hash, Debug)] pub struct Body { pub arguments: HirVec, diff --git a/src/librustc/lib.rs b/src/librustc/lib.rs index 2226bfcfd3c1..cd39ef709463 100644 --- a/src/librustc/lib.rs +++ b/src/librustc/lib.rs @@ -8,7 +8,28 @@ // option. This file may not be copied, modified, or distributed // except according to those terms. -//! The Rust compiler. +//! The "main crate" of the Rust compiler. This crate contains common +//! type definitions that are used by the other crates in the rustc +//! "family". Some prominent examples (note that each of these modules +//! has their own README with further details). +//! +//! - **HIR.** The "high-level (H) intermediate representation (IR)" is +//! defined in the `hir` module. +//! - **MIR.** The "mid-level (M) intermediate representation (IR)" is +//! defined in the `mir` module. This module contains only the +//! *definition* of the MIR; the passes that transform and operate +//! on MIR are found in `librustc_mir` crate. +//! - **Types.** The internal representation of types used in rustc is +//! defined in the `ty` module. This includes the **type context** +//! (or `tcx`), which is the central context during most of +//! compilation, containing the interners and other things. +//! - **Traits.** Trait resolution is implemented in the `traits` module. +//! - **Type inference.** The type inference code can be found in the `infer` module; +//! this code handles low-level equality and subtyping operations. The +//! type check pass in the compiler is found in the `librustc_typeck` crate. +//! +//! For a deeper explanation of how the compiler works and is +//! organized, see the README.md file in this directory. //! //! # Note //! diff --git a/src/librustc/ty/README.md b/src/librustc/ty/README.md new file mode 100644 index 000000000000..0416be8b9ab3 --- /dev/null +++ b/src/librustc/ty/README.md @@ -0,0 +1,159 @@ +# Types and the Type Context + +The `ty` module defines how the Rust compiler represents types +internally. It also defines the *typing context* (`tcx` or `TyCtxt`), +which is the central data structure in the compiler. + +## The tcx and how it uses lifetimes + +The `tcx` ("typing context") is the central data structure in the +compiler. It is the context that you use to perform all manner of +queries. The struct `TyCtxt` defines a reference to this shared context: + +```rust +tcx: TyCtxt<'a, 'gcx, 'tcx> +// -- ---- ---- +// | | | +// | | innermost arena lifetime (if any) +// | "global arena" lifetime +// lifetime of this reference +``` + +As you can see, the `TyCtxt` type takes three lifetime parameters. +These lifetimes are perhaps the most complex thing to understand about +the tcx. During rust compilation, we allocate most of our memory in +**arenas**, which are basically pools of memory that get freed all at +once. When you see a reference with a lifetime like `'tcx` or `'gcx`, +you know that it refers to arena-allocated data (or data that lives as +long as the arenas, anyhow). + +We use two distinct levels of arenas. The outer level is the "global +arena". This arena lasts for the entire compilation: so anything you +allocate in there is only freed once compilation is basically over +(actually, when we shift to executing LLVM). + +To reduce peak memory usage, when we do type inference, we also use an +inner level of arena. These arenas get thrown away once type inference +is over. This is done because type inference generates a lot of +"throw-away" types that are not particularly interesting after type +inference completes, so keeping around those allocations would be +wasteful. + +Often, we wish to write code that explicitly asserts that it is not +taking place during inference. In that case, there is no "local" +arena, and all the types that you can access are allocated in the +global arena. To express this, the idea is to us the same lifetime +for the `'gcx` and `'tcx` parameters of `TyCtxt`. Just to be a touch +confusing, we tend to use the name `'tcx` in such contexts. Here is an +example: + +```rust +fn not_in_inference<'a, 'tcx>(tcx: TyCtxt<'a, 'tcx, 'tcx>, def_id: DefId) { + // ---- ---- + // Using the same lifetime here asserts + // that the innermost arena accessible through + // this reference *is* the global arena. +} +``` + +In contrast, if we want to code that can be usable during type inference, then you +need to declare a distinct `'gcx` and `'tcx` lifetime parameter: + +```rust +fn maybe_in_inference<'a, 'gcx, 'tcx>(tcx: TyCtxt<'a, 'gcx, 'tcx>, def_id: DefId) { + // ---- ---- + // Using different lifetimes here means that + // the innermost arena *may* be distinct + // from the global arena (but doesn't have to be). +} +``` + +### Allocating and working with types + +Rust types are represented using the `ty::Ty<'tcx>` type. This is in fact a simple type alias +for a reference with `'tcx` lifetime: + +```rust +pub type Ty<'tcx> = &'tcx TyS<'tcx>; +``` + +The `TyS` struct defines the actual details of how a type is +represented. The most interesting part of it is the `sty` field, which +contains an enum that lets us test what sort of type this is. For +example, it is very common to see code that tests what sort of type you have +that looks roughly like so: + +```rust +fn test_type<'tcx>(ty: Ty<'tcx>) { + match ty.sty { + ty::TyArray(elem_ty, len) => { ... } + ... + } +} +``` + +(Note though that doing such low-level tests on types during inference +can be risky, as there are may be inference variables and other things +to consider, or sometimes types are not yet known that will become +known later.). + +To allocate a new type, you can use the various `mk_` methods defined +on the `tcx`. These have names that correpond mostly to the various kinds +of type variants. For example: + +```rust +let array_ty = tcx.mk_array(elem_ty, len * 2); +``` + +These methods all return a `Ty<'tcx>` -- note that the lifetime you +get back is the lifetime of the innermost arena that this `tcx` has +access to. In fact, types are always canonicalized and interned (so we +never allocate exactly the same type twice) and are always allocated +in the outermost arena where they can be (so, if they do not contain +any inference variables or other "temporary" types, they will be +allocated in the global arena). However, the lifetime `'tcx` is always +a safe approximation, so that is what you get back. + +NB. Because types are interned, it is possible to compare them for +equality efficiently using `==` -- however, this is almost never what +you want to do unless you happen to be hashing and looking for +duplicates. This is because often in Rust there are multiple ways to +represent the same type, particularly once inference is involved. If +you are going to be testing for type equality, you probably need to +start looking into the inference code to do it right. + +You can also find various common types in the tcx itself by accessing +`tcx.types.bool`, `tcx.types.char`, etc (see `CommonTypes` for more). + +### Beyond types: Other kinds of arena-allocated data structures + +In addition to types, there are a number of other arena-allocated data +structures that you can allocate, and which are found in this +module. Here are a few examples: + +- `Substs`, allocated with `mk_substs` -- this will intern a slice of types, often used to + specify the values to be substituted for generics (e.g., `HashMap` + would be represented as a slice `&'tcx [tcx.types.i32, tcx.types.u32]`. +- `TraitRef`, typically passed by value -- a **trait reference** + consists of a reference to a trait along with its various type + parameters (including `Self`), like `i32: Display` (here, the def-id + would reference the `Display` trait, and the substs would contain + `i32`). +- `Predicate` defines something the trait system has to prove (see `traits` module). + +### Import conventions + +Although there is no hard and fast rule, the `ty` module tends to be used like so: + +```rust +use ty::{self, Ty, TyCtxt}; +``` + +In particular, since they are so common, the `Ty` and `TyCtxt` types +are imported directly. Other types are often referenced with an +explicit `ty::` prefix (e.g., `ty::TraitRef<'tcx>`). But some modules +choose to import a larger or smaller set of names explicitly. + + + + diff --git a/src/librustc/ty/context.rs b/src/librustc/ty/context.rs index 8005714433f5..6a95c62a303f 100644 --- a/src/librustc/ty/context.rs +++ b/src/librustc/ty/context.rs @@ -793,9 +793,11 @@ impl<'tcx> CommonTypes<'tcx> { } } -/// The data structure to keep track of all the information that typechecker -/// generates so that so that it can be reused and doesn't have to be redone -/// later on. +/// The central data structure of the compiler. Keeps track of all the +/// information that typechecker generates so that so that it can be +/// reused and doesn't have to be redone later on. +/// +/// See [the README](README.md) for more deatils. #[derive(Copy, Clone)] pub struct TyCtxt<'a, 'gcx: 'a+'tcx, 'tcx: 'a> { gcx: &'a GlobalCtxt<'gcx>, diff --git a/src/librustc_back/README.md b/src/librustc_back/README.md new file mode 100644 index 000000000000..bd99c687bb6a --- /dev/null +++ b/src/librustc_back/README.md @@ -0,0 +1,6 @@ +NB: This crate is part of the Rust compiler. For an overview of the +compiler as a whole, see +[the README.md file found in `librustc`](../librustc/README.md). + +`librustc_back` contains some very low-level details that are +specific to different LLVM targets and so forth. diff --git a/src/librustc_driver/README.md b/src/librustc_driver/README.md new file mode 100644 index 000000000000..5331a05b5cd8 --- /dev/null +++ b/src/librustc_driver/README.md @@ -0,0 +1,12 @@ +NB: This crate is part of the Rust compiler. For an overview of the +compiler as a whole, see +[the README.md file found in `librustc`](../librustc/README.md). + +The `driver` crate is effectively the "main" function for the rust +compiler. It orchstrates the compilation process and "knits together" +the code from the other crates within rustc. This crate itself does +not contain any of the "main logic" of the compiler (though it does +have some code related to pretty printing or other minor compiler +options). + + diff --git a/src/librustc_trans/README.md b/src/librustc_trans/README.md index cd43cbd70528..b69d632a6a0d 100644 --- a/src/librustc_trans/README.md +++ b/src/librustc_trans/README.md @@ -1 +1,7 @@ -See [librustc/README.md](../librustc/README.md). +NB: This crate is part of the Rust compiler. For an overview of the +compiler as a whole, see +[the README.md file found in `librustc`](../librustc/README.md). + +The `trans` crate contains the code to convert from MIR into LLVM IR, +and then from LLVM IR into machine code. In general it contains code +that runs towards the end of the compilation process. diff --git a/src/libsyntax/README.md b/src/libsyntax/README.md new file mode 100644 index 000000000000..3bf735ee8680 --- /dev/null +++ b/src/libsyntax/README.md @@ -0,0 +1,7 @@ +NB: This crate is part of the Rust compiler. For an overview of the +compiler as a whole, see +[the README.md file found in `librustc`](../librustc/README.md). + +The `syntax` crate contains those things concerned purely with syntax +– that is, the AST ("abstract syntax tree"), parser, pretty-printer, +lexer, macro expander, and utilities for traversing ASTs.