[dart2js]: Update README

One pass to update the general description of the compiler pipeline. Change-Id: I0597958139e9ea11b27dcb6072b8d70a90c9c937 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/242505 Reviewed-by: Mayank Patke <fishythefish@google.com> Commit-Queue: Sigmund Cherem <sigmund@google.com>
2024-09-16 00:29:48 +00:00 · 2022-05-06 01:16:25 +00:00 · 2022-05-06 01:16:25 +00:00 · bf2cff83d5
parent 4f0ed6a45c
commit bf2cff83d5
1 changed files with 88 additions and 182 deletions
--- a/pkg/compiler/README.md
+++ b/pkg/compiler/README.md
@ -4,188 +4,101 @@ Welcome to the sources of the dart2js compiler!
 ## Architecture
-The compiler is currently undergoing a long refactoring process. As you navigate
+The compiler is structured to operate in several phases. By default these phases
-this code you may find it helpful to understand how the compiler used to be,
+are executed in sequence in a single process, but on some build systems, some of
-where it is going, and where it is today.
+these phases are split into separate processes. As such, there is plenty of
 indirection and data representations used mostly for the purpose of serializing
 intermediate results during compilation.
-### The near future architecture
+The current compiler phases are:
-The compiler will operate in these general phases:
+  1. **common front-end**: Execute traditional front-end compilation phases.
     Dart2js delegates to the common front-end (also used by DDC and the VM) to
     do all front-end features, this includes:
     *  parsing Dart source code,
     *  type checking,
     *  inferring implicit user types, like locals with a `var` declaration,
     *  lowering or simplifying Dart features. For example, this is how many
        syntactic features, like extension methods and list comprehensions, are
        implemented.
     * additional web-specific lowering or simplifications. For example,
       expansion of JS-interop features and web specific implementation of
       language features like late variables.
-  1. **load kernel**: Load all the code as kernel
+    The result of this phase is a kernel AST which is serialized as a `.dill`
-      * Collect dart sources transtively
+    file.
      * Convert to kernel AST
-  (this will be handled by invoking the front-end package)
+  2. **modular analysis**: Using kernel as input, compute data recording
     properties about each method in the program, especially around dependencies
     and features they may need. We call this "impact data" (i1).
-  Alternatively, the compiler can start compilation directly from kernel files.
+     When the compiler runs as a single process, this is done lazily/on-demand
     during the tree-shaking phase (below). However, this data can also be
     computed independently for individual methods, files, or packages in the
     application.  That makes it possible to run this modularly and in parallel.
-  2. **model**: Create a Dart model of the program
+     The result of this phase can be emitted as files containing impact data in
-     * The kernel ASTs could be used as a model, so this might be a no-op or just
+     a serialized format.
       creating a thin wrapper on top of kernel.
-  3. **tree-shake and create world**: Build world of reachable code
+  3. **tree-shake and create world**: Create a model to understand what parts of
-     * For each reachable piece of code:
+     the code are used by an application. This consists of:
-         * Compute impact (i1) from kernel AST
+        * creating an intermediate representation called the "K model" that
-     * Build a closed world (w1)
+          wraps our kernel representation
        * calculating which classes and methods are considered live in the
          program. This is done by incrementally combining impact data (i1)
          starting from `main`, then visiting reachable methods in the program
          with an Rapid Type Analysis (RTA) algorithm to aggregate impacts
          together.
-  4. **analyze**: Run a global analysis
+     The result of this phase is what we call a "closed world" (w1). The closed
-     * Assume closed world semantics (from w1)
+     world is also a datastructure that can answer interesting queries, such as:
-     * Produce a global result (g)
+     Is this interface implemented by a single class? Is this method available
-        * Like today (g) will contain type and nullability information
+     in any stubtype of some interface? The answers to these questions can help
-        * After we adopt strong-mode types, we want to explore simplifying this
+     the compiler generate higher quality JavaScript.
        to only contain native + nullability information.
-  5. **codegen model**: Create a JS model of the program
+  4. **global analysis**: Run a global analysis that assumes closed world
-     * Model JavaScript specific concepts (like the split of constructor bodies
+     semantics (from w1) and propagates information across method boundaries to
-       as separate elements) and provide a mapping to the Dart model
+     further understand what values flow through the program. This phase is
     very valuable in narrowing down possibilities that are ambiguous based
     solely on type information written by developers. It often finds
     oportunities that enable the compiler to devirtualize or inline method
     calls, generate code specializations, or trigger performance optimizations.
-  6. **codegen and tree-shake**: Generate code, as needed
+     The result of this phase is a "global result" (g).
-     * For each reachable piece of code:
+
-        * build ssa graph from kernel ASTs and global results (g)
+  5. **codegen model**: Create a JS or backend model of the program. This is an
-        * optimize ssa
+     intermediate representation of the entities in the program we referred to
     as the "J model". It is very similar to the "K model", but it is tailored
     to model JavaScript specific concepts (like the split of constructor bodies
     as separate elements) and provide a mapping to the Dart model.
  6. **codegen**: Generate code for each method that is deemed necessary. This
     includes:
        * build an SSA graph from kernel ASTs and global results (g)
        * optimize the SSA representation
        * compute impact (i2) from optimized code
        * emit JS ASTs for the code
     * Build a codegen closed world (w2) from new impacts (i2)
-  7. **emit**: Assemble and minify the program
+
-     * Build program structure from the compiled pieces (w2)
+  7. **link tree-shake**: Using the results of codegen, we perform a second
     round of tree-shaking. This is important because code that was deemed
     reachable in (w1) may be found unreachable after optimizations. The process
     is very similar to the earlier phase: we combine incrementally the codegen
     impact data (i2) and compute a codegen closed world (w2).
     When dart2js runs as a single process the codegen phase is done lazily and
     on-demand, together with the tree-shaking phase.
  8. **emit JavaScript files**: The final step is to assemble and minify the
     final program. This includes:
     * Build a JavaScript program structure from the compiled pieces (w2)
     * Use frequency namer to minify names.
     * Emit js and source map files.
-### The old architecture
+## Code organization
-The compiler used to operate as follows:
+### Some terminology used in the compiler
  1. **load dart**: Load all source files
     * Collect dart sources transtively
     * Scan enough tokens to build import dependencies.
  2. **model**: Create a Dart model (aka. Element Model) of the program
     * Do a diet-parse of the program to create the high-level element model
  3. **resolve and tree-shake**: Resolve and build world of reachable code (the
     resolution enqueuer)
     * For each reachable piece of code:
        * Parse the full body of the function
        * Resolve it and enqueue other pieces that are reachable
        * Type check the body of the function
  4. **analyze**: Run a global analysis
     * Assume closed world semantics (from everything enqueued by the resolver)
     * Produce a global result about type and nullability information of method
       arguments, return values, and receivers of dynamic sends.
  5. **codegen and tree-shake**: Generate code, as needed (via the codegen
     enqueuer)
     * For each reachable piece of code:
        * build ssa graph from resolved source ASTs global results (g)
        * optimize ssa
        * enqueue visible dependencies
        * emit js asts for the code
  6. **emit**: Assemble and minify the program
     * Build program structure from the compiled pieces
     * Use frequency namer to minify names.
     * Emit js and source map files.
 ### The architecture today (which might be changing while you read this!)
 When using the `--use-kernel` flag, you can test the latest state of the
 compiler as we are migrating to the new architecture. Currently it works as
 follows:
  1. **load dart**: (same as old compiler)
  2. **model**: (same element model as old compiler)
  3. **resolve, tree-shake and build world**: Build world of reachable code
     * For each reachable piece of code:
        * Parse full body of the function
        * Resolve it from the parsed source ASTs
        * Type check it (same as old compiler)
        * Compute impact (i1) from resolved source ASTs (no kernel)
     * Build a closed world (w1)
  4. **kernelize**: Create kernel ASTs
     * For all resolved elements in w1, compute their kernel representation using
       the `rasta` visitor.
  5. **analyze**: (almost same as old compiler)
  6. **codegen and tree-shake**: Generate code, as needed
     * For each reachable piece of code:
        * build ssa graph from kernel ASTs (uses global results g)
        * optimize ssa
        * compute impact (i2) from optimized code
        * emit js asts for the code
     * Build a codegen closed world (w2) from new impacts (i2)
  7. **emit**: (same as old compiler)
 Some additional details worth highlighting:
  * tree-shaking is close to working as we want: the notion of a world and world
    impacts are computed explicitly:
     * In the old compiler, the resolver and code generator directly
       enqueued items to be processed, there was no knowledge of what had
       to be done other than in the algorithm itself.
     * Now the information is computed explicitly in two ways:
       * The dependencies of a single element are computed as an "impact"
         object, these are derived from the structure of the
         code (either the resolved code or the generated code).
       * The closed world is now an explicit concept that can be replaced in the
         compiler.
     * This allows us to delete the resolver in the future and replace it
       with a kernel loader, an impact builder from kernel, and a kernel world.
     * There is an implementation of a kernel impact builder, but it is not yet
       in use in the compiler pipeline (gated on replacing the Dart model)
  * We still depend on the Dart model computed by resolution, but progress has
    been made introducing an abstraction common to the new and old models. The
    old model is the "Element model", the generic abstraction is called the
    "Entity model". Some portions of the compiler now refer to the entity model.
  * The ssa graph is built from the kernel ASTs, but it still depends on the old
    element model computed from resolution (accessed via a kernel2Ast adapter).
    The graph builder implementation covers a large chunk of the language
    features, but is not complete (89% of langage & corelib tests are passing).
  * Global analysis is still working on top of the dart2js ASTs.
 ## Code organization and history
 The compiler package was initially intended to be compiler for multiple targets:
 Javascript, Dart (dart2dart), and dartino bytecodes. It has now evolved to be a
 Javascript only compiler, but some of the abstractions to support multiple
 targets still remain.
 ### Possibly confusing terminology
 Some of the terminology in the compiler is confusing without knowing its
 history. We are cleaning this up as we are rearchitecting the system, but here
 are some of the legacy terminology we have:
  * **target**: the output the compiler is producing. Nowdays it just
    JavaScript, but in the past there was also Dart and dartino bytecodes.
  * **backend**: pieces of the compiler that were target-specific.
    Note: in the past we've used the term *backend* also for code that is used
    in the frontend of the compiler that happens to be target-specific, as well
    as and code that is used in the emitter or what traditionally is known
    as the backend of the compiler.
  * **frontend**: the parser, resolver, and other early stages of the compiler.
    The front-end however makes target-specific choices. For example, to compile
    a program with async-await, the dart2js backend needs to include some helper
    functions that are used by the expanded async-await code, these helpers need
    to be parsed by the frontend and added to the compilation pipeline.
  * **world**: the compiler exploits closed-world assumptions to do
    optimizations. The *world* encapsulates some of our knowledge of the
@ -201,29 +114,22 @@ are some of the legacy terminology we have:
  * **model**: there are many models in the compiler:
-    * **element model**: this is an abstraction describing the elements seen in
+    * **entity model**: this is an abstraction describing the elements seen in
-      Dart programs, like "libraries", "classes", "methods", etc.
+      Dart programs, like "libraries", "classes", "methods", etc. We currently
-
+      have two entity models, the "K model" (which is frontend centric and
-    * **entity model**: also describes elements seen in Dart programs, but it is
+      usually maps 1:1 with kernel entities) and the "J model" (which is backend
-      meant to be minimalistic and a super-hierarchy above the *element models*.
+      centric).
      This is a newer addition, is an added abstraction to make it possible to
      refactor our code from our old frontend to the kernel frontend.
    * **Dart vs JS models**: the compiler in the past had a single model to
      describe elements in the source and elements that were being compiled. In
      the future we plan to have two. Both input model and output models will be
      implementations of the *entity model*. The JS model is intended to have
      concepts specific about generating code in JS (like constructor-bodies as
      a separate entity than the constructor, closure classes, etc).
    * **emitter model**: this is a model just used for dumping out the structure
      of the program in a .js text file. It doesn't have enough semantic meaning
-      to be a JS model for compilation at this moment.
+      to be a JS model for compilation, which is why there is a separate "J
      model".
  * **enqueuer**: a work-queue used to achieve tree-shaking (or more precisely
    tree-growing): elements are added to the enqueuer as we recognize that they
-    are needed in a given application. Note that we even track how elements are
+    are needed in a given application (as described by the impact data). Note
-    used, since some ways of using an element require more code than others.
+    that we even track how elements are used, since some ways of using an
    element require more code than others.
 ### Code layout