dart-sdk

mirror of https://github.com/dart-lang/sdk synced 2024-09-15 22:41:41 +00:00

History

Ömer Sinan Ağacan 0f54180b51 [dart2wasm] New typed data implementation New typed data implementation that optimizes the common cases. This uses the best possible representation for the fast case with a representation like: class _I32List implements Int32List { final WasmIntArray<WasmI32> _data; int operator [](int index) { // range check return _data.read(index); } void operator []=(int index, int value) { // range check _data.writeSigned(index, value); } ... } This gives us the best possible runtime performance in the common cases of: - The list is used directly. - The list is used via a view of the same Wasm element type (e.g. a `Uint32List` view of a `Int32List`) and with aligned byte offset. All other classes (`ByteBuffer`, `ByteData`, and view classes) implemented to be able to support this representation. Summary of classes: - One list class per Dart typed data list, with the matching Wasm array as the buffer (as shown in the example above): `_I8List`, `_U8List`, `_U8ClampedList`, `_I16List`, `_U16List`, ... - One list class per Dart typed data list, with mismatching Wasm array as the buffer. These classes are used when a view is created from a list, and the original list has a Wasm array with different element type than the view needs. `_SlowI8List`, `_SlowU8List`, ... These classes use `ByteData` interface to update the buffer. - One list class for each of the classes listed above, for immutable views. `_UnmodifiableI32List`, `_UnmodifiableSlowU64List`, ... These classes inherit from their modifiable list classes and override update methods using a mixin. - One `ByteData` class for each Wasm array type: `_I8ByteData`, `_I16ByteData`, ... - One immutable `ByteData` view for each `ByteData` class. - One `ByteBuffer` class for each Wasm array type: `_I8ByteBuffer`, `_I16ByteBuffer`, ... - A single `ByteBuffer` class for the immutable view of a byte buffer. We don't need one immutable `ByteBuffer` view class per Wasm array type as `ByteBuffer` API does not provide direct access to the buffer. Other optimizations: - `setRange` now uses `array.copy` when possible, which causes a huge performance win in some benchmarks. - The new implementation is pure Dart and needs no support or special cases from the compiler other than the Wasm array type support and intrinsics like `array.copy`. As a result this removes a bunch of `entry-point` pragmas and significantly reduces code size in some cases. Other changes: - Patch and implementation files for typed data and SIMD types are split into separate files. `typed_data_patch.dart` and `simd_patch.dart` now only contains patched factories. Implementation classes are moved to `typed_data.dart` and `simd.dart` as libraries `dart:_typed_data` and `dart:_simd`. Benchmark results: This CL significantly improves common cases. New implementation is only slower than the current implementation when a view uses a Wasm array type with incompatible element type (for example, `Uint32List` created from a `Uint64List`). These cases can still be improved by overriding the relevant `ByteData` methods. For example, in the example of `Uint32List` view of a `Uint64List`, by overriding `_I64ByteData.getUint32` to do a single read then requested bytes don't cross element boundaries in the Wasm array. These optimizations are left as future work. Some sample benchmarks: vector_math matrix_bench before: Binary size: 133,104 bytes. MatrixMultiply(RunTime): 201 us. SIMDMatrixMultiply(RunTime): 3,608 us. VectorTransform(RunTime): 94 us. SIMDVectorTransform(RunTime): 833 us. setViewMatrix(RunTime): 506 us. aabb2Transform(RunTime): 987 us. aabb2Rotate(RunTime): 721 us. aabb3Transform(RunTime): 1,710 us. aabb3Rotate(RunTime): 1,156 us. Matrix3.determinant(RunTime): 171 us. Matrix3.transform(Vector3)(RunTime): 8,550 us. Matrix3.transform(Vector2)(RunTime): 3924 us. Matrix3.transposeMultiply(RunTime): 201 us. vector_math matrix_bench after: Binary size: 135,198 bytes. MatrixMultiply(RunTime): 42 us. SIMDMatrixMultiply(RunTime): 2,068 us. VectorTransform(RunTime): 12 us. SIMDVectorTransform(RunTime): 272 us. setViewMatrix(RunTime): 82 us. aabb2Transform(RunTime): 167 us. aabb2Rotate(RunTime): 147 us. aabb3Transform(RunTime): 194 us. aabb3Rotate(RunTime): 199 us. Matrix3.determinant(RunTime): 70 us. Matrix3.transform(Vector3)(RunTime): 726 us. Matrix3.transform(Vector2)(RunTime): 504 us. Matrix3.transposeMultiply(RunTime): 53 us. FluidMotion before: Binary size: 121,130 bytes. FluidMotion(RunTime): 270,625 us. FluidMotion after: Binary size: 110,674 bytes. FluidMotion(RunTime): 71,357 us. With bound checks omitted (not in this CL), FluidMotion becomes competitive with `dart2js -O4`: FluidMotion dart2js -O4: FluidMotion(RunTime): 47,813 us. FluidMotion this CL + boud checks omitted: FluidMotion(RunTime): 51,289 us. Fixes #52710. Tested: With existing tests. Change-Id: I33bf5585c3be5d3919a99af857659cf7d9393df0 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/312907 Reviewed-by: Joshua Litt <joshualitt@google.com> Commit-Queue: Ömer Ağacan <omersa@google.com>	2023-07-20 09:47:39 +00:00
..
compile_benchmark	[dart2wasm] New typed data implementation	2023-07-20 09:47:39 +00:00
run_benchmark	[dart2wasm] New async implementation	2023-05-22 08:32:12 +00:00

Ömer Sinan Ağacan 0f54180b51 [dart2wasm] New typed data implementation

New typed data implementation that optimizes the common cases.

This uses the best possible representation for the fast case with a
representation like:

    class _I32List implements Int32List {
      final WasmIntArray<WasmI32> _data;

      int operator [](int index) {
        // range check
        return _data.read(index);
      }

      void operator []=(int index, int value) {
        // range check
        _data.writeSigned(index, value);
      }

      ...
    }

This gives us the best possible runtime performance in the common cases
of:

- The list is used directly.
- The list is used via a view of the same Wasm element type (e.g. a
  `Uint32List` view of a `Int32List`) and with aligned byte offset.

All other classes (`ByteBuffer`, `ByteData`, and view classes)
implemented to be able to support this representation.

Summary of classes:

- One list class per Dart typed data list, with the matching Wasm array
  as the buffer (as shown in the example above): `_I8List`, `_U8List`,
  `_U8ClampedList`, `_I16List`, `_U16List`, ...

- One list class per Dart typed data list, with mismatching Wasm array
  as the buffer. These classes are used when a view is created from a
  list, and the original list has a Wasm array with different element
  type than the view needs. `_SlowI8List`, `_SlowU8List`, ...

  These classes use `ByteData` interface to update the buffer.

- One list class for each of the classes listed above, for immutable
  views. `_UnmodifiableI32List`, `_UnmodifiableSlowU64List`, ...

  These classes inherit from their modifiable list classes and override
  update methods using a mixin.

- One `ByteData` class for each Wasm array type: `_I8ByteData`,
  `_I16ByteData`,
  ...

- One immutable `ByteData` view for each `ByteData` class.

- One `ByteBuffer` class for each Wasm array type: `_I8ByteBuffer`,
  `_I16ByteBuffer`, ...

- A single `ByteBuffer` class for the immutable view of a byte buffer.

  We don't need one immutable `ByteBuffer` view class per Wasm array
  type as `ByteBuffer` API does not provide direct access to the buffer.

Other optimizations:

- `setRange` now uses `array.copy` when possible, which causes a huge
  performance win in some benchmarks.

- The new implementation is pure Dart and needs no support or special
  cases from the compiler other than the Wasm array type support and
  intrinsics like `array.copy`. As a result this removes a bunch of
  `entry-point` pragmas and significantly reduces code size in some
  cases.

Other changes:

- Patch and implementation files for typed data and SIMD types are split
  into separate files. `typed_data_patch.dart` and `simd_patch.dart` now
  only contains patched factories. Implementation classes are moved to
  `typed_data.dart` and `simd.dart` as libraries `dart:_typed_data` and
  `dart:_simd`.

Benchmark results:

This CL significantly improves common cases. New implementation is only
slower than the current implementation when a view uses a Wasm array
type with incompatible element type (for example, `Uint32List` created
from a `Uint64List`).

These cases can still be improved by overriding the relevant `ByteData`
methods. For example, in the example of `Uint32List` view of a
`Uint64List`, by overriding `_I64ByteData.getUint32` to do a single read
then requested bytes don't cross element boundaries in the Wasm array.
These optimizations are left as future work.

Some sample benchmarks:

vector_math matrix_bench before:

    Binary size: 133,104 bytes.
    MatrixMultiply(RunTime): 201 us.
    SIMDMatrixMultiply(RunTime): 3,608 us.
    VectorTransform(RunTime): 94 us.
    SIMDVectorTransform(RunTime): 833 us.
    setViewMatrix(RunTime): 506 us.
    aabb2Transform(RunTime): 987 us.
    aabb2Rotate(RunTime): 721 us.
    aabb3Transform(RunTime): 1,710 us.
    aabb3Rotate(RunTime): 1,156 us.
    Matrix3.determinant(RunTime): 171 us.
    Matrix3.transform(Vector3)(RunTime): 8,550 us.
    Matrix3.transform(Vector2)(RunTime): 3924 us.
    Matrix3.transposeMultiply(RunTime): 201 us.

vector_math matrix_bench after:

    Binary size: 135,198 bytes.
    MatrixMultiply(RunTime): 42 us.
    SIMDMatrixMultiply(RunTime): 2,068 us.
    VectorTransform(RunTime): 12 us.
    SIMDVectorTransform(RunTime): 272 us.
    setViewMatrix(RunTime): 82 us.
    aabb2Transform(RunTime): 167 us.
    aabb2Rotate(RunTime): 147 us.
    aabb3Transform(RunTime): 194 us.
    aabb3Rotate(RunTime): 199 us.
    Matrix3.determinant(RunTime): 70 us.
    Matrix3.transform(Vector3)(RunTime): 726 us.
    Matrix3.transform(Vector2)(RunTime): 504 us.
    Matrix3.transposeMultiply(RunTime): 53 us.

FluidMotion before:

    Binary size: 121,130 bytes.
    FluidMotion(RunTime): 270,625 us.

FluidMotion after:

    Binary size: 110,674 bytes.
    FluidMotion(RunTime): 71,357 us.

With bound checks omitted (not in this CL), FluidMotion becomes
competitive with `dart2js -O4`:

FluidMotion dart2js -O4:

    FluidMotion(RunTime): 47,813 us.

FluidMotion this CL + boud checks omitted:

    FluidMotion(RunTime): 51,289 us.

Fixes #52710.

Tested: With existing tests.
Change-Id: I33bf5585c3be5d3919a99af857659cf7d9393df0
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/312907
Reviewed-by: Joshua Litt <joshualitt@google.com>
Commit-Queue: Ömer Ağacan <omersa@google.com>

2023-07-20 09:47:39 +00:00

compile_benchmark

[dart2wasm] New typed data implementation

2023-07-20 09:47:39 +00:00

run_benchmark

[dart2wasm] New async implementation

2023-05-22 08:32:12 +00:00