[vm/compiler] Further optimize setRange on TypedData receivers.

When setRange is called on a TypedData receiver and the source is also a TypedData object with the same element size and clamping is not required, the VM implementation now calls _boundsCheckAndMemcpyN for element size N. The generated IL for these methods performs the copy using the MemoryCopy instruction (mostly, see the note below). Since the two TypedData objects might have the same underlying buffer, the CL adds a can_overlap flag to the MemoryCopy instruction which checks for overlapping regions. If can_overlap is set, then the copy is performed backwards instead of forwards when needed to ensure that elements of the source region are read before they are overwritten. The existing uses of the MemoryCopy instruction are adjusted as follows: * The IL generated for copyRangeFromUint8ListToOneByteString passes false for can_overlap, as all uses currently ensure that the OneByteString is non-external and thus cannot overlap. * The IL generated for _memCopy, used by the FFI library, passes true for can_overlap, as there is no guarantee that the regions pointed at by the Pointer objects do not overlap. The MemoryCopy instruction has also been adjusted so that all numeric inputs (the two start offsets and the length) are either boxed or unboxed instead of just the length. This exposed an issue in the inliner, where unboxed constants in the callee graph were replaced with boxed constants when inlining into the caller graph, since withList calls setRange with constant starting offsets of 0. Now the representation of constants in the callee graph are preserved when inlining the callee graph into the caller graph. Fixes https://github.com/dart-lang/sdk/issues/51237 by using TMP and TMP2 for the LDP/STP calls in the 16-byte element size case, so no temporaries need to be allocated for the instruction. On ARM when not unrolling the memory copy loop, uses TMP and a single additional temporary for LDM/STM calls in the 8-byte and 16-byte element cases, with the latter just using two LDM/STM calls within the loop, a different approach than the one described in https://github.com/dart-lang/sdk/issues/51229 . Note: Once the number of elements being copied reaches a certain threshold (1048576 on X86, 256 otherwise), _boundsCheckAndMemcpyN instead calls _nativeSetRange, which is a native call that uses memmove from the standard C library for non-clamped inputs. It does this because the code currently emitted for MemoryCopy performs poorly compared to the more optimized memmove implementation when copying larger regions of memory. Notable benchmark changes for dart-aot: * X64 * TypedDataDuplicate.*.fromList improvement from ~13%-~250% * Uf8Encode.*.10 improvement from ~50%-~75% * MapCopy.Map.*.of.Map.* improvement from ~13%-~65% * MemoryCopy.*.setRange.* improvement from ~13%-~500% * ARM7 * Uf8Encode.*.10 improvement from ~35%-~70% * MapCopy.Map.*.of.Map.* improvement from ~6%-~75% * MemoryCopy.*.setRange.{8,64} improvement from ~22%-~500% * Improvement of ~100%-~200% for MemoryCopy.512.setRange.*.Double * Regression of ~40% for MemoryCopy.512.setRange.*.Uint8 * Regression of ~85% for MemoryCopy.4096.setRange.*.Uint8 * ARM8 * Uf8Encode.*.10 improvement from ~35%-~70% * MapCopy.Map.*.of.Map.* improvement from ~7%-~75% * MemoryCopy.*.setRange.{8,64} improvement from ~22%-~500% * Improvement of ~75%-~160% for MemoryCopy.512.setRange.*.Double * Regression of ~40% for MemoryCopy.512.setRange.*.Uint8 * Regression of ~85% for MemoryCopy.4096.setRange.*.Uint8 TEST=vm/cc/IRTest_Memory, co19{,_2}/LibTest/typed_data, lib{,_2}/typed_data, corelib{,_2}/list_test Issue: https://github.com/dart-lang/sdk/issues/42072 Issue: b/294114694 Issue: b/259315681 Change-Id: Ic75521c5fe10b952b5b9ce5f2020c7e3f03672a9 Cq-Include-Trybots: luci.dart.try:vm-aot-linux-debug-simarm_x64-try,vm-aot-linux-debug-simriscv64-try,vm-aot-linux-debug-x64-try,vm-aot-linux-debug-x64c-try,vm-kernel-linux-debug-x64-try,vm-kernel-precomp-linux-debug-x64-try,vm-linux-debug-ia32-try,vm-linux-debug-simriscv64-try,vm-linux-debug-x64-try,vm-linux-debug-x64c-try,vm-mac-debug-arm64-try,vm-mac-debug-x64-try,vm-aot-linux-release-simarm64-try,vm-aot-linux-release-simarm_x64-try,vm-aot-linux-release-x64-try,vm-aot-mac-release-arm64-try,vm-aot-mac-release-x64-try,vm-ffi-qemu-linux-release-riscv64-try,vm-ffi-qemu-linux-release-arm-try,vm-aot-msan-linux-release-x64-try,vm-msan-linux-release-x64-try,vm-aot-tsan-linux-release-x64-try,vm-tsan-linux-release-x64-try,vm-linux-release-ia32-try,vm-linux-release-simarm-try,vm-linux-release-simarm64-try,vm-linux-release-x64-try,vm-mac-release-arm64-try,vm-mac-release-x64-try,vm-kernel-precomp-linux-release-x64-try,vm-aot-android-release-arm64c-try,vm-ffi-android-debug-arm64c-try Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/319521 Reviewed-by: Daco Harkes <dacoharkes@google.com> Reviewed-by: Alexander Markov <alexmarkov@google.com> Commit-Queue: Tess Strickland <sstrickl@google.com>
2024-10-02 23:59:16 +00:00 · 2023-09-04 14:38:27 +00:00 · 2023-09-04 14:38:27 +00:00 · c93f924c82
parent b745fa8923
commit c93f924c82
29 changed files with 1754 additions and 969 deletions
--- a/runtime/lib/typed_data.cc
+++ b/runtime/lib/typed_data.cc
@ -66,90 +66,93 @@ DEFINE_NATIVE_ENTRY(TypedDataView_typedData, 0, 1) {
  return TypedDataView::Cast(instance).typed_data();
 }

-static BoolPtr CopyData(const TypedDataBase& dst_array,
-                        const TypedDataBase& src_array,
-                        const Smi& dst_start,
-                        const Smi& src_start,
-                        const Smi& length,
-                        bool clamped) {
-  const intptr_t dst_offset_in_bytes = dst_start.Value();
-  const intptr_t src_offset_in_bytes = src_start.Value();
-  const intptr_t length_in_bytes = length.Value();
-  ASSERT(Utils::RangeCheck(src_offset_in_bytes, length_in_bytes,
-                           src_array.LengthInBytes()));
-  ASSERT(Utils::RangeCheck(dst_offset_in_bytes, length_in_bytes,
-                           dst_array.LengthInBytes()));
-  if (length_in_bytes > 0) {
-    NoSafepointScope no_safepoint;
-    if (clamped) {
-      uint8_t* dst_data =
-          reinterpret_cast<uint8_t*>(dst_array.DataAddr(dst_offset_in_bytes));
-      int8_t* src_data =
-          reinterpret_cast<int8_t*>(src_array.DataAddr(src_offset_in_bytes));
-      for (intptr_t ix = 0; ix < length_in_bytes; ix++) {
-        int8_t v = *src_data;
-        if (v < 0) v = 0;
-        *dst_data = v;
-        src_data++;
-        dst_data++;
-      }
-    } else {
-      memmove(dst_array.DataAddr(dst_offset_in_bytes),
-              src_array.DataAddr(src_offset_in_bytes), length_in_bytes);
-    }
-  }
-  return Bool::True().ptr();
-}
-
 static bool IsClamped(intptr_t cid) {
-  switch (cid) {
-    case kTypedDataUint8ClampedArrayCid:
-    case kExternalTypedDataUint8ClampedArrayCid:
-    case kTypedDataUint8ClampedArrayViewCid:
-    case kUnmodifiableTypedDataUint8ClampedArrayViewCid:
-      return true;
-    default:
-      return false;
-  }
+  COMPILE_ASSERT((kTypedDataUint8ClampedArrayCid + 1 ==
+                  kTypedDataUint8ClampedArrayViewCid) &&
+                 (kTypedDataUint8ClampedArrayCid + 2 ==
+                  kExternalTypedDataUint8ClampedArrayCid) &&
+                 (kTypedDataUint8ClampedArrayCid + 3 ==
+                  kUnmodifiableTypedDataUint8ClampedArrayViewCid));
+  return cid >= kTypedDataUint8ClampedArrayCid &&
+         cid <= kUnmodifiableTypedDataUint8ClampedArrayViewCid;
 }

 static bool IsUint8(intptr_t cid) {
-  switch (cid) {
-    case kTypedDataUint8ClampedArrayCid:
-    case kExternalTypedDataUint8ClampedArrayCid:
-    case kTypedDataUint8ClampedArrayViewCid:
-    case kUnmodifiableTypedDataUint8ClampedArrayViewCid:
-    case kTypedDataUint8ArrayCid:
-    case kExternalTypedDataUint8ArrayCid:
-    case kTypedDataUint8ArrayViewCid:
-    case kUnmodifiableTypedDataUint8ArrayViewCid:
-      return true;
-    default:
-      return false;
-  }
+  COMPILE_ASSERT(
+      (kTypedDataUint8ArrayCid + 1 == kTypedDataUint8ArrayViewCid) &&
+      (kTypedDataUint8ArrayCid + 2 == kExternalTypedDataUint8ArrayCid) &&
+      (kTypedDataUint8ArrayCid + 3 ==
+       kUnmodifiableTypedDataUint8ArrayViewCid) &&
+      (kTypedDataUint8ArrayCid + 4 == kTypedDataUint8ClampedArrayCid));
+  return cid >= kTypedDataUint8ArrayCid &&
+         cid <= kUnmodifiableTypedDataUint8ClampedArrayViewCid;
 }

-DEFINE_NATIVE_ENTRY(TypedDataBase_setRange, 0, 7) {
+DEFINE_NATIVE_ENTRY(TypedDataBase_setRange, 0, 5) {
  const TypedDataBase& dst =
      TypedDataBase::CheckedHandle(zone, arguments->NativeArgAt(0));
-  const Smi& dst_start = Smi::CheckedHandle(zone, arguments->NativeArgAt(1));
-  const Smi& length = Smi::CheckedHandle(zone, arguments->NativeArgAt(2));
+  const Smi& dst_start_smi =
+      Smi::CheckedHandle(zone, arguments->NativeArgAt(1));
+  const Smi& dst_end_smi = Smi::CheckedHandle(zone, arguments->NativeArgAt(2));
  const TypedDataBase& src =
      TypedDataBase::CheckedHandle(zone, arguments->NativeArgAt(3));
-  const Smi& src_start = Smi::CheckedHandle(zone, arguments->NativeArgAt(4));
-  const Smi& to_cid_smi = Smi::CheckedHandle(zone, arguments->NativeArgAt(5));
-  const Smi& from_cid_smi = Smi::CheckedHandle(zone, arguments->NativeArgAt(6));
+  const Smi& src_start_smi =
+      Smi::CheckedHandle(zone, arguments->NativeArgAt(4));

-  if (length.Value() < 0) {
-    const String& error = String::Handle(String::NewFormatted(
-        "length (%" Pd ") must be non-negative", length.Value()));
-    Exceptions::ThrowArgumentError(error);
+  const intptr_t element_size_in_bytes = dst.ElementSizeInBytes();
+  ASSERT_EQUAL(src.ElementSizeInBytes(), element_size_in_bytes);
+
+  const intptr_t dst_start_in_bytes =
+      dst_start_smi.Value() * element_size_in_bytes;
+  const intptr_t dst_end_in_bytes = dst_end_smi.Value() * element_size_in_bytes;
+  const intptr_t src_start_in_bytes =
+      src_start_smi.Value() * element_size_in_bytes;
+
+  const intptr_t length_in_bytes = dst_end_in_bytes - dst_start_in_bytes;
+
+  if (!IsClamped(dst.ptr()->GetClassId()) || IsUint8(src.ptr()->GetClassId())) {
+    // We've already performed range checking in _boundsCheckAndMemcpyN prior
+    // to the call to _nativeSetRange, so just perform the memmove.
+    //
+    // TODO(dartbug.com/42072): We do this when the copy length gets large
+    // enough that a native call to invoke memmove is faster than the generated
+    // code from MemoryCopy. Replace the static call to _nativeSetRange with
+    // a CCall() to a memmove leaf runtime entry and remove the possibility of
+    // calling _nativeSetRange except in the clamping case.
+    NoSafepointScope no_safepoint;
+    memmove(dst.DataAddr(dst_start_in_bytes), src.DataAddr(src_start_in_bytes),
+            length_in_bytes);
+    return Object::null();
  }
-  const intptr_t to_cid = to_cid_smi.Value();
-  const intptr_t from_cid = from_cid_smi.Value();

-  const bool needs_clamping = IsClamped(to_cid) && !IsUint8(from_cid);
-  return CopyData(dst, src, dst_start, src_start, length, needs_clamping);
+  // This is called on the fast path prior to bounds checking, so perform
+  // the bounds check even if the length is 0.
+  const intptr_t dst_length_in_bytes = dst.LengthInBytes();
+  RangeCheck(dst_start_in_bytes, length_in_bytes, dst_length_in_bytes,
+             element_size_in_bytes);
+
+  const intptr_t src_length_in_bytes = src.LengthInBytes();
+  RangeCheck(src_start_in_bytes, length_in_bytes, src_length_in_bytes,
+             element_size_in_bytes);
+
+  ASSERT_EQUAL(element_size_in_bytes, 1);
+
+  if (length_in_bytes > 0) {
+    NoSafepointScope no_safepoint;
+    uint8_t* dst_data =
+        reinterpret_cast<uint8_t*>(dst.DataAddr(dst_start_in_bytes));
+    int8_t* src_data =
+        reinterpret_cast<int8_t*>(src.DataAddr(src_start_in_bytes));
+    for (intptr_t ix = 0; ix < length_in_bytes; ix++) {
+      int8_t v = *src_data;
+      if (v < 0) v = 0;
+      *dst_data = v;
+      src_data++;
+      dst_data++;
+    }
+  }
+
+  return Object::null();
 }

 // Native methods for typed data allocation are recognized and implemented
--- a/runtime/vm/bootstrap_natives.h
+++ b/runtime/vm/bootstrap_natives.h
@ -173,7 +173,7 @@ namespace dart {
  V(TypedData_Int32x4Array_new, 2)                                             \
  V(TypedData_Float64x2Array_new, 2)                                           \
  V(TypedDataBase_length, 1)                                                   \
-  V(TypedDataBase_setRange, 7)                                                 \
+  V(TypedDataBase_setRange, 5)                                                 \
  V(TypedData_GetInt8, 2)                                                      \
  V(TypedData_SetInt8, 3)                                                      \
  V(TypedData_GetUint8, 2)                                                     \
--- a/runtime/vm/compiler/assembler/assembler_base.h
+++ b/runtime/vm/compiler/assembler/assembler_base.h
@ -629,6 +629,23 @@ class AssemblerBase : public StackResource {

  virtual void SmiTag(Register r) = 0;

+  // If Smis are compressed and the Smi value in dst is non-negative, ensures
+  // the upper bits are cleared. If Smis are not compressed, is a no-op.
+  //
+  // Since this operation only affects the unused upper bits when Smis are
+  // compressed, it can be used on registers not allocated as writable.
+  //
+  // The behavior on the upper bits of signed compressed Smis is undefined.
+#if defined(DART_COMPRESSED_POINTERS)
+  virtual void ExtendNonNegativeSmi(Register dst) {
+    // Default to sign extension and allow architecture-specific assemblers
+    // where an alternative like zero-extension is preferred to override this.
+    ExtendValue(dst, dst, kObjectBytes);
+  }
+#else
+  void ExtendNonNegativeSmi(Register dst) {}
+#endif
+
  // Extends a value of size sz in src to a value of size kWordBytes in dst.
  // That is, bits in the source register that are not part of the sz-sized
  // value are ignored, and if sz is signed, then the value is sign extended.
--- a/runtime/vm/compiler/assembler/assembler_ia32.cc
+++ b/runtime/vm/compiler/assembler/assembler_ia32.cc
@ -1776,6 +1776,16 @@ void Assembler::cmpxchgl(const Address& address, Register reg) {
  EmitOperand(reg, address);
 }

+void Assembler::cld() {
+  AssemblerBuffer::EnsureCapacity ensured(&buffer_);
+  EmitUint8(0xFC);
+}
+
+void Assembler::std() {
+  AssemblerBuffer::EnsureCapacity ensured(&buffer_);
+  EmitUint8(0xFD);
+}
+
 void Assembler::cpuid() {
  AssemblerBuffer::EnsureCapacity ensured(&buffer_);
  EmitUint8(0x0F);
@ -3126,46 +3136,6 @@ Address Assembler::ElementAddressForIntIndex(bool is_external,
  }
 }

-static ScaleFactor ToScaleFactor(intptr_t index_scale, bool index_unboxed) {
-  if (index_unboxed) {
-    switch (index_scale) {
-      case 1:
-        return TIMES_1;
-      case 2:
-        return TIMES_2;
-      case 4:
-        return TIMES_4;
-      case 8:
-        return TIMES_8;
-      case 16:
-        return TIMES_16;
-      default:
-        UNREACHABLE();
-        return TIMES_1;
-    }
-  } else {
-    // Note that index is expected smi-tagged, (i.e, times 2) for all arrays
-    // with index scale factor > 1. E.g., for Uint8Array and OneByteString the
-    // index is expected to be untagged before accessing.
-    ASSERT(kSmiTagShift == 1);
-    switch (index_scale) {
-      case 1:
-        return TIMES_1;
-      case 2:
-        return TIMES_1;
-      case 4:
-        return TIMES_2;
-      case 8:
-        return TIMES_4;
-      case 16:
-        return TIMES_8;
-      default:
-        UNREACHABLE();
-        return TIMES_1;
-    }
-  }
-}
-
 Address Assembler::ElementAddressForRegIndex(bool is_external,
                                             intptr_t cid,
                                             intptr_t index_scale,
--- a/runtime/vm/compiler/assembler/assembler_ia32.h
+++ b/runtime/vm/compiler/assembler/assembler_ia32.h
@ -572,6 +572,9 @@ class Assembler : public AssemblerBase {
  void lock();
  void cmpxchgl(const Address& address, Register reg);

+  void cld();
+  void std();
+
  void cpuid();

  /*
--- a/runtime/vm/compiler/assembler/assembler_x64.cc
+++ b/runtime/vm/compiler/assembler/assembler_x64.cc
@ -2683,46 +2683,6 @@ Address Assembler::ElementAddressForIntIndex(bool is_external,
  }
 }

-static ScaleFactor ToScaleFactor(intptr_t index_scale, bool index_unboxed) {
-  if (index_unboxed) {
-    switch (index_scale) {
-      case 1:
-        return TIMES_1;
-      case 2:
-        return TIMES_2;
-      case 4:
-        return TIMES_4;
-      case 8:
-        return TIMES_8;
-      case 16:
-        return TIMES_16;
-      default:
-        UNREACHABLE();
-        return TIMES_1;
-    }
-  } else {
-    // Note that index is expected smi-tagged, (i.e, times 2) for all arrays
-    // with index scale factor > 1. E.g., for Uint8Array and OneByteString the
-    // index is expected to be untagged before accessing.
-    ASSERT(kSmiTagShift == 1);
-    switch (index_scale) {
-      case 1:
-        return TIMES_1;
-      case 2:
-        return TIMES_1;
-      case 4:
-        return TIMES_2;
-      case 8:
-        return TIMES_4;
-      case 16:
-        return TIMES_8;
-      default:
-        UNREACHABLE();
-        return TIMES_1;
-    }
-  }
-}
-
 Address Assembler::ElementAddressForRegIndex(bool is_external,
                                             intptr_t cid,
                                             intptr_t index_scale,
--- a/runtime/vm/compiler/assembler/assembler_x64.h
+++ b/runtime/vm/compiler/assembler/assembler_x64.h
@ -1024,6 +1024,14 @@ class Assembler : public AssemblerBase {
                               Register scratch,
                               bool can_be_null = false) override;

+#if defined(DART_COMPRESSED_POINTERS)
+  void ExtendNonNegativeSmi(Register dst) override {
+    // Zero-extends and is a smaller instruction to output than sign
+    // extension (movsxd).
+    orl(dst, dst);
+  }
+#endif
+
  // CheckClassIs fused with optimistic SmiUntag.
  // Value in the register object is untagged optimistically.
  void SmiUntagOrCheckClass(Register object, intptr_t class_id, Label* smi);
--- a/runtime/vm/compiler/backend/il.cc
+++ b/runtime/vm/compiler/backend/il.cc
@ -6590,8 +6590,23 @@ Representation StoreIndexedInstr::RequiredInputRepresentation(
  return RepresentationOfArrayElement(class_id());
 }

+#if defined(TARGET_ARCH_ARM64)
+// We can emit a 16 byte move in a single instruction using LDP/STP.
+static const intptr_t kMaxElementSizeForEfficientCopy = 16;
+#else
+static const intptr_t kMaxElementSizeForEfficientCopy =
+    compiler::target::kWordSize;
+#endif
+
 Instruction* MemoryCopyInstr::Canonicalize(FlowGraph* flow_graph) {
-  if (!length()->BindsToSmiConstant() || !src_start()->BindsToSmiConstant() ||
+  if (!length()->BindsToSmiConstant()) {
+    return this;
+  } else if (length()->BoundSmiConstant() == 0) {
+    // Nothing to copy.
+    return nullptr;
+  }
+
+  if (!src_start()->BindsToSmiConstant() ||
      !dest_start()->BindsToSmiConstant()) {
    // TODO(https://dartbug.com/51031): Consider adding support for src/dest
    // starts to be in bytes rather than element size.
@ -6603,7 +6618,7 @@ Instruction* MemoryCopyInstr::Canonicalize(FlowGraph* flow_graph) {
  intptr_t new_dest_start = dest_start()->BoundSmiConstant();
  intptr_t new_element_size = element_size_;
  while (((new_length | new_src_start | new_dest_start) & 1) == 0 &&
-         new_element_size < compiler::target::kWordSize) {
+         new_element_size < kMaxElementSizeForEfficientCopy) {
    new_length >>= 1;
    new_src_start >>= 1;
    new_dest_start >>= 1;
@ -6614,9 +6629,11 @@ Instruction* MemoryCopyInstr::Canonicalize(FlowGraph* flow_graph) {
  }

  Zone* const zone = flow_graph->zone();
+  // The new element size is larger than the original one, so it must be > 1.
+  // That means unboxed integers will always require a shift, but Smis
+  // may not if element_size == 2, so always use Smis.
  auto* const length_instr = flow_graph->GetConstant(
-      Integer::ZoneHandle(zone, Integer::New(new_length, Heap::kOld)),
-      unboxed_length_ ? kUnboxedIntPtr : kTagged);
+      Integer::ZoneHandle(zone, Integer::New(new_length, Heap::kOld)));
  auto* const src_start_instr = flow_graph->GetConstant(
      Integer::ZoneHandle(zone, Integer::New(new_src_start, Heap::kOld)));
  auto* const dest_start_instr = flow_graph->GetConstant(
@ -6625,9 +6642,154 @@ Instruction* MemoryCopyInstr::Canonicalize(FlowGraph* flow_graph) {
  src_start()->BindTo(src_start_instr);
  dest_start()->BindTo(dest_start_instr);
  element_size_ = new_element_size;
+  unboxed_inputs_ = false;
  return this;
 }

+void MemoryCopyInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
+  const Register src_reg = locs()->in(kSrcPos).reg();
+  const Register dest_reg = locs()->in(kDestPos).reg();
+  const Location& src_start_loc = locs()->in(kSrcStartPos);
+  const Location& dest_start_loc = locs()->in(kDestStartPos);
+  const Location& length_loc = locs()->in(kLengthPos);
+  // Note that for all architectures, constant_length is only true if
+  // length() binds to a _small_ constant, so we can end up generating a loop
+  // if the constant length() was bound to is too large.
+  const bool constant_length = length_loc.IsConstant();
+  const Register length_reg = constant_length ? kNoRegister : length_loc.reg();
+  const intptr_t num_elements =
+      constant_length ? Integer::Cast(length_loc.constant()).AsInt64Value()
+                      : -1;
+
+  // The zero constant case should be handled via canonicalization.
+  ASSERT(!constant_length || num_elements > 0);
+
+  EmitComputeStartPointer(compiler, src_cid_, src_reg, src_start_loc);
+  EmitComputeStartPointer(compiler, dest_cid_, dest_reg, dest_start_loc);
+
+  compiler::Label copy_forwards, done;
+  if (!constant_length) {
+#if defined(TARGET_ARCH_IA32)
+    // Save ESI (THR), as we have to use it on the loop path.
+    __ PushRegister(ESI);
+#endif
+    PrepareLengthRegForLoop(compiler, length_reg, &done);
+  }
+  // Omit the reversed loop for possible overlap if copying a single element.
+  if (can_overlap() && num_elements != 1) {
+    __ CompareRegisters(dest_reg, src_reg);
+    // Both regions are the same size, so if there is an overlap, then either:
+    //
+    // * The destination region comes before the source, so copying from
+    //   front to back ensures that the data in the overlap is read and
+    //   copied before it is written.
+    // * The source region comes before the destination, which requires
+    //   copying from back to front to ensure that the data in the overlap is
+    //   read and copied before it is written.
+    //
+    // To make the generated code smaller for the unrolled case, we do not
+    // additionally verify here that there is an actual overlap. Instead, only
+    // do that when we need to calculate the end address of the regions in
+    // the loop case.
+    __ BranchIf(UNSIGNED_LESS_EQUAL, &copy_forwards,
+                compiler::Assembler::kNearJump);
+    if (constant_length) {
+      EmitUnrolledCopy(compiler, dest_reg, src_reg, num_elements,
+                       /*reversed=*/true);
+    } else {
+      EmitLoopCopy(compiler, dest_reg, src_reg, length_reg, &done,
+                   &copy_forwards);
+    }
+    __ Jump(&done, compiler::Assembler::kNearJump);
+  }
+  __ Bind(&copy_forwards);
+  if (constant_length) {
+    EmitUnrolledCopy(compiler, dest_reg, src_reg, num_elements,
+                     /*reversed=*/false);
+  } else {
+    EmitLoopCopy(compiler, dest_reg, src_reg, length_reg, &done);
+  }
+  __ Bind(&done);
+#if defined(TARGET_ARCH_IA32)
+  if (!constant_length) {
+    // Restore ESI (THR).
+    __ PopRegister(ESI);
+  }
+#endif
+}
+
+// EmitUnrolledCopy on ARM is different enough that it is defined separately.
+#if !defined(TARGET_ARCH_ARM)
+void MemoryCopyInstr::EmitUnrolledCopy(FlowGraphCompiler* compiler,
+                                       Register dest_reg,
+                                       Register src_reg,
+                                       intptr_t num_elements,
+                                       bool reversed) {
+  ASSERT(element_size_ <= 16);
+  const intptr_t num_bytes = num_elements * element_size_;
+#if defined(TARGET_ARCH_ARM64)
+  // We use LDP/STP with TMP/TMP2 to handle 16-byte moves.
+  const intptr_t mov_size = element_size_;
+#else
+  const intptr_t mov_size =
+      Utils::Minimum<intptr_t>(element_size_, compiler::target::kWordSize);
+#endif
+  const intptr_t mov_repeat = num_bytes / mov_size;
+  ASSERT(num_bytes % mov_size == 0);
+
+#if defined(TARGET_ARCH_IA32)
+  // No TMP on IA32, so we have to allocate one instead.
+  const Register temp_reg = locs()->temp(0).reg();
+#else
+  const Register temp_reg = TMP;
+#endif
+  for (intptr_t i = 0; i < mov_repeat; i++) {
+    const intptr_t offset = (reversed ? (mov_repeat - (i + 1)) : i) * mov_size;
+    switch (mov_size) {
+      case 1:
+        __ LoadFromOffset(temp_reg, src_reg, offset, compiler::kUnsignedByte);
+        __ StoreToOffset(temp_reg, dest_reg, offset, compiler::kUnsignedByte);
+        break;
+      case 2:
+        __ LoadFromOffset(temp_reg, src_reg, offset,
+                          compiler::kUnsignedTwoBytes);
+        __ StoreToOffset(temp_reg, dest_reg, offset,
+                         compiler::kUnsignedTwoBytes);
+        break;
+      case 4:
+        __ LoadFromOffset(temp_reg, src_reg, offset,
+                          compiler::kUnsignedFourBytes);
+        __ StoreToOffset(temp_reg, dest_reg, offset,
+                         compiler::kUnsignedFourBytes);
+        break;
+      case 8:
+#if defined(TARGET_ARCH_IS_64_BIT)
+        __ LoadFromOffset(temp_reg, src_reg, offset, compiler::kEightBytes);
+        __ StoreToOffset(temp_reg, dest_reg, offset, compiler::kEightBytes);
+#else
+        UNREACHABLE();
+#endif
+        break;
+      case 16: {
+#if defined(TARGET_ARCH_ARM64)
+        __ ldp(
+            TMP, TMP2,
+            compiler::Address(src_reg, offset, compiler::Address::PairOffset));
+        __ stp(
+            TMP, TMP2,
+            compiler::Address(dest_reg, offset, compiler::Address::PairOffset));
+#else
+        UNREACHABLE();
+#endif
+        break;
+      }
+      default:
+        UNREACHABLE();
+    }
+  }
+}
+#endif
+
 bool Utf8ScanInstr::IsScanFlagsUnboxed() const {
  return scan_flags_field_.is_unboxed();
 }
--- a/runtime/vm/compiler/backend/il.h
+++ b/runtime/vm/compiler/backend/il.h
@ -3064,11 +3064,13 @@ class MemoryCopyInstr : public TemplateInstruction<5, NoThrow> {
                  Value* length,
                  classid_t src_cid,
                  classid_t dest_cid,
-                  bool unboxed_length)
+                  bool unboxed_inputs,
+                  bool can_overlap = true)
      : src_cid_(src_cid),
        dest_cid_(dest_cid),
        element_size_(Instance::ElementSizeFor(src_cid)),
-        unboxed_length_(unboxed_length) {
+        unboxed_inputs_(unboxed_inputs),
+        can_overlap_(can_overlap) {
    ASSERT(IsArrayTypeSupported(src_cid));
    ASSERT(IsArrayTypeSupported(dest_cid));
    ASSERT(Instance::ElementSizeFor(src_cid) ==
@ -3091,11 +3093,11 @@ class MemoryCopyInstr : public TemplateInstruction<5, NoThrow> {
  DECLARE_INSTRUCTION(MemoryCopy)

  virtual Representation RequiredInputRepresentation(intptr_t index) const {
-    if (index == kLengthPos && unboxed_length_) {
-      return kUnboxedIntPtr;
+    if (index == kSrcPos || index == kDestPos) {
+      // The object inputs are always tagged.
+      return kTagged;
    }
-    // All inputs are tagged (for now).
-    return kTagged;
+    return unboxed_inputs() ? kUnboxedIntPtr : kTagged;
  }

  virtual bool ComputeCanDeoptimize() const { return false; }
@ -3110,16 +3112,20 @@ class MemoryCopyInstr : public TemplateInstruction<5, NoThrow> {
  Value* length() const { return inputs_[kLengthPos]; }

  intptr_t element_size() const { return element_size_; }
-  bool unboxed_length() const { return unboxed_length_; }
+  bool unboxed_inputs() const { return unboxed_inputs_; }
+  bool can_overlap() const { return can_overlap_; }

  // Optimizes MemoryCopyInstr with constant parameters to use larger moves.
  virtual Instruction* Canonicalize(FlowGraph* flow_graph);

+  PRINT_OPERANDS_TO_SUPPORT
+
 #define FIELD_LIST(F)                                                          \
  F(classid_t, src_cid_)                                                       \
  F(classid_t, dest_cid_)                                                      \
  F(intptr_t, element_size_)                                                   \
-  F(bool, unboxed_length_)
+  F(bool, unboxed_inputs_)                                                     \
+  F(bool, can_overlap_)

  DECLARE_INSTRUCTION_SERIALIZABLE_FIELDS(MemoryCopyInstr,
                                          TemplateInstruction,
@ -3134,6 +3140,38 @@ class MemoryCopyInstr : public TemplateInstruction<5, NoThrow> {
                               Register array_reg,
                               Location start_loc);

+  // Generates an unrolled loop for copying a known amount of data from
+  // src to dest.
+  void EmitUnrolledCopy(FlowGraphCompiler* compiler,
+                        Register dest_reg,
+                        Register src_reg,
+                        intptr_t num_elements,
+                        bool reversed);
+
+  // Called prior to EmitLoopCopy() to adjust the length register as needed
+  // for the code emitted by EmitLoopCopy. May jump to done if the emitted
+  // loop(s) should be skipped.
+  void PrepareLengthRegForLoop(FlowGraphCompiler* compiler,
+                               Register length_reg,
+                               compiler::Label* done);
+
+  // Generates a loop for copying the data from src to dest, for cases where
+  // either the length is not known at compile time or too large to unroll.
+  //
+  // copy_forwards is only provided (not nullptr) when a backwards loop is
+  // requested. May jump to copy_forwards if backwards iteration is slower than
+  // forwards iteration and the emitted code verifies no actual overlap exists.
+  //
+  // May jump to done if no copying is needed.
+  //
+  // Assumes that PrepareLengthRegForLoop() has been called beforehand.
+  void EmitLoopCopy(FlowGraphCompiler* compiler,
+                    Register dest_reg,
+                    Register src_reg,
+                    Register length_reg,
+                    compiler::Label* done,
+                    compiler::Label* copy_forwards = nullptr);
+
  static bool IsArrayTypeSupported(classid_t array_cid) {
    if (IsTypedDataBaseClassId(array_cid)) {
      return true;
--- a/runtime/vm/compiler/backend/il_arm.cc
+++ b/runtime/vm/compiler/backend/il_arm.cc
@ -155,12 +155,17 @@ DEFINE_BACKEND(TailCall,
  __ set_constant_pool_allowed(true);
 }

+// TODO(http://dartbug.com/51229): We can use TMP for LDM/STM, which means we
+// only need one additional temporary for 8-byte moves. For 16-byte moves,
+// attempting to allocate three temporaries causes too much register pressure,
+// so just use two 8-byte sized moves there per iteration.
+static constexpr intptr_t kMaxMemoryCopyElementSize =
+    2 * compiler::target::kWordSize;
+
 LocationSummary* MemoryCopyInstr::MakeLocationSummary(Zone* zone,
                                                      bool opt) const {
  const intptr_t kNumInputs = 5;
-  const intptr_t kNumTemps = element_size_ == 16  ? 4
-                             : element_size_ == 8 ? 2
-                                                  : 1;
+  const intptr_t kNumTemps = element_size_ >= kMaxMemoryCopyElementSize ? 1 : 0;
  LocationSummary* locs = new (zone)
      LocationSummary(zone, kNumInputs, kNumTemps, LocationSummary::kNoCall);
  locs->set_in(kSrcPos, Location::WritableRegister());
@ -175,89 +180,149 @@ LocationSummary* MemoryCopyInstr::MakeLocationSummary(Zone* zone,
  return locs;
 }

-void MemoryCopyInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
-  const Register src_reg = locs()->in(kSrcPos).reg();
-  const Register dest_reg = locs()->in(kDestPos).reg();
-  const Location src_start_loc = locs()->in(kSrcStartPos);
-  const Location dest_start_loc = locs()->in(kDestStartPos);
-  const Location length_loc = locs()->in(kLengthPos);
-  const bool constant_length = length_loc.IsConstant();
+void MemoryCopyInstr::EmitUnrolledCopy(FlowGraphCompiler* compiler,
+                                       Register dest_reg,
+                                       Register src_reg,
+                                       intptr_t num_elements,
+                                       bool reversed) {
+  const intptr_t num_bytes = num_elements * element_size_;
+  // The amount moved in a single load/store pair.
+  const intptr_t mov_size =
+      Utils::Minimum(element_size_, kMaxMemoryCopyElementSize);
+  const intptr_t mov_repeat = num_bytes / mov_size;
+  ASSERT(num_bytes % mov_size == 0);
+  // We can use TMP for all instructions below because element_size_ is
+  // guaranteed to fit in the offset portion of the instruction in the
+  // non-LDM/STM cases.

-  const Register temp_reg = locs()->temp(0).reg();
-  RegList temp_regs = 0;
-  for (intptr_t i = 0; i < locs()->temp_count(); i++) {
-    temp_regs |= 1 << locs()->temp(i).reg();
-  }
-
-  EmitComputeStartPointer(compiler, src_cid_, src_reg, src_start_loc);
-  EmitComputeStartPointer(compiler, dest_cid_, dest_reg, dest_start_loc);
-
-  if (constant_length) {
-    const intptr_t mov_repeat =
-        Integer::Cast(length_loc.constant()).AsInt64Value();
+  if (mov_size == kMaxMemoryCopyElementSize) {
+    RegList temp_regs = (1 << TMP);
+    for (intptr_t i = 0; i < locs()->temp_count(); i++) {
+      temp_regs |= 1 << locs()->temp(i).reg();
+    }
+    auto block_mode = BlockAddressMode::IA_W;
+    if (reversed) {
+      // When reversed, start the src and dest registers with the end addresses
+      // and apply the negated offset prior to indexing.
+      block_mode = BlockAddressMode::DB_W;
+      __ AddImmediate(src_reg, num_bytes);
+      __ AddImmediate(dest_reg, num_bytes);
+    }
    for (intptr_t i = 0; i < mov_repeat; i++) {
-      compiler::Address src_address =
-          compiler::Address(src_reg, element_size_ * i);
-      compiler::Address dest_address =
-          compiler::Address(dest_reg, element_size_ * i);
-      switch (element_size_) {
-        case 1:
-          __ ldrb(temp_reg, src_address);
-          __ strb(temp_reg, dest_address);
-          break;
-        case 2:
-          __ ldrh(temp_reg, src_address);
-          __ strh(temp_reg, dest_address);
-          break;
-        case 4:
-          __ ldr(temp_reg, src_address);
-          __ str(temp_reg, dest_address);
-          break;
-        case 8:
-        case 16:
-          __ ldm(BlockAddressMode::IA_W, src_reg, temp_regs);
-          __ stm(BlockAddressMode::IA_W, dest_reg, temp_regs);
-          break;
-      }
+      __ ldm(block_mode, src_reg, temp_regs);
+      __ stm(block_mode, dest_reg, temp_regs);
    }
    return;
  }

-  const Register length_reg = length_loc.reg();
+  for (intptr_t i = 0; i < mov_repeat; i++) {
+    const intptr_t byte_index =
+        (reversed ? mov_repeat - (i + 1) : i) * mov_size;
+    switch (mov_size) {
+      case 1:
+        __ ldrb(TMP, compiler::Address(src_reg, byte_index));
+        __ strb(TMP, compiler::Address(dest_reg, byte_index));
+        break;
+      case 2:
+        __ ldrh(TMP, compiler::Address(src_reg, byte_index));
+        __ strh(TMP, compiler::Address(dest_reg, byte_index));
+        break;
+      case 4:
+        __ ldr(TMP, compiler::Address(src_reg, byte_index));
+        __ str(TMP, compiler::Address(dest_reg, byte_index));
+        break;
+      default:
+        UNREACHABLE();
+    }
+  }
+}

-  compiler::Label loop, done;
+void MemoryCopyInstr::PrepareLengthRegForLoop(FlowGraphCompiler* compiler,
+                                              Register length_reg,
+                                              compiler::Label* done) {
+  __ BranchIfZero(length_reg, done);
+}

+void MemoryCopyInstr::EmitLoopCopy(FlowGraphCompiler* compiler,
+                                   Register dest_reg,
+                                   Register src_reg,
+                                   Register length_reg,
+                                   compiler::Label* done,
+                                   compiler::Label* copy_forwards) {
+  const intptr_t loop_subtract = unboxed_inputs() ? 1 : Smi::RawValue(1);
+  auto load_mode = compiler::Address::PostIndex;
+  auto load_multiple_mode = BlockAddressMode::IA_W;
+  if (copy_forwards != nullptr) {
+    // When reversed, start the src and dest registers with the end addresses
+    // and apply the negated offset prior to indexing.
+    load_mode = compiler::Address::NegPreIndex;
+    load_multiple_mode = BlockAddressMode::DB_W;
+    // Verify that the overlap actually exists by checking to see if
+    // dest_start < src_end.
+    const intptr_t shift = Utils::ShiftForPowerOfTwo(element_size_) -
+                           (unboxed_inputs() ? 0 : kSmiTagShift);
+    if (shift < 0) {
+      __ add(TMP, src_reg, compiler::Operand(length_reg, ASR, -shift));
+    } else {
+      __ add(TMP, src_reg, compiler::Operand(length_reg, LSL, shift));
+    }
+    __ CompareRegisters(dest_reg, TMP);
+    __ BranchIf(UNSIGNED_GREATER_EQUAL, copy_forwards);
+    // There is overlap, so mov TMP to src_reg and adjust dest_reg now.
+    __ MoveRegister(src_reg, TMP);
+    if (shift < 0) {
+      __ add(dest_reg, dest_reg, compiler::Operand(length_reg, ASR, -shift));
+    } else {
+      __ add(dest_reg, dest_reg, compiler::Operand(length_reg, LSL, shift));
+    }
+  }
+  // We can use TMP for all instructions below because element_size_ is
+  // guaranteed to fit in the offset portion of the instruction in the
+  // non-LDM/STM cases.
  compiler::Address src_address =
-      compiler::Address(src_reg, element_size_, compiler::Address::PostIndex);
+      compiler::Address(src_reg, element_size_, load_mode);
  compiler::Address dest_address =
-      compiler::Address(dest_reg, element_size_, compiler::Address::PostIndex);
-
-  const intptr_t loop_subtract = unboxed_length_ ? 1 : Smi::RawValue(1);
-  __ BranchIfZero(length_reg, &done);
-
+      compiler::Address(dest_reg, element_size_, load_mode);
+  // Used only for LDM/STM below.
+  RegList temp_regs = (1 << TMP);
+  for (intptr_t i = 0; i < locs()->temp_count(); i++) {
+    temp_regs |= 1 << locs()->temp(i).reg();
+  }
+  compiler::Label loop;
  __ Bind(&loop);
  switch (element_size_) {
    case 1:
-      __ ldrb(temp_reg, src_address);
-      __ strb(temp_reg, dest_address);
+      __ ldrb(TMP, src_address);
+      __ strb(TMP, dest_address);
      break;
    case 2:
-      __ ldrh(temp_reg, src_address);
-      __ strh(temp_reg, dest_address);
+      __ ldrh(TMP, src_address);
+      __ strh(TMP, dest_address);
      break;
    case 4:
-      __ ldr(temp_reg, src_address);
-      __ str(temp_reg, dest_address);
+      __ ldr(TMP, src_address);
+      __ str(TMP, dest_address);
      break;
    case 8:
+      COMPILE_ASSERT(8 == kMaxMemoryCopyElementSize);
+      ASSERT_EQUAL(Utils::CountOneBitsWord(temp_regs), 2);
+      __ ldm(load_multiple_mode, src_reg, temp_regs);
+      __ stm(load_multiple_mode, dest_reg, temp_regs);
+      break;
    case 16:
-      __ ldm(BlockAddressMode::IA_W, src_reg, temp_regs);
-      __ stm(BlockAddressMode::IA_W, dest_reg, temp_regs);
+      COMPILE_ASSERT(16 > kMaxMemoryCopyElementSize);
+      ASSERT_EQUAL(Utils::CountOneBitsWord(temp_regs), 2);
+      __ ldm(load_multiple_mode, src_reg, temp_regs);
+      __ stm(load_multiple_mode, dest_reg, temp_regs);
+      __ ldm(load_multiple_mode, src_reg, temp_regs);
+      __ stm(load_multiple_mode, dest_reg, temp_regs);
+      break;
+    default:
+      UNREACHABLE();
      break;
  }
  __ subs(length_reg, length_reg, compiler::Operand(loop_subtract));
  __ b(&loop, NOT_ZERO);
-  __ Bind(&done);
 }

 void MemoryCopyInstr::EmitComputeStartPointer(FlowGraphCompiler* compiler,
@ -311,7 +376,8 @@ void MemoryCopyInstr::EmitComputeStartPointer(FlowGraphCompiler* compiler,
  }
  __ AddImmediate(array_reg, offset);
  const Register start_reg = start_loc.reg();
-  intptr_t shift = Utils::ShiftForPowerOfTwo(element_size_) - 1;
+  intptr_t shift = Utils::ShiftForPowerOfTwo(element_size_) -
+                   (unboxed_inputs() ? 0 : kSmiTagShift);
  if (shift < 0) {
    __ add(array_reg, array_reg, compiler::Operand(start_reg, ASR, -shift));
  } else {
--- a/runtime/vm/compiler/backend/il_arm64.cc
+++ b/runtime/vm/compiler/backend/il_arm64.cc
@ -157,7 +157,7 @@ DEFINE_BACKEND(TailCall,
 LocationSummary* MemoryCopyInstr::MakeLocationSummary(Zone* zone,
                                                      bool opt) const {
  const intptr_t kNumInputs = 5;
-  const intptr_t kNumTemps = 1;
+  const intptr_t kNumTemps = 0;
  LocationSummary* locs = new (zone)
      LocationSummary(zone, kNumInputs, kNumTemps, LocationSummary::kNoCall);
  locs->set_in(kSrcPos, Location::WritableRegister());
@ -166,108 +166,87 @@ LocationSummary* MemoryCopyInstr::MakeLocationSummary(Zone* zone,
  locs->set_in(kDestStartPos, LocationRegisterOrConstant(dest_start()));
  locs->set_in(kLengthPos,
               LocationWritableRegisterOrSmiConstant(length(), 0, 4));
-  locs->set_temp(0, element_size_ == 16
-                        ? Location::Pair(Location::RequiresRegister(),
-                                         Location::RequiresRegister())
-                        : Location::RequiresRegister());
  return locs;
 }

-void MemoryCopyInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
-  const Register src_reg = locs()->in(kSrcPos).reg();
-  const Register dest_reg = locs()->in(kDestPos).reg();
-  const Location src_start_loc = locs()->in(kSrcStartPos);
-  const Location dest_start_loc = locs()->in(kDestStartPos);
-  const Location length_loc = locs()->in(kLengthPos);
-  const bool constant_length = length_loc.IsConstant();
+void MemoryCopyInstr::PrepareLengthRegForLoop(FlowGraphCompiler* compiler,
+                                              Register length_reg,
+                                              compiler::Label* done) {
+  __ BranchIfZero(length_reg, done);
+}

-  Register temp_reg, temp_reg2;
-  if (locs()->temp(0).IsPairLocation()) {
-    PairLocation* pair = locs()->temp(0).AsPairLocation();
-    temp_reg = pair->At(0).reg();
-    temp_reg2 = pair->At(1).reg();
-  } else {
-    temp_reg = locs()->temp(0).reg();
-    temp_reg2 = kNoRegister;
-  }
-
-  EmitComputeStartPointer(compiler, src_cid_, src_reg, src_start_loc);
-  EmitComputeStartPointer(compiler, dest_cid_, dest_reg, dest_start_loc);
-
-  if (constant_length) {
-    const intptr_t mov_repeat =
-        Integer::Cast(length_loc.constant()).AsInt64Value();
-    for (intptr_t i = 0; i < mov_repeat; i++) {
-      compiler::Address src_address =
-          compiler::Address(src_reg, element_size_ * i);
-      compiler::Address dest_address =
-          compiler::Address(dest_reg, element_size_ * i);
-      switch (element_size_) {
-        case 1:
-          __ ldr(temp_reg, src_address, compiler::kUnsignedByte);
-          __ str(temp_reg, dest_address, compiler::kUnsignedByte);
-          break;
-        case 2:
-          __ ldr(temp_reg, src_address, compiler::kUnsignedTwoBytes);
-          __ str(temp_reg, dest_address, compiler::kUnsignedTwoBytes);
-          break;
-        case 4:
-          __ ldr(temp_reg, src_address, compiler::kUnsignedFourBytes);
-          __ str(temp_reg, dest_address, compiler::kUnsignedFourBytes);
-          break;
-        case 8:
-          __ ldr(temp_reg, src_address, compiler::kEightBytes);
-          __ str(temp_reg, dest_address, compiler::kEightBytes);
-          break;
-        case 16:
-          __ ldp(temp_reg, temp_reg2, src_address, compiler::kEightBytes);
-          __ stp(temp_reg, temp_reg2, dest_address, compiler::kEightBytes);
-          break;
-      }
+void MemoryCopyInstr::EmitLoopCopy(FlowGraphCompiler* compiler,
+                                   Register dest_reg,
+                                   Register src_reg,
+                                   Register length_reg,
+                                   compiler::Label* done,
+                                   compiler::Label* copy_forwards) {
+  const intptr_t loop_subtract = unboxed_inputs() ? 1 : Smi::RawValue(1);
+  intptr_t offset = element_size_;
+  auto mode = element_size_ == 16 ? compiler::Address::PairPostIndex
+                                  : compiler::Address::PostIndex;
+  if (copy_forwards != nullptr) {
+    // When reversed, start the src and dest registers with the end addresses
+    // and apply the negated offset prior to indexing.
+    offset = -element_size_;
+    mode = element_size_ == 16 ? compiler::Address::PairPreIndex
+                               : compiler::Address::PreIndex;
+    // Verify that the overlap actually exists by checking to see if
+    // dest_start < src_end.
+    if (!unboxed_inputs()) {
+      __ ExtendNonNegativeSmi(length_reg);
+    }
+    const intptr_t shift = Utils::ShiftForPowerOfTwo(element_size_) -
+                           (unboxed_inputs() ? 0 : kSmiTagShift);
+    if (shift < 0) {
+      __ add(TMP, src_reg, compiler::Operand(length_reg, ASR, -shift));
+    } else {
+      __ add(TMP, src_reg, compiler::Operand(length_reg, LSL, shift));
+    }
+    __ CompareRegisters(dest_reg, TMP);
+    __ BranchIf(UNSIGNED_GREATER_EQUAL, copy_forwards);
+    // There is overlap, so move TMP to src_reg and adjust dest_reg now.
+    __ MoveRegister(src_reg, TMP);
+    if (shift < 0) {
+      __ add(dest_reg, dest_reg, compiler::Operand(length_reg, ASR, -shift));
+    } else {
+      __ add(dest_reg, dest_reg, compiler::Operand(length_reg, LSL, shift));
    }
-    return;
  }

-  const Register length_reg = length_loc.reg();
-
-  compiler::Label loop, done;
-
-  compiler::Address src_address =
-      compiler::Address(src_reg, element_size_, compiler::Address::PostIndex);
-  compiler::Address dest_address =
-      compiler::Address(dest_reg, element_size_, compiler::Address::PostIndex);
-
-  const intptr_t loop_subtract = unboxed_length_ ? 1 : Smi::RawValue(1);
-  __ BranchIfZero(length_reg, &done);
-
+  compiler::Address src_address = compiler::Address(src_reg, offset, mode);
+  compiler::Address dest_address = compiler::Address(dest_reg, offset, mode);
+  compiler::Label loop;
  __ Bind(&loop);
  switch (element_size_) {
    case 1:
-      __ ldr(temp_reg, src_address, compiler::kUnsignedByte);
-      __ str(temp_reg, dest_address, compiler::kUnsignedByte);
+      __ ldr(TMP, src_address, compiler::kUnsignedByte);
+      __ str(TMP, dest_address, compiler::kUnsignedByte);
      break;
    case 2:
-      __ ldr(temp_reg, src_address, compiler::kUnsignedTwoBytes);
-      __ str(temp_reg, dest_address, compiler::kUnsignedTwoBytes);
+      __ ldr(TMP, src_address, compiler::kUnsignedTwoBytes);
+      __ str(TMP, dest_address, compiler::kUnsignedTwoBytes);
      break;
    case 4:
-      __ ldr(temp_reg, src_address, compiler::kUnsignedFourBytes);
-      __ str(temp_reg, dest_address, compiler::kUnsignedFourBytes);
+      __ ldr(TMP, src_address, compiler::kUnsignedFourBytes);
+      __ str(TMP, dest_address, compiler::kUnsignedFourBytes);
      break;
    case 8:
-      __ ldr(temp_reg, src_address, compiler::kEightBytes);
-      __ str(temp_reg, dest_address, compiler::kEightBytes);
+      __ ldr(TMP, src_address, compiler::kEightBytes);
+      __ str(TMP, dest_address, compiler::kEightBytes);
      break;
    case 16:
-      __ ldp(temp_reg, temp_reg2, src_address, compiler::kEightBytes);
-      __ stp(temp_reg, temp_reg2, dest_address, compiler::kEightBytes);
+      __ ldp(TMP, TMP2, src_address, compiler::kEightBytes);
+      __ stp(TMP, TMP2, dest_address, compiler::kEightBytes);
+      break;
+    default:
+      UNREACHABLE();
      break;
  }

  __ subs(length_reg, length_reg, compiler::Operand(loop_subtract),
          compiler::kObjectBytes);
  __ b(&loop, NOT_ZERO);
-  __ Bind(&done);
 }

 void MemoryCopyInstr::EmitComputeStartPointer(FlowGraphCompiler* compiler,
@ -321,18 +300,19 @@ void MemoryCopyInstr::EmitComputeStartPointer(FlowGraphCompiler* compiler,
  }
  __ AddImmediate(array_reg, offset);
  const Register start_reg = start_loc.reg();
-  intptr_t shift = Utils::ShiftForPowerOfTwo(element_size_) - 1;
+  intptr_t shift = Utils::ShiftForPowerOfTwo(element_size_) -
+                   (unboxed_inputs() ? 0 : kSmiTagShift);
  if (shift < 0) {
-#if defined(DART_COMPRESSED_POINTERS)
-    __ sxtw(start_reg, start_reg);
-#endif
+    if (!unboxed_inputs()) {
+      __ ExtendNonNegativeSmi(start_reg);
+    }
    __ add(array_reg, array_reg, compiler::Operand(start_reg, ASR, -shift));
-  } else {
-#if !defined(DART_COMPRESSED_POINTERS)
-    __ add(array_reg, array_reg, compiler::Operand(start_reg, LSL, shift));
-#else
+#if defined(DART_COMPRESSED_POINTERS)
+  } else if (!unboxed_inputs()) {
    __ add(array_reg, array_reg, compiler::Operand(start_reg, SXTW, shift));
 #endif
+  } else {
+    __ add(array_reg, array_reg, compiler::Operand(start_reg, LSL, shift));
  }
 }

--- a/runtime/vm/compiler/backend/il_ia32.cc
+++ b/runtime/vm/compiler/backend/il_ia32.cc
@ -87,8 +87,17 @@ LocationSummary* MemoryCopyInstr::MakeLocationSummary(Zone* zone,
      LocationSummary(zone, kNumInputs, kNumTemps, LocationSummary::kNoCall);
  locs->set_in(kSrcPos, Location::WritableRegister());
  locs->set_in(kDestPos, Location::RegisterLocation(EDI));
-  locs->set_in(kSrcStartPos, LocationRegisterOrConstant(src_start()));
-  locs->set_in(kDestStartPos, LocationRegisterOrConstant(dest_start()));
+  const bool needs_writable_inputs =
+      (((element_size_ == 1) && !unboxed_inputs_) ||
+       ((element_size_ == 16) && unboxed_inputs_));
+  locs->set_in(kSrcStartPos,
+               needs_writable_inputs
+                   ? LocationWritableRegisterOrConstant(src_start())
+                   : LocationRegisterOrConstant(src_start()));
+  locs->set_in(kDestStartPos,
+               needs_writable_inputs
+                   ? LocationWritableRegisterOrConstant(dest_start())
+                   : LocationRegisterOrConstant(dest_start()));
  if (remove_loop) {
    locs->set_in(
        kLengthPos,
@ -104,61 +113,56 @@ LocationSummary* MemoryCopyInstr::MakeLocationSummary(Zone* zone,
  return locs;
 }

-void MemoryCopyInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
-  const Register src_reg = locs()->in(kSrcPos).reg();
-  const Register dest_reg = locs()->in(kDestPos).reg();
-  const Location src_start_loc = locs()->in(kSrcStartPos);
-  const Location dest_start_loc = locs()->in(kDestStartPos);
-  const Location length_loc = locs()->in(kLengthPos);
+static inline intptr_t SizeOfMemoryCopyElements(intptr_t element_size) {
+  return Utils::Minimum<intptr_t>(element_size, compiler::target::kWordSize);
+}

-  EmitComputeStartPointer(compiler, src_cid_, src_reg, src_start_loc);
-  EmitComputeStartPointer(compiler, dest_cid_, dest_reg, dest_start_loc);
-
-  if (length_loc.IsConstant()) {
-    const intptr_t num_bytes =
-        Integer::Cast(length_loc.constant()).AsInt64Value() * element_size_;
-    const intptr_t mov_size = Utils::Minimum(element_size_, 4);
-    const intptr_t mov_repeat = num_bytes / mov_size;
-    ASSERT(num_bytes % mov_size == 0);
-
-    const Register temp_reg = locs()->temp(0).reg();
-    for (intptr_t i = 0; i < mov_repeat; i++) {
-      const intptr_t disp = mov_size * i;
-      switch (mov_size) {
-        case 1:
-          __ movzxb(temp_reg, compiler::Address(src_reg, disp));
-          __ movb(compiler::Address(dest_reg, disp), ByteRegisterOf(temp_reg));
-          break;
-        case 2:
-          __ movzxw(temp_reg, compiler::Address(src_reg, disp));
-          __ movw(compiler::Address(dest_reg, disp), temp_reg);
-          break;
-        case 4:
-          __ movl(temp_reg, compiler::Address(src_reg, disp));
-          __ movl(compiler::Address(dest_reg, disp), temp_reg);
-          break;
-      }
-    }
-    return;
+void MemoryCopyInstr::PrepareLengthRegForLoop(FlowGraphCompiler* compiler,
+                                              Register length_reg,
+                                              compiler::Label* done) {
+  const intptr_t mov_size = SizeOfMemoryCopyElements(element_size_);
+  // We want to convert the value in length_reg to an unboxed length in
+  // terms of mov_size-sized elements.
+  const intptr_t shift = Utils::ShiftForPowerOfTwo(element_size_) -
+                         Utils::ShiftForPowerOfTwo(mov_size) -
+                         (unboxed_inputs() ? 0 : kSmiTagShift);
+  if (shift < 0) {
+    ASSERT_EQUAL(shift, -kSmiTagShift);
+    __ SmiUntag(length_reg);
+  } else if (shift > 0) {
+    __ shll(length_reg, compiler::Immediate(shift));
  }
+}

-  // Save ESI which is THR.
-  __ pushl(ESI);
-  __ movl(ESI, src_reg);
-
-  if (element_size_ <= compiler::target::kWordSize) {
-    if (!unboxed_length_) {
-      __ SmiUntag(ECX);
-    }
+void MemoryCopyInstr::EmitLoopCopy(FlowGraphCompiler* compiler,
+                                   Register dest_reg,
+                                   Register src_reg,
+                                   Register length_reg,
+                                   compiler::Label* done,
+                                   compiler::Label* copy_forwards) {
+  const intptr_t mov_size = SizeOfMemoryCopyElements(element_size_);
+  const bool reversed = copy_forwards != nullptr;
+  if (reversed) {
+    // Avoid doing the extra work to prepare for the rep mov instructions
+    // if the length to copy is zero.
+    __ BranchIfZero(length_reg, done);
+    // Verify that the overlap actually exists by checking to see if
+    // the first element in dest <= the last element in src.
+    const ScaleFactor scale = ToScaleFactor(mov_size, /*index_unboxed=*/true);
+    __ leal(ESI, compiler::Address(src_reg, length_reg, scale, -mov_size));
+    __ CompareRegisters(dest_reg, ESI);
+    __ BranchIf(UNSIGNED_GREATER, copy_forwards,
+                compiler::Assembler::kNearJump);
+    // ESI already has the right address, so we just need to adjust dest_reg
+    // appropriately.
+    __ leal(dest_reg,
+            compiler::Address(dest_reg, length_reg, scale, -mov_size));
+    __ std();
  } else {
-    const intptr_t shift = Utils::ShiftForPowerOfTwo(element_size_) -
-                           compiler::target::kWordSizeLog2 -
-                           (unboxed_length_ ? 0 : kSmiTagShift);
-    if (shift != 0) {
-      __ shll(ECX, compiler::Immediate(shift));
-    }
+    // Move the start of the src array into ESI before the string operation.
+    __ movl(ESI, src_reg);
  }
-  switch (element_size_) {
+  switch (mov_size) {
    case 1:
      __ rep_movsb();
      break;
@ -166,14 +170,14 @@ void MemoryCopyInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
      __ rep_movsw();
      break;
    case 4:
-    case 8:
-    case 16:
      __ rep_movsd();
      break;
+    default:
+      UNREACHABLE();
+  }
+  if (reversed) {
+    __ cld();
  }
-
-  // Restore THR.
-  __ popl(ESI);
 }

 void MemoryCopyInstr::EmitComputeStartPointer(FlowGraphCompiler* compiler,
@ -225,29 +229,22 @@ void MemoryCopyInstr::EmitComputeStartPointer(FlowGraphCompiler* compiler,
    __ AddImmediate(array_reg, add_value);
    return;
  }
+  // Note that start_reg must be writable in the special cases below.
  const Register start_reg = start_loc.reg();
-  ScaleFactor scale;
-  switch (element_size_) {
-    case 1:
-      __ SmiUntag(start_reg);
-      scale = TIMES_1;
-      break;
-    case 2:
-      scale = TIMES_1;
-      break;
-    case 4:
-      scale = TIMES_2;
-      break;
-    case 8:
-      scale = TIMES_4;
-      break;
-    case 16:
-      scale = TIMES_8;
-      break;
-    default:
-      UNREACHABLE();
-      break;
+  bool index_unboxed = unboxed_inputs_;
+  // Both special cases below assume that Smis are only shifted one bit.
+  COMPILE_ASSERT(kSmiTagShift == 1);
+  if (element_size_ == 1 && !index_unboxed) {
+    // Shift the value to the right by tagging it as a Smi.
+    __ SmiUntag(start_reg);
+    index_unboxed = true;
+  } else if (element_size_ == 16 && index_unboxed) {
+    // Can't use TIMES_16 on X86, so instead pre-shift the value to reduce the
+    // scaling needed in the leaq instruction.
+    __ SmiTag(start_reg);
+    index_unboxed = false;
  }
+  auto const scale = ToScaleFactor(element_size_, index_unboxed);
  __ leal(array_reg, compiler::Address(array_reg, start_reg, scale, offset));
 }

@ -1612,22 +1609,24 @@ void LoadIndexedInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
  const Register array = locs()->in(0).reg();
  const Location index = locs()->in(1);

-  compiler::Address element_address =
-      index.IsRegister() ? compiler::Assembler::ElementAddressForRegIndex(
-                               IsExternal(), class_id(), index_scale(),
-                               index_unboxed_, array, index.reg())
-                         : compiler::Assembler::ElementAddressForIntIndex(
-                               IsExternal(), class_id(), index_scale(), array,
-                               Smi::Cast(index.constant()).Value());
-
-  if (index_scale() == 1 && !index_unboxed_) {
+  bool index_unboxed = index_unboxed_;
+  if (index_scale() == 1 && !index_unboxed) {
    if (index.IsRegister()) {
      __ SmiUntag(index.reg());
+      index_unboxed = true;
    } else {
      ASSERT(index.IsConstant());
    }
  }

+  compiler::Address element_address =
+      index.IsRegister() ? compiler::Assembler::ElementAddressForRegIndex(
+                               IsExternal(), class_id(), index_scale(),
+                               index_unboxed, array, index.reg())
+                         : compiler::Assembler::ElementAddressForIntIndex(
+                               IsExternal(), class_id(), index_scale(), array,
+                               Smi::Cast(index.constant()).Value());
+
  if ((representation() == kUnboxedFloat) ||
      (representation() == kUnboxedDouble) ||
      (representation() == kUnboxedFloat32x4) ||
@ -1678,7 +1677,7 @@ void LoadIndexedInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
      element_address =
          index.IsRegister()
              ? compiler::Assembler::ElementAddressForRegIndex(
-                    IsExternal(), class_id(), index_scale(), index_unboxed_,
+                    IsExternal(), class_id(), index_scale(), index_unboxed,
                    array, index.reg(), kWordSize)
              : compiler::Assembler::ElementAddressForIntIndex(
                    IsExternal(), class_id(), index_scale(), array,
@ -1804,17 +1803,19 @@ void StoreIndexedInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
  const Register array = locs()->in(0).reg();
  const Location index = locs()->in(1);

+  bool index_unboxed = index_unboxed_;
+  if ((index_scale() == 1) && index.IsRegister() && !index_unboxed) {
+    __ SmiUntag(index.reg());
+    index_unboxed = true;
+  }
  compiler::Address element_address =
      index.IsRegister() ? compiler::Assembler::ElementAddressForRegIndex(
                               IsExternal(), class_id(), index_scale(),
-                               index_unboxed_, array, index.reg())
+                               index_unboxed, array, index.reg())
                         : compiler::Assembler::ElementAddressForIntIndex(
                               IsExternal(), class_id(), index_scale(), array,
                               Smi::Cast(index.constant()).Value());

-  if ((index_scale() == 1) && index.IsRegister() && !index_unboxed_) {
-    __ SmiUntag(index.reg());
-  }
  switch (class_id()) {
    case kArrayCid:
      if (ShouldEmitStoreBarrier()) {
@ -1897,7 +1898,7 @@ void StoreIndexedInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
      element_address =
          index.IsRegister()
              ? compiler::Assembler::ElementAddressForRegIndex(
-                    IsExternal(), class_id(), index_scale(), index_unboxed_,
+                    IsExternal(), class_id(), index_scale(), index_unboxed,
                    array, index.reg(), kWordSize)
              : compiler::Assembler::ElementAddressForIntIndex(
                    IsExternal(), class_id(), index_scale(), array,
@ -3816,14 +3817,15 @@ void LoadCodeUnitsInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
  const Register str = locs()->in(0).reg();
  const Location index = locs()->in(1);

-  compiler::Address element_address =
-      compiler::Assembler::ElementAddressForRegIndex(
-          IsExternal(), class_id(), index_scale(), /*index_unboxed=*/false, str,
-          index.reg());
-
+  bool index_unboxed = false;
  if ((index_scale() == 1)) {
    __ SmiUntag(index.reg());
+    index_unboxed = true;
  }
+  compiler::Address element_address =
+      compiler::Assembler::ElementAddressForRegIndex(
+          IsExternal(), class_id(), index_scale(), index_unboxed, str,
+          index.reg());

  if (representation() == kUnboxedInt64) {
    ASSERT(compiler->is_optimizing());
--- a/runtime/vm/compiler/backend/il_printer.cc
+++ b/runtime/vm/compiler/backend/il_printer.cc
@ -6,6 +6,7 @@

 #include <tuple>

+#include "vm/class_id.h"
 #include "vm/compiler/api/print_filter.h"
 #include "vm/compiler/backend/il.h"
 #include "vm/compiler/backend/linearscan.h"
@ -1370,6 +1371,40 @@ void StoreIndexedInstr::PrintOperandsTo(BaseTextBuffer* f) const {
  }
 }

+void MemoryCopyInstr::PrintOperandsTo(BaseTextBuffer* f) const {
+  Instruction::PrintOperandsTo(f);
+  // kTypedDataUint8ArrayCid is used as the default cid for cases where
+  // the destination object is a subclass of PointerBase and the arguments
+  // are given in terms of bytes, so only print if the cid differs.
+  if (dest_cid_ != kTypedDataUint8ArrayCid) {
+    const Class& cls =
+        Class::Handle(IsolateGroup::Current()->class_table()->At(dest_cid_));
+    if (!cls.IsNull()) {
+      f->Printf(", dest_cid=%s (%d)", cls.ScrubbedNameCString(), dest_cid_);
+    } else {
+      f->Printf(", dest_cid=%d", dest_cid_);
+    }
+  }
+  if (src_cid_ != dest_cid_) {
+    const Class& cls =
+        Class::Handle(IsolateGroup::Current()->class_table()->At(src_cid_));
+    if (!cls.IsNull()) {
+      f->Printf(", src_cid=%s (%d)", cls.ScrubbedNameCString(), src_cid_);
+    } else {
+      f->Printf(", src_cid=%d", src_cid_);
+    }
+  }
+  if (element_size() != 1) {
+    f->Printf(", element_size=%" Pd "", element_size());
+  }
+  if (unboxed_inputs()) {
+    f->AddString(", unboxed_inputs");
+  }
+  if (can_overlap()) {
+    f->AddString(", can_overlap");
+  }
+}
+
 void TailCallInstr::PrintOperandsTo(BaseTextBuffer* f) const {
  const char* name = "<unknown code>";
  if (code_.IsStubCode()) {
--- a/runtime/vm/compiler/backend/il_riscv.cc
+++ b/runtime/vm/compiler/backend/il_riscv.cc
@ -186,131 +186,117 @@ LocationSummary* MemoryCopyInstr::MakeLocationSummary(Zone* zone,
  return locs;
 }

-void MemoryCopyInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
-  const Register src_reg = locs()->in(kSrcPos).reg();
-  const Register dest_reg = locs()->in(kDestPos).reg();
-  const Location src_start_loc = locs()->in(kSrcStartPos);
-  const Location dest_start_loc = locs()->in(kDestStartPos);
-  const Location length_loc = locs()->in(kLengthPos);
-  const bool constant_length = length_loc.IsConstant();
-  const Register length_reg = constant_length ? kNoRegister : length_loc.reg();
+void MemoryCopyInstr::PrepareLengthRegForLoop(FlowGraphCompiler* compiler,
+                                              Register length_reg,
+                                              compiler::Label* done) {
+  __ BranchIfZero(length_reg, done);
+}

-  EmitComputeStartPointer(compiler, src_cid_, src_reg, src_start_loc);
-  EmitComputeStartPointer(compiler, dest_cid_, dest_reg, dest_start_loc);
+void MemoryCopyInstr::EmitLoopCopy(FlowGraphCompiler* compiler,
+                                   Register dest_reg,
+                                   Register src_reg,
+                                   Register length_reg,
+                                   compiler::Label* done,
+                                   compiler::Label* copy_forwards) {
+  const intptr_t loop_subtract = unboxed_inputs() ? 1 : Smi::RawValue(1);
+  // The size of an (sub)element in an individual load/store pair.
+  intptr_t mov_size = Utils::Minimum<intptr_t>(element_size_, XLEN / 8);

-  if (constant_length) {
-    const intptr_t num_bytes =
-        Integer::Cast(length_loc.constant()).AsInt64Value() * element_size_;
-    const intptr_t mov_size =
-        Utils::Minimum(element_size_, static_cast<intptr_t>(XLEN / 8));
-    const intptr_t mov_repeat = num_bytes / mov_size;
-    ASSERT(num_bytes % mov_size == 0);
-    for (intptr_t i = 0; i < mov_repeat; i++) {
-      switch (mov_size) {
-        case 1:
-          __ lb(TMP, compiler::Address(src_reg, mov_size * i));
-          __ sb(TMP, compiler::Address(dest_reg, mov_size * i));
-          break;
-        case 2:
-          __ lh(TMP, compiler::Address(src_reg, mov_size * i));
-          __ sh(TMP, compiler::Address(dest_reg, mov_size * i));
-          break;
-        case 4:
-          __ lw(TMP, compiler::Address(src_reg, mov_size * i));
-          __ sw(TMP, compiler::Address(dest_reg, mov_size * i));
-          break;
-        case 8:
-#if XLEN == 64
-          __ ld(TMP, compiler::Address(src_reg, mov_size * i));
-          __ sd(TMP, compiler::Address(dest_reg, mov_size * i));
-#else
-          UNREACHABLE();
-#endif
-          break;
-        case 16:
-#if XLEN == 128
-          __ lq(TMP, compiler::Address(src_reg, mov_size * i));
-          __ sq(TMP, compiler::Address(dest_reg, mov_size * i));
-#else
-          UNREACHABLE();
-#endif
-          break;
-      }
+  if (copy_forwards != nullptr) {
+    // Verify that the overlap actually exists by checking to see if
+    // the first element in dest <= the last element in src.
+    const intptr_t shift = Utils::ShiftForPowerOfTwo(element_size_) -
+                           (unboxed_inputs() ? 0 : kSmiTagShift);
+    if (shift == 0) {
+      __ subi(TMP, length_reg, mov_size);
+    } else if (shift < 0) {
+      __ srai(TMP, length_reg, -shift);
+      __ subi(TMP, TMP, mov_size);
+    } else {
+      __ slli(TMP, length_reg, shift);
+      __ subi(TMP, TMP, mov_size);
    }
-    return;
+    __ add(TMP, src_reg, TMP);
+    __ CompareRegisters(dest_reg, TMP);
+    __ BranchIf(UNSIGNED_GREATER, copy_forwards,
+                compiler::Assembler::kNearJump);
+    // There is overlap, so adjust dest_reg and src_reg appropriately.
+    __ add(dest_reg, dest_reg, TMP);
+    __ sub(dest_reg, dest_reg, src_reg);
+    __ MoveRegister(src_reg, TMP);
+    // Negate the increment to the next (sub)element. This way, the
+    // (sub)elements will be copied in reverse order (highest to lowest).
+    mov_size = -mov_size;
  }
-
-  compiler::Label loop, done;
-
-  const intptr_t loop_subtract = unboxed_length_ ? 1 : Smi::RawValue(1);
-  __ beqz(length_reg, &done);
-
+  compiler::Label loop;
  __ Bind(&loop);
  switch (element_size_) {
    case 1:
      __ lb(TMP, compiler::Address(src_reg));
-      __ addi(src_reg, src_reg, 1);
+      __ addi(src_reg, src_reg, mov_size);
      __ sb(TMP, compiler::Address(dest_reg));
-      __ addi(dest_reg, dest_reg, 1);
+      __ addi(dest_reg, dest_reg, mov_size);
      break;
    case 2:
      __ lh(TMP, compiler::Address(src_reg));
-      __ addi(src_reg, src_reg, 2);
+      __ addi(src_reg, src_reg, mov_size);
      __ sh(TMP, compiler::Address(dest_reg));
-      __ addi(dest_reg, dest_reg, 2);
+      __ addi(dest_reg, dest_reg, mov_size);
      break;
    case 4:
      __ lw(TMP, compiler::Address(src_reg));
-      __ addi(src_reg, src_reg, 4);
+      __ addi(src_reg, src_reg, mov_size);
      __ sw(TMP, compiler::Address(dest_reg));
-      __ addi(dest_reg, dest_reg, 4);
+      __ addi(dest_reg, dest_reg, mov_size);
      break;
    case 8:
-#if XLEN == 32
-      __ lw(TMP, compiler::Address(src_reg, 0));
-      __ lw(TMP2, compiler::Address(src_reg, 4));
-      __ addi(src_reg, src_reg, 8);
-      __ sw(TMP, compiler::Address(dest_reg, 0));
-      __ sw(TMP2, compiler::Address(dest_reg, 4));
-      __ addi(dest_reg, dest_reg, 8);
-#else
+#if XLEN >= 64
      __ ld(TMP, compiler::Address(src_reg));
-      __ addi(src_reg, src_reg, 8);
+      __ addi(src_reg, src_reg, mov_size);
      __ sd(TMP, compiler::Address(dest_reg));
-      __ addi(dest_reg, dest_reg, 8);
+      __ addi(dest_reg, dest_reg, mov_size);
+#else
+      __ lw(TMP, compiler::Address(src_reg));
+      __ lw(TMP2, compiler::Address(src_reg, mov_size));
+      __ addi(src_reg, src_reg, 2 * mov_size);
+      __ sw(TMP, compiler::Address(dest_reg));
+      __ sw(TMP2, compiler::Address(dest_reg, mov_size));
+      __ addi(dest_reg, dest_reg, 2 * mov_size);
 #endif
      break;
    case 16:
-#if XLEN == 32
-      __ lw(TMP, compiler::Address(src_reg, 0));
-      __ lw(TMP2, compiler::Address(src_reg, 4));
-      __ sw(TMP, compiler::Address(dest_reg, 0));
-      __ sw(TMP2, compiler::Address(dest_reg, 4));
-      __ lw(TMP, compiler::Address(src_reg, 8));
-      __ lw(TMP2, compiler::Address(src_reg, 12));
-      __ addi(src_reg, src_reg, 16);
-      __ sw(TMP, compiler::Address(dest_reg, 8));
-      __ sw(TMP2, compiler::Address(dest_reg, 12));
-      __ addi(dest_reg, dest_reg, 16);
-#elif XLEN == 64
-      __ ld(TMP, compiler::Address(src_reg, 0));
-      __ ld(TMP2, compiler::Address(src_reg, 8));
-      __ addi(src_reg, src_reg, 16);
-      __ sd(TMP, compiler::Address(dest_reg, 0));
-      __ sd(TMP2, compiler::Address(dest_reg, 8));
-      __ addi(dest_reg, dest_reg, 16);
-#elif XLEN == 128
+#if XLEN >= 128
      __ lq(TMP, compiler::Address(src_reg));
-      __ addi(src_reg, src_reg, 16);
+      __ addi(src_reg, src_reg, mov_size);
      __ sq(TMP, compiler::Address(dest_reg));
-      __ addi(dest_reg, dest_reg, 16);
+      __ addi(dest_reg, dest_reg, mov_size);
+#elif XLEN == 64
+      __ ld(TMP, compiler::Address(src_reg));
+      __ ld(TMP2, compiler::Address(src_reg, mov_size));
+      __ addi(src_reg, src_reg, 2 * mov_size);
+      __ sd(TMP, compiler::Address(dest_reg));
+      __ sd(TMP2, compiler::Address(dest_reg, mov_size));
+      __ addi(dest_reg, dest_reg, 2 * mov_size);
+#else
+      __ lw(TMP, compiler::Address(src_reg));
+      __ lw(TMP2, compiler::Address(src_reg, mov_size));
+      __ sw(TMP, compiler::Address(dest_reg));
+      __ sw(TMP2, compiler::Address(dest_reg, mov_size));
+      __ lw(TMP, compiler::Address(src_reg, 2 * mov_size));
+      __ lw(TMP2, compiler::Address(src_reg, 3 * mov_size));
+      __ addi(src_reg, src_reg, 4 * mov_size);
+      __ sw(TMP, compiler::Address(dest_reg, 2 * mov_size));
+      __ sw(TMP2, compiler::Address(dest_reg, 3 * mov_size));
+      __ addi(dest_reg, dest_reg, 4 * mov_size);
 #endif
      break;
+    default:
+      UNREACHABLE();
+      break;
  }

  __ subi(length_reg, length_reg, loop_subtract);
  __ bnez(length_reg, &loop);
-  __ Bind(&done);
 }

 void MemoryCopyInstr::EmitComputeStartPointer(FlowGraphCompiler* compiler,
@ -364,7 +350,8 @@ void MemoryCopyInstr::EmitComputeStartPointer(FlowGraphCompiler* compiler,
  }
  __ AddImmediate(array_reg, offset);
  const Register start_reg = start_loc.reg();
-  intptr_t shift = Utils::ShiftForPowerOfTwo(element_size_) - 1;
+  intptr_t shift = Utils::ShiftForPowerOfTwo(element_size_) -
+                   (unboxed_inputs() ? 0 : kSmiTagShift);
  __ AddShifted(array_reg, array_reg, start_reg, shift);
 }

--- a/runtime/vm/compiler/backend/il_x64.cc
+++ b/runtime/vm/compiler/backend/il_x64.cc
@ -160,8 +160,17 @@ LocationSummary* MemoryCopyInstr::MakeLocationSummary(Zone* zone,
      LocationSummary(zone, kNumInputs, kNumTemps, LocationSummary::kNoCall);
  locs->set_in(kSrcPos, Location::RegisterLocation(RSI));
  locs->set_in(kDestPos, Location::RegisterLocation(RDI));
-  locs->set_in(kSrcStartPos, LocationRegisterOrConstant(src_start()));
-  locs->set_in(kDestStartPos, LocationRegisterOrConstant(dest_start()));
+  const bool needs_writable_inputs =
+      (((element_size_ == 1) && !unboxed_inputs_) ||
+       ((element_size_ == 16) && unboxed_inputs_));
+  locs->set_in(kSrcStartPos,
+               needs_writable_inputs
+                   ? LocationWritableRegisterOrConstant(src_start())
+                   : LocationRegisterOrConstant(src_start()));
+  locs->set_in(kDestStartPos,
+               needs_writable_inputs
+                   ? LocationWritableRegisterOrConstant(dest_start())
+                   : LocationRegisterOrConstant(dest_start()));
  if (length()->BindsToSmiConstant() && length()->BoundSmiConstant() <= 4) {
    locs->set_in(
        kLengthPos,
@ -173,64 +182,57 @@ LocationSummary* MemoryCopyInstr::MakeLocationSummary(Zone* zone,
  return locs;
 }

-void MemoryCopyInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
-  const Register src_reg = locs()->in(kSrcPos).reg();
-  const Register dest_reg = locs()->in(kDestPos).reg();
-  const Location src_start_loc = locs()->in(kSrcStartPos);
-  const Location dest_start_loc = locs()->in(kDestStartPos);
-  const Location length_loc = locs()->in(kLengthPos);
+static inline intptr_t SizeOfMemoryCopyElements(intptr_t element_size) {
+  return Utils::Minimum<intptr_t>(element_size, compiler::target::kWordSize);
+}

-  EmitComputeStartPointer(compiler, src_cid_, src_reg, src_start_loc);
-  EmitComputeStartPointer(compiler, dest_cid_, dest_reg, dest_start_loc);
+void MemoryCopyInstr::PrepareLengthRegForLoop(FlowGraphCompiler* compiler,
+                                              Register length_reg,
+                                              compiler::Label* done) {
+  const intptr_t mov_size = SizeOfMemoryCopyElements(element_size_);

-  if (length_loc.IsConstant()) {
-    const intptr_t num_bytes =
-        Integer::Cast(length_loc.constant()).AsInt64Value() * element_size_;
-    const intptr_t mov_size =
-        Utils::Minimum(element_size_, static_cast<intptr_t>(8));
-    const intptr_t mov_repeat = num_bytes / mov_size;
-    ASSERT(num_bytes % mov_size == 0);
-
-    for (intptr_t i = 0; i < mov_repeat; i++) {
-      const intptr_t disp = mov_size * i;
-      switch (mov_size) {
-        case 1:
-          __ movzxb(TMP, compiler::Address(src_reg, disp));
-          __ movb(compiler::Address(dest_reg, disp), ByteRegisterOf(TMP));
-          break;
-        case 2:
-          __ movzxw(TMP, compiler::Address(src_reg, disp));
-          __ movw(compiler::Address(dest_reg, disp), TMP);
-          break;
-        case 4:
-          __ movl(TMP, compiler::Address(src_reg, disp));
-          __ movl(compiler::Address(dest_reg, disp), TMP);
-          break;
-        case 8:
-          __ movq(TMP, compiler::Address(src_reg, disp));
-          __ movq(compiler::Address(dest_reg, disp), TMP);
-          break;
-      }
-    }
-    return;
-  }
-
-  if (element_size_ <= compiler::target::kWordSize) {
-    if (!unboxed_length_) {
-      __ SmiUntag(RCX);
-    }
+  // We want to convert the value in length_reg to an unboxed length in
+  // terms of mov_size-sized elements.
+  const intptr_t shift = Utils::ShiftForPowerOfTwo(element_size_) -
+                         Utils::ShiftForPowerOfTwo(mov_size) -
+                         (unboxed_inputs() ? 0 : kSmiTagShift);
+  if (shift < 0) {
+    ASSERT_EQUAL(shift, -kSmiTagShift);
+    __ SmiUntag(length_reg);
+  } else if (shift > 0) {
+    __ OBJ(shl)(length_reg, compiler::Immediate(shift));
  } else {
-    const intptr_t shift = Utils::ShiftForPowerOfTwo(element_size_) -
-                           compiler::target::kWordSizeLog2 -
-                           (unboxed_length_ ? 0 : kSmiTagShift);
-    if (shift != 0) {
-      __ shll(RCX, compiler::Immediate(shift));
-    }
-#if defined(DART_COMPRESSED_POINTERS)
-    __ orl(RCX, RCX);
-#endif
+    __ ExtendNonNegativeSmi(length_reg);
  }
-  switch (element_size_) {
+}
+
+void MemoryCopyInstr::EmitLoopCopy(FlowGraphCompiler* compiler,
+                                   Register dest_reg,
+                                   Register src_reg,
+                                   Register length_reg,
+                                   compiler::Label* done,
+                                   compiler::Label* copy_forwards) {
+  const intptr_t mov_size = SizeOfMemoryCopyElements(element_size_);
+  const bool reversed = copy_forwards != nullptr;
+  if (reversed) {
+    // Avoid doing the extra work to prepare for the rep mov instructions
+    // if the length to copy is zero.
+    __ BranchIfZero(length_reg, done);
+    // Verify that the overlap actually exists by checking to see if
+    // the first element in dest <= the last element in src.
+    const ScaleFactor scale = ToScaleFactor(mov_size, /*index_unboxed=*/true);
+    __ leaq(TMP, compiler::Address(src_reg, length_reg, scale, -mov_size));
+    __ CompareRegisters(dest_reg, TMP);
+    __ BranchIf(UNSIGNED_GREATER, copy_forwards,
+                compiler::Assembler::kNearJump);
+    // The backwards move must be performed, so move TMP -> src_reg and do the
+    // same adjustment for dest_reg.
+    __ movq(src_reg, TMP);
+    __ leaq(dest_reg,
+            compiler::Address(dest_reg, length_reg, scale, -mov_size));
+    __ std();
+  }
+  switch (mov_size) {
    case 1:
      __ rep_movsb();
      break;
@ -241,9 +243,13 @@ void MemoryCopyInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
      __ rep_movsd();
      break;
    case 8:
-    case 16:
      __ rep_movsq();
      break;
+    default:
+      UNREACHABLE();
+  }
+  if (reversed) {
+    __ cld();
  }
 }

@ -296,43 +302,24 @@ void MemoryCopyInstr::EmitComputeStartPointer(FlowGraphCompiler* compiler,
    __ AddImmediate(array_reg, add_value);
    return;
  }
+  // Note that start_reg must be writable in the special cases below.
  const Register start_reg = start_loc.reg();
-  ScaleFactor scale;
-  switch (element_size_) {
-    case 1:
-      __ SmiUntag(start_reg);
-      scale = TIMES_1;
-      break;
-    case 2:
-#if defined(DART_COMPRESSED_POINTERS)
-      // Clear garbage upper bits, as no form of lea will ignore them. Assume
-      // start is positive to use the shorter orl over the longer movsxd.
-      __ orl(start_reg, start_reg);
-#endif
-      scale = TIMES_1;
-      break;
-    case 4:
-#if defined(DART_COMPRESSED_POINTERS)
-      __ orl(start_reg, start_reg);
-#endif
-      scale = TIMES_2;
-      break;
-    case 8:
-#if defined(DART_COMPRESSED_POINTERS)
-      __ orl(start_reg, start_reg);
-#endif
-      scale = TIMES_4;
-      break;
-    case 16:
-#if defined(DART_COMPRESSED_POINTERS)
-      __ orl(start_reg, start_reg);
-#endif
-      scale = TIMES_8;
-      break;
-    default:
-      UNREACHABLE();
-      break;
+  bool index_unboxed = unboxed_inputs_;
+  // Both special cases below assume that Smis are only shifted one bit.
+  COMPILE_ASSERT(kSmiTagShift == 1);
+  if (element_size_ == 1 && !index_unboxed) {
+    // Shift the value to the right by tagging it as a Smi.
+    __ SmiUntag(start_reg);
+    index_unboxed = true;
+  } else if (element_size_ == 16 && index_unboxed) {
+    // Can't use TIMES_16 on X86, so instead pre-shift the value to reduce the
+    // scaling needed in the leaq instruction.
+    __ SmiTag(start_reg);
+    index_unboxed = false;
+  } else if (!index_unboxed) {
+    __ ExtendNonNegativeSmi(start_reg);
  }
+  auto const scale = ToScaleFactor(element_size_, index_unboxed);
  __ leaq(array_reg, compiler::Address(array_reg, start_reg, scale, offset));
 }

@ -1617,15 +1604,10 @@ void OneByteStringFromCharCodeInstr::EmitNativeCode(
  Register char_code = locs()->in(0).reg();
  Register result = locs()->out(0).reg();

-#if defined(DART_COMPRESSED_POINTERS)
-  // The upper half of a compressed Smi contains undefined bits, but no x64
-  // addressing mode will ignore these bits. Assume that the index is
-  // non-negative and clear the upper bits, which is shorter than
-  // sign-extension (movsxd). Note: we don't bother to ensure index is a
-  // writable input because any other instructions using it must also not
-  // rely on the upper bits.
-  __ orl(char_code, char_code);
-#endif
+  // Note: we don't bother to ensure char_code is a writable input because any
+  // other instructions using it must also not rely on the upper bits when
+  // compressed.
+  __ ExtendNonNegativeSmi(char_code);
  __ movq(result,
          compiler::Address(THR, Thread::predefined_symbols_address_offset()));
  __ movq(result,
@ -1868,24 +1850,20 @@ void LoadIndexedInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
  const Register array = locs()->in(0).reg();
  const Location index = locs()->in(1);

-  intptr_t index_scale = index_scale_;
+  bool index_unboxed = index_unboxed_;
  if (index.IsRegister()) {
-    if (index_scale == 1 && !index_unboxed_) {
+    if (index_scale_ == 1 && !index_unboxed) {
      __ SmiUntag(index.reg());
-    } else if (index_scale == 16 && index_unboxed_) {
+      index_unboxed = true;
+    } else if (index_scale_ == 16 && index_unboxed) {
      // X64 does not support addressing mode using TIMES_16.
      __ SmiTag(index.reg());
-      index_scale >>= 1;
-    } else if (!index_unboxed_) {
-#if defined(DART_COMPRESSED_POINTERS)
-      // The upper half of a compressed Smi contains undefined bits, but no x64
-      // addressing mode will ignore these bits. Assume that the index is
-      // non-negative and clear the upper bits, which is shorter than
-      // sign-extension (movsxd). Note: we don't bother to ensure index is a
-      // writable input because any other instructions using it must also not
-      // rely on the upper bits.
-      __ orl(index.reg(), index.reg());
-#endif
+      index_unboxed = false;
+    } else if (!index_unboxed) {
+      // Note: we don't bother to ensure index is a writable input because any
+      // other instructions using it must also not rely on the upper bits
+      // when compressed.
+      __ ExtendNonNegativeSmi(index.reg());
    }
  } else {
    ASSERT(index.IsConstant());
@ -1893,10 +1871,10 @@ void LoadIndexedInstr::EmitNativeCode(FlowGraphCompiler* compiler) {

  compiler::Address element_address =
      index.IsRegister() ? compiler::Assembler::ElementAddressForRegIndex(
-                               IsExternal(), class_id(), index_scale,
-                               index_unboxed_, array, index.reg())
+                               IsExternal(), class_id(), index_scale_,
+                               index_unboxed, array, index.reg())
                         : compiler::Assembler::ElementAddressForIntIndex(
-                               IsExternal(), class_id(), index_scale, array,
+                               IsExternal(), class_id(), index_scale_, array,
                               Smi::Cast(index.constant()).Value());

  if ((representation() == kUnboxedFloat) ||
@ -1984,26 +1962,19 @@ LocationSummary* LoadCodeUnitsInstr::MakeLocationSummary(Zone* zone,
 void LoadCodeUnitsInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
  // The string register points to the backing store for external strings.
  const Register str = locs()->in(0).reg();
-  const Location index = locs()->in(1);
+  const Register index = locs()->in(1).reg();

+  bool index_unboxed = false;
+  if ((index_scale() == 1)) {
+    __ SmiUntag(index);
+    index_unboxed = true;
+  } else {
+    __ ExtendNonNegativeSmi(index);
+  }
  compiler::Address element_address =
      compiler::Assembler::ElementAddressForRegIndex(
-          IsExternal(), class_id(), index_scale(), /*index_unboxed=*/false, str,
-          index.reg());
+          IsExternal(), class_id(), index_scale(), index_unboxed, str, index);

-  if ((index_scale() == 1)) {
-    __ SmiUntag(index.reg());
-  } else {
-#if defined(DART_COMPRESSED_POINTERS)
-    // The upper half of a compressed Smi contains undefined bits, but no x64
-    // addressing mode will ignore these bits. Assume that the index is
-    // non-negative and clear the upper bits, which is shorter than
-    // sign-extension (movsxd). Note: we don't bother to ensure index is a
-    // writable input because any other instructions using it must also not
-    // rely on the upper bits.
-    __ orl(index.reg(), index.reg());
-#endif
-  }
  Register result = locs()->out(0).reg();
  switch (class_id()) {
    case kOneByteStringCid:
@ -2118,24 +2089,20 @@ void StoreIndexedInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
  const Register array = locs()->in(0).reg();
  const Location index = locs()->in(1);

-  intptr_t index_scale = index_scale_;
+  bool index_unboxed = index_unboxed_;
  if (index.IsRegister()) {
-    if (index_scale == 1 && !index_unboxed_) {
+    if (index_scale_ == 1 && !index_unboxed) {
      __ SmiUntag(index.reg());
-    } else if (index_scale == 16 && index_unboxed_) {
+      index_unboxed = true;
+    } else if (index_scale_ == 16 && index_unboxed) {
      // X64 does not support addressing mode using TIMES_16.
      __ SmiTag(index.reg());
-      index_scale >>= 1;
-    } else if (!index_unboxed_) {
-#if defined(DART_COMPRESSED_POINTERS)
-      // The upper half of a compressed Smi contains undefined bits, but no x64
-      // addressing mode will ignore these bits. Assume that the index is
-      // non-negative and clear the upper bits, which is shorter than
-      // sign-extension (movsxd). Note: we don't bother to ensure index is a
-      // writable input because any other instructions using it must also not
-      // rely on the upper bits.
-      __ orl(index.reg(), index.reg());
-#endif
+      index_unboxed = false;
+    } else if (!index_unboxed) {
+      // Note: we don't bother to ensure index is a writable input because any
+      // other instructions using it must also not rely on the upper bits
+      // when compressed.
+      __ ExtendNonNegativeSmi(index.reg());
    }
  } else {
    ASSERT(index.IsConstant());
@ -2143,10 +2110,10 @@ void StoreIndexedInstr::EmitNativeCode(FlowGraphCompiler* compiler) {

  compiler::Address element_address =
      index.IsRegister() ? compiler::Assembler::ElementAddressForRegIndex(
-                               IsExternal(), class_id(), index_scale,
-                               index_unboxed_, array, index.reg())
+                               IsExternal(), class_id(), index_scale_,
+                               index_unboxed, array, index.reg())
                         : compiler::Assembler::ElementAddressForIntIndex(
-                               IsExternal(), class_id(), index_scale, array,
+                               IsExternal(), class_id(), index_scale_, array,
                               Smi::Cast(index.constant()).Value());

  switch (class_id()) {
@ -6583,15 +6550,10 @@ void IndirectGotoInstr::EmitNativeCode(FlowGraphCompiler* compiler) {
  Register offset_reg = locs()->temp(0).reg();

  ASSERT(RequiredInputRepresentation(0) == kTagged);
-#if defined(DART_COMPRESSED_POINTERS)
-  // The upper half of a compressed Smi contains undefined bits, but no x64
-  // addressing mode will ignore these bits. Assume that the index is
-  // non-negative and clear the upper bits, which is shorter than
-  // sign-extension (movsxd). Note: we don't bother to ensure index is a
-  // writable input because any other instructions using it must also not
-  // rely on the upper bits.
-  __ orl(index_reg, index_reg);
-#endif
+  // Note: we don't bother to ensure index is a writable input because any
+  // other instructions using it must also not rely on the upper bits
+  // when compressed.
+  __ ExtendNonNegativeSmi(index_reg);
  __ LoadObject(offset_reg, offsets_);
  __ movsxd(offset_reg, compiler::Assembler::ElementAddressForRegIndex(
                            /*is_external=*/false, kTypedDataInt32ArrayCid,
--- a/runtime/vm/compiler/backend/inliner.cc
+++ b/runtime/vm/compiler/backend/inliner.cc
@ -970,7 +970,8 @@ static void ReplaceParameterStubs(Zone* zone,
  for (intptr_t i = 0; i < defns->length(); ++i) {
    ConstantInstr* constant = (*defns)[i]->AsConstant();
    if (constant != nullptr && constant->HasUses()) {
-      constant->ReplaceUsesWith(caller_graph->GetConstant(constant->value()));
+      constant->ReplaceUsesWith(caller_graph->GetConstant(
+          constant->value(), constant->representation()));
    }
  }

@ -978,7 +979,8 @@ static void ReplaceParameterStubs(Zone* zone,
  for (intptr_t i = 0; i < defns->length(); ++i) {
    ConstantInstr* constant = (*defns)[i]->AsConstant();
    if (constant != nullptr && constant->HasUses()) {
-      constant->ReplaceUsesWith(caller_graph->GetConstant(constant->value()));
+      constant->ReplaceUsesWith(caller_graph->GetConstant(
+          constant->value(), constant->representation()));
    }

    SpecialParameterInstr* param = (*defns)[i]->AsSpecialParameter();
@ -4508,11 +4510,24 @@ bool FlowGraphInliner::TryInlineRecognizedMethod(

      // Insert explicit unboxing instructions with truncation to avoid relying
      // on [SelectRepresentations] which doesn't mark them as truncating.
+      arg_target_offset_in_bytes = UnboxInstr::Create(
+          kUnboxedIntPtr, new (Z) Value(arg_target_offset_in_bytes),
+          call->deopt_id(), Instruction::kNotSpeculative);
+      arg_target_offset_in_bytes->AsUnboxInteger()->mark_truncating();
+      flow_graph->AppendTo(*entry, arg_target_offset_in_bytes, env,
+                           FlowGraph::kValue);
+      arg_source_offset_in_bytes = UnboxInstr::Create(
+          kUnboxedIntPtr, new (Z) Value(arg_source_offset_in_bytes),
+          call->deopt_id(), Instruction::kNotSpeculative);
+      arg_source_offset_in_bytes->AsUnboxInteger()->mark_truncating();
+      flow_graph->AppendTo(arg_target_offset_in_bytes,
+                           arg_source_offset_in_bytes, env, FlowGraph::kValue);
      arg_length_in_bytes =
          UnboxInstr::Create(kUnboxedIntPtr, new (Z) Value(arg_length_in_bytes),
                             call->deopt_id(), Instruction::kNotSpeculative);
      arg_length_in_bytes->AsUnboxInteger()->mark_truncating();
-      flow_graph->AppendTo(*entry, arg_length_in_bytes, env, FlowGraph::kValue);
+      flow_graph->AppendTo(arg_source_offset_in_bytes, arg_length_in_bytes, env,
+                           FlowGraph::kValue);

      *last = new (Z)
          MemoryCopyInstr(new (Z) Value(arg_source), new (Z) Value(arg_target),
@ -4520,7 +4535,8 @@ bool FlowGraphInliner::TryInlineRecognizedMethod(
                          new (Z) Value(arg_target_offset_in_bytes),
                          new (Z) Value(arg_length_in_bytes),
                          /*src_cid=*/kTypedDataUint8ArrayCid,
-                          /*dest_cid=*/kTypedDataUint8ArrayCid, true);
+                          /*dest_cid=*/kTypedDataUint8ArrayCid,
+                          /*unboxed_inputs=*/true, /*can_overlap=*/true);
      flow_graph->AppendTo(arg_length_in_bytes, *last, env, FlowGraph::kEffect);

      *result = flow_graph->constant_null();
--- a/runtime/vm/compiler/backend/locations.cc
+++ b/runtime/vm/compiler/backend/locations.cc
@ -252,6 +252,14 @@ Location LocationRegisterOrSmiConstant(Value* value,
  return Location::Constant(constant);
 }

+Location LocationWritableRegisterOrConstant(Value* value) {
+  ConstantInstr* constant = value->definition()->AsConstant();
+  return ((constant != nullptr) &&
+          compiler::Assembler::IsSafe(constant->value()))
+             ? Location::Constant(constant)
+             : Location::WritableRegister();
+}
+
 Location LocationWritableRegisterOrSmiConstant(Value* value,
                                               intptr_t min_value,
                                               intptr_t max_value) {
--- a/runtime/vm/compiler/backend/locations.h
+++ b/runtime/vm/compiler/backend/locations.h
@ -505,6 +505,7 @@ Location LocationRegisterOrSmiConstant(
    Value* value,
    intptr_t min_value = compiler::target::kSmiMin,
    intptr_t max_value = compiler::target::kSmiMax);
+Location LocationWritableRegisterOrConstant(Value* value);
 Location LocationWritableRegisterOrSmiConstant(
    Value* value,
    intptr_t min_value = compiler::target::kSmiMin,
--- a/runtime/vm/compiler/backend/memory_copy_test.cc
+++ b/runtime/vm/compiler/backend/memory_copy_test.cc
@ -37,31 +37,115 @@ static classid_t TypedDataCidForElementSize(intptr_t elem_size) {
  UNIMPLEMENTED();
 }

+static inline intptr_t ExpectedValue(intptr_t i) {
+  return 1 + i % 100;
+}
+
+static void InitializeMemory(uint8_t* input, uint8_t* output) {
+  const bool use_same_buffer = input == output;
+  for (intptr_t i = 0; i < kMemoryTestLength; i++) {
+    input[i] = ExpectedValue(i);  // Initialized.
+    if (!use_same_buffer) {
+      output[i] = kUnInitialized;  // Empty.
+    }
+  }
+}
+
+static bool CheckMemory(Expect expect,
+                        const uint8_t* input,
+                        const uint8_t* output,
+                        intptr_t dest_start,
+                        intptr_t src_start,
+                        intptr_t length,
+                        intptr_t elem_size) {
+  ASSERT(Utils::IsPowerOfTwo(kMemoryTestLength));
+  expect.LessThan<intptr_t>(0, elem_size);
+  if (!Utils::IsPowerOfTwo(elem_size)) {
+    expect.Fail("Expected %" Pd " to be a power of two", elem_size);
+  }
+  expect.LessEqual<intptr_t>(0, length);
+  expect.LessEqual<intptr_t>(0, dest_start);
+  expect.LessEqual<intptr_t>(dest_start + length,
+                             kMemoryTestLength / elem_size);
+  expect.LessEqual<intptr_t>(0, src_start);
+  expect.LessEqual<intptr_t>(src_start + length, kMemoryTestLength / elem_size);
+  const bool use_same_buffer = input == output;
+  const intptr_t dest_start_in_bytes = dest_start * elem_size;
+  const intptr_t dest_end_in_bytes = dest_start_in_bytes + length * elem_size;
+  const intptr_t index_diff = dest_start_in_bytes - src_start * elem_size;
+  for (intptr_t i = 0; i < kMemoryTestLength; i++) {
+    if (!use_same_buffer) {
+      const intptr_t expected = ExpectedValue(i);
+      const intptr_t got = input[i];
+      if (expected != got) {
+        expect.Fail("Unexpected change to input buffer at index %" Pd
+                    ", expected %" Pd ", got %" Pd "",
+                    i, expected, got);
+      }
+    }
+    const intptr_t unchanged =
+        use_same_buffer ? ExpectedValue(i) : kUnInitialized;
+    const intptr_t got = output[i];
+    if (dest_start_in_bytes <= i && i < dest_end_in_bytes) {
+      // Copied.
+      const intptr_t expected = ExpectedValue(i - index_diff);
+      if (expected != got) {
+        if (got == unchanged) {
+          expect.Fail("No change to output buffer at index %" Pd
+                      ", expected %" Pd ", got %" Pd "",
+                      i, expected, got);
+        } else {
+          expect.Fail("Incorrect change to output buffer at index %" Pd
+                      ", expected %" Pd ", got %" Pd "",
+                      i, expected, got);
+        }
+      }
+    } else {
+      // Untouched.
+      if (got != unchanged) {
+        expect.Fail("Unexpected change to input buffer at index %" Pd
+                    ", expected %" Pd ", got %" Pd "",
+                    i, unchanged, got);
+      }
+    }
+  }
+  return expect.failed();
+}
+
+#define CHECK_DEFAULT_MEMORY(in, out)                                          \
+  do {                                                                         \
+    if (CheckMemory(dart::Expect(__FILE__, __LINE__), in, out, 0, 0, 0, 1)) {  \
+      return;                                                                  \
+    }                                                                          \
+  } while (false)
+#define CHECK_MEMORY(in, out, start, skip, len, size)                          \
+  do {                                                                         \
+    if (CheckMemory(dart::Expect(__FILE__, __LINE__), in, out, start, skip,    \
+                    len, size)) {                                              \
+      return;                                                                  \
+    }                                                                          \
+  } while (false)
+
 static void RunMemoryCopyInstrTest(intptr_t src_start,
                                   intptr_t dest_start,
                                   intptr_t length,
                                   intptr_t elem_size,
-                                   bool length_unboxed) {
+                                   bool unboxed_inputs,
+                                   bool use_same_buffer) {
  OS::Print("==================================================\n");
  OS::Print("RunMemoryCopyInstrTest src_start %" Pd " dest_start %" Pd
            " length "
            "%" Pd "%s elem_size %" Pd "\n",
-            src_start, dest_start, length, length_unboxed ? " (unboxed)" : "",
+            src_start, dest_start, length, unboxed_inputs ? " (unboxed)" : "",
            elem_size);
  OS::Print("==================================================\n");
  classid_t cid = TypedDataCidForElementSize(elem_size);

-  intptr_t dest_copied_start = dest_start * elem_size;
-  intptr_t dest_copied_end = dest_copied_start + length * elem_size;
-  ASSERT(dest_copied_end < kMemoryTestLength);
-  intptr_t expect_diff = (dest_start - src_start) * elem_size;
-
  uint8_t* ptr = reinterpret_cast<uint8_t*>(malloc(kMemoryTestLength));
-  uint8_t* ptr2 = reinterpret_cast<uint8_t*>(malloc(kMemoryTestLength));
-  for (intptr_t i = 0; i < kMemoryTestLength; i++) {
-    ptr[i] = 1 + i % 100;      // Initialized.
-    ptr2[i] = kUnInitialized;  // Emtpy.
-  }
+  uint8_t* ptr2 = use_same_buffer
+                      ? ptr
+                      : reinterpret_cast<uint8_t*>(malloc(kMemoryTestLength));
+  InitializeMemory(ptr, ptr2);

  OS::Print("&ptr %p &ptr2 %p\n", ptr, ptr2);

@ -69,174 +153,348 @@ static void RunMemoryCopyInstrTest(intptr_t src_start,
  auto kScript = Utils::CStringUniquePtr(OS::SCreate(nullptr, R"(
    import 'dart:ffi';

-    void myFunction() {
+    void copyConst() {
      final pointer = Pointer<Uint8>.fromAddress(%s%p);
      final pointer2 = Pointer<Uint8>.fromAddress(%s%p);
-      anotherFunction();
+      noop();
    }

-    void anotherFunction() {}
-  )", pointer_prefix, ptr, pointer_prefix, ptr2), std::free);
+    void callNonConstCopy() {
+      final pointer = Pointer<Uint8>.fromAddress(%s%p);
+      final pointer2 = Pointer<Uint8>.fromAddress(%s%p);
+      final src_start = %)" Pd R"(;
+      final dest_start = %)" Pd R"(;
+      final length = %)" Pd R"(;
+      copyNonConst(
+          pointer, pointer2, src_start, dest_start, length);
+    }
+
+    void noop() {}
+
+    void copyNonConst(Pointer<Uint8> ptr1,
+                      Pointer<Uint8> ptr2,
+                      int src_start,
+                      int dest_start,
+                      int length) {}
+  )", pointer_prefix, ptr, pointer_prefix, ptr2,
+      pointer_prefix, ptr, pointer_prefix, ptr2,
+      src_start, dest_start, length), std::free);
  // clang-format on

  const auto& root_library = Library::Handle(LoadTestScript(kScript.get()));
-  Invoke(root_library, "myFunction");
-  // Running this should be a no-op on the memory.
-  for (intptr_t i = 0; i < kMemoryTestLength; i++) {
-    EXPECT_EQ(1 + i % 100, static_cast<intptr_t>(ptr[i]));
-    EXPECT_EQ(kUnInitialized, static_cast<intptr_t>(ptr2[i]));
-  }

-  const auto& my_function =
-      Function::Handle(GetFunction(root_library, "myFunction"));
-
-  TestPipeline pipeline(my_function, CompilerPass::kJIT);
-  FlowGraph* flow_graph = pipeline.RunPasses({
-      CompilerPass::kComputeSSA,
-  });
-
-  StaticCallInstr* pointer = nullptr;
-  StaticCallInstr* pointer2 = nullptr;
-  StaticCallInstr* another_function_call = nullptr;
+  // Test the MemoryCopy instruction when the inputs are constants.
  {
-    ILMatcher cursor(flow_graph, flow_graph->graph_entry()->normal_entry());
+    Invoke(root_library, "copyConst");
+    // Running this should be a no-op on the memory.
+    CHECK_DEFAULT_MEMORY(ptr, ptr2);

-    EXPECT(cursor.TryMatch({
-        kMoveGlob,
-        {kMatchAndMoveStaticCall, &pointer},
-        {kMatchAndMoveStaticCall, &pointer2},
-        {kMatchAndMoveStaticCall, &another_function_call},
-    }));
-  }
+    const auto& const_copy =
+        Function::Handle(GetFunction(root_library, "copyConst"));

-  Zone* const zone = Thread::Current()->zone();
+    TestPipeline pipeline(const_copy, CompilerPass::kJIT);
+    FlowGraph* flow_graph = pipeline.RunPasses({
+        CompilerPass::kComputeSSA,
+    });

-  auto* const src_start_constant_instr = flow_graph->GetConstant(
-      Integer::ZoneHandle(zone, Integer::New(src_start, Heap::kOld)), kTagged);
+    StaticCallInstr* pointer = nullptr;
+    StaticCallInstr* pointer2 = nullptr;
+    StaticCallInstr* another_function_call = nullptr;
+    {
+      ILMatcher cursor(flow_graph, flow_graph->graph_entry()->normal_entry());

-  auto* const dest_start_constant_instr = flow_graph->GetConstant(
-      Integer::ZoneHandle(zone, Integer::New(dest_start, Heap::kOld)), kTagged);
+      EXPECT(cursor.TryMatch({
+          kMoveGlob,
+          {kMatchAndMoveStaticCall, &pointer},
+          {kMatchAndMoveStaticCall, &pointer2},
+          {kMatchAndMoveStaticCall, &another_function_call},
+      }));
+    }

-  auto* const length_constant_instr = flow_graph->GetConstant(
-      Integer::ZoneHandle(zone, Integer::New(length, Heap::kOld)),
-      length_unboxed ? kUnboxedIntPtr : kTagged);
+    Zone* const zone = Thread::Current()->zone();
+    auto const rep = unboxed_inputs ? kUnboxedIntPtr : kTagged;

-  auto* const memory_copy_instr = new (zone)
-      MemoryCopyInstr(new (zone) Value(pointer), new (zone) Value(pointer2),
-                      new (zone) Value(src_start_constant_instr),
-                      new (zone) Value(dest_start_constant_instr),
-                      new (zone) Value(length_constant_instr),
-                      /*src_cid=*/cid,
-                      /*dest_cid=*/cid, length_unboxed);
-  flow_graph->InsertBefore(another_function_call, memory_copy_instr, nullptr,
-                           FlowGraph::kEffect);
+    auto* const src_start_constant_instr = flow_graph->GetConstant(
+        Integer::ZoneHandle(zone, Integer::New(src_start, Heap::kOld)), rep);

-  another_function_call->RemoveFromGraph();
+    auto* const dest_start_constant_instr = flow_graph->GetConstant(
+        Integer::ZoneHandle(zone, Integer::New(dest_start, Heap::kOld)), rep);

-  {
-    // Check we constructed the right graph.
-    ILMatcher cursor(flow_graph, flow_graph->graph_entry()->normal_entry());
-    EXPECT(cursor.TryMatch({
-        kMoveGlob,
-        kMatchAndMoveStaticCall,
-        kMatchAndMoveStaticCall,
-        kMatchAndMoveMemoryCopy,
-    }));
-  }
+    auto* const length_constant_instr = flow_graph->GetConstant(
+        Integer::ZoneHandle(zone, Integer::New(length, Heap::kOld)), rep);

-  {
+    auto* const memory_copy_instr = new (zone) MemoryCopyInstr(
+        new (zone) Value(pointer), new (zone) Value(pointer2),
+        new (zone) Value(src_start_constant_instr),
+        new (zone) Value(dest_start_constant_instr),
+        new (zone) Value(length_constant_instr),
+        /*src_cid=*/cid,
+        /*dest_cid=*/cid, unboxed_inputs, /*can_overlap=*/use_same_buffer);
+    flow_graph->InsertBefore(another_function_call, memory_copy_instr, nullptr,
+                             FlowGraph::kEffect);
+
+    another_function_call->RemoveFromGraph();
+
+    {
+      // Check we constructed the right graph.
+      ILMatcher cursor(flow_graph, flow_graph->graph_entry()->normal_entry());
+      EXPECT(cursor.TryMatch({
+          kMoveGlob,
+          kMatchAndMoveStaticCall,
+          kMatchAndMoveStaticCall,
+          kMatchAndMoveMemoryCopy,
+      }));
+    }
+
+    {
 #if !defined(PRODUCT) && !defined(USING_THREAD_SANITIZER)
-    SetFlagScope<bool> sfs(&FLAG_disassemble_optimized, true);
+      SetFlagScope<bool> sfs(&FLAG_disassemble_optimized, true);
 #endif

-    pipeline.RunForcedOptimizedAfterSSAPasses();
-    pipeline.CompileGraphAndAttachFunction();
+      pipeline.RunForcedOptimizedAfterSSAPasses();
+      pipeline.CompileGraphAndAttachFunction();
+    }
+
+    {
+      // Check that the memory copy has constant inputs after optimization.
+      ILMatcher cursor(flow_graph, flow_graph->graph_entry()->normal_entry());
+      MemoryCopyInstr* memory_copy;
+      EXPECT(cursor.TryMatch({
+          kMoveGlob,
+          {kMatchAndMoveMemoryCopy, &memory_copy},
+      }));
+      EXPECT(memory_copy->src_start()->BindsToConstant());
+      EXPECT(memory_copy->dest_start()->BindsToConstant());
+      EXPECT(memory_copy->length()->BindsToConstant());
+    }
+
+    // Run the mem copy.
+    Invoke(root_library, "copyConst");
  }

-  // Run the mem copy.
-  Invoke(root_library, "myFunction");
-  for (intptr_t i = 0; i < kMemoryTestLength; i++) {
-    EXPECT_EQ(1 + i % 100, static_cast<intptr_t>(ptr[i]));
-    if (dest_copied_start <= i && i < dest_copied_end) {
-      // Copied.
-      EXPECT_EQ(1 + (i - expect_diff) % 100, static_cast<intptr_t>(ptr2[i]));
-    } else {
-      // Untouched.
-      EXPECT_EQ(kUnInitialized, static_cast<intptr_t>(ptr2[i]));
+  CHECK_MEMORY(ptr, ptr2, dest_start, src_start, length, elem_size);
+  // Reinitialize the memory for the non-constant MemoryCopy version.
+  InitializeMemory(ptr, ptr2);
+
+  // Test the MemoryCopy instruction when the inputs are not constants.
+  {
+    Invoke(root_library, "callNonConstCopy");
+    // Running this should be a no-op on the memory.
+    CHECK_DEFAULT_MEMORY(ptr, ptr2);
+
+    const auto& copy_non_const =
+        Function::Handle(GetFunction(root_library, "copyNonConst"));
+
+    TestPipeline pipeline(copy_non_const, CompilerPass::kJIT);
+    FlowGraph* flow_graph = pipeline.RunPasses({
+        CompilerPass::kComputeSSA,
+    });
+
+    auto* const entry_instr = flow_graph->graph_entry()->normal_entry();
+    auto* const initial_defs = entry_instr->initial_definitions();
+    EXPECT(initial_defs != nullptr);
+    EXPECT_EQ(5, initial_defs->length());
+
+    auto* const param_ptr = initial_defs->At(0)->AsParameter();
+    EXPECT(param_ptr != nullptr);
+    auto* const param_ptr2 = initial_defs->At(1)->AsParameter();
+    EXPECT(param_ptr2 != nullptr);
+    auto* const param_src_start = initial_defs->At(2)->AsParameter();
+    EXPECT(param_src_start != nullptr);
+    auto* const param_dest_start = initial_defs->At(3)->AsParameter();
+    EXPECT(param_dest_start != nullptr);
+    auto* const param_length = initial_defs->At(4)->AsParameter();
+    EXPECT(param_length != nullptr);
+
+    ReturnInstr* return_instr;
+    {
+      ILMatcher cursor(flow_graph, entry_instr);
+
+      EXPECT(cursor.TryMatch({
+          kMoveGlob,
+          {kMatchReturn, &return_instr},
+      }));
    }
+
+    Zone* const zone = Thread::Current()->zone();
+
+    Definition* src_start_def = param_src_start;
+    Definition* dest_start_def = param_dest_start;
+    Definition* length_def = param_length;
+    if (unboxed_inputs) {
+      // Manually add the unbox instruction ourselves instead of leaving it
+      // up to the SelectDefinitions pass.
+      length_def =
+          UnboxInstr::Create(kUnboxedWord, new (zone) Value(param_length),
+                             DeoptId::kNone, Instruction::kNotSpeculative);
+      flow_graph->InsertBefore(return_instr, length_def, nullptr,
+                               FlowGraph::kValue);
+      dest_start_def =
+          UnboxInstr::Create(kUnboxedWord, new (zone) Value(param_dest_start),
+                             DeoptId::kNone, Instruction::kNotSpeculative);
+      flow_graph->InsertBefore(length_def, dest_start_def, nullptr,
+                               FlowGraph::kValue);
+      src_start_def =
+          UnboxInstr::Create(kUnboxedWord, new (zone) Value(param_src_start),
+                             DeoptId::kNone, Instruction::kNotSpeculative);
+      flow_graph->InsertBefore(dest_start_def, src_start_def, nullptr,
+                               FlowGraph::kValue);
+    }
+
+    auto* const memory_copy_instr = new (zone) MemoryCopyInstr(
+        new (zone) Value(param_ptr), new (zone) Value(param_ptr2),
+        new (zone) Value(src_start_def), new (zone) Value(dest_start_def),
+        new (zone) Value(length_def),
+        /*src_cid=*/cid,
+        /*dest_cid=*/cid, unboxed_inputs, /*can_overlap=*/use_same_buffer);
+    flow_graph->InsertBefore(return_instr, memory_copy_instr, nullptr,
+                             FlowGraph::kEffect);
+
+    {
+      // Check we constructed the right graph.
+      ILMatcher cursor(flow_graph, flow_graph->graph_entry()->normal_entry());
+      if (unboxed_inputs) {
+        EXPECT(cursor.TryMatch({
+            kMoveGlob,
+            kMatchAndMoveUnbox,
+            kMatchAndMoveUnbox,
+            kMatchAndMoveUnbox,
+            kMatchAndMoveMemoryCopy,
+            kMatchReturn,
+        }));
+      } else {
+        EXPECT(cursor.TryMatch({
+            kMoveGlob,
+            kMatchAndMoveMemoryCopy,
+            kMatchReturn,
+        }));
+      }
+    }
+
+    {
+#if !defined(PRODUCT) && !defined(USING_THREAD_SANITIZER)
+      SetFlagScope<bool> sfs(&FLAG_disassemble_optimized, true);
+#endif
+
+      pipeline.RunForcedOptimizedAfterSSAPasses();
+      pipeline.CompileGraphAndAttachFunction();
+    }
+
+    {
+      // Check that the memory copy has non-constant inputs after optimization.
+      ILMatcher cursor(flow_graph, flow_graph->graph_entry()->normal_entry());
+      MemoryCopyInstr* memory_copy;
+      EXPECT(cursor.TryMatch({
+          kMoveGlob,
+          {kMatchAndMoveMemoryCopy, &memory_copy},
+      }));
+      EXPECT(!memory_copy->src_start()->BindsToConstant());
+      EXPECT(!memory_copy->dest_start()->BindsToConstant());
+      EXPECT(!memory_copy->length()->BindsToConstant());
+    }
+
+    // Run the mem copy.
+    Invoke(root_library, "callNonConstCopy");
  }
+
+  CHECK_MEMORY(ptr, ptr2, dest_start, src_start, length, elem_size);
  free(ptr);
-  free(ptr2);
+  if (!use_same_buffer) {
+    free(ptr2);
+  }
 }

 #define MEMORY_COPY_TEST_BOXED(src_start, dest_start, length, elem_size)       \
  ISOLATE_UNIT_TEST_CASE(                                                      \
      IRTest_MemoryCopy_##src_start##_##dest_start##_##length##_##elem_size) { \
-    RunMemoryCopyInstrTest(src_start, dest_start, length, elem_size, false);   \
+    RunMemoryCopyInstrTest(src_start, dest_start, length, elem_size, false,    \
+                           false);                                             \
  }

 #define MEMORY_COPY_TEST_UNBOXED(src_start, dest_start, length, el_si)         \
  ISOLATE_UNIT_TEST_CASE(                                                      \
      IRTest_MemoryCopy_##src_start##_##dest_start##_##length##_##el_si##_u) { \
-    RunMemoryCopyInstrTest(src_start, dest_start, length, el_si, true);        \
+    RunMemoryCopyInstrTest(src_start, dest_start, length, el_si, true, false); \
+  }
+
+#define MEMORY_MOVE_TEST_BOXED(src_start, dest_start, length, elem_size)       \
+  ISOLATE_UNIT_TEST_CASE(                                                      \
+      IRTest_MemoryMove_##src_start##_##dest_start##_##length##_##elem_size) { \
+    RunMemoryCopyInstrTest(src_start, dest_start, length, elem_size, true,     \
+                           false);                                             \
+  }
+
+#define MEMORY_MOVE_TEST_UNBOXED(src_start, dest_start, length, el_si)         \
+  ISOLATE_UNIT_TEST_CASE(                                                      \
+      IRTest_MemoryMove_##src_start##_##dest_start##_##length##_##el_si##_u) { \
+    RunMemoryCopyInstrTest(src_start, dest_start, length, el_si, true, true);  \
  }

 #define MEMORY_COPY_TEST(src_start, dest_start, length, elem_size)             \
  MEMORY_COPY_TEST_BOXED(src_start, dest_start, length, elem_size)             \
  MEMORY_COPY_TEST_UNBOXED(src_start, dest_start, length, elem_size)

+#define MEMORY_MOVE_TEST(src_start, dest_start, length, elem_size)             \
+  MEMORY_MOVE_TEST_BOXED(src_start, dest_start, length, elem_size)             \
+  MEMORY_MOVE_TEST_UNBOXED(src_start, dest_start, length, elem_size)
+
+#define MEMORY_TEST(src_start, dest_start, length, elem_size)                  \
+  MEMORY_MOVE_TEST(src_start, dest_start, length, elem_size)                   \
+  MEMORY_COPY_TEST(src_start, dest_start, length, elem_size)
+
 // No offset, varying length.
-MEMORY_COPY_TEST(0, 0, 1, 1)
-MEMORY_COPY_TEST(0, 0, 2, 1)
-MEMORY_COPY_TEST(0, 0, 3, 1)
-MEMORY_COPY_TEST(0, 0, 4, 1)
-MEMORY_COPY_TEST(0, 0, 5, 1)
-MEMORY_COPY_TEST(0, 0, 6, 1)
-MEMORY_COPY_TEST(0, 0, 7, 1)
-MEMORY_COPY_TEST(0, 0, 8, 1)
-MEMORY_COPY_TEST(0, 0, 16, 1)
+MEMORY_TEST(0, 0, 1, 1)
+MEMORY_TEST(0, 0, 2, 1)
+MEMORY_TEST(0, 0, 3, 1)
+MEMORY_TEST(0, 0, 4, 1)
+MEMORY_TEST(0, 0, 5, 1)
+MEMORY_TEST(0, 0, 6, 1)
+MEMORY_TEST(0, 0, 7, 1)
+MEMORY_TEST(0, 0, 8, 1)
+MEMORY_TEST(0, 0, 16, 1)

 // Offsets.
-MEMORY_COPY_TEST(2, 2, 1, 1)
-MEMORY_COPY_TEST(2, 17, 3, 1)
-MEMORY_COPY_TEST(20, 5, 17, 1)
+MEMORY_TEST(2, 2, 1, 1)
+MEMORY_TEST(2, 17, 3, 1)
+MEMORY_TEST(20, 5, 17, 1)

 // Other element sizes.
-MEMORY_COPY_TEST(0, 0, 1, 2)
-MEMORY_COPY_TEST(0, 0, 1, 4)
-MEMORY_COPY_TEST(0, 0, 1, 8)
-MEMORY_COPY_TEST(0, 0, 2, 2)
-MEMORY_COPY_TEST(0, 0, 2, 4)
-MEMORY_COPY_TEST(0, 0, 2, 8)
-MEMORY_COPY_TEST(0, 0, 4, 2)
-MEMORY_COPY_TEST(0, 0, 4, 4)
-MEMORY_COPY_TEST(0, 0, 4, 8)
-MEMORY_COPY_TEST(0, 0, 8, 2)
-MEMORY_COPY_TEST(0, 0, 8, 4)
-MEMORY_COPY_TEST(0, 0, 8, 8)
-// TODO(http://dartbug.com/51237): Fix arm64 issue.
-#if !defined(TARGET_ARCH_ARM64)
-MEMORY_COPY_TEST(0, 0, 2, 16)
-MEMORY_COPY_TEST(0, 0, 4, 16)
-MEMORY_COPY_TEST(0, 0, 8, 16)
-#endif
+MEMORY_TEST(0, 0, 1, 2)
+MEMORY_TEST(0, 0, 1, 4)
+MEMORY_TEST(0, 0, 1, 8)
+MEMORY_TEST(0, 0, 2, 2)
+MEMORY_TEST(0, 0, 2, 4)
+MEMORY_TEST(0, 0, 2, 8)
+MEMORY_TEST(0, 0, 4, 2)
+MEMORY_TEST(0, 0, 4, 4)
+MEMORY_TEST(0, 0, 4, 8)
+MEMORY_TEST(0, 0, 8, 2)
+MEMORY_TEST(0, 0, 8, 4)
+MEMORY_TEST(0, 0, 8, 8)
+MEMORY_TEST(0, 0, 2, 16)
+MEMORY_TEST(0, 0, 4, 16)
+MEMORY_TEST(0, 0, 8, 16)

 // Other element sizes with offsets.
-MEMORY_COPY_TEST(1, 1, 2, 2)
-MEMORY_COPY_TEST(0, 1, 4, 2)
-MEMORY_COPY_TEST(1, 2, 3, 2)
-MEMORY_COPY_TEST(123, 2, 4, 4)
-MEMORY_COPY_TEST(5, 72, 1, 8)
+MEMORY_TEST(1, 1, 2, 2)
+MEMORY_TEST(0, 1, 4, 2)
+MEMORY_TEST(1, 2, 3, 2)
+MEMORY_TEST(2, 1, 3, 2)
+MEMORY_TEST(123, 2, 4, 4)
+MEMORY_TEST(2, 123, 4, 4)
+MEMORY_TEST(24, 23, 8, 4)
+MEMORY_TEST(23, 24, 8, 4)
+MEMORY_TEST(5, 72, 1, 8)
+MEMORY_TEST(12, 13, 3, 8)
+MEMORY_TEST(15, 12, 8, 8)

-// TODO(http://dartbug.com/51229): Fix arm issue.
-// TODO(http://dartbug.com/51237): Fix arm64 issue.
-#if !defined(TARGET_ARCH_ARM) && !defined(TARGET_ARCH_ARM64)
-MEMORY_COPY_TEST(13, 14, 15, 16)
-#endif
+MEMORY_TEST(13, 14, 15, 16)
+MEMORY_TEST(14, 13, 15, 16)

 // Size promotions with offsets.
-MEMORY_COPY_TEST(2, 2, 8, 1)  // promoted to 2.
-MEMORY_COPY_TEST(4, 4, 8, 1)  // promoted to 4.
-MEMORY_COPY_TEST(8, 8, 8, 1)  // promoted to 8.
+MEMORY_TEST(2, 2, 8, 1)     // promoted to 2.
+MEMORY_TEST(4, 4, 8, 1)     // promoted to 4.
+MEMORY_TEST(8, 8, 8, 1)     // promoted to 8.
+MEMORY_TEST(16, 16, 16, 1)  // promoted to 16 on ARM64.

 }  // namespace dart
--- a/runtime/vm/compiler/frontend/base_flow_graph_builder.cc
+++ b/runtime/vm/compiler/frontend/base_flow_graph_builder.cc
@ -269,14 +269,16 @@ Fragment BaseFlowGraphBuilder::UnboxedIntConstant(

 Fragment BaseFlowGraphBuilder::MemoryCopy(classid_t src_cid,
                                          classid_t dest_cid,
-                                          bool unboxed_length) {
+                                          bool unboxed_inputs,
+                                          bool can_overlap) {
  Value* length = Pop();
  Value* dest_start = Pop();
  Value* src_start = Pop();
  Value* dest = Pop();
  Value* src = Pop();
-  auto copy = new (Z) MemoryCopyInstr(src, dest, src_start, dest_start, length,
-                                      src_cid, dest_cid, unboxed_length);
+  auto copy =
+      new (Z) MemoryCopyInstr(src, dest, src_start, dest_start, length, src_cid,
+                              dest_cid, unboxed_inputs, can_overlap);
  return Fragment(copy);
 }

--- a/runtime/vm/compiler/frontend/base_flow_graph_builder.h
+++ b/runtime/vm/compiler/frontend/base_flow_graph_builder.h
@ -314,7 +314,8 @@ class BaseFlowGraphBuilder {
  Fragment CheckStackOverflowInPrologue(TokenPosition position);
  Fragment MemoryCopy(classid_t src_cid,
                      classid_t dest_cid,
-                      bool unboxed_length);
+                      bool unboxed_inputs,
+                      bool can_overlap = true);
  Fragment TailCall(const Code& code);
  Fragment Utf8Scan();

--- a/runtime/vm/compiler/frontend/kernel_to_il.cc
+++ b/runtime/vm/compiler/frontend/kernel_to_il.cc
@ -925,6 +925,11 @@ bool FlowGraphBuilder::IsRecognizedMethodForFlowGraph(
    case MethodRecognizer::kRecord_numFields:
    case MethodRecognizer::kSuspendState_clone:
    case MethodRecognizer::kSuspendState_resume:
+    case MethodRecognizer::kTypedData_checkBoundsAndMemcpy1:
+    case MethodRecognizer::kTypedData_checkBoundsAndMemcpy2:
+    case MethodRecognizer::kTypedData_checkBoundsAndMemcpy4:
+    case MethodRecognizer::kTypedData_checkBoundsAndMemcpy8:
+    case MethodRecognizer::kTypedData_checkBoundsAndMemcpy16:
    case MethodRecognizer::kTypedData_ByteDataView_factory:
    case MethodRecognizer::kTypedData_Int8ArrayView_factory:
    case MethodRecognizer::kTypedData_Uint8ArrayView_factory:
@ -1133,6 +1138,27 @@ FlowGraph* FlowGraphBuilder::BuildGraphOfRecognizedMethod(
      body += TailCall(resume_stub);
      break;
    }
+    case MethodRecognizer::kTypedData_checkBoundsAndMemcpy1:
+      // Pick an appropriate typed data cid based on the element size.
+      body +=
+          BuildTypedDataCheckBoundsAndMemcpy(function, kTypedDataUint8ArrayCid);
+      break;
+    case MethodRecognizer::kTypedData_checkBoundsAndMemcpy2:
+      body += BuildTypedDataCheckBoundsAndMemcpy(function,
+                                                 kTypedDataUint16ArrayCid);
+      break;
+    case MethodRecognizer::kTypedData_checkBoundsAndMemcpy4:
+      body += BuildTypedDataCheckBoundsAndMemcpy(function,
+                                                 kTypedDataUint32ArrayCid);
+      break;
+    case MethodRecognizer::kTypedData_checkBoundsAndMemcpy8:
+      body += BuildTypedDataCheckBoundsAndMemcpy(function,
+                                                 kTypedDataUint64ArrayCid);
+      break;
+    case MethodRecognizer::kTypedData_checkBoundsAndMemcpy16:
+      body += BuildTypedDataCheckBoundsAndMemcpy(function,
+                                                 kTypedDataInt32x4ArrayCid);
+      break;
 #define CASE(name)                                                             \
  case MethodRecognizer::kTypedData_##name##_factory:                          \
    body += BuildTypedDataFactoryConstructor(function, kTypedData##name##Cid); \
@ -1215,7 +1241,8 @@ FlowGraph* FlowGraphBuilder::BuildGraphOfRecognizedMethod(
      body += LoadLocal(parsed_function_->RawParameterVariable(3));
      body += LoadLocal(parsed_function_->RawParameterVariable(4));
      body += MemoryCopy(kTypedDataUint8ArrayCid, kOneByteStringCid,
-                         /*unboxed_length=*/false);
+                         /*unboxed_inputs=*/false,
+                         /*can_overlap=*/false);
      body += NullConstant();
      break;
    case MethodRecognizer::kImmutableLinkedHashBase_setIndexStoreRelease:
@ -1261,7 +1288,8 @@ FlowGraph* FlowGraphBuilder::BuildGraphOfRecognizedMethod(
      body += LoadLocal(arg_length_in_bytes);
      // Pointers and TypedData have the same layout.
      body += MemoryCopy(kTypedDataUint8ArrayCid, kTypedDataUint8ArrayCid,
-                         /*unboxed_length=*/false);
+                         /*unboxed_inputs=*/false,
+                         /*can_overlap=*/true);
      body += NullConstant();
    } break;
    case MethodRecognizer::kFfiAbi:
@ -1730,6 +1758,99 @@ Fragment FlowGraphBuilder::BuildTypedDataViewFactoryConstructor(
  return body;
 }

+Fragment FlowGraphBuilder::BuildTypedDataCheckBoundsAndMemcpy(
+    const Function& function,
+    intptr_t cid) {
+  ASSERT_EQUAL(parsed_function_->function().NumParameters(), 5);
+  LocalVariable* arg_to = parsed_function_->RawParameterVariable(0);
+  LocalVariable* arg_to_start = parsed_function_->RawParameterVariable(1);
+  LocalVariable* arg_to_end = parsed_function_->RawParameterVariable(2);
+  LocalVariable* arg_from = parsed_function_->RawParameterVariable(3);
+  LocalVariable* arg_from_start = parsed_function_->RawParameterVariable(4);
+
+  const Library& lib = Library::Handle(Z, Library::TypedDataLibrary());
+  ASSERT(!lib.IsNull());
+  const Function& check_set_range_args = Function::ZoneHandle(
+      Z, lib.LookupFunctionAllowPrivate(Symbols::_checkSetRangeArguments()));
+  ASSERT(!check_set_range_args.IsNull());
+
+  Fragment body;
+  body += LoadLocal(arg_to);
+  body += LoadLocal(arg_to_start);
+  body += LoadLocal(arg_to_end);
+  body += LoadLocal(arg_from);
+  body += LoadLocal(arg_from_start);
+  body += StaticCall(TokenPosition::kNoSource, check_set_range_args, 5,
+                     ICData::kStatic);
+  // The length is guaranteed to be a Smi if bounds checking is successful.
+  LocalVariable* length_to_copy = MakeTemporary("length");
+
+  // If we're copying at least this many elements, calling _nativeSetRange,
+  // which calls memmove via a native call, is faster than using the code
+  // currently emitted by the MemoryCopy instruction.
+  //
+  // TODO(dartbug.com/42072): Improve the code generated by MemoryCopy to
+  // either increase the constants below or remove the need to call out to
+  // memmove() altogether.
+#if defined(TARGET_ARCH_X64) || !defined(TARGET_ARCH_IA32)
+  // On X86, the breakpoint for using a native call instead of generating a
+  // loop via MemoryCopy() is around the same as the largest benchmark
+  // (1048576 elements) on the machines we use.
+  const intptr_t kCopyLengthForNativeCall = 1024 * 1024;
+#else
+  // On other architectures, the breakpoint is much lower for our benchmarks,
+  // and the overhead of iterating in a tight loop is enough that it's the
+  // number of elements copied, not the amount of memory copied.
+  const intptr_t kCopyLengthForNativeCall = 256;
+#endif
+
+  JoinEntryInstr* done = BuildJoinEntry();
+  TargetEntryInstr *is_small_enough, *is_too_large;
+  body += LoadLocal(length_to_copy);
+  body += IntConstant(kCopyLengthForNativeCall);
+  body += SmiRelationalOp(Token::kLT);
+  body += BranchIfTrue(&is_small_enough, &is_too_large);
+
+  Fragment use_instruction(is_small_enough);
+  use_instruction += LoadLocal(arg_from);
+  use_instruction += LoadLocal(arg_to);
+  use_instruction += LoadLocal(arg_from_start);
+  use_instruction += LoadLocal(arg_to_start);
+  use_instruction += LoadLocal(length_to_copy);
+  use_instruction += MemoryCopy(cid, cid,
+                                /*unboxed_inputs=*/false, /*can_overlap=*/true);
+  use_instruction += Goto(done);
+
+  // TODO(dartbug.com/42072): Instead of doing a static call to a native
+  // method, make a leaf runtime entry for memmove and use CCall.
+  const Class& typed_list_base =
+      Class::Handle(Z, lib.LookupClassAllowPrivate(Symbols::_TypedListBase()));
+  ASSERT(!typed_list_base.IsNull());
+  const auto& error = typed_list_base.EnsureIsFinalized(H.thread());
+  ASSERT(error == Error::null());
+  const Function& native_set_range = Function::ZoneHandle(
+      Z,
+      typed_list_base.LookupFunctionAllowPrivate(Symbols::_nativeSetRange()));
+  ASSERT(!native_set_range.IsNull());
+
+  Fragment call_native(is_too_large);
+  call_native += LoadLocal(arg_to);
+  call_native += LoadLocal(arg_to_start);
+  call_native += LoadLocal(arg_to_end);
+  call_native += LoadLocal(arg_from);
+  call_native += LoadLocal(arg_from_start);
+  call_native += StaticCall(TokenPosition::kNoSource, native_set_range, 5,
+                            ICData::kStatic);
+  call_native += Drop();
+  call_native += Goto(done);
+
+  body.current = done;
+  body += DropTemporary(&length_to_copy);
+  body += NullConstant();
+
+  return body;
+}
+
 Fragment FlowGraphBuilder::BuildTypedDataFactoryConstructor(
    const Function& function,
    classid_t cid) {
--- a/runtime/vm/compiler/frontend/kernel_to_il.h
+++ b/runtime/vm/compiler/frontend/kernel_to_il.h
@ -146,6 +146,8 @@ class FlowGraphBuilder : public BaseFlowGraphBuilder {

  FlowGraph* BuildGraphOfRecognizedMethod(const Function& function);

+  Fragment BuildTypedDataCheckBoundsAndMemcpy(const Function& function,
+                                              intptr_t cid);
  Fragment BuildTypedDataViewFactoryConstructor(const Function& function,
                                                classid_t cid);
  Fragment BuildTypedDataFactoryConstructor(const Function& function,
--- a/runtime/vm/compiler/recognized_methods_list.h
+++ b/runtime/vm/compiler/recognized_methods_list.h
@ -114,6 +114,16 @@ namespace dart {
  V(Float32x4List, ., TypedData_Float32x4Array_factory, 0x0a6eefa8)            \
  V(Int32x4List, ., TypedData_Int32x4Array_factory, 0x5a09288e)                \
  V(Float64x2List, ., TypedData_Float64x2Array_factory, 0xecbc738a)            \
+  V(_TypedListBase, _checkBoundsAndMemcpy1,                                    \
+    TypedData_checkBoundsAndMemcpy1, 0xf9d326bd)                               \
+  V(_TypedListBase, _checkBoundsAndMemcpy2,                                    \
+    TypedData_checkBoundsAndMemcpy2, 0xf0756646)                               \
+  V(_TypedListBase, _checkBoundsAndMemcpy4,                                    \
+    TypedData_checkBoundsAndMemcpy4, 0xe8cfd800)                               \
+  V(_TypedListBase, _checkBoundsAndMemcpy8,                                    \
+    TypedData_checkBoundsAndMemcpy8, 0xe945188e)                               \
+  V(_TypedListBase, _checkBoundsAndMemcpy16,                                   \
+    TypedData_checkBoundsAndMemcpy16, 0xebd06cb3)                              \
  V(::, _toClampedUint8, ConvertIntToClampedUint8, 0xd0e522d0)                 \
  V(::, copyRangeFromUint8ListToOneByteString,                                 \
    CopyRangeFromUint8ListToOneByteString, 0xcc42cce1)                         \
--- a/runtime/vm/constants.h
+++ b/runtime/vm/constants.h
@ -19,6 +19,11 @@
 #error Unknown architecture.
 #endif

+#include "platform/assert.h"
+#include "platform/utils.h"
+
+#include "vm/pointer_tagging.h"
+
 namespace dart {

 // An architecture independent ABI for the InstantiateType stub.
@ -87,6 +92,17 @@ constexpr bool IsAbiPreservedRegister(Register reg) {
 }
 #endif

+static inline ScaleFactor ToScaleFactor(intptr_t index_scale,
+                                        bool index_unboxed) {
+  RELEASE_ASSERT(index_scale >= 0);
+  const intptr_t shift = Utils::ShiftForPowerOfTwo(index_scale) -
+                         (index_unboxed ? 0 : kSmiTagShift);
+  // index_scale < kSmiTagShift for boxed indexes must be handled by the caller,
+  // and ScaleFactor is currently only defined up to TIMES_16 == 4.
+  RELEASE_ASSERT(shift >= 0 && shift <= 4);
+  return static_cast<ScaleFactor>(shift);
+}
+
 }  // namespace dart

 #endif  // RUNTIME_VM_CONSTANTS_H_
--- a/runtime/vm/constants_x86.h
+++ b/runtime/vm/constants_x86.h
@ -67,6 +67,7 @@ static inline Condition InvertCondition(Condition c) {
  F(leave, 0xC9)                                                               \
  F(hlt, 0xF4)                                                                 \
  F(cld, 0xFC)                                                                 \
+  F(std, 0xFD)                                                                 \
  F(int3, 0xCC)                                                                \
  F(pushad, 0x60)                                                              \
  F(popad, 0x61)                                                               \
--- a/runtime/vm/object.cc
+++ b/runtime/vm/object.cc
@ -9102,6 +9102,11 @@ bool Function::RecognizedKindForceOptimize() const {
    case MethodRecognizer::kRecord_numFields:
    case MethodRecognizer::kUtf8DecoderScan:
    case MethodRecognizer::kDouble_hashCode:
+    case MethodRecognizer::kTypedData_checkBoundsAndMemcpy1:
+    case MethodRecognizer::kTypedData_checkBoundsAndMemcpy2:
+    case MethodRecognizer::kTypedData_checkBoundsAndMemcpy4:
+    case MethodRecognizer::kTypedData_checkBoundsAndMemcpy8:
+    case MethodRecognizer::kTypedData_checkBoundsAndMemcpy16:
    // Prevent the GC from running so that the operation is atomic from
    // a GC point of view. Always double check implementation in
    // kernel_to_il.cc that no GC can happen in between the relevant IL
--- a/runtime/vm/symbols.h
+++ b/runtime/vm/symbols.h
@ -384,6 +384,7 @@ class ObjectPointerVisitor;
  V(_Type, "_Type")                                                            \
  V(_TypeParameter, "_TypeParameter")                                          \
  V(_TypeVariableMirror, "_TypeVariableMirror")                                \
+  V(_TypedListBase, "_TypedListBase")                                          \
  V(_Uint16ArrayFactory, "Uint16List.")                                        \
  V(_Uint16ArrayView, "_Uint16ArrayView")                                      \
  V(_Uint16List, "_Uint16List")                                                \
@ -422,6 +423,7 @@ class ObjectPointerVisitor;
  V(_await, "_await")                                                          \
  V(_awaitWithTypeCheck, "_awaitWithTypeCheck")                                \
  V(_backtrackingStack, "_backtrackingStack")                                  \
+  V(_checkSetRangeArguments, "_checkSetRangeArguments")                        \
  V(_classRangeCheck, "_classRangeCheck")                                      \
  V(_current, "_current")                                                      \
  V(_ensureScheduleImmediate, "_ensureScheduleImmediate")                      \
@ -447,6 +449,7 @@ class ObjectPointerVisitor;
  V(_mapGet, "_mapGet")                                                        \
  V(_mapKeys, "_mapKeys")                                                      \
  V(_name, "_name")                                                            \
+  V(_nativeSetRange, "_nativeSetRange")                                        \
  V(_objectEquals, "_objectEquals")                                            \
  V(_objectHashCode, "_objectHashCode")                                        \
  V(_objectNoSuchMethod, "_objectNoSuchMethod")                                \
@ -490,7 +493,9 @@ class ObjectPointerVisitor;
  V(current_position, ":current_position")                                     \
  V(dynamic_assert_assignable_stc_check,                                       \
    ":dynamic_assert_assignable_stc_check")                                    \
+  V(end, "end")                                                                \
  V(executable, "executable")                                                  \
+  V(from, "from")                                                              \
  V(get, "get")                                                                \
  V(index_temp, ":index_temp")                                                 \
  V(isPaused, "isPaused")                                                      \
@ -506,8 +511,10 @@ class ObjectPointerVisitor;
  V(relative, "relative")                                                      \
  V(result, "result")                                                          \
  V(set, "set")                                                                \
+  V(skip_count, "skipCount")                                                   \
  V(stack, ":stack")                                                           \
  V(stack_pointer, ":stack_pointer")                                           \
+  V(start, "start")                                                            \
  V(start_index_param, ":start_index_param")                                   \
  V(state, "state")                                                            \
  V(string_param, ":string_param")                                             \
--- a/sdk/lib/_internal/vm/lib/typed_data_patch.dart
+++ b/sdk/lib/_internal/vm/lib/typed_data_patch.dart
@ -102,19 +102,83 @@ abstract final class _TypedListBase {
    throw new UnsupportedError("Cannot remove from a fixed-length list");
  }

+  @pragma("vm:prefer-inline")
+  void setRange(int start, int end, Iterable from, [int skipCount = 0]) =>
+      (from is _TypedListBase &&
+              (from as _TypedListBase).elementSizeInBytes == elementSizeInBytes)
+          ? _fastSetRange(start, end, from as _TypedListBase, skipCount)
+          : _slowSetRange(start, end, from, skipCount);
+
  // Method(s) implementing Object interface.
  String toString() => ListBase.listToString(this as List);

  // Internal utility methods.
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount);
+  void _slowSetRange(int start, int end, Iterable from, int skipCount);

-  // Returns true if operation succeeds.
-  // 'fromCid' and 'toCid' may be cid-s of the views and therefore may not
-  // match the cids of 'this' and 'from'.
-  // Uses toCid and fromCid to decide if clamping is necessary.
-  // Element size of toCid and fromCid must match (test at caller).
+  @pragma("vm:prefer-inline")
+  bool get _containsUnsignedBytes => false;
+
+  // Performs a copy of the elements starting at [skipCount] in [from] to
+  // [this] starting at [start] (inclusive) and ending at [end] (exclusive).
+  //
+  // Primarily called by Dart code to handle clamping.
+  //
+  // Element sizes of [this] and [from] must match (test at caller).
  @pragma("vm:external-name", "TypedDataBase_setRange")
-  external bool _setRange(int startInBytes, int lengthInBytes,
-      _TypedListBase from, int startFromInBytes, int toCid, int fromCid);
+  @pragma("vm:entry-point")
+  external void _nativeSetRange(
+      int start, int end, _TypedListBase from, int skipOffset);
+
+  // Performs a copy of the elements starting at [skipCount] in [from] to
+  // [this] starting at [start] (inclusive) and ending at [end] (exclusive).
+  //
+  // The element sizes of [this] and [from] must be 1 (test at caller).
+  @pragma("vm:recognized", "other")
+  @pragma("vm:prefer-inline")
+  @pragma("vm:idempotent")
+  external void _checkBoundsAndMemcpy1(
+      int start, int end, _TypedListBase from, int skipCount);
+
+  // Performs a copy of the elements starting at [skipCount] in [from] to
+  // [this] starting at [start] (inclusive) and ending at [end] (exclusive).
+  //
+  // The element sizes of [this] and [from] must be 2 (test at caller).
+  @pragma("vm:recognized", "other")
+  @pragma("vm:prefer-inline")
+  @pragma("vm:idempotent")
+  external void _checkBoundsAndMemcpy2(
+      int start, int end, _TypedListBase from, int skipCount);
+
+  // Performs a copy of the elements starting at [skipCount] in [from] to
+  // [this] starting at [start] (inclusive) and ending at [end] (exclusive).
+  //
+  // The element sizes of [this] and [from] must be 4 (test at caller).
+  @pragma("vm:recognized", "other")
+  @pragma("vm:prefer-inline")
+  @pragma("vm:idempotent")
+  external void _checkBoundsAndMemcpy4(
+      int start, int end, _TypedListBase from, int skipCount);
+
+  // Performs a copy of the elements starting at [skipCount] in [from] to
+  // [this] starting at [start] (inclusive) and ending at [end] (exclusive).
+  //
+  // The element sizes of [this] and [from] must be 8 (test at caller).
+  @pragma("vm:recognized", "other")
+  @pragma("vm:prefer-inline")
+  @pragma("vm:idempotent")
+  external void _checkBoundsAndMemcpy8(
+      int start, int end, _TypedListBase from, int skipCount);
+
+  // Performs a copy of the elements starting at [skipCount] in [from] to
+  // [this] starting at [start] (inclusive) and ending at [end] (exclusive).
+  //
+  // The element sizes of [this] and [from] must be 16 (test at caller).
+  @pragma("vm:recognized", "other")
+  @pragma("vm:prefer-inline")
+  @pragma("vm:idempotent")
+  external void _checkBoundsAndMemcpy16(
+      int start, int end, _TypedListBase from, int skipCount);
 }

 mixin _IntListMixin implements List<int> {
@ -397,41 +461,15 @@ mixin _TypedIntListMixin<SpawnedType extends List<int>> on _IntListMixin
    implements List<int> {
  SpawnedType _createList(int length);

-  void setRange(int start, int end, Iterable<int> from, [int skipCount = 0]) {
-    // Check ranges.
-    if (0 > start || start > end || end > length) {
-      RangeError.checkValidRange(start, end, length); // Always throws.
-      assert(false);
-    }
-    if (skipCount < 0) {
-      throw RangeError.range(skipCount, 0, null, "skipCount");
-    }
-
-    final count = end - start;
-    if ((from.length - skipCount) < count) {
-      throw IterableElementError.tooFew();
-    }
+  void _slowSetRange(int start, int end, Iterable from, int skipCount) {
+    final count = _checkSetRangeArguments(this, start, end, from, skipCount);

    if (count == 0) return;

    if (from is _TypedListBase) {
      // Note: _TypedListBase is not related to Iterable<int> so there is
      // no promotion here.
-      final fromAsTypedList = from as _TypedListBase;
-      if (this.elementSizeInBytes == fromAsTypedList.elementSizeInBytes) {
-        if ((count < 10) && (fromAsTypedList.buffer != this.buffer)) {
-          Lists.copy(from as List<int>, skipCount, this, start, count);
-          return;
-        } else if (this.buffer._data._setRange(
-            start * elementSizeInBytes + this.offsetInBytes,
-            count * elementSizeInBytes,
-            fromAsTypedList.buffer._data,
-            skipCount * elementSizeInBytes + fromAsTypedList.offsetInBytes,
-            ClassID.getID(this),
-            ClassID.getID(from))) {
-          return;
-        }
-      } else if (fromAsTypedList.buffer == this.buffer) {
+      if ((from as _TypedListBase).buffer == this.buffer) {
        // Different element sizes, but same buffer means that we need
        // an intermediate structure.
        // TODO(srdjan): Optimize to skip copying if the range does not overlap.
@ -754,42 +792,15 @@ mixin _TypedDoubleListMixin<SpawnedType extends List<double>>
    on _DoubleListMixin implements List<double> {
  SpawnedType _createList(int length);

-  void setRange(int start, int end, Iterable<double> from,
-      [int skipCount = 0]) {
-    // Check ranges.
-    if (0 > start || start > end || end > length) {
-      RangeError.checkValidRange(start, end, length); // Always throws.
-      assert(false);
-    }
-    if (skipCount < 0) {
-      throw RangeError.range(skipCount, 0, null, "skipCount");
-    }
-
-    final count = end - start;
-    if ((from.length - skipCount) < count) {
-      throw IterableElementError.tooFew();
-    }
+  void _slowSetRange(int start, int end, Iterable from, int skipCount) {
+    final count = _checkSetRangeArguments(this, start, end, from, skipCount);

    if (count == 0) return;

    if (from is _TypedListBase) {
      // Note: _TypedListBase is not related to Iterable<double> so there is
      // no promotion here.
-      final fromAsTypedList = from as _TypedListBase;
-      if (this.elementSizeInBytes == fromAsTypedList.elementSizeInBytes) {
-        if ((count < 10) && (fromAsTypedList.buffer != this.buffer)) {
-          Lists.copy(from as List<double>, skipCount, this, start, count);
-          return;
-        } else if (this.buffer._data._setRange(
-            start * elementSizeInBytes + this.offsetInBytes,
-            count * elementSizeInBytes,
-            fromAsTypedList.buffer._data,
-            skipCount * elementSizeInBytes + fromAsTypedList.offsetInBytes,
-            ClassID.getID(this),
-            ClassID.getID(from))) {
-          return;
-        }
-      } else if (fromAsTypedList.buffer == this.buffer) {
+      if ((from as _TypedListBase).buffer == this.buffer) {
        // Different element sizes, but same buffer means that we need
        // an intermediate structure.
        // TODO(srdjan): Optimize to skip copying if the range does not overlap.
@ -895,42 +906,15 @@ mixin _Float32x4ListMixin implements List<Float32x4> {
    }
  }

-  void setRange(int start, int end, Iterable<Float32x4> from,
-      [int skipCount = 0]) {
-    // Check ranges.
-    if (0 > start || start > end || end > length) {
-      RangeError.checkValidRange(start, end, length); // Always throws.
-      assert(false);
-    }
-    if (skipCount < 0) {
-      throw RangeError.range(skipCount, 0, null, "skipCount");
-    }
-
-    final count = end - start;
-    if ((from.length - skipCount) < count) {
-      throw IterableElementError.tooFew();
-    }
+  void _slowSetRange(int start, int end, Iterable from, int skipCount) {
+    final count = _checkSetRangeArguments(this, start, end, from, skipCount);

    if (count == 0) return;

    if (from is _TypedListBase) {
      // Note: _TypedListBase is not related to Iterable<Float32x4> so there is
      // no promotion here.
-      final fromAsTypedList = from as _TypedListBase;
-      if (this.elementSizeInBytes == fromAsTypedList.elementSizeInBytes) {
-        if ((count < 10) && (fromAsTypedList.buffer != this.buffer)) {
-          Lists.copy(from as List<Float32x4>, skipCount, this, start, count);
-          return;
-        } else if (this.buffer._data._setRange(
-            start * elementSizeInBytes + this.offsetInBytes,
-            count * elementSizeInBytes,
-            fromAsTypedList.buffer._data,
-            skipCount * elementSizeInBytes + fromAsTypedList.offsetInBytes,
-            ClassID.getID(this),
-            ClassID.getID(from))) {
-          return;
-        }
-      } else if (fromAsTypedList.buffer == this.buffer) {
+      if ((from as _TypedListBase).buffer == this.buffer) {
        // Different element sizes, but same buffer means that we need
        // an intermediate structure.
        // TODO(srdjan): Optimize to skip copying if the range does not overlap.
@ -1253,42 +1237,15 @@ mixin _Int32x4ListMixin implements List<Int32x4> {
    }
  }

-  void setRange(int start, int end, Iterable<Int32x4> from,
-      [int skipCount = 0]) {
-    // Check ranges.
-    if (0 > start || start > end || end > length) {
-      RangeError.checkValidRange(start, end, length); // Always throws.
-      assert(false);
-    }
-    if (skipCount < 0) {
-      throw RangeError.range(skipCount, 0, null, "skipCount");
-    }
-
-    final count = end - start;
-    if ((from.length - skipCount) < count) {
-      throw IterableElementError.tooFew();
-    }
+  void _slowSetRange(int start, int end, Iterable from, int skipCount) {
+    final count = _checkSetRangeArguments(this, start, end, from, skipCount);

    if (count == 0) return;

    if (from is _TypedListBase) {
      // Note: _TypedListBase is not related to Iterable<Int32x4> so there is
      // no promotion here.
-      final fromAsTypedList = from as _TypedListBase;
-      if (this.elementSizeInBytes == fromAsTypedList.elementSizeInBytes) {
-        if ((count < 10) && (fromAsTypedList.buffer != this.buffer)) {
-          Lists.copy(from as List<Int32x4>, skipCount, this, start, count);
-          return;
-        } else if (this.buffer._data._setRange(
-            start * elementSizeInBytes + this.offsetInBytes,
-            count * elementSizeInBytes,
-            fromAsTypedList.buffer._data,
-            skipCount * elementSizeInBytes + fromAsTypedList.offsetInBytes,
-            ClassID.getID(this),
-            ClassID.getID(from))) {
-          return;
-        }
-      } else if (fromAsTypedList.buffer == this.buffer) {
+      if ((from as _TypedListBase).buffer == this.buffer) {
        // Different element sizes, but same buffer means that we need
        // an intermediate structure.
        // TODO(srdjan): Optimize to skip copying if the range does not overlap.
@ -1610,42 +1567,15 @@ mixin _Float64x2ListMixin implements List<Float64x2> {
    }
  }

-  void setRange(int start, int end, Iterable<Float64x2> from,
-      [int skipCount = 0]) {
-    // Check ranges.
-    if (0 > start || start > end || end > length) {
-      RangeError.checkValidRange(start, end, length); // Always throws.
-      assert(false);
-    }
-    if (skipCount < 0) {
-      throw RangeError.range(skipCount, 0, null, "skipCount");
-    }
-
-    final count = end - start;
-    if ((from.length - skipCount) < count) {
-      throw IterableElementError.tooFew();
-    }
+  void _slowSetRange(int start, int end, Iterable from, int skipCount) {
+    final count = _checkSetRangeArguments(this, start, end, from, skipCount);

    if (count == 0) return;

    if (from is _TypedListBase) {
      // Note: _TypedListBase is not related to Iterable<Float64x2> so there is
      // no promotion here.
-      final fromAsTypedList = from as _TypedListBase;
-      if (this.elementSizeInBytes == fromAsTypedList.elementSizeInBytes) {
-        if ((count < 10) && (fromAsTypedList.buffer != this.buffer)) {
-          Lists.copy(from as List<Float64x2>, skipCount, this, start, count);
-          return;
-        } else if (this.buffer._data._setRange(
-            start * elementSizeInBytes + this.offsetInBytes,
-            count * elementSizeInBytes,
-            fromAsTypedList.buffer._data,
-            skipCount * elementSizeInBytes + fromAsTypedList.offsetInBytes,
-            ClassID.getID(this),
-            ClassID.getID(from))) {
-          return;
-        }
-      } else if (fromAsTypedList.buffer == this.buffer) {
+      if ((from as _TypedListBase).buffer == this.buffer) {
        // Different element sizes, but same buffer means that we need
        // an intermediate structure.
        // TODO(srdjan): Optimize to skip copying if the range does not overlap.
@ -2235,6 +2165,10 @@ final class _Int8List extends _TypedList
  Int8List _createList(int length) {
    return new Int8List(length);
  }
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy1(start, end, from, skipCount);
 }

@patch
@ -2290,6 +2224,13 @@ final class _Uint8List extends _TypedList
  Uint8List _createList(int length) {
    return new Uint8List(length);
  }
+
+  @pragma("vm:prefer-inline")
+  bool get _containsUnsignedBytes => true;
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy1(start, end, from, skipCount);
 }

@patch
@ -2345,6 +2286,15 @@ final class _Uint8ClampedList extends _TypedList
  Uint8ClampedList _createList(int length) {
    return new Uint8ClampedList(length);
  }
+
+  @pragma("vm:prefer-inline")
+  bool get _containsUnsignedBytes => true;
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      from._containsUnsignedBytes
+          ? _checkBoundsAndMemcpy1(start, end, from, skipCount)
+          : _nativeSetRange(start, end, from, skipCount);
 }

@patch
@ -2391,8 +2341,8 @@ final class _Int16List extends _TypedList
    _setIndexedInt16(index, _toInt16(value));
  }

-  void setRange(int start, int end, Iterable<int> iterable,
-      [int skipCount = 0]) {
+  @pragma("vm:prefer-inline")
+  void setRange(int start, int end, Iterable iterable, [int skipCount = 0]) {
    if (iterable is CodeUnits) {
      end = RangeError.checkValidRange(start, end, this.length);
      int length = end - start;
@ -2413,6 +2363,10 @@ final class _Int16List extends _TypedList
    return new Int16List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy2(start, end, from, skipCount);
+
  int _getIndexedInt16(int index) {
    return _getInt16(index * Int16List.bytesPerElement);
  }
@ -2466,8 +2420,8 @@ final class _Uint16List extends _TypedList
    _setIndexedUint16(index, _toUint16(value));
  }

-  void setRange(int start, int end, Iterable<int> iterable,
-      [int skipCount = 0]) {
+  @pragma("vm:prefer-inline")
+  void setRange(int start, int end, Iterable iterable, [int skipCount = 0]) {
    if (iterable is CodeUnits) {
      end = RangeError.checkValidRange(start, end, this.length);
      int length = end - start;
@ -2488,6 +2442,10 @@ final class _Uint16List extends _TypedList
    return new Uint16List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy2(start, end, from, skipCount);
+
  int _getIndexedUint16(int index) {
    return _getUint16(index * Uint16List.bytesPerElement);
  }
@ -2550,6 +2508,10 @@ final class _Int32List extends _TypedList
    return new Int32List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy4(start, end, from, skipCount);
+
  int _getIndexedInt32(int index) {
    return _getInt32(index * Int32List.bytesPerElement);
  }
@ -2612,6 +2574,10 @@ final class _Uint32List extends _TypedList
    return new Uint32List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy4(start, end, from, skipCount);
+
  int _getIndexedUint32(int index) {
    return _getUint32(index * Uint32List.bytesPerElement);
  }
@ -2674,6 +2640,10 @@ final class _Int64List extends _TypedList
    return new Int64List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy8(start, end, from, skipCount);
+
  int _getIndexedInt64(int index) {
    return _getInt64(index * Int64List.bytesPerElement);
  }
@ -2736,6 +2706,10 @@ final class _Uint64List extends _TypedList
    return new Uint64List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy8(start, end, from, skipCount);
+
  int _getIndexedUint64(int index) {
    return _getUint64(index * Uint64List.bytesPerElement);
  }
@ -2799,6 +2773,10 @@ final class _Float32List extends _TypedList
    return new Float32List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy4(start, end, from, skipCount);
+
  double _getIndexedFloat32(int index) {
    return _getFloat32(index * Float32List.bytesPerElement);
  }
@ -2862,6 +2840,10 @@ final class _Float64List extends _TypedList
    return new Float64List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy8(start, end, from, skipCount);
+
  double _getIndexedFloat64(int index) {
    return _getFloat64(index * Float64List.bytesPerElement);
  }
@ -2924,6 +2906,10 @@ final class _Float32x4List extends _TypedList
    return new Float32x4List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy16(start, end, from, skipCount);
+
  Float32x4 _getIndexedFloat32x4(int index) {
    return _getFloat32x4(index * Float32x4List.bytesPerElement);
  }
@ -2986,6 +2972,10 @@ final class _Int32x4List extends _TypedList
    return new Int32x4List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy16(start, end, from, skipCount);
+
  Int32x4 _getIndexedInt32x4(int index) {
    return _getInt32x4(index * Int32x4List.bytesPerElement);
  }
@ -3048,6 +3038,10 @@ final class _Float64x2List extends _TypedList
    return new Float64x2List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy16(start, end, from, skipCount);
+
  Float64x2 _getIndexedFloat64x2(int index) {
    return _getFloat64x2(index * Float64x2List.bytesPerElement);
  }
@ -3091,6 +3085,10 @@ final class _ExternalInt8Array extends _TypedList
  Int8List _createList(int length) {
    return new Int8List(length);
  }
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy1(start, end, from, skipCount);
 }

@pragma("vm:entry-point")
@ -3130,6 +3128,13 @@ final class _ExternalUint8Array extends _TypedList
  Uint8List _createList(int length) {
    return new Uint8List(length);
  }
+
+  @pragma("vm:prefer-inline")
+  bool get _containsUnsignedBytes => true;
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy1(start, end, from, skipCount);
 }

@pragma("vm:entry-point")
@ -3169,6 +3174,15 @@ final class _ExternalUint8ClampedArray extends _TypedList
  Uint8ClampedList _createList(int length) {
    return new Uint8ClampedList(length);
  }
+
+  @pragma("vm:prefer-inline")
+  bool get _containsUnsignedBytes => true;
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      from._containsUnsignedBytes
+          ? _checkBoundsAndMemcpy1(start, end, from, skipCount)
+          : _nativeSetRange(start, end, from, skipCount);
 }

@pragma("vm:entry-point")
@ -3206,6 +3220,10 @@ final class _ExternalInt16Array extends _TypedList
    return new Int16List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy2(start, end, from, skipCount);
+
  int _getIndexedInt16(int index) {
    return _getInt16(index * Int16List.bytesPerElement);
  }
@ -3250,6 +3268,10 @@ final class _ExternalUint16Array extends _TypedList
    return new Uint16List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy2(start, end, from, skipCount);
+
  int _getIndexedUint16(int index) {
    return _getUint16(index * Uint16List.bytesPerElement);
  }
@ -3294,6 +3316,10 @@ final class _ExternalInt32Array extends _TypedList
    return new Int32List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy4(start, end, from, skipCount);
+
  int _getIndexedInt32(int index) {
    return _getInt32(index * Int32List.bytesPerElement);
  }
@ -3338,6 +3364,10 @@ final class _ExternalUint32Array extends _TypedList
    return new Uint32List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy4(start, end, from, skipCount);
+
  int _getIndexedUint32(int index) {
    return _getUint32(index * Uint32List.bytesPerElement);
  }
@ -3382,6 +3412,10 @@ final class _ExternalInt64Array extends _TypedList
    return new Int64List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy8(start, end, from, skipCount);
+
  int _getIndexedInt64(int index) {
    return _getInt64(index * Int64List.bytesPerElement);
  }
@ -3426,6 +3460,10 @@ final class _ExternalUint64Array extends _TypedList
    return new Uint64List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy8(start, end, from, skipCount);
+
  int _getIndexedUint64(int index) {
    return _getUint64(index * Uint64List.bytesPerElement);
  }
@ -3470,6 +3508,10 @@ final class _ExternalFloat32Array extends _TypedList
    return new Float32List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy4(start, end, from, skipCount);
+
  double _getIndexedFloat32(int index) {
    return _getFloat32(index * Float32List.bytesPerElement);
  }
@ -3514,6 +3556,10 @@ final class _ExternalFloat64Array extends _TypedList
    return new Float64List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy8(start, end, from, skipCount);
+
  double _getIndexedFloat64(int index) {
    return _getFloat64(index * Float64List.bytesPerElement);
  }
@ -3558,6 +3604,10 @@ final class _ExternalFloat32x4Array extends _TypedList
    return new Float32x4List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy16(start, end, from, skipCount);
+
  Float32x4 _getIndexedFloat32x4(int index) {
    return _getFloat32x4(index * Float32x4List.bytesPerElement);
  }
@ -3602,6 +3652,10 @@ final class _ExternalInt32x4Array extends _TypedList
    return new Int32x4List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy16(start, end, from, skipCount);
+
  Int32x4 _getIndexedInt32x4(int index) {
    return _getInt32x4(index * Int32x4List.bytesPerElement);
  }
@ -3646,6 +3700,10 @@ final class _ExternalFloat64x2Array extends _TypedList
    return new Float64x2List(length);
  }

+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy16(start, end, from, skipCount);
+
  Float64x2 _getIndexedFloat64x2(int index) {
    return _getFloat64x2(index * Float64x2List.bytesPerElement);
  }
@ -4247,6 +4305,10 @@ final class _Int8ArrayView extends _TypedListView
  Int8List _createList(int length) {
    return new Int8List(length);
  }
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy1(start, end, from, skipCount);
 }

@pragma("vm:entry-point")
@ -4291,6 +4353,13 @@ final class _Uint8ArrayView extends _TypedListView
  Uint8List _createList(int length) {
    return new Uint8List(length);
  }
+
+  @pragma("vm:prefer-inline")
+  bool get _containsUnsignedBytes => true;
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy1(start, end, from, skipCount);
 }

@pragma("vm:entry-point")
@ -4335,6 +4404,15 @@ final class _Uint8ClampedArrayView extends _TypedListView
  Uint8ClampedList _createList(int length) {
    return new Uint8ClampedList(length);
  }
+
+  @pragma("vm:prefer-inline")
+  bool get _containsUnsignedBytes => true;
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      from._containsUnsignedBytes
+          ? _checkBoundsAndMemcpy1(start, end, from, skipCount)
+          : _nativeSetRange(start, end, from, skipCount);
 }

@pragma("vm:entry-point")
@ -4370,8 +4448,8 @@ final class _Int16ArrayView extends _TypedListView
        offsetInBytes + (index * Int16List.bytesPerElement), _toInt16(value));
  }

-  void setRange(int start, int end, Iterable<int> iterable,
-      [int skipCount = 0]) {
+  @pragma("vm:prefer-inline")
+  void setRange(int start, int end, Iterable iterable, [int skipCount = 0]) {
    if (iterable is CodeUnits) {
      end = RangeError.checkValidRange(start, end, this.length);
      int length = end - start;
@ -4392,6 +4470,10 @@ final class _Int16ArrayView extends _TypedListView
  Int16List _createList(int length) {
    return new Int16List(length);
  }
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy2(start, end, from, skipCount);
 }

@pragma("vm:entry-point")
@ -4427,8 +4509,8 @@ final class _Uint16ArrayView extends _TypedListView
        offsetInBytes + (index * Uint16List.bytesPerElement), _toUint16(value));
  }

-  void setRange(int start, int end, Iterable<int> iterable,
-      [int skipCount = 0]) {
+  @pragma("vm:prefer-inline")
+  void setRange(int start, int end, Iterable iterable, [int skipCount = 0]) {
    if (iterable is CodeUnits) {
      end = RangeError.checkValidRange(start, end, this.length);
      int length = end - start;
@ -4450,6 +4532,10 @@ final class _Uint16ArrayView extends _TypedListView
  Uint16List _createList(int length) {
    return new Uint16List(length);
  }
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy2(start, end, from, skipCount);
 }

@pragma("vm:entry-point")
@ -4494,6 +4580,10 @@ final class _Int32ArrayView extends _TypedListView
  Int32List _createList(int length) {
    return new Int32List(length);
  }
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy4(start, end, from, skipCount);
 }

@pragma("vm:entry-point")
@ -4538,6 +4628,10 @@ final class _Uint32ArrayView extends _TypedListView
  Uint32List _createList(int length) {
    return new Uint32List(length);
  }
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy4(start, end, from, skipCount);
 }

@pragma("vm:entry-point")
@ -4582,6 +4676,10 @@ final class _Int64ArrayView extends _TypedListView
  Int64List _createList(int length) {
    return new Int64List(length);
  }
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy8(start, end, from, skipCount);
 }

@pragma("vm:entry-point")
@ -4626,6 +4724,10 @@ final class _Uint64ArrayView extends _TypedListView
  Uint64List _createList(int length) {
    return new Uint64List(length);
  }
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy8(start, end, from, skipCount);
 }

@pragma("vm:entry-point")
@ -4670,6 +4772,10 @@ final class _Float32ArrayView extends _TypedListView
  Float32List _createList(int length) {
    return new Float32List(length);
  }
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy4(start, end, from, skipCount);
 }

@pragma("vm:entry-point")
@ -4714,6 +4820,10 @@ final class _Float64ArrayView extends _TypedListView
  Float64List _createList(int length) {
    return new Float64List(length);
  }
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy8(start, end, from, skipCount);
 }

@pragma("vm:entry-point")
@ -4756,6 +4866,10 @@ final class _Float32x4ArrayView extends _TypedListView
  Float32x4List _createList(int length) {
    return new Float32x4List(length);
  }
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy16(start, end, from, skipCount);
 }

@pragma("vm:entry-point")
@ -4798,6 +4912,10 @@ final class _Int32x4ArrayView extends _TypedListView
  Int32x4List _createList(int length) {
    return new Int32x4List(length);
  }
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy16(start, end, from, skipCount);
 }

@pragma("vm:entry-point")
@ -4840,6 +4958,10 @@ final class _Float64x2ArrayView extends _TypedListView
  Float64x2List _createList(int length) {
    return new Float64x2List(length);
  }
+
+  @pragma("vm:prefer-inline")
+  void _fastSetRange(int start, int end, _TypedListBase from, int skipCount) =>
+      _checkBoundsAndMemcpy16(start, end, from, skipCount);
 }

@pragma("vm:entry-point")
@ -5230,6 +5352,28 @@ void _offsetAlignmentCheck(int offset, int alignment) {
  }
 }

+// Checks the arguments provided to a setRange call. Returns the number
+// of elements to copy.
+@pragma("vm:entry-point")
+int _checkSetRangeArguments(
+    Iterable to, int start, int end, Iterable from, int skipCount) {
+  // Check ranges.
+  if (0 > start || start > end || end > to.length) {
+    RangeError.checkValidRange(start, end, to.length); // Always throws.
+    assert(false);
+  }
+  if (skipCount < 0) {
+    throw RangeError.range(skipCount, 0, null, "skipCount");
+  }
+
+  final count = end - start;
+  if ((from.length - skipCount) < count) {
+    throw IterableElementError.tooFew();
+  }
+
+  return count;
+}
+
@patch
 abstract class UnmodifiableByteBufferView implements Uint8List {
  @patch