scalainlineinliningscala-2.13

Completely inlining an AnyVal implicit class constructor without -Yopt-inline-heuristics:everything


I'm experimenting with the Java 17 Vector API Incubator and I decided to see if I can create a zero-cost syntactic sugar for it. Here is a small snippet of what I wrote:

import jdk.incubator.vector._

object VectorOps {
  implicit final class FloatVectorOps @inline() (val _this: FloatVector) extends AnyVal {
    @inline def +(that: FloatVector): FloatVector = _this.add(that)
    @inline def apply(i: Int): Float = _this.lane(i)
  }
}

class Test {
  def test(x: Float, y: Float): Float = {
    import VectorOps._
    val SSE = FloatVector.SPECIES_128
    val xv = FloatVector.broadcast(SSE, x)
    val yv = FloatVector.broadcast(SSE, y)
    (xv + yv)(0) // sugar for xv.add(yv).lane(0)
  }
}

I'm using Scala 2.13.5 and Java 17.

Scala compiler is ran with -optimize -opt:inline -opt-warnings:at-inline-failed -Yopt-inline-heuristics:at-inline-annotated -opt:nullness-tracking -opt:box-unbox -opt:copy-propagation -opt:unreachable-code -language:implicitConversions -opt:closure-invocations

JVM is ran with --add-modules jdk.incubator.vector.

However, the Scala compiler compiles the final line of the test method to

    GETSTATIC VectorOps$FloatVectorOps$.MODULE$ : LVectorOps$FloatVectorOps$;
    POP
    GETSTATIC VectorOps$.MODULE$ : LVectorOps$;
    GETSTATIC VectorOps$FloatVectorOps$.MODULE$ : LVectorOps$FloatVectorOps$;
    POP
    GETSTATIC VectorOps$.MODULE$ : LVectorOps$;
    ALOAD 4
    INVOKEVIRTUAL VectorOps$.FloatVectorOps (Ljdk/incubator/vector/FloatVector;)Ljdk/incubator/vector/FloatVector;
    ALOAD 5
    INVOKEVIRTUAL jdk/incubator/vector/FloatVector.add (Ljdk/incubator/vector/Vector;)Ljdk/incubator/vector/FloatVector;
    INVOKEVIRTUAL VectorOps$.FloatVectorOps (Ljdk/incubator/vector/FloatVector;)Ljdk/incubator/vector/FloatVector;
    ICONST_0
    INVOKEVIRTUAL jdk/incubator/vector/FloatVector.lane (I)F
    FRETURN

Those calls to the implicit class constructor completely throw Hotspot off and it's unable to unbox the vector variables, killing performance. Note that the implicit class constructor is, bytecode-wise, an identity function, which effectively means that it's a no-op. All the stuff with MODULE$ is also unnecessary. But Hotspot does not see it.

(Note that the method calls to + and apply were successfully inlined.)

Adding -Yopt-inline-heuristics:everything removes both the constructor calls and MODULE$, and fixes performance, but it's like using a sledgehammer to crack a nut. And like a sledgehammer, it doesn't feel safe.

Of course, writing the entire code in Java style also fixes the performance, but that's not the point.

So my questions:

  1. Can the calls be eliminated without -Yopt-inline-heuristics:everything and without rewriting everything in the original Java syntax?

  2. Scala 3 has some new inlining features. Can this be done in Scala 3 without aggressive optimization options?


Solution

  • I've figured out a solution: macros.

    object VectorOps {
      implicit final class FloatVectorOps @inline() (val _this: FloatVector) extends AnyVal {
        @inline def +(that: FloatVector): FloatVector = macro MacroVectorOps.add
        @inline def apply(i: Int): Float = macro MacroVectorOps.lane
      }
    }
    
    object MacroVectorOps {
      import scala.reflect.macros.blackbox
      def add(c: blackbox.Context)(that: c.Tree): c.Tree = {
        import c.universe._
        // deconstruct the implicit conversion:
        val q"$conv($in)" = c.prefix.tree
        q"$in.add($that)"
      }  
      def lane(c: blackbox.Context)(i: c.Tree): c.Tree = {
        import c.universe._
        val q"$conv($in)" = c.prefix.tree
        q"$in.lane($i)"
      }
    }
    

    Code that uses this compiles to as efficient bytecode as if I wrote it using the original Java API, there are no traces of the FloatVectorOps class or VectorOps object in the bytecode at all.