j

Sum of arrays with repeated indices


How can I add an array of numbers to another array by indices? Especially with repeated indices. Like that

   x
1 2 3 4
  idx
0 1 0
   y
5 6 7
   ] x add idx;y NB. (1 + 5 + 7) , (2 + 6) , 3 , 4
13 8 3 4

All nouns (x, idx, y) can be millions of items and I need to fast 'add' verb.

UPDATE

Solution (thanks to Dan Bron):

   cumIdx =: 1 : 0
   :
   'i z' =. y
   n =. ~. i
   x n}~ (n{x) + i u//. z
   )
   (1 2 3 4) + cumIdx (0 1 0);(5 6 7)
13 8 3 4

Solution

  • For now, a short answer in the "get it done" mode:

    data   =.  1 2 3 4
    idx    =.  0 1 0
    updat  =.  5 6 7
    
    cumIdx =:  adverb define
    :
      n =. ~. m
      y n}~ (n{y) + m +//. x
    )
    
      updat idx cumIdx data  NB. 13 8 3 4
    

    In brief:

    1. Start by grouping the update array (in your post, y¹) where your index array has the same value, and taking the sum of each group
      1. Accomplish this using the adverb key (/.) with sum (+/) as its verbal argument, deriving a dyadic verb whose arguments are idx on the left and the update array (your y, my updat) on the right.
    2. Get the nub (~.) of your index array
    3. Select these (unique) indices from your value array (your x, my data)
      1. This will, by definition, have the same length as the cumulative sums we calculated in (1.)
    4. Add these to the cumulative sum
    5. Now you have your final updates to the data; updat and idx have the same length, so you just merge them into your value array using }, as you did in your code

    Since we kept the update array small (never greater than its original length), this should have decent performance on larger inputs, though I haven't run any tests. The only performance drawback is the double computation of the nub of idx (once explicitly with ~. and once implicitly with /.), though since your values are integers, this should be relatively cheap; it's one of J's stronger areas, performance-wise.


    ¹ I realize renaming your arrays makes this answer more verbose than it needs to be. However, since you named your primary data x rather than y (which is the convention), if I had just kept your naming convention, then when I invoked cumIdx, the names of the nouns inside the definition would have the opposite meanings to the ones outside the definition, which I thought would cause greater confusion. For this reason, it's best to keep "primary data" on the right (y), and "control data" on the left (x).

    You might also consider constraining your use of the special names x,y,u,v,m and n to where they're already implicitly defined by invoking an explicit definition; definitely never change their nameclasses.