I think APL has a lot to teach us about parallel (specifically GPU) programming....

jodrellblank · on Feb 25, 2022

Following is an attempt to explain +/((⌽⌈\⌽a)⌊⌈\a)-a

The functions ⌊ ⌈ find min and max values, and \ is functional programming scan operator. Combining ⌈\ gives max-scan which does this:

          twentyRandoms
    71 61 79 72 84 76 76 88 16 53 51 56 64 50 100 33 60 82 96 53

          ⌈\twentyRandoms
    71 71 79 79 84 84 84 88 88 88 88 88 88 88 100 100 100 100 100 100

moving to the right, the largest value (max) seen so far is kept and carried on.

Then ⌽ reverses the values in an array so this (⌽⌈\⌽a) does the same thing going from right to left, by reversing, max-scan, then reversing the result:

          twentyRandoms
    71 61 79 72 84 76 76 88 16 53 51 56 64 50 100 33 60 82 96 53

          (⌽⌈\⌽twentyRandoms)
    100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 96 96 96 96 53

Then the ⌊ in the middle compares the two results an item at a time, taking the minimum of each pair:

          (⌽⌈\⌽twentyRandoms)⌊⌈\twentyRandoms
    71 71 79 79 84 84 84 88 88 88 88 88 88 88 100 96 96 96 96 53

The (...a) - a part does item by item subtraction, each number in this result array minus the number in that position from the original array.

          ((⌽⌈\⌽twentyRandoms)⌊⌈\twentyRandoms)-twentyRandoms
    0 10 0 7 0 8 8 0 72 35 37 32 24 38 0 63 36 14 0 0

Then +/ sums them up like a sum() or foldr or reduce:

          +/((⌽⌈\⌽twentyRandoms)⌊⌈\twentyRandoms)-twentyRandoms
    384

Which, if I remember the talk and the code works, means that if you drew this array as a bar graph and poured water into the top middle of it, how much water would it hold before it all overflowed? The highest block on the left and the highest block on the right are the bounds of the highest points where the water can't get past. In between those, the bar heights are the depth at each position, and the water depth is from there up to the lower of the highest bar.

raphlinus · on Feb 25, 2022

Yes! I believe being able to figure this out qualifies you as ⊢>+⌿÷≢

jodrellblank · on Feb 25, 2022

Above average, haha. Permanent beginner, I think :)

jazzyjackson · on Feb 25, 2022

its interesting that ⌽ is used for reverse, it's phi, right? As the lesser is to the greater, the greater is to the whole? There's a vague notion of invertability to the golden ratio but I can't find the words for it. Maybe, a ratio that works in both directions.

Am I way off base? Maybe there are other meanings to the character I'm unaware of.

moonchild · on Feb 25, 2022

The line is an axis of vertical symmetry in a 2-d context (circle); compare with ⊖, which shows an axis of horizontal symmetry. When applied to a matrix, ⌽ reverses rows and ⊖ reverses columns; the elements are mirrored around the pictured axis of symmetry. (In general, when applied to a high-rank array, ⊖ reverses the first axis of its argument, and ⌽ reverses the last.)

beagle3 · on Feb 25, 2022

There’s also a circle with a diagonal line through it (on phone, no Unicode keyboard) which consistently with the above axes of symmetry transposes a matrix; don’t remember if it transposes first two or last two axes on higher rank arrays.

moonchild · on Feb 25, 2022

It reverses the order of the axes. There is also a dyadic variant, which provides a new order for the axes. Duplicating an axis is no problem; but there is a danger of leaving out an axis. So in j, instead, the specified axes are moved to the rear.

moonchild · on Feb 25, 2022

(Axes of horizontal and vertical symmetry, respectively, of course; I don't know why I wrote that.)

jazzyjackson · on Feb 25, 2022

o that's much less esoteric, thanks for the explanation

adityaathalye · on Feb 25, 2022

> I'm especially fond of the power (⍣) operator

Oh boy, I had a mind-melting moment with _inverse_ (⍣¯1).

Aaron helped me understand what was going on. He wrote about "Decoding Inverses" here: https://www.sacrideo.us/decoding-inverses/

And that took me from a clunky solution to Dismal Arithmetic (addition):

       {10(⊤⍣¯1)⍵}∘{⌈/⍵}∘{10(⊥⍣¯1)⍵}⊢ 100000 10000 1000 100 10 1
  111111

To this little gem:

       da ← 10⊥(⌈/10⊥⍣¯1⊢)
       da 169 248
  269

(Edit: add code example, fix language)

nyc111 · on Feb 25, 2022

Thanks for the reference to Dismal Arithmetic, I found it very interesting. By the way, according to Wikipedia, Dismal Arithmetic is now called Lunar Arithmetic. https://en.m.wikipedia.org/wiki/Lunar_arithmetic

moonchild · on Feb 25, 2022

> power (⍣) operator[3], which represents parallel iteration and I think can often be translated to a per-thread while or for loop in a compute shader.

Power is for sequential iteration. You may be thinking of rank (⍤).

raphlinus · on Feb 25, 2022

I spoke imprecisely. It is sequential iteration applied to each of the elements of an array, in parallel. That maps very well to compute shaders, but is hard to express in numpy because it's a higher-order operator, the right hand side is a function.

moonchild · on Feb 25, 2022

That is incorrect. The function is applied to its argument in its entirety. We can see this by using a function such as ⊂:

         ]boxing on
   Was OFF
         (⊂⍣5) 2 3 4
   ┌─────────────┐
   │┌───────────┐│
   ││┌─────────┐││
   │││┌───────┐│││
   ││││┌─────┐││││
   │││││2 3 4│││││
   ││││└─────┘││││
   │││└───────┘│││
   ││└─────────┘││
   │└───────────┘│
   └─────────────┘

(On GNU APL, it looks like the relevant formulation is ']boxing 3' rather than ']boxing on'; this simply affects the display of nested arrays.)

raphlinus · on Feb 25, 2022

Ok, thanks for the correction. I think I understood where I went wrong - Dr. Hsu's algorithm for traversing up a tree to the root uses this, using = as the right argument so it's a fixpoint, but it only happens that in this case the left operand to the power operator is piecewise. So sometimes this can (at least in theory) be efficiently compiled to compute shader, in other cases not.

moonchild · on Feb 25, 2022

Even in the case where f has low rank, I do not think it is given that you would want to run multiple invocations of f in parallel in e.g. f⍣≡. Even though f has low rank, ≡ does not, and so gets applied to the entirety of the result. Meaning that you need a synchronization after every iteration; if f is cheap enough, that may not be worthwhile. (And low-rank g in f⍣g is meaningless, as it needs to produce a single boolean.)

(You can do f⍣g⍤n, but then, again, that is rank doing the work of parallelization. And, performance considerations aside, I am no longer so convinced of the aesthetic utility of ⍤ as I once was. Wrt performance, when targeting CPUs, where branches are not so cheap, it might be worthwhile to design an f which can tolerate being called overly many times. Then give y an extraneous trailing axis the same size as a simd lane and do (f⍣g⍤1) y. That permits the whole computation to be vectorised, though it is also somewhat brittle.)

(Note ≡ is not the same as =. 1 1 1 ←→ 1 2 3 = 1 2 3, but 1 ←→ 1 2 3 ≡ 1 2 3. f⍣≡ is an idiom. f⍣= is meaningless except at rank 0, where is behaves the same as f⍣≡.)

raphlinus · on Feb 25, 2022

Thanks for this explanation, it helps. I got into APL about 40 years ago (using an IBM 2741 terminal connected to a 360 if reconstruction of memory serves), but have only recently picked it up again, and not very deeply. From the kinds of things you're saying, the semantics line up somewhat with GPU-friendly parallelism, but not entirely.

If my (newly refreshed) understanding is correct, you can implement f⍣≡ as a per-thread fixpoint if you know that f is elementwise independent, which it seems like you'd be able to determine at least some of the time by static analysis. But I can also see that would be brittle, and if you fell back to do a dispatch per iteration, performance would tank.

moonchild · on Feb 25, 2022

> If my (newly refreshed) understanding is correct, you can implement f⍣≡ as a per-thread fixpoint if you know that f is elementwise independent, which it seems like you'd be able to determine at least some of the time by static analysis

Hmm, yeah, that would work. Though you would also need to account for side effects in f.

> semantics line up somewhat with GPU-friendly parallelism, but not entirely

An APL machine was proposed[0]. I say it is the GPUs that are wrong! :)

https://www.slac.stanford.edu/pubs/slacreports/reports07/sla...