I think APL has a lot to teach us about parallel (specifically GPU) programming. I've rewatched a few talks with Aaron Hsu and listened to his arraycast podcast[1].
In researching an upcoming blog post on monoids, I tried implementing the problem in Guy Steele's "Four Solutions to a Trivial Problem" talk[2] in APL (actually I tried GNU APL for this, because I didn't need anything fancy). This is what I came up with:
+/((⌽⌈\⌽a)⌊⌈\a)-a
That in turn should be fully parallelizable and run fast on the GPU, as it's based on a couple scan (generalized prefix sum) operations.
There's a considerable amount of DNA from APL in modern compiler-based machine learning frameworks, mostly by way of numpy. Dr. Hsu makes a good case that APL has quite a bit of power which is considerably diluted in later spins such as numpy. I'm especially fond of the power (⍣) operator[3], which represents parallel iteration and I think can often be translated to a per-thread while or for loop in a compute shader. I don't believe numpy has anything quite like that (but feel free to correct me!).
moving to the right, the largest value (max) seen so far is kept and carried on.
Then ⌽ reverses the values in an array so this (⌽⌈\⌽a) does the same thing going from right to left, by reversing, max-scan, then reversing the result:
Which, if I remember the talk and the code works, means that if you drew this array as a bar graph and poured water into the top middle of it, how much water would it hold before it all overflowed? The highest block on the left and the highest block on the right are the bounds of the highest points where the water can't get past. In between those, the bar heights are the depth at each position, and the water depth is from there up to the lower of the highest bar.
its interesting that ⌽ is used for reverse, it's phi, right? As the lesser is to the greater, the greater is to the whole? There's a vague notion of invertability to the golden ratio but I can't find the words for it. Maybe, a ratio that works in both directions.
Am I way off base? Maybe there are other meanings to the character I'm unaware of.
The line is an axis of vertical symmetry in a 2-d context (circle); compare with ⊖, which shows an axis of horizontal symmetry. When applied to a matrix, ⌽ reverses rows and ⊖ reverses columns; the elements are mirrored around the pictured axis of symmetry. (In general, when applied to a high-rank array, ⊖ reverses the first axis of its argument, and ⌽ reverses the last.)
There’s also a circle with a diagonal line through it (on phone, no Unicode keyboard) which consistently with the above axes of symmetry transposes a matrix; don’t remember if it transposes first two or last two axes on higher rank arrays.
It reverses the order of the axes. There is also a dyadic variant, which provides a new order for the axes. Duplicating an axis is no problem; but there is a danger of leaving out an axis. So in j, instead, the specified axes are moved to the rear.
Thanks for the reference to Dismal Arithmetic, I found it very interesting. By the way, according to Wikipedia, Dismal Arithmetic is now called Lunar Arithmetic. https://en.m.wikipedia.org/wiki/Lunar_arithmetic
> power (⍣) operator[3], which represents parallel iteration and I think can often be translated to a per-thread while or for loop in a compute shader.
Power is for sequential iteration. You may be thinking of rank (⍤).
I spoke imprecisely. It is sequential iteration applied to each of the elements of an array, in parallel. That maps very well to compute shaders, but is hard to express in numpy because it's a higher-order operator, the right hand side is a function.
Ok, thanks for the correction. I think I understood where I went wrong - Dr. Hsu's algorithm for traversing up a tree to the root uses this, using = as the right argument so it's a fixpoint, but it only happens that in this case the left operand to the power operator is piecewise. So sometimes this can (at least in theory) be efficiently compiled to compute shader, in other cases not.
Even in the case where f has low rank, I do not think it is given that you would want to run multiple invocations of f in parallel in e.g. f⍣≡. Even though f has low rank, ≡ does not, and so gets applied to the entirety of the result. Meaning that you need a synchronization after every iteration; if f is cheap enough, that may not be worthwhile. (And low-rank g in f⍣g is meaningless, as it needs to produce a single boolean.)
(You can do f⍣g⍤n, but then, again, that is rank doing the work of parallelization. And, performance considerations aside, I am no longer so convinced of the aesthetic utility of ⍤ as I once was. Wrt performance, when targeting CPUs, where branches are not so cheap, it might be worthwhile to design an f which can tolerate being called overly many times. Then give y an extraneous trailing axis the same size as a simd lane and do (f⍣g⍤1) y. That permits the whole computation to be vectorised, though it is also somewhat brittle.)
(Note ≡ is not the same as =. 1 1 1 ←→ 1 2 3 = 1 2 3, but 1 ←→ 1 2 3 ≡ 1 2 3. f⍣≡ is an idiom. f⍣= is meaningless except at rank 0, where is behaves the same as f⍣≡.)
Thanks for this explanation, it helps. I got into APL about 40 years ago (using an IBM 2741 terminal connected to a 360 if reconstruction of memory serves), but have only recently picked it up again, and not very deeply. From the kinds of things you're saying, the semantics line up somewhat with GPU-friendly parallelism, but not entirely.
If my (newly refreshed) understanding is correct, you can implement f⍣≡ as a per-thread fixpoint if you know that f is elementwise independent, which it seems like you'd be able to determine at least some of the time by static analysis. But I can also see that would be brittle, and if you fell back to do a dispatch per iteration, performance would tank.
> If my (newly refreshed) understanding is correct, you can implement f⍣≡ as a per-thread fixpoint if you know that f is elementwise independent, which it seems like you'd be able to determine at least some of the time by static analysis
Hmm, yeah, that would work. Though you would also need to account for side effects in f.
> semantics line up somewhat with GPU-friendly parallelism, but not entirely
An APL machine was proposed[0]. I say it is the GPUs that are wrong! :)
In researching an upcoming blog post on monoids, I tried implementing the problem in Guy Steele's "Four Solutions to a Trivial Problem" talk[2] in APL (actually I tried GNU APL for this, because I didn't need anything fancy). This is what I came up with:
That in turn should be fully parallelizable and run fast on the GPU, as it's based on a couple scan (generalized prefix sum) operations.There's a considerable amount of DNA from APL in modern compiler-based machine learning frameworks, mostly by way of numpy. Dr. Hsu makes a good case that APL has quite a bit of power which is considerably diluted in later spins such as numpy. I'm especially fond of the power (⍣) operator[3], which represents parallel iteration and I think can often be translated to a per-thread while or for loop in a compute shader. I don't believe numpy has anything quite like that (but feel free to correct me!).
[1]: https://www.arraycast.com/episodes/episode19-aaron-hsu
[2]: https://www.youtube.com/watch?v=ftcIcn8AmSY
[3]: https://help.dyalog.com/15.0/Content/Language/Primitive%20Op...