Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think APL has a lot to teach us about parallel (specifically GPU) programming. I've rewatched a few talks with Aaron Hsu and listened to his arraycast podcast[1].

In researching an upcoming blog post on monoids, I tried implementing the problem in Guy Steele's "Four Solutions to a Trivial Problem" talk[2] in APL (actually I tried GNU APL for this, because I didn't need anything fancy). This is what I came up with:

    +/((⌽⌈\⌽a)⌊⌈\a)-a
That in turn should be fully parallelizable and run fast on the GPU, as it's based on a couple scan (generalized prefix sum) operations.

There's a considerable amount of DNA from APL in modern compiler-based machine learning frameworks, mostly by way of numpy. Dr. Hsu makes a good case that APL has quite a bit of power which is considerably diluted in later spins such as numpy. I'm especially fond of the power (⍣) operator[3], which represents parallel iteration and I think can often be translated to a per-thread while or for loop in a compute shader. I don't believe numpy has anything quite like that (but feel free to correct me!).

[1]: https://www.arraycast.com/episodes/episode19-aaron-hsu

[2]: https://www.youtube.com/watch?v=ftcIcn8AmSY

[3]: https://help.dyalog.com/15.0/Content/Language/Primitive%20Op...



Following is an attempt to explain +/((⌽⌈\⌽a)⌊⌈\a)-a

The functions ⌊ ⌈ find min and max values, and \ is functional programming scan operator. Combining ⌈\ gives max-scan which does this:

          twentyRandoms
    71 61 79 72 84 76 76 88 16 53 51 56 64 50 100 33 60 82 96 53

          ⌈\twentyRandoms
    71 71 79 79 84 84 84 88 88 88 88 88 88 88 100 100 100 100 100 100

moving to the right, the largest value (max) seen so far is kept and carried on.

Then ⌽ reverses the values in an array so this (⌽⌈\⌽a) does the same thing going from right to left, by reversing, max-scan, then reversing the result:

          twentyRandoms
    71 61 79 72 84 76 76 88 16 53 51 56 64 50 100 33 60 82 96 53

          (⌽⌈\⌽twentyRandoms)
    100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 96 96 96 96 53

Then the ⌊ in the middle compares the two results an item at a time, taking the minimum of each pair:

          (⌽⌈\⌽twentyRandoms)⌊⌈\twentyRandoms
    71 71 79 79 84 84 84 88 88 88 88 88 88 88 100 96 96 96 96 53
The (...a) - a part does item by item subtraction, each number in this result array minus the number in that position from the original array.

          ((⌽⌈\⌽twentyRandoms)⌊⌈\twentyRandoms)-twentyRandoms
    0 10 0 7 0 8 8 0 72 35 37 32 24 38 0 63 36 14 0 0
Then +/ sums them up like a sum() or foldr or reduce:

          +/((⌽⌈\⌽twentyRandoms)⌊⌈\twentyRandoms)-twentyRandoms
    384
Which, if I remember the talk and the code works, means that if you drew this array as a bar graph and poured water into the top middle of it, how much water would it hold before it all overflowed? The highest block on the left and the highest block on the right are the bounds of the highest points where the water can't get past. In between those, the bar heights are the depth at each position, and the water depth is from there up to the lower of the highest bar.


Yes! I believe being able to figure this out qualifies you as ⊢>+⌿÷≢


Above average, haha. Permanent beginner, I think :)


its interesting that ⌽ is used for reverse, it's phi, right? As the lesser is to the greater, the greater is to the whole? There's a vague notion of invertability to the golden ratio but I can't find the words for it. Maybe, a ratio that works in both directions.

Am I way off base? Maybe there are other meanings to the character I'm unaware of.


The line is an axis of vertical symmetry in a 2-d context (circle); compare with ⊖, which shows an axis of horizontal symmetry. When applied to a matrix, ⌽ reverses rows and ⊖ reverses columns; the elements are mirrored around the pictured axis of symmetry. (In general, when applied to a high-rank array, ⊖ reverses the first axis of its argument, and ⌽ reverses the last.)


There’s also a circle with a diagonal line through it (on phone, no Unicode keyboard) which consistently with the above axes of symmetry transposes a matrix; don’t remember if it transposes first two or last two axes on higher rank arrays.


It reverses the order of the axes. There is also a dyadic variant, which provides a new order for the axes. Duplicating an axis is no problem; but there is a danger of leaving out an axis. So in j, instead, the specified axes are moved to the rear.


(Axes of horizontal and vertical symmetry, respectively, of course; I don't know why I wrote that.)


o that's much less esoteric, thanks for the explanation


> I'm especially fond of the power (⍣) operator

Oh boy, I had a mind-melting moment with _inverse_ (⍣¯1).

Aaron helped me understand what was going on. He wrote about "Decoding Inverses" here: https://www.sacrideo.us/decoding-inverses/

And that took me from a clunky solution to Dismal Arithmetic (addition):

       {10(⊤⍣¯1)⍵}∘{⌈/⍵}∘{10(⊥⍣¯1)⍵}⊢ 100000 10000 1000 100 10 1
  111111
To this little gem:

       da ← 10⊥(⌈/10⊥⍣¯1⊢)
       da 169 248
  269
(Edit: add code example, fix language)


Thanks for the reference to Dismal Arithmetic, I found it very interesting. By the way, according to Wikipedia, Dismal Arithmetic is now called Lunar Arithmetic. https://en.m.wikipedia.org/wiki/Lunar_arithmetic


> power (⍣) operator[3], which represents parallel iteration and I think can often be translated to a per-thread while or for loop in a compute shader.

Power is for sequential iteration. You may be thinking of rank (⍤).


I spoke imprecisely. It is sequential iteration applied to each of the elements of an array, in parallel. That maps very well to compute shaders, but is hard to express in numpy because it's a higher-order operator, the right hand side is a function.


That is incorrect. The function is applied to its argument in its entirety. We can see this by using a function such as ⊂:

         ]boxing on
   Was OFF
         (⊂⍣5) 2 3 4
   ┌─────────────┐
   │┌───────────┐│
   ││┌─────────┐││
   │││┌───────┐│││
   ││││┌─────┐││││
   │││││2 3 4│││││
   ││││└─────┘││││
   │││└───────┘│││
   ││└─────────┘││
   │└───────────┘│
   └─────────────┘
(On GNU APL, it looks like the relevant formulation is ']boxing 3' rather than ']boxing on'; this simply affects the display of nested arrays.)


Ok, thanks for the correction. I think I understood where I went wrong - Dr. Hsu's algorithm for traversing up a tree to the root uses this, using = as the right argument so it's a fixpoint, but it only happens that in this case the left operand to the power operator is piecewise. So sometimes this can (at least in theory) be efficiently compiled to compute shader, in other cases not.


Even in the case where f has low rank, I do not think it is given that you would want to run multiple invocations of f in parallel in e.g. f⍣≡. Even though f has low rank, does not, and so gets applied to the entirety of the result. Meaning that you need a synchronization after every iteration; if f is cheap enough, that may not be worthwhile. (And low-rank g in f⍣g is meaningless, as it needs to produce a single boolean.)

(You can do f⍣g⍤n, but then, again, that is rank doing the work of parallelization. And, performance considerations aside, I am no longer so convinced of the aesthetic utility of ⍤ as I once was. Wrt performance, when targeting CPUs, where branches are not so cheap, it might be worthwhile to design an f which can tolerate being called overly many times. Then give y an extraneous trailing axis the same size as a simd lane and do (f⍣g⍤1) y. That permits the whole computation to be vectorised, though it is also somewhat brittle.)

(Note ≡ is not the same as =. 1 1 1 ←→ 1 2 3 = 1 2 3, but 1 ←→ 1 2 3 ≡ 1 2 3. f⍣≡ is an idiom. f⍣= is meaningless except at rank 0, where is behaves the same as f⍣≡.)


Thanks for this explanation, it helps. I got into APL about 40 years ago (using an IBM 2741 terminal connected to a 360 if reconstruction of memory serves), but have only recently picked it up again, and not very deeply. From the kinds of things you're saying, the semantics line up somewhat with GPU-friendly parallelism, but not entirely.

If my (newly refreshed) understanding is correct, you can implement f⍣≡ as a per-thread fixpoint if you know that f is elementwise independent, which it seems like you'd be able to determine at least some of the time by static analysis. But I can also see that would be brittle, and if you fell back to do a dispatch per iteration, performance would tank.


> If my (newly refreshed) understanding is correct, you can implement f⍣≡ as a per-thread fixpoint if you know that f is elementwise independent, which it seems like you'd be able to determine at least some of the time by static analysis

Hmm, yeah, that would work. Though you would also need to account for side effects in f.

> semantics line up somewhat with GPU-friendly parallelism, but not entirely

An APL machine was proposed[0]. I say it is the GPUs that are wrong! :)

https://www.slac.stanford.edu/pubs/slacreports/reports07/sla...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: