Are there functional programming languages that run on the GPU?

Asked 5/2, 2014 at 16:55 Answered 4/1, 2021 at 0:45

haskell parallel-processing functional-programming ocaml gpu

Using the traditional, sequential reduction approach, the following graph is reduced as:

(+ (+ 1 2) (+ 3 4)) ->
(+ 3 (+ 3 4)) ->
(+ 3 7) ->
10

Graph reductions are, though, inherently parallel. One could, instead, reduce it as:

(+ (+ 1 2) (+ 3 4)) ->
(+ 3 7) ->
10

As far as I know, every functional programming language uses the first approach. I believe this is mostly because, on the CPU, scheduling threads overcompensate the benefits of doing parallel reductions. Recently, though, we've been starting to use the GPU more than the CPU for parallel applications. If a language ran entirely on the GPU, those communication costs would vanish.

Are there functional languages making use of that idea?

Toucan answered 5/2, 2014 at 16:55 Comment(0)

What makes you think on GPU scheduling would not overcomponsate the benefits?

In fact, the kind of parallelism used in GPUs is far harder to schedule: it's SIMD parallelism, i.e. a whole batch of stream processors do all essentially the same thing at a time, except each one crushes a different bunch of numbers. So, not only would you need to schedule the subtasks, you would also need to keep them synchronised. Doing that automatically for general computations is virtually impossible.

Doing it for specific tasks works out quite well and has been embedded into functional languages; check out the Accelerate project.

Hittite answered 5/2, 2014 at 17:14 Comment(11)

GPU scheduling would not be a problem if you only used the GPU, ie, loaded your program on it and left it running there for hours, only coming back later to check the results. And please define "general computations". What is stopping one to implement an interpreter for the lambda calculus, for example, on the GPU? It is universal/general, and there is nothing stopping it to be efficient (just use actual numbers and lists instead of church numbers). – Toucan 5/2, 2014 at 17:22

Ah, see, there the trouble starts already: you can't define lists on the GPU like Lisp or Haskell does it, that requires plenty of heap allocation, pointer redirection, garbage collection. On a GPU you need tight arrays, aligned in a very specific way, to get any parallelism at all. Perhaps you could write a complete lambda-calculus interpreter on GPU... single threaded, which would be far slower than on CPU. – Hittite 5/2, 2014 at 17:30

That is not true at all. Simple proof: you can implement the lambda calculus using Rule 110, which doesn't require any of that. (Note I'm not saying this is a good implementation, but it invalidates your statement which an implementation needs pointers, allocation and garbage collection thus must be slow -- ie, there could be a good implementation that is fast). – Toucan 5/2, 2014 at 17:55

I looked through the sources of Accelerate. It features some nice ideas (stuffing code-generated expressions into optimized CUDA templates may be rather effective). But I think it is still no more than a proof of concept. It wraps you into very tight envelope and you can hardly push it: I attempted to make some patches, but realized that any little change affects almost every data structure and function. – Barretter 5/2, 2014 at 18:10

@Viclib: I didn't say an implementation of lambda calculus needs pointers etc., I said Lisp-style lists do. What you definitely need is some way to approximate Turing-completeness. That requires each processor to have access to more than a tiny fraction of the memory, but in a GPU there's just a tiny bit of memory for each processor, while the global memory can only be accessed with decent performance in batch mode: you need to prepare a whole region of memory in accordance to the processor outlay, and send it to the whole batch in one operation. – Hittite 5/2, 2014 at 18:19

And I'm saying I don't think that accessing more than a tiny fraction of the memory per node is necessary for a useful, fast turing complete language implementation. Graph reductions are pretty self-contained, and lists can be implemented as heaps (binary trees), for example, which has all advantages of arrays (no pointers at all, constant access, etc) but encodes trees... – Toucan 5/2, 2014 at 18:40

A heap-based implementation may not need explicit pointers, but it uses lots of offsets which are basically locally-scoped pointers. – Graph reductions aren't exactly self-contained, if you want them to spawn new threads... which isn't actually possible on a GPU at all, instead you always need to predefine the code that all processors will run, retrieve all the results, and then you can start to think about the next parallel operation (in fact, you should better do that before retrieving the previous results). – Hittite 5/2, 2014 at 18:53

And what is the problem with that approach? You do predefine the code that all processors will run, some of them will perform some reductions, other not, and each step you get closer to normal form, in much less steps than the sequential approach... – Toucan 6/2, 2014 at 2:54

The problem is that pretty much most of the actual work, namely deciding which processor is to do which part of a reduction step, needs to be done beforehand, sequentially. – Hittite 6/2, 2014 at 9:36

@Viclib Sorry for the very late comment, but in case anyone else is reading this and wondering about the same things as you were, the primary problem is that in a GPU, in order to have any efficiency at all, every thread within a warp needs to be following as close to the exact same code path as possible. While individual threads can take independent paths through the code, all parallelism will be lost when that happens. CUDA cores are not like CPU cores where different threads can be following completely different code paths. – Hendrika 31/3, 2015 at 6:15

@Hendrika I understand. The part I don't understand is why that is a problem. The lambda calculus has only one rule (beta reduction) so a billion of threads could be executing the very same instructions without any issue. – Toucan 31/3, 2015 at 12:40

on the CPU, scheduling threads overcompensate the benefits of doing parallel reduction

Thread scheduling is very effective in modern OSes. Thread initialization and termination may be a matter of concern, but there are plenty of techniques to eliminate those costs.

Graph reductions are, though, inherently parallel

As it was mentioned in other answer, GPUs are very special devices. One can't simply take arbitrary algorithm and make it 100 times faster just by rewriting on CUDA. Speaking of CUDA, it is not exactly a SIMD (Single Instruction on Multiple Data), but SIMT (Single Instruction on Multiple Thread). This is something far more complex, but let's think of it as a mere vector processing language. As name suggests, vector processors designed to handle dense vectors and matrices, i.e. simple linear data structures. So any branching within warp reduces efficiency of parallelism and performance down to zero. Modern architectures (Fermi+) are capable to process even some trees, but this is rather tricky and performance isn't that shining. So you simply can't accelerate arbitrary graph reduction.

What about functional languages for GPGPU. I believe it can't be serious. Most of valuable CUDA code exists inside hardly optimized libraries made by PhDs, and it is aimed straight toward performance. Readability, declarativity, clearness and even safety of functional languages don't matter there.

Barretter answered 5/2, 2014 at 18:44 Comment(0)

SPOC provides some GPGPU access from OCaml.

Denudate answered 5/2, 2014 at 18:55 Comment(0)

The language Obsidian is a domain specific language embedded in Haskell which targets GPGPU computations. It's rather more low-level than what you're asking for but I thought I'd mention it anyway.

Carabin answered 5/2, 2014 at 19:45 Comment(0)

https://github.com/BenjaminTrapani/FunGPU provides a Racket-like functional language that runs entirely on GPUs and other devices that can be targeted with SYCL. The runtime automatically schedules the independent sub-trees in a way that efficiently utilizes the GPU (multiple evaluations of the same instructions with different data are evaluated concurrently). It is still in early stages but could be worth experimenting with. It is already outperforming the Racket VM.

Landed answered 4/1, 2021 at 0:45 Comment(0)

Recommended topics

Hot tags