Raspberry Pi cluster, neuron networks and brain simulation

Asked 14/9, 2011 at 18:11 Answered 26/10, 2014 at 4:29

Solved arm cluster-computing supercomputers raspberry-pi

Since the RBPI (Raspberry Pi) has very low power consumption and very low production price, it means one could build a very big cluster with those. I'm not sure, but a cluster of 100000 RBPI would take little power and little room.

Now I think it might not be as powerful as existing supercomputers in terms of FLOPS or others sorts of computing measurements, but could it allow better neuronal network simulation ?

I'm not sure if saying "1 CPU = 1 neuron" is a reasonable statement, but it seems valid enough.

So does it mean such a cluster would more efficient for neuronal network simulation, since it's far more parallel than other classical clusters ?

Tomika answered 14/9, 2011 at 18:11 Comment(1)

People interested in the Raspberry Pi might want to support the StackExchange proposal for a R-Pi site. area51.stackexchange.com/proposals/37041/… – Berardo 3/3, 2012 at 0:57

Using Raspberry Pi itself doesn't solve the whole problem of building a massively parallel supercomputer: how to connect all your compute cores together efficiently is a really big problem, which is why supercomputers are specially designed, not just made of commodity parts. That said, research units are really beginning to look at ARM cores as a power-efficient way to bring compute power to bear on exactly this problem: for example, this project that aims to simulate the human brain with a million ARM cores.

http://www.zdnet.co.uk/news/emerging-tech/2011/07/08/million-core-arm-machine-aims-to-simulate-brain-40093356/ "Million-core ARM machine aims to simulate brain"

http://www.eetimes.com/electronics-news/4217840/Million-ARM-cores-brain-simulator "A million ARM cores to host brain simulator"

It's very specialist, bespoke hardware, but conceptually, it's not far from the network of Raspberry Pis you suggest. Don't forget that ARM cores have all the features that JohnB mentioned the Xeon has (Advanced SIMD instead of SSE, can do 64-bit calculations, overlap instructions, etc.), but sit at a very different MIPS-per-Watt sweet-spot: and you have different options for what features are included (if you don't want floating-point, just buy a chip without floating-point), so I can see why it's an appealing option, especially when you consider that power use is the biggest ongoing cost for a supercomputer.

Exurbia answered 28/9, 2011 at 19:29 Comment(1)

I still wonder how the transistor quantity grows with the aperture; when you go from 8bits to 32bits, does the transistor quantity grows linearly ? If not then maybe 8 bits cores might cut power consumption. – Tomika 16/2, 2012 at 6:11

Seems unlikely to be a good/cheap system to me. Consider a modern xeon cpu. It has 8 cores running at 5 times the clock speed, so just on that basis can do 40 times as much work. Plus it has SSE which seems suited for this application and will let it calculate 4 things in parallel. So we're up to maybe 160 times as much work. Then it has multithreading, can do 64 bit calculations, overlap instructions etc. I would guess it would be at least 200 times faster for this kind of work.

Then finally, the results of at least 200 local "neurons" would be in local memory but on the raspberry pi network you'd have to communicate between 200 of them... Which would be very much slower.

I think the raspberry pi is great and certainly plan to get at least one :P But you're not going to build a cheap and fast network of them that will compete with a network of "real" computers :P

Anyway, the fastest hardware for this kind of thing is likely to be a graphics card GPU as it's designed to run many copies of a small program in parallel. Or just program an fpga with a few hundred copies of a "hardware" neuron.

Territorial answered 14/9, 2011 at 20:25 Comment(9)

1. how much does a xeon cost when it's running in a super computer, versus a massively produced raspberry pi ? 2. how much power does a xeon individually consume ? 3. if you use several multicore cpu, you have an heterogeneous system where neurons don't have the same latency if a neurons is on the same or on another xeon. 4. clock speed has nothing to deal with parallelization, memory access has. I want to center the discussion on massive parallelizing, which can deal with problems differently because some algorithms have a different order of magnitude which cannot be solved by clock speed. – Tomika 14/9, 2011 at 22:24

5. SSE is awesome yes, but compilers don't automatically do it, so that's added costs. – Tomika 14/9, 2011 at 22:24

Well yeah, but my point was just that you'd probably get more MIPS per $ from a network of x64 cups than this way... – Territorial 15/9, 2011 at 4:55

I don't think brain power can be measured with MIPS of FLOPS, computer power and brain power are different things. I'm sure brains get their power because they have lots lots of neurons. When working with computer, the speeds matters less than how many things you can do at the same time: parallel computation is not time-dependent, and I think that's why our brains can do so many things. – Tomika 16/9, 2011 at 18:54

However the number of neurons you can simulate does vary directly with the computer power. – Territorial 16/9, 2011 at 19:32

Yes, but how do you define computer power ? Does it relate to clock speed, or concurrent executions ? That's why we have GPU and CPUs. Anyway, I'm sure nobody will want to fund a supercomputer project with millions and millions of 8 bits cores if it does not advertise "we are using athlon" or "we are using Xeon". – Tomika 18/9, 2011 at 4:18

@gokoon If you can simulate 4 neurons serially on a fast CPU in the same time as it takes for 4 slower CPUs to simulate one neuron each in parallel, then they are equivalent. – Ceballos 16/2, 2012 at 5:53

32 bits CPUs can become expensive, we use 32 bits CPUs because we need floating point precision, I don't think 8 bit cores scale at the same price, I mean 4 8-bit cores might cost much less than one 32 bit. On top of that, SIMD can be a programming contraint, you have to think about the guys coding the thing with ease and still being able to have good performances. I think it's more an electronics discussions, not a computer science one. – Tomika 16/2, 2012 at 6:6

Yeah just use an FPGA and simulate 1000s in parallel. – Territorial 16/2, 2012 at 8:46

GPU and FPU do this kind of think much better then a CPU, the Nvidia GPU's that support CDUA programming has in effect 100's of separate processing units. Or at least it can use the evolution of the pixel pipe lines (where the card could render mutiple pixels in parallel) to produce huge incresses in speed. CPU allows a few cores that can carry out reletivly complex steps. GPU allows 100's of threads that can carry out simply steps.

So for tasks where you haves simple threads things like a single GPU will out preform a cluster of beefy CPU. (or a stack of Raspberry pi's)

However for creating a cluster running some thing like "condor" Which cand be used for things like Disease out break modelling, where you are running the same mathematical model millions of times with varible starting points. (size of out break, wind direction, how infectious the disease is etc.. ) so thing like the Pi would be ideal. as you are general looking for a full blown CPU that can run standard code.http://research.cs.wisc.edu/condor/

some well known usage of this aproach are "Seti" or "folding at home" (search for aliens and cancer research)

A lot of universities have a cluster such as this so I can see some of them trying the approach of mutipl Raspberry Pi's

But for simulating nurons in a brain you require very low latency between the nodes they are special OS's and applications that make mutiply systems act as one. You also need special networks to link it togather to give latency between nodes in terms of < 1 millsecond.

http://en.wikipedia.org/wiki/InfiniBand

the Raspberry just will not manage this in any way.

So yes I think people will make clusters out of them and I think they will be very pleased. But I think more university's and small organisations. They arn't going to compete with the top supercomputers.

Saying that we are going to get a few and test them against our current nodes in our cluster to see how they compare with a desktop that has a duel core 3.2ghz CPU and cost £650! I reckon for that we could get 25 Raspberries and they will use much less power, so will be interesting to compare. This will be for disease outbreak modelling.

Abamp answered 1/3, 2012 at 21:54 Comment(3)

I wonder, what is the best latency you can get out of an ethernet or usb plug ? About seti, results are not continuously sent, so maybe there can be some kind of compromise made in the way you use memory, to avoid spreading data unnecessarily... – Tomika 2/3, 2012 at 13:4

best latency on a Ethernet link is around 10ms, but with standard ether net you have over head and error checking that increase this further. Things like infiniband do away with error checking and over heads. as I said above, two types methods of creating clusters are the seti type model, where a package of work is sent to each node, which is completed and sent back to a central server that then sends the next work package out. This works well for destint work packages where there nodes don't need to interact. (see next comment) – Abamp 3/3, 2012 at 21:7

This is only suitable for some applications and not real time modelling of a single system. where all the nodes need to exchange data between there processes. When you think that a 1ghx processer is carrying out 1billion operations per second, and ddr2 ram has a 4 or 5 nano second refresh rate. 10ms latency is a massive bottle neck (even if you carefully use memory to buffer the effect). For this model where single work threads may be distributed over multiply nodes. you need specialised hardware and network to support it. – Abamp 3/3, 2012 at 21:14

I am undertaking a large amount of neural network research in the area of chaotic time series prediction (with echo state networks). Although I see using the raspberry PI's in this way will offer little to no benefit over say a strong cpu or a GPU, I have been using a raspberry PI to manage the distribution of simulation jobs to multiple machines. The processing power benefit of a large core will nail that possible on the raspberry PI, not just that but running multiple PI's in this configuration will generate large overheads of waiting for them to sync, data transfer etc. Due to the low cost and robustness of the PI, i have it hosting the source of the network data, as well as mediating the jobs to the Agent machines. It can also hard reset and restart a machine if a simulation fails taking the machine down with it allowing for optimal uptime.

Byte answered 4/7, 2012 at 15:12 Comment(0)

Neural networks are expensive to train, but very cheap to run. While I would not recommend using these (even clustered) to iterate over a learning set for endless epochs, once you have the weights, you can transfer the learning effort into them.

Used in this way, one raspberry pi should be useful for much more than a single neuron. Given the ratio of memory to cpu, it will likely be memory bound in its scale. Assuming about 300 megs of free memory to work with (which will vary according to OS/drivers/etc.) and assuming you are working with 8 byte double precision weights, you will have an upper limit on the order of 5000 "neurons" (before becoming storage bound), although so many other factors can change this and it is like asking: "How long is a piece of string?"

Cupriferous answered 22/3, 2013 at 1:39 Comment(0)

Some engineers at Southampton University built a Raspberry Pi supercomputer:

Palestine answered 2/4, 2013 at 15:18 Comment(0)

I have ported a spiking network (see http://www.raspberrypi.org/phpBB3/viewtopic.php?f=37&t=57385&e=0 for details) to the Raspberry Pi and it runs about 24 times slower than on my old Pentium-M notebook from 2005 with SSE and prefetch optimizations.

Gasometer answered 7/10, 2013 at 5:26 Comment(0)

It all depends on the type of computing you want to do. If you are doing very numerically intensive algorithms with not much memory movement between the processor caches and RAM memory then a GPU solution is indicated. The middle ground is an Intel PC chip using the SIMD assembly language instructions - you can still easily end up being limited by rate you can transfer data to and from RAM. For nearly the same cost you can get 50 ARM boards with say 4 cores per board and 2Gb RAM per board. That's 200 cores and 100 Gb of RAM. The amount of data that can be shuffled between the CPUs and RAM per second is very high. It could be a good option for neural nets that use large weight vectors. Also the latest ARM GPU's and the new nVidea ARM based chip (used in the slate tablet) have GPU compute as well.

Buchmanism answered 26/10, 2014 at 4:29 Comment(1)

what are those ARM board you're talking about ? when you mean 'high data transfer rate' do you mean per board or between boards ? – Tomika 26/10, 2014 at 8:58

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags