How can I have R utilize more of the processing power on my PC?
Asked Answered
C

2

0

My task is to perform clustering on a data set.

The variables have been scaled and centered. I am using the following code to find the optimal number of clusters:

d <- dist(df, method = "euclidean")

library(cluster)

pamk.best <- pamk(d)

plot(pam(d, pamk.best$nc))
str(df)
161976 obs. of 11 variables
R version: 3.2.4

RStudio version: 0.99.893

Windows 7

Intel i7

480 GB RAM

I have noticed that the system never uses more than 22% of the CPU's processing power.

I have taken the following actions so far:

  1. Unsuccessfully tried to change the Set Priority and Set Affinity setting for rsession.exe in the Processes tab of the Windows Task Manager. But, for some reason, it always comes back to low even when I set it to High or Realtime or anything else on that list. The Set Affinity setting shows that the system is allowing R to use all of the cores.
  2. I have adjusted the High Performance settings on my machine by going into Control Panel -> Power Options -> Change advance power settings -> Processor Power Management to 100%.
  3. I have read up the parallel processing CRAN Task View for High Performance Computing. I may be wrong but I don't think that calculating distance between observations in a data set is a task that should be parallelized, in the sense of, dividing up the data set in subsets and performing the distance calculations on subsets in parallel on different cores. Please correct me if I am wrong.

One option I have is to perform clustering on a subset of the data set and then predict cluster membership for the rest of the data set. But, I am thinking that if I have the processing power and the memory available, why can't I perform the clustering on the whole data set!

Is there a way to have the machine or R use higher percentage of the processing power and complete the task quicker?

EDIT: I think that my issue is different from the one described in Multithreading in R because I am not trying to run different functions in R. Rather, I am running only one function on one dataset and would like the machine to use more processing power that is available to it.

Culex answered 4/5, 2016 at 17:49 Comment(3)
Possible duplicate of multithreading with R?Softball
If your computer is less than 10 years old, then you have multiple cores (and if intel, then probably multiple threads as well). Because R is single threaded, any single operation can only run on a single thread at one time. While this single core is chugging away on your code, your other cores are sitting around maybe doing a little cleaning or OS stuff. My guess from your 22% number is that you have a 4 core or 2 core with multi-threading CPU, so that the max R can use at one time (without parallel packages) is 25%. Dirk's answer on the link is worth a thorough read and re-read.Softball
There should be per core utilization number on Windows Task Manager. If 22% is total CPUs utilization then it's not a surprise..Age
I
2

It is probably using one core only.

There is no automatic way to parallelize computations. So what you need to do is rewrite parts of R (here, probably the dist and pam functions, which supposedly are C or Fortran code) to use more than one core.

Or you use a different tool, where someone did the work already. I'm a big fan of ELKI but it's mostly single-core. I think Julia may be worth a look because it is more similar to R (it is very similar to Matlab) and it was designed to use multi-core better. Of course there may also be an R module that parallelizes this. I'd look at the Rcpp modules, which are udually very fast.

But the key to fast and scalable clustering is to avoid distance matrixes. See: a 4-core system yields maybe a 3.5x speedup (often much less, because of turboboost) and a 8 core yields up to 6.5x better performance. But if you increase the data set size 10x you need 100x as much memory and computation. This is a race that you cannot win, except with clever algorithms

Inarticulate answered 5/5, 2016 at 7:11 Comment(0)
S
1

Here is a quick example of using multiple CPU cores. The task has to be split similar to a for loop, but you cannot access any intermediate results for further calculations until the loop was fully executed.

library(doParallel)
registerDoParallel(cores = detectCores(all.tests = FALSE, logical = TRUE))

This would be a basic example of how you can split a task:

vec = c(1,3,5)
do = function(n) n^2
foreach(i = seq_along(vec)) %dopar% do(vec[i])

If packages are required within your do() function, you can load them in the following way:

foreach(i = seq_along(vec), .packages=c(some packages)) %dopar% do(vec[i])
Sleuth answered 11/11, 2019 at 15:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.