Is it possible to accelerate (dynamic) LINQ queries using GPU?
Asked Answered
E

5

14

I have been searching for some days for solid information on the possibility to accelerate LINQ queries using a GPU.

Technologies I have "investigated" so far:

  • Microsoft Accelerator
  • Cudafy
  • Brahma

In short, would it even be possible at all to do an in-memory filtering of objects on the GPU?

Let´s say we have a list of some objects and we want to filter something like:

var result = myList.Where(x => x.SomeProperty == SomeValue);

Any pointers on this one?

Thanks in advance!

UPDATE

I´ll try to be more specific about what I am trying to achieve :)

The goal is, to use any technology, which is able to filter a list of objects (ranging from ~50 000 to ~2 000 000), in the absolutely fastest way possible.

The operations I perform on the data when the filtering is done (sum, min, max etc) is made using the built in LINQ-methods and is already fast enough for our application, so that´s not a problem.

The bottleneck is "simply" the filtering of data.

UPDATE

Just wanted to add that I have tested about 15 databases, including MySQL (checking possible cluster approach / memcached solution), H2, HSQLDB, VelocityDB (currently investigating further), SQLite, MongoDB etc, and NONE is good enough when it comes to the speed of filtering data (of course, the NO-sql solutions do not offer this like the sql ones, but you get the idea) and/or the returning of the actual data.

Just to summarize what I/we need:

A database which is able to sort data in the format of 200 columns and about 250 000 rows in less than 100 ms.

I currently have a solution with parallellized LINQ which is able (on a specific machine) to spend only nano-seconds on each row when filtering AND processing the result!

So, we need like sub-nano-second-filtering on each row.

  1. Why does it seem that only in-memory LINQ is able to provide this?
  2. Why would this be impossible?

Some figures from the logfile:

Total tid för 1164 frågor: 2579

This is Swedish and translates:

Total time for 1164 queries: 2579

Where the queries in this case are queries like:

WHERE SomeProperty = SomeValue

And those queries are all being done in parallell on 225639 rows.

So, 225639 rows are being filtered in memory 1164 times in about 2.5 seconds.

That´s 9,5185952917007032597107300413827e-9 seconds / row, BUT, that also includes the actual processing of the numbers! We do Count (not null), total count, Sum, Min, Max, Avg, Median. So, we have 7 operations on these filtered rows.

So, we could say it´s actually 7 times faster than the the databases we´ve tried, since we do NOT do any aggregation-stuff in those cases!

So, in conclusion, why are the databases so poor at filtering data compared to in-memory LINQ filtering? Have Microsoft really done such a good job that it is impossible to compete with it? :)

It makes sense though that in-memory filtering should be faster, but I don´t want a sense that it is faster. I want to know what is faster, and if it´s possible why.

Egalitarian answered 15/2, 2012 at 17:43 Comment(2)
not that I know of but +1 for the idea. I anticipate the answers.Lippe
Watch github.com/palladin/LinqOptimizer/tree/gpuQuadric
L
9

I will answer definitively about Brahma since it's my library, but it probably applies to other approaches as well. The GPU has no knowledge of objects. It's memory is also mostly completely separate from CPU memory.

If you do have a LARGE set of objects and want to operate on them, you can only pack the data you want to operate on into a buffer suitable for the GPU/API you're using and send it off to be processed.

Note that this will make two round trips over the CPU-GPU memory interface, so if you aren't doing enough work on the GPU to make it worthwhile, you'll be slower than if you simply used the CPU in the first place (like the sample above).

Hope this helps.

Legendre answered 15/2, 2012 at 17:49 Comment(3)
Thanks! I love your library and have used it for other stuff than what I am asking about above! All I really wany to do actually is to filter objects reeeaalllyy fast :) Keep up the good work with Brahma!Egalitarian
@Ani, where can I find information about Brahma? Any tutorial and/or examples anywhere?Umbrage
@AlexSanséau Sorry, It's been years since I actively developed this project. You can check out the F# version here (github.com/gsvgit/Brahma.FSharp)Legendre
S
4

The GPU is really not intended for all general purpose computing purposes, especially with object oriented designs like this, and filtering an arbitrary collection of data like this would really not be an appropriate thing.

GPU computations are great for things where you are performing the same operation on a large dataset - which is why things like matrix operations and transforms can be very nice. There, the data copying can be outweighed by the incredibly fast computational capabilities on the GPU....

In this case, you'd have to copy all of the data into the GPU to make this work, and restructure it into some form the GPU will understand, which would likely be more expensive than just performing the filter in software in the first place.

Instead, I would recommend looking at using PLINQ for speeding up queries of this nature. Provided your filter is thread safe (which it'd have to be for any GPU related work...) this is likely a better option for general purpose query optimization, as it won't require the memory copying of your data. PLINQ would work by rewriting your query as:

var result = myList.AsParallel().Where(x => x.SomeProperty == SomeValue);

If the predicate is an expensive operation, or the collection is very large (and easily partitionable), this can make a significant improvement to the overall performance when compared to standard LINQ to Objects.

Selfexamination answered 15/2, 2012 at 17:51 Comment(14)
Thanks for answering! However, this is the approach we are using today. It almost enough, but soon we will be having alot more rows and well... Not even the 16 core machine we have atm is enough anymore :(Egalitarian
@Johan At some point, using a database, and techniques which cause the filtering to be accelerated on the query side (like EF) will cause this type of operation to be FAR faster with large datasets, especially when the filter is a simple property check... The cost of copying to the GPU would be prohibitively expensive for this type of operation. What are you actually doing? What type of data is this?Selfexamination
well it´s a List of objects which come from a table (I´m using Dapper to get the data). The table has many columns (not normalized for performance reasons), but not so many rows atm (around 250 000). However, all the "databases" I have tried so far are WAY to slow compared to in-memory filtering. Do you know any extremely fast "where-optimized" techique of getting data? :)Egalitarian
@Johan, even if you have an index on SomeProperty?Lippe
@Johan If your query is using an indexed column, the DB should be FAR faster than any in memory query. What database are you using? Are your query columns indexed?Selfexamination
@ReedCopsey, we have tested: MySQL H2 HSQLDB (MongoDB) (RedisCache) eXtremeDB (uses reflection and is therefor slower) SQLite And some other alternatives. They are maybe fast at filtering, but at some point, you have to get the results.Egalitarian
@Johan Typically, the fastest way to handle this type of scenario is to do the entire query, including the sum, on the DB. With proper indexing, this should outperform any in memory filtering by many orders of magnitude...Selfexamination
@Jodrell, yes, the filtering is maybe fast enough. But not the transfer of the results. Well, I´ll guess I just have to try harder :) Some info: Today we use a Parallel.For-loop. We are expecting about 1000 "concurrent queries" to be done. Would it be better to do a "normal" sequential loop and use the AsParallel-extension method instead of Parallel.For without AsParallel? Thanks in advance!Egalitarian
@Johan It really depends a lot on what your queries are doing. With large datasets, storage is typically the key... but the proper storage depends on how you'll be accessing the data, and what you're doing with it.Selfexamination
@ReedCopsey, thanks for your answer! Ok, I will take a second look at using a database-approach. Do you have any recommendations (in respect to high performance)?Egalitarian
@Johan Depends on the data in question, of course, but MySQL is typically pretty quick for queries, and SQL Server is very tunable... Nearly any "real" DB solution should handle this better than an in memory approach, for most things, thoguh.Selfexamination
@ReedCopsey, ok, thanks. I´ll look some more into this. However, I´m not very convinced since I´ve tried ALOT of "databases" by now, including velocityDB, BerkeleyDB etc. I would say, almost all "known" databases in fact. They are all too slow att either filtering the rows or returning them for further processing. :(Egalitarian
@ReedCopsey, please look at my updated question if you have time.Egalitarian
@Johan I suspect the problem is that you're not configuring the DB correctly. You should be doing the aggregation IN the DB, not pulling the results and filtering in memory, and I suspect you'd out-perform this every time... The key is proper indexing, though.Selfexamination
S
3

GpuLinq

GpuLinq's main mission is to democratize GPGPU programming through LINQ. The main idea is that we represent the query as an Expression tree and after various transformations-optimizations we compile it into fast OpenCL kernel code. In addition we provide a very easy to work API without the need of messing with the details of the OpenCL API.

https://github.com/nessos/GpuLinq

Sideboard answered 3/9, 2014 at 9:17 Comment(0)
N
2
select *
from table1  -- contains 100k rows
left join table2 -- contains 1M rows
on table1.id1=table2.id2 -- this would run for ~100G times 
                         -- unless they are cached on sql side
where table1.id between 1 and 100000 -- but this optimizes things (depends)

could be turned into

select id1 from table1 -- 400k bytes if id1 is 32 bit 
-- no need to order

stored in memory

select id2 from table2 -- 4Mbytes if id2 is 32 bit
-- no need to order

stored in memory, both arrays sent to gpu using a kernel(cuda,opencl) like below

int i=get_global_id(0); // to select an id2, we need a thread id
int selectedID2=id2[i];
summary__=-1;
for(int j=0;j<id1Length;j++)
{
      int selectedID1=id1[j];
      summary__=(selectedID2==selectedID1?j:summary__); // no branching
}
summary[i]=j; // accumulates target indexings of 
"on table1.id1=table2.id2" part.

On the host side, you can make

 select * from table1 --- query3

and

 select * from table2 --- query4

then use the id list from gpu to select the data

 // x is table1 ' s data
 myList.AsParallel().ForEach(x=>query3.leftjoindata=query4[summary[index]]);

The gpu code shouldn't be slower than 50ms for a gpu with constant memory, global broadcast ability and some thousands of cores.

If any trigonometric function is used for filtering, the performance would drop fast. Also when left joined tables row count makes it O(m*n) complexity so millions versus millions would be much slower. GPU memory bandwidth is important here.

Edit: A single operation of gpu.findIdToJoin(table1,table2,"id1","id2") on my hd7870(1280 cores) and R7-240(320 cores) with "products table(64k rows)" and a "categories table(64k rows)" (left join filter) took 48 milliseconds with unoptimized kernel.

Ado.Net 's "nosql" style linq-join took more than 2000 ms with only 44k products and 4k categories table.

Edit-2:

left join with a string search condition gets 50 to 200 x faster on gpu when tables grow to 1000s of rows each having at least hundreds of characters.

Nog answered 24/10, 2015 at 17:58 Comment(0)
M
0

The simple answer for your use case is no.

1) There's no solution for that kind of workload even in raw linq to object, much less in something that would replace your database.

2) Even if you were fine with loading the whole set of data at once (this takes time) it would still be much slower as GPU have high thoroughput but their access is high latency, so if you're looking at "very" fast solutions GPGPU is often not the answer as just preparing / sending the workload and getting back the results will be slow, and in your case probably need to be done in chunks too.

Muth answered 3/9, 2014 at 9:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.