Find an efficient way to integrate different language libraries into one project using Python as the "glue"

Asked 7/8, 2011 at 2:35 Answered 8/8, 2011 at 9:40

I am about to get involved in a NLP-related project and I need to use various libraries. Some are in java, others in C/C++ (for tasks that require more speed) and finally some are in Python. I was thinking of using Python as the "glue" and create wrapper-classes for every task that I want to do that relies on a different language. In order to do that, the wrapper class, for example, would execute the java program and communicate with it using pipes. My questions are:

Do you think that would work for cpu-demanding and highly repetitive tasks? Or would the overhead added by the pipe-communication be too heavy?
Is there any other (preferably simple) architecture that you would suggest?

Avila answered 7/8, 2011 at 2:35 Comment(1)

I think it could be simpler to use Java as the glue and use Jython instead of CPython. – Afar 7/8, 2011 at 2:49

I would simply advise not doing this.

Don't implement stuff in C/C++ "for speed". The performance benefit is not likely to be as great as you expect; e.g. compared with implementing in Java using "best practice" design and performance techniques.

Don't try and glue lots of languages together. You are setting yourself up for lots of portability issues, difficulties in debugging, and reliability issues; e.g. due to C / C++ bugs crashing the JVM. In addition, there are performance overheads in bridging between languages, and there can be unexpected bottlenecks. (For instance, you may find that your C/C++ has to be run single-threaded due to threading issues, and that you therefore can't get the benefit of Java multi-threading on a typically multi-core system.)

Instead, I advise you to look for libraries that allow you to implement the entire application in one language. If that is not possible, design it so that the different language components are different executables / processes, communicating via some kind of RPC, messaging, or whatever.

Thoreau answered 7/8, 2011 at 3:47 Comment(1)

In an ideal world, one language would solve all problems. But in NLP, lots of problems don't even admit a perfect solution so you have to select the libraries that give the best approximation. – Marven 8/8, 2011 at 9:47

Whether or not you'd have problems communicating over pipes / sockets has nothing to do with how CPU intensive the tasks are, but how frequently you'd need to send information between the processes and how much data they need to send. Setting up threads to do your communication will have little processing overhead.

You can probably automatically wrap the C/C++ code with Python (SWIG, ctypesgen, Boost.Python), so the only glue you'll have to write yourself would then be talking to Java.

You could also do it the other way -- run the Python code in the JVM with Jython so the Python and Java code are together, then talk to the C/C++ from there.

Foulmouthed answered 7/8, 2011 at 2:48 Comment(0)

You should take a look at Apache UIMA. It is designed exactly for this. From the project website:

The Frameworks run the components, and are available for both Java and C++. The Java Framework supports running both Java and non-Java components (using the C++ framework). The C++ framework, besides supporting annotators written in C/C++, also supports Perl, Python, and TCL annotators.

UIMA can manage pipes and annotators and is built to scale.

Bindman answered 7/8, 2011 at 13:36 Comment(0)

I would look at Jepp or JPype instead of using IPC for this. I would avoid Jython since loading the C/C++ libraries into Java would probably be harder than into CPython.

Leoleod answered 7/8, 2011 at 2:48 Comment(0)

1) Do you think that would work for cpu-demanding and highly repetitive tasks? Or would the overhead added by the pipe-communication be too heavy?

Depends on your task. If this is a typical NLP app where you have a large model loaded in memory and you only communicate relatively small pieces of data (strings in, label sequences/parse trees out), it may work. Pipe communication is hard to get right, though, since there's a lot of buffering and synchronization issues you have to tackle. Python is a very good glue language, but it doesn't solve everything.

2) Is there any other (preferably simple) architecture that you would suggest?

Make your NLP components services and connect to them via REST interfaces. There are off-the-shelf tools that do this, e.g. CLAM. Pyro and SPIRO make communication between Java and Python even more direct and might be easier to use than HTTP/REST (but YMMV).

The parts that are written in C/C++ can also be integrated with CPython using Cython. Don't start implementing things in C or C++ because you think they'll be faster, though; you can also implement them in Python first, then see if you can get the desired performance with NumPy and/or Cython.

Marven answered 8/8, 2011 at 9:40 Comment(0)

Recommended topics

Hot tags