FST (Finite-state transducers) Libraries, C++ or java
Asked Answered
A

7

8

I have a problem to solve using FSTs. Basically, I'll make a morphological parser, and in this moment i have to work with large transducers. The performance is The Big issue here.

Recently, i worked in c++ in other projects where the performance matters, but now, i'am considering java, because the java's benefits and because java is getting better.

I studied some comparisons between java and c++, but I cannot decide what language i should use for this specific problem because it depends on lib in use.

I can´t find much information about java's libs, so, my question is: Are there any open source java libs in which the performance is good, like The RWTH FSA Toolkit that i read in an article that is the fastest c++ lib?

Thanks all.

Anteroom answered 3/11, 2010 at 15:5 Comment(0)
P
3

What are the "benefits" of Java, for your purposes? What specific problem does that platform solve that you need? What is the performance constraint you must consider? Were the "comparisons" fair, because Java is actually extremely difficult to benchmark. So is C++, but you can at least get some algorithmic boundary guarantees from STL.

I suggest you look at OpenFst and the AT&T finite-state transducer tools. There are others out there, but I think your worry about Java puts the cart before the horse-- focus on what solves your problem well.

Good luck!

Pollitt answered 1/2, 2011 at 20:53 Comment(0)
A
2

http://jautomata.sourceforge.net/ and http://www.cs.duke.edu/csed/jflap/ are based Java finite state machine libraries, although I don't have experience using them so I cannot comment on the efficiency.

Abramabramo answered 11/8, 2011 at 5:32 Comment(0)
B
2

I'm one of the developers of the morfologik-stemming library. It's pure Java and its performance is very good, both when you build the automaton and when you use it. We use it for morphological analysis in LanguageTool.

Brilliant answered 19/6, 2012 at 8:21 Comment(0)
G
1

Lucene has a excellent implementation of FST, which is easy to use and high performance, making query engines like Elasticsearch, Solr deliver very fast sub-second term based query.Let me take an example:

import com.google.common.base.Preconditions;
import org.apache.lucene.store.ByteArrayDataInput;
import org.apache.lucene.store.DataInput;
import org.apache.lucene.store.GrowableByteArrayDataOutput;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.IntsRefBuilder;
import org.apache.lucene.util.fst.Builder;
import org.apache.lucene.util.fst.FST;
import org.apache.lucene.util.fst.PositiveIntOutputs;
import org.apache.lucene.util.fst.Util;

import java.io.IOException;

public class T {

    private final String inputValues[] = {"cat", "dog", "dogs"};
    private final long outputValues[] = {5, 7, 12};

    // https://lucene.apache.org/core/8_4_0/core/org/apache/lucene/util/fst/package-summary.html
    public static void main(String[] args) throws IOException {

        T t = new T();

        FST<Long> fst = t.buildFSTInMemory();
        System.out.println(String.format("memory used for fst is %d bytes", fst.ramBytesUsed()));
        t.searchFST(fst);
        byte[] bytes = t.serialize(fst);
        System.out.println(String.format("length of serialized fst is %d bytes", bytes.length));
        fst = t.deserialize(bytes);
        t.searchFST(fst);
    }

    private FST<Long> buildFSTInMemory() throws IOException {
        // Input values (keys). These must be provided to Builder in Unicode sorted order! Use Collections.sort() to sort inputValues first.

        PositiveIntOutputs outputs = PositiveIntOutputs.getSingleton();
        Builder<Long> builder = new Builder<Long>(FST.INPUT_TYPE.BYTE1, outputs);
        BytesRef scratchBytes = new BytesRef();
        IntsRefBuilder scratchInts = new IntsRefBuilder();
        for (int i = 0; i < inputValues.length; i++) {
//            scratchBytes.copyChars(inputValues[i]);
            scratchBytes.bytes = inputValues[i].getBytes();
            scratchBytes.offset = 0;
            scratchBytes.length = inputValues[i].length();
            builder.add(Util.toIntsRef(scratchBytes, scratchInts), outputValues[i]);
        }
        FST<Long> fst = builder.finish();

        return fst;
    }

    private FST<Long> deserialize(byte[] bytes) throws IOException {
        DataInput in = new ByteArrayDataInput(bytes);
        PositiveIntOutputs outputs = PositiveIntOutputs.getSingleton();
        FST<Long> fst = new FST<Long>(in, outputs);
        return fst;
    }

    private byte[] serialize(FST<Long> fst) throws IOException {
        final int capicity = 32;
        GrowableByteArrayDataOutput out = new GrowableByteArrayDataOutput(capicity);
        fst.save(out);

        return out.getBytes();
    }

    private void searchFST(FST<Long> fst) throws IOException {
        for (int i = 0; i < inputValues.length; i++) {
            Long value = Util.get(fst, new BytesRef(inputValues[i]));
            Preconditions.checkState(value == outputValues[i], "fatal error");
        }
    }
}
Geognosy answered 18/7, 2021 at 3:9 Comment(0)
M
0

The problem here is the minimum size of your objects in Java. In C++, without virtual methods and run time type identification, your objects weight exactly their content. And the time your automata take to manipulate memory has a big impact on performance.

I think that should be the main reason for choosing C++ over Java.

Madrigalist answered 5/2, 2012 at 17:52 Comment(0)
P
0

OpenFST is a C++ finite state transducer framework that is really comprehensive. Some people from CMU ported it to Java for use in their natural language processing.

A blog post series describing it.
The code is located on svn.

Update: I ported it to java here

Permeate answered 14/10, 2014 at 4:11 Comment(0)
M
0

Try ginr. It can compile very complex regular expressions and relations. Regular relations that are 1-1 specify single-valued partial word functions (FSTs). See also ribose, which builds FSTs from ginr automata for use in the Java VM.

Mathamathe answered 21/10, 2023 at 0:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.