Store metadata into Jackrabbit repository
Asked Answered
I

3

8

can anybody explain to me, how to proceed in following scenario ?

  1. receiving documents (MS docs, ODS, PDF)

  2. Dublic core metadata extraction via Apache Tika + content extraction via jackrabbit-content-extractors

  3. using Jackrabbit to store documents (content) into repository together with their metadata ?

  4. retrieving documents + metadata

I'm interested in points 3 and 4 ...

DETAILS: The application is processing documents interactively (some analysis - language detection, word count etc. + gather as many details possible - Dublin core + parsing the content/events handling) so that it returns results of the processing to the user and then the extracted content and metadata(extracted and custom user metadata) stores into JCR repository

Appreciate any helps, thank you

Immoderate answered 1/3, 2011 at 14:22 Comment(3)
Can you give some more context? Can you be more specific as to your question? Where did this list of items come from?Diez
@jzd: I'm not acquainted much with JCR and jackrabbit and I kinda cannot find any reference on how this is handled.Immoderate
The documents are uploaded to my application, on each document upload the document is processed and persistedImmoderate
T
31

Uploading files is basically the same for JCR 2.0 as it is for JCR 1.0. However, JCR 2.0 adds a few additional built-in property definitions that are useful.

The "nt:file" node type is intended to represent a file and has two built-in property definitions in JCR 2.0 (both of which are auto-created by the repository when nodes are created):

  • jcr:created (DATE)
  • jcr:createdBy (STRING)

and defines a single child named "jcr:content". This "jcr:content" node can be of any node type, but generally speaking all information pertaining to the content itself is stored on this child node. The de facto standard is to use the "nt:resource" node type, which has these properties defined:

  • jcr:data (BINARY) mandatory
  • jcr:lastModified (DATE) autocreated
  • jcr:lastModifiedBy (STRING) autocreated
  • jcr:mimeType (STRING) protected?
  • jcr:encoding (STRING) protected?

Note that "jcr:mimeType" and "jcr:encoding" were added in JCR 2.0.

In particular, the purpose of the "jcr:mimeType" property was to do exactly what you're asking for - capture the "type" of the content. However, the "jcr:mimeType" and "jcr:encoding" property definitions can be defined (by the JCR implementation) as protected (meaning the JCR implementation automatically sets them) - if this is the case, you would not be allowed to manually set these properties. I believe that Jackrabbit and ModeShape do not treat these as protected.

Here is some code that shows how to upload a file into a JCR 2.0 repository using these built-in node types:

// Get an input stream for the file ...
File file = ...
InputStream stream = new BufferedInputStream(new FileInputStream(file));

Node folder = session.getNode("/absolute/path/to/folder/node");
Node file = folder.addNode("Article.pdf","nt:file");
Node content = file.addNode("jcr:content","nt:resource");
Binary binary = session.getValueFactory().createBinary(stream);
content.setProperty("jcr:data",binary);

And if the JCR implementation does not treat the "jcr:mimeType" property as protected (i.e., Jackrabbit and ModeShape), you'd have to set this property manually:

content.setProperty("jcr:mimeType","application/pdf");

Metadata can very easily be stored on the "nt:file" and "jcr:content" nodes, but out-of-the-box the "nt:file" and "nt:resource" node types don't allow for extra properties. So before you can add other properties, you first need to add a mixin (or multiple mixins) that have property definitions for the kinds of properties you want to store. You can even define a mixin that would allow any property. Here is a CND file defining such a mixin:

<custom = 'http://example.com/mydomain'>
[custom:extensible] mixin
- * (undefined) multiple 
- * (undefined) 

After registering this node type definition, you can then use this on your nodes:

content.addMixin("custom:extensible");
content.setProperty("anyProp","some value");
content.setProperty("custom:otherProp","some other value");

You could also define and use a mixin that allowed for any Dublin Core element:

<dc = 'http://purl.org/dc/elements/1.1/'>
[dc:metadata] mixin
- dc:contributor (STRING)
- dc:coverage (STRING)
- dc:creator (STRING)
- dc:date (DATE)
- dc:description (STRING)
- dc:format (STRING)
- dc:identifier (STRING)
- dc:language (STRING)
- dc:publisher (STRING)
- dc:relation (STRING)
- dc:right (STRING)
- dc:source (STRING)
- dc:subject (STRING)
- dc:title (STRING)
- dc:type (STRING)

All of these properties are optional, and this mixin doesn't allow for properties of any name or type. I've also not really addressed with this 'dc:metadata' mixin the fact that some of these are already represented with the built-in properties (e.g., "jcr:createBy", "jcr:lastModifiedBy", "jcr:created", "jcr:lastModified", "jcr:mimeType") and that some of them may be more related to content while others more related to the file.

You could of course define other mixins that better suit your metadata needs, using inheritance where needed. But be careful using inheritance with mixins - since JCR allows a node to multiple mixins, it's often best to design your mixins to be tightly scoped and facet-oriented (e.g., "ex:taggable", "ex:describable", etc.) and then simply apply the appropriate mixins to a node as needed.

(It's even possible, though much more complicated, to define a mixin that allows more children under the "nt:file" nodes, and to store some metadata there.)

Mixins are fantastic and give a tremendous amount of flexibility and power to your JCR content.

Oh, and when you've created all of the nodes you want, be sure to save the session:

session.save();
Tombaugh answered 2/3, 2011 at 15:46 Comment(3)
thank you very much for brilliant explanation. I now have an overall idea how to implement my application. Btw, how do you deal with use case, when documents are ALWAYS in pairs ? For instance if it was for a translation company : source file x target file (french > english). Do I create a parent node "Files" which would be a folder and two child nodes "sourceFile" and "targetFile" ?Immoderate
Supporting translations and multiple languages is tough. I can think of several ways of handling it: 1) Use separate files, and somehow link them together. Your suggestion of 'source' and 'target' is way; another might be to have 'translatedFrom' as either a PATH or (WEAK)REFERENCE property. 2) Treat the files as the same and therefore to have one "nt:file" node, but with multiple "jcr:content"-type nodes (e.g., maybe "jcr:content" for the default language and "ex:content-fr" and "ex:content-en"). There are likely other possibilities also.Tombaugh
I had to postpone this until now, because I needed it to be also CMIS compatible. CMIS and OpenCMIS jcr bindings don't deal with "secondary types" until tools.oasis-open.org/issues/browse/CMIS-713 ... But it's gonna need some more time. Now opencmis operates with folder, file and mix:simpleVersionable ... So finally I have only one choice - folder > [sourceFolder, targetFolder] > files ...Immoderate
S
1

I am a bit rusty with JCR and I have never used 2.0 but this should get you started.

See this link. You'll want to open up the second comment.

You just store the file in a node and add additional metadata to the node. Here is how to store the file:

Node folder = session.getRootNode().getNode("path/to/file/uploads"); 
Node file = folder.addNode(fileName, "nt:file"); 
Node fileContent = file.addNode("jcr:content"); 
fileContent.setProperty("jcr:data", fileStream);
// Add other metadata
session.save();

How you store meta-data is up to you. A simple way is to just store key value pairs:

fileContent.setProperty(key, value, PropertyType.STRING);

To read the data you just call getProperty().

fileStream = fileContent.getProperty("jcr:data");
value = fileContent.getProperty(key);
Shadrach answered 1/3, 2011 at 23:31 Comment(2)
Thank you. The problem of this use case is, that the documents are totally different in type of metadata. So that if the node tree has a "group/user/category/document" or "category/group/user/document" structure (I'm not sure about that what is better), each document would have to have a property "type" if it is pdf/doc/odt/ppt etc., and I would have to test for this every timeImmoderate
I'd be surprised that line 3 in the above code snippet actually works, because per the JCR specification (Section 3.7.11.2 of JCR 2.0, and Section 6.7.22.6 of JCR 1.0) the "jcr:content" node is mandatory but not auto-created.Tombaugh
E
1

I am new to Jackrabbit, working on 2.4.2. As for your solution, you can check for the type using a core java logic and put cases defining any variation in your action.

You won't need to worry about issues with saving contents of different .txt or .pdf as their content is converted into binary and saved. Here is a small sample in which I uploaded and downloaded a pdf file in/from jackrabbit repo.

    // Import the pdf file unless already imported 
            // This program is for sample purpose only so everything is hard coded.
        if (!root.hasNode("Alfresco_E0_Training.pdf"))
        { 
            System.out.print("Importing PDF... "); 

            // Create an unstructured node under which to import the XML 
            //Node node = root.addNode("importxml", "nt:unstructured"); 
            Node file = root.addNode("Alfresco_E0_Training.pdf","nt:file");

            // Import the file "Alfresco_E0_Training.pdf" under the created node 
            FileInputStream stream = new FileInputStream("<path of file>\\Alfresco_E0_Training.pdf");
            Node content = file.addNode("jcr:content","nt:resource");
            Binary binary = session.getValueFactory().createBinary(stream);
            content.setProperty("jcr:data",binary);
            stream.close();
            session.save(); 
            //System.out.println("done."); 
            System.out.println("::::::::::::::::::::Checking content of the node:::::::::::::::::::::::::");
            System.out.println("File Node Name : "+file.getName());
            System.out.println("File Node Identifier : "+file.getIdentifier());
            System.out.println("File Node child : "+file.JCR_CHILD_NODE_DEFINITION);
            System.out.println("Content Node Name : "+content.getName());
            System.out.println("Content Node Identifier : "+content.getIdentifier());
            System.out.println("Content Node Content : "+content.getProperty("jcr:data"));
            System.out.println(":::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::");

        }else
        {
            session.save();
            Node file = root.getNode("Alfresco_E0_Training.pdf");
            Node content = file.getNode("jcr:content");
            String path = content.getPath();
            Binary bin = session.getNode(path).getProperty("jcr:data").getBinary();
            InputStream stream = bin.getStream();
             File f=new File("C:<path of the output file>\\Alfresco_E0_Training.pdf");

              OutputStream out=new FileOutputStream(f);
              byte buf[]=new byte[1024];
              int len;
              while((len=stream.read(buf))>0)
              out.write(buf,0,len);
              out.close();
              stream.close();
              System.out.println("\nFile is created...................................");


            System.out.println("done."); 
            System.out.println("::::::::::::::::::::Checking content of the node:::::::::::::::::::::::::");
            System.out.println("File Node Name : "+file.getName());
            System.out.println("File Node Identifier : "+file.getIdentifier());
            //System.out.println("File Node child : "+file.JCR_CHILD_NODE_DEFINITION);
            System.out.println("Content Node Name : "+content.getName());
            System.out.println("Content Node Identifier : "+content.getIdentifier());
            System.out.println("Content Node Content : "+content.getProperty("jcr:data"));
            System.out.println(":::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::");
        } 

        //output the repository content
        } 
    catch (IOException e){
        System.out.println("Exception: "+e);
    }
    finally { 
        session.logout(); 
        } 
        } 
}

Hope this helps

Excrescent answered 25/7, 2012 at 5:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.