Does it make sense to use neo4j to index a file system
Asked Answered
M

3

6

I am working on a Java based backup client that scans for files on the file system and populates a Sqlite database with the directories and file names that it find to backup. Would it make sense to use neo4j instead of sqlite? Will it be more perfomant and easier to use for this application. I was thinking because a filesystem is a tree (or graph if you consider symbolic links), a gaph database may be suitable? The sqlite database schema defines only 2 tables, one for directories (full path and other info) and one for files (name only with foreign key to containing directory in directory table), so its relatively simple.

The application needs to index many millions of files so the solution needs to be fast.

Mincey answered 21/6, 2011 at 8:38 Comment(0)
A
3

As long as you can perform the DB operations essentially using string matching on the stored file system paths, using a relational databases makes sense. The moment the data model gets more complex and you actually can't do your queries with string matching but need to traverse a graph, using a graph database will make this much easier.

Altruistic answered 21/6, 2011 at 13:13 Comment(1)
It really depends on the queries you want to run on this data-set. In a graph each directory and file would be its own node storing the meta-information and the relationship probably holding the file-name (as there might symbolic or hard links as well to this node) with different names.Angulo
L
3

As I understand it then one of the earliest uses of Neo4j were to do exactly this as a part of the CMS system Neo4j is originiated from.

Lucene, the indexing backend for Neo4j, will allow you to build any indexes you might need.

You should read up on that and ask them directly.

Lazos answered 22/7, 2011 at 11:39 Comment(0)
S
0

I am considering a similar solution to index a data store on a filesystem. Remark about the queries above is right.

Examples of worst case queries:

For sqlite:

  • if you have a large quantity of subdirectories somewhere deep into the fs, your space need on sqlite will not be optimal: save the full path for each small subdirectories (think of a code project for instance)
  • if you need to move a directory, the closer to the root, the more work you will have to do, so that will not be a O(1) as it would be with neo4j
  • can you do multithreading on sqlite to scale?

For neo4j:

  • each time you search for a full path, you need to split it into components, and build a cypher query with all the elements of the path.
  • the data model will probably be more complex than 2 tables: all the different objects, then dir-in-dir relationship, file-in-dir relationship, symlink relationship

Greetings, hj

Shipboard answered 1/10, 2017 at 5:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.