Shortest sequence of operations transforming a file tree to another

Asked 1/8, 2011 at 19:43 Answered 9/8, 2011 at 16:12

Given two file trees A and B, is it possible to determine the shortest sequence of operations or a short sequence of operations that is necessary in order to transform A to B?

An operation can be:

Create a new, empty folder
Create a new file with any contents
Delete a file
Delete an empty folder
Rename a file
Rename a folder
Move a file inside another existing folder
Move a folder inside another existing folder

A and B are identical when they will have the same files with the same contents (or same size same CRC) and same name, in the same folder structure.

This question has been puzzling me for some time. For the moment I have the following, basic idea:

Compute a database:
- Store file names and their CRCs
- Then, find all folders with no subfolders, and compute a CRC from the CRCs of the files they contain, and a size from the total size of the files they contain
- Ascend the tree to make a CRC for each parent folder
Use the following loop having database A and database B:
- Compute A ∩ B and remove this intersection from both databases.
- Use an inner join to find matching CRCs in A and B, folders first, order by size desc
- while there is a result, use the first result to make a folder or file move (possibly creating new folders if necessary), remove from both database the source rows of the result. If there was a move then update CRCs of new location's parent folders in db A.
- Then remove all files and folders referenced in database A and create those referenced in database B.

However I think that this is really a suboptimal way to do that. What could you give me as advice?

Thank you!

Popeyed answered 1/8, 2011 at 19:43 Comment(3)

Perhaps unison or rsync can do this. – Vociferous 4/8, 2011 at 10:29

This could be very difficult to do in the general case. Especially if you impose the obvious constraint that two folders/files in a folder cannot contain the same name at any given point. – Guillaume 5/8, 2011 at 16:53

@tskuzzy: of course there is this constraint… Real-world uses of such an algorithm would be so nice do minimize data transfers when doing remote backups with a third, intermediary, storage medium… – Popeyed 5/8, 2011 at 17:54

This problem is a special case of the tree edit distance problem, for which finding an optimal solution is (unfortunately) known to be NP-hard. This means that there probably aren't any good, fast, and accurate algorithms for the general case.

That said, the paper I linked does contain several nice discussions of approximation algorithms and algorithms that work in restricted cases of the problem. You may find the discussion interesting, as it illuminates many of the issues that actually arise in solving this problem.

Hope this helps! And thanks for posting an awesome question!

Sought answered 5/8, 2011 at 17:18 Comment(3)

It's not really a special case; it's more of a related problem since the allowable edits are different and the distinction between leaf and non-leaf nodes. – Guillaume 5/8, 2011 at 17:31

@tskuzzy- You are correct that the allowable edits are different, but I believe that you can make the distinction between leaf and non-leaf nodes work by having each directory treat its files as leaf nodes. Perhaps I am incorrect about this? – Sought 5/8, 2011 at 17:41

I meant that tree edit distance doesn't make a distinction between "folders" and "files". For example, it might add insert a node onto a file. – Guillaume 5/8, 2011 at 17:45

You might want to check out tree-edit distance algorithms. I don't know if this will map neatly to your file system, but it might give you some ideas.

https://github.com/irskep/sleepytree (code and paper)

Belorussia answered 1/8, 2011 at 20:0 Comment(0)

The first step to do is figure out which files need to be created/renamed/deleted.

A) Create a hash map of the files of Tree A
B) Go through the files of Tree B
B.1) If there is an identical (name and contents) file in the hash map, then leave it alone
B.2) If the contents are the identical but the name is different, rename the file to that in the hash map
B.3) If the file contents doesn't exist in the hash map, remove it
B.4) (if one of 1,2,3 was true) Remove the file from the hash map

The files left over in the hash map are those that must be created. This should be the last step, after the directory structure has been resolved.

After the file differences have been resolved, it get's rather tricky. I wouldn't be surprised if there isn't an efficient optimal solution to this problem (NP-complete/hard).

The difficulty lies in that the problem doesn't naturally subdivide itself. Each step you do must consider the entire file tree. I'll think about it some more.

EDIT: It seems that the most studied tree edit distance algorithms consider only creating/deleting nodes and relabeling of nodes. This isn't directly applicable to this problem because this problem allows moving entire subtrees around which makes it significantly more difficult. The current fastest run-time for the "easier" edit distance problem is O(N^3). I'd imagine the run-time for this will be significantly slower.

Helpful Links/References

An Optimal Decomposition Algorithm for Tree Edit Distance - Demaine, Mozes, Weimann

Guillaume answered 5/8, 2011 at 17:5 Comment(0)

Enumerate all files in B and their associated sizes and checksums; sort by size/checksum.
Enumerate all files in A and their associated sizes and checksums; sort by size/checksum.
Now, doing an ordered list comparison, do the following:

a. for every file in A but not B, delete it.

b. for every file in B but not A, create it.

c. for every file in A and B, rename as many as you encounter from A to B, then make copies of the rest in B. If you are going to overwrite an existing file, save it off to the side in a separate list. If you find A in that list, use that as the source file.

Do the same for directories, deleting ones in A but not in B and adding those in B but not in A.

You iterate by checksum/size to ensure you never have to visit files twice or worry about deleting a file you will later need to resynchronize. I'm assuming you are trying to keep two directories in sync without unnecessary copying?

The overall complexity is O(N log N) plus however long it takes to read in all those files and their metadata.

This isn't the tree edit distance problem; it's more of a list synchronization problem that happens to generate a tree.

Ivied answered 9/8, 2011 at 16:12 Comment(0)

Only non trivial problem is moving folders and files. Renaming, deleting and creating is trivial and can be done in first step (or better in last when you finish).

You can then transform this problem into problem whit transforming tree both whit same leafs but different topology.

You decide which files will be moved from some folder/bucket and which files will be left in folder. Decision is based on number of same files in source and destination.
You apply same strategy to move folders in new topology.

I think that you should be near optimal or optimal if you forget about names of folders and think just about files and topology.

Gnome answered 2/8, 2011 at 7:42 Comment(0)

Helpful Links/References

Recommended topics

Hot tags