Compressing a folder with many duplicated files [closed]
Asked Answered
C

6

23

I have a pretty big folder (~10GB) that contains many duplicated files throughout it's directory tree. Many of these files are duplicated up 10 times. The duplicated files don't reside side by side, but within different sub-directories.

How can I compress the folder to a make it small enough?

I tried to use Winrar in "Best" mode, but it didn't compress it at all. (Pretty strange)

Will zip\tar\cab\7z\ any other compression tool do a better job?

I don't mind letting the tool work for a few hours - but not more.

I rather not do it programmatically myself

Chary answered 13/12, 2014 at 9:18 Comment(1)
I select files in Windows Explorer > right-click > WinRAR (6.24) > Add to archive > Options tab > Save identical files as references. The Help file (available by clicking the question mark at upper right in that dialog) says, "If several identical files larger than 64 KB are found, the first file in the set is saved as usual file and all following files are saved as references to this first file." This appears to offer file-level deduplication. For deduplication of similar but non-identical files, consider Borg: raywoodcockslatest.wordpress.com/2022/06/24/borg-retryCarrycarryall
C
29

Best options in your case is 7-zip. Here is the options:

7za a -r -t7z -m0=lzma2 -mx=9 -mfb=273 -md=29 -ms=8g -mmt=off -mmtf=off -mqs=on -bt -bb3 archife_file_name.7z /path/to/files

a - add files to archive

-r - Recurse subdirectories

-t7z - Set type of archive (7z in your case)

-m0=lzma2 - Set compression method to LZMA2. LZMA is default and general compression method of 7z format. The main features of LZMA method:

  • High compression ratio
  • Variable dictionary size (up to 4 GB)
  • Compressing speed: about 1 MB/s on 2 GHz CPU
  • Decompressing speed: about 10-20 MB/s on 2 GHz CPU
  • Small memory requirements for decompressing (depend from dictionary size)
  • Small code size for decompressing: about 5 KB
  • Supporting multi-threading and P4's hyper-threading

-mx=9 - Sets level of compression. x=0 means Copy mode (no compression). x=9 - Ultra

-mfb=273 - Sets number of fast bytes for LZMA. It can be in the range from 5 to 273. The default value is 32 for normal mode and 64 for maximum and ultra modes. Usually, a big number gives a little bit better compression ratio and slower compression process.

-md=29 - Sets Dictionary size for LZMA. You must specify the size in bytes, kilobytes, or megabytes. The maximum value for dictionary size is 1536 MB, but 32-bit version of 7-Zip allows to specify up to 128 MB dictionary. Default values for LZMA are 24 (16 MB) in normal mode, 25 (32 MB) in maximum mode (-mx=7) and 26 (64 MB) in ultra mode (-mx=9). If you do not specify any symbol from the set [b|k|m|g], the dictionary size will be calculated as DictionarySize = 2^Size bytes. For decompressing a file compressed by LZMA method with dictionary size N, you need about N bytes of memory (RAM) available.

I use md=29 because on my server there is 16Gb only RAM available. using this settings 7-zip takes only 5Gb on any directory size archiving. If I use bigger dictionary size - system goes to swap.

-ms=8g - Enables or disables solid mode. The default mode is s=on. In solid mode, files are grouped together. Usually, compressing in solid mode improves the compression ratio. In your case this is very important to make solid block size as big as possible.

Limitation of the solid block size usually decreases compression ratio. The updating of solid .7z archives can be slow, since it can require some recompression.

-mmt=off - Sets multithreading mode to OFF. You need to switch it off because we need similar or identical files to be processed by same 7-zip thread in one soled block. Drawback is slow archiving. Does not matter how many CPUs or cores your system have.

-mmtf=off - Set multithreading mode for filters to OFF.

-myx=9 - Sets level of file analysis to maximum, analysis of all files (Delta and executable filters).

-mqs=on - Sort files by type in solid archives. To store identical files together.

-bt - show execution time statistics -bb3 - set output log level

Cuspidate answered 12/10, 2018 at 3:3 Comment(2)
I found that this command created a smaller file than first creating a .wim file (with 7z a -twim name.wim folder/) and then compressing it with -mx=9 -m0=lzma2.Relator
The -r switch might be unnecessary or even cause unexpected behavior. In my case, under Linux, in addition to (expected) compression of TARGET folder content, it was causing (unexpected) compression of TARGET SIBLING and TARGET PARENT folder content. From man 7za: CAUTION: this flag does not do what you think, avoid using it. Also see: "7z: What does the -r flag do?".Manganin
I
13

7-zip supports the 'WIM' file format which will detect and 'compress' duplicates. If you're using the 7-zip GUI then you simply select the 'wim' file format.

Only if you're using command line 7-zip, see this answer. https://serverfault.com/questions/483586/backup-files-with-many-duplicated-files

Interdict answered 3/9, 2016 at 11:44 Comment(0)
B
7

I suggest 3 options that I've tried (in Windows):

  1. 7zip LZMA2 compression with dictionary size of 1536Mb
  2. WinRar "solid" file
  3. 7zip WIM file

I had 10 folders with different versions of a web site (with files such as .php, .html, .js, .css, .jpeg, .sql, etc.) with a total size of 1Gb (100Mb average per folder). While standard 7zip or WinRar compression gave me a file of about 400/500Mb, these options gave me a file of (1) 80Mb, (2) 100Mb & (3) 170Mb respectively.

Update edit: Thanks to @Griffin suggestion in comments, I tried to use 7zip LZMA2 compression (dictionary size seems to have no difference) over the 7zip WIM file. Sadly is not the same backup file I used in the test years ago, but I could compress the WIM file at 70% of it size. I would give this 2 steps method a try using your specific set of files and compare it against method 1.

New edit: My backups were growing and now have many images files. With 30 versions of the site, method 1 weights 6Gb, while a 7zip WIM file inside a 7zip LZMA2 file weights only 2Gb!

Byplay answered 2/11, 2016 at 20:6 Comment(3)
Your Solid Block size may have made a difference as well.Heideheidegger
The VIM don't compress, it just removes the duplicate data, I would expect you to end up with maybe 20-40mb if you compress the vim file using LZMA2. So first create a vim archive THEN compress that vim archive.Cranmer
@Cranmer Nice suggestion! I don't know why there is not an option to do that automatically. Will try it and refresh the answer ;)Byplay
U
4

Do the duplicated files have the same names? Are they usually less than 64 MB in size? Then you should sort by file name (without the path), use tar to archive all of the files in that order into a .tar file, and then use xz to compress to make a .tar.xz compressed archive. Duplicated files that are adjacent in the .tar file and are less than the window size for the xz compression level being used should compress to almost nothing. You can see the dictionary sizes, "DictSize" for the compression levels in this xz man page. They range from 256 KB to 64 MB.

Understructure answered 13/12, 2014 at 18:55 Comment(4)
Thanks a lot! This helped me to shrink a tar.gz archive with many duplicate html files (same name but different directories) from over 1 GB to 450 MB.Tricolor
From your description, it seems like the compression should have been better than a factor of two.Understructure
Sorry, another large part of that archive was from larger binary files (no duplicates). I didn't do any further investigation, just was happy to shrink the data down to fit onto a single CD. Cheers!Tricolor
tar can handle hard links, so I find duplicated files and make hard links before exec tar. Here is my script: for path in path1 path2 path3 ...; do find $path -type f; done | xargs -d'\n' sha1sum | sort | while read -r sha path; do test "$sha" == "$prev_sha" && ln -f $prev_path $path; prev_sha=$sha; prev_path=$path; done.Pinzler
D
3

WinRAR compresses by default each file separately. So there is no real gain on compressing a folder structure with many similar or even identical files by default.

But there is also the option to create a solid archive. Open help of WinRAR and open on Contents tab the item Archive types and parameters and click on Solid archives. This help page explains what a solid archive is and which advantages and disadvantages this archive file format has.

A solid archive with a larger dictionary size in combination with best compression can make an archive file with a list of similar files very small. For example I have a list of 327 binary files with file sizes from 22 KB to 453 KB which have in total 47 MB not included the cluster size of the partition. I can compress those 327 similar, but not identical files, into a RAR archive with a dictionary size of 4 MB having only 193 KB. That is of course a dramatic reduce of size.

Follow the link to help page about rarfiles.lst after reading help page about solid archive. It describes how you can control in which order the files are put into a solid archive. This file is located in program files folder of WinRAR and can be of course customized to your needs.

You have to take care also about option Files to store without compression in case of using GUI version of WinRAR. This option can be found after clicking on symbol/command Add on the tab Files. There are specified file types which are just stored in the archive without any compression like *.png, *.jpg, *.zip, *.rar, ... Those files contain usually already the data in compressed format and therefore it does not make much sense to compress them once again. But if duplicate *.jpg exist in a folder structure and a solid archive is created it makes sense to remove all file extensions from this option.

A suitable command line with using the console version Rar.exe of WinRAR and with using RAR5 archive file format would be:

"%ProgramFiles%\WinRAR\Rar.exe a -@ -cfg- -ep1 -idq -m5 -ma5 -md128 -mt1 -r -s -tl -y -- "%UserProfile%\ArchiveFileName.rar" "%UserProfile%\FolderToArchive\"

The used switches in this example are explained in manual of Rar.exe which is the text file Rar.txt in program files directory of WinRAR. There can be also used WinRAR.exe with replacing the switch -idq by -ibck as explained in help of WinRAR on page Alphabetic switches list opened via last menu Help with a click on first menu item Help topics and expanding on first tab Contents the list item Command line mode and next the sublist item Switches and clicking on first item Alphabetic switches list.

By the way: There are applications like Total Commander, UltraFinder or UltraCompare and many others which support searching for duplicate files by various, user configurable criteria like finding files with same name and same size, or most secure, finding files with same size and same content, and providing functions to delete the duplicates.

Detergent answered 13/12, 2014 at 13:32 Comment(0)
W
-1

Try eXdupe from www.exdupe.com, it uses deduplication and is so fast that it's practically disk I/O bound

Whalen answered 13/12, 2014 at 9:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.