How to dump part of binary file

K

7

10

I have binary and want to extract part of it, starting from know byte string (i.e. FF D8 FF D0) and ending with known byte string (AF FF D9)

In the past I've used dd to cut part of binary file from beginning/ending but this command doesn't seem to support what I ask.

What tool on terminal can do this?

Kaine answered 26/2, 2012 at 9:16 Comment(0)

M

3

In a single pipe:

xxd -c1 -p file |
  awk -v b="ffd8ffd0" -v e="aaffd9" '
    found == 1 {
      print $0
      str = str $0
      if (str == e) {found = 0; exit}
      if (length(str) == length(e)) str = substr(str, 3)}
    found == 0 {
      str = str $0
      if (str == b) {found = 1; print str; str = ""}
      if (length(str) == length(b)) str = substr(str, 3)}
    END{ exit found }' |
  xxd -r -p > new_file
test ${PIPESTATUS[1]} -eq 0 || rm new_file

The idea is to use awk between two xxd to select the part of the file that is needed. Once the 1st pattern is found, awk prints the bytes until the 2nd pattern is found and exit.

The case where the 1st pattern is found but the 2nd is not must be taken into account. It is done in the END part of the awk script, which return a non-zero exit status. This is catch by bash's ${PIPESTATUS[1]} where I decided to delete the new file.

Note that en empty file also mean that nothing has been found.

Maltreat answered 27/2, 2012 at 22:5 Comment(3)

Yet another mark reassignment - lOranger' solution fails if 2nd pattern can be found before the 1st - giving $len with negative sign. This solution searches after the 1st pattern match, so it doesn't have such problem, nor generates intermediate triple size file. – Kaine 28/2, 2012 at 9:46

After testing this more, I found it without issues, but it's rather slow on larger files. Does anyone see a place for some optimisation, or this is the best one can get from xxd/awk? – Kaine 28/2, 2012 at 12:34

Try the new sed version that I just post. This one can be optimized replacing string concatenation and extraction with rotatory indexes in arrays, but it is less readable; and I do not want to do it if not needed ;-). – Maltreat 28/2, 2012 at 13:3

D

7

Locate the start/end position, then extract the range.

$ xxd -g0 input.bin | grep -im1 FFD8FFD0  | awk -F: '{print $1}'
0000cb0
$ ^FFD8FFD0^AFFFD9^
0009590
$ dd ibs=1 count=$((0x9590-0xcb0+1)) skip=$((0xcb0)) if=input.bin of=output.bin

Drawing answered 26/2, 2012 at 9:49 Comment(5)

I found "..count=$((0x9590-0xcb0+2)) skip=$((0xcb0+1))..." to match exactly starting from "FFD8.." and ending to "AFFF..". Thank you for your nice procedure. Cheers – Kaine 26/2, 2012 at 10:19

After couple of extractions I noticed that this is only approximate solution. +1, +2 all depend on content. For example 007d820: 74290068656c6c6f2e6a706700ffd8ff gives 007d820 for both '74 29 00 68' and '00 ff d8 ff' so something slightly different has to be done – Kaine 26/2, 2012 at 12:19

This does not work. If the pattern to match is split on two lines of xxd output it will never be found (by default xxd -g0 group lines per 16 bytes). For a pattern of 4 bytes long the probability to have a split is 25%. Also, the grep|awk will print the address of the beginning of the line where the pattern occur, so a delta of up to line size can happen, you end up with more data than you really want. – Glyptodont 27/2, 2012 at 7:42

@lOranger use -c 160 option to reduce the probability. – Drawing 27/2, 2012 at 7:58

We're not talking about probability here, but certainty! Even with 160 (the max is 256 for xxd), the probability is more than 2%, which is huge. If you automate this, you need a script that works all the time, not 98% of the times. See my answer below for a proposal that works all the time. – Glyptodont 27/2, 2012 at 8:30

M

3

In a single pipe:

xxd -c1 -p file |
  awk -v b="ffd8ffd0" -v e="aaffd9" '
    found == 1 {
      print $0
      str = str $0
      if (str == e) {found = 0; exit}
      if (length(str) == length(e)) str = substr(str, 3)}
    found == 0 {
      str = str $0
      if (str == b) {found = 1; print str; str = ""}
      if (length(str) == length(b)) str = substr(str, 3)}
    END{ exit found }' |
  xxd -r -p > new_file
test ${PIPESTATUS[1]} -eq 0 || rm new_file

The idea is to use awk between two xxd to select the part of the file that is needed. Once the 1st pattern is found, awk prints the bytes until the 2nd pattern is found and exit.

The case where the 1st pattern is found but the 2nd is not must be taken into account. It is done in the END part of the awk script, which return a non-zero exit status. This is catch by bash's ${PIPESTATUS[1]} where I decided to delete the new file.

Note that en empty file also mean that nothing has been found.

Maltreat answered 27/2, 2012 at 22:5 Comment(3)

Yet another mark reassignment - lOranger' solution fails if 2nd pattern can be found before the 1st - giving $len with negative sign. This solution searches after the 1st pattern match, so it doesn't have such problem, nor generates intermediate triple size file. – Kaine 28/2, 2012 at 9:46

After testing this more, I found it without issues, but it's rather slow on larger files. Does anyone see a place for some optimisation, or this is the best one can get from xxd/awk? – Kaine 28/2, 2012 at 12:34

Try the new sed version that I just post. This one can be optimized replacing string concatenation and extraction with rotatory indexes in arrays, but it is less readable; and I do not want to do it if not needed ;-). – Maltreat 28/2, 2012 at 13:3

V

2

This should work with standard tools (xxd, tr, grep, awk, dd). This correctly handles the "pattern split across line" issue, also look for the pattern only aligned at byte offset (not nibble).

file=<yourfile>
outfile=<youroutputfile>
startpattern="ff d8 ff d0"
endpattern="af ff d9"
xxd -g0 -c1 -ps ${file} | tr '\n' ' ' > ${file}.hex 
start=$((($(grep -bo "${startpattern}" ${file}.hex\
    | head -1 | awk -F: '{print $1}')-1)/3))
len=$((($(grep -bo "${endpattern}" ${file}.hex\
    | head -1 | awk -F: '{print $1}')-1)/3-${start}))
dd ibs=1 count=${len} skip=${start} if=${file} of=${outfile}

Note: The script above use a temporary file to prevent having the binary>hex conversion twice. A space/time trade-off is to pipe the result of xxd directly into the two grep. A one-liner is also possible, at the expense of clarity.

One could also use tee and named pipe to prevent having to store a temporary file and converting output twice, but I'm not sure it would be faster (xxd is fast) and is certainly more complex to write.

Virus answered 27/2, 2012 at 8:27 Comment(8)

lOranger, I used -c64 to compensate a bit, and cut and sed to calculate correct address, but -c1 should be real solution. I'll mark your solution, but when I manage to make it work. First I needed to change place of grep's pattern and filename to make grep work, but regardless I get dd: invalid number I imagine problem in start/len calculation/grammar. Also can't we exclude empty space and save 1/3 of output .hex file which would be double the input file size instead triple as it is now? – Kaine 27/2, 2012 at 10:0

Sorry, there was a typo in the script: grep pattern should be before the filename. I also added a | head -1 to cover the case where the pattern appears multiple times in the input, which can happen. Concerning your question, the space between hex bytes is necessary, otherwise you have the "nibble" issue (pattern is not aligned on byte boundaries). – Glyptodont 27/2, 2012 at 10:25

I'm afraid it still doesn't work. I get input file as result. I used my -c64 script, and get expected dump, but I was unwilling to post it here as it was fragile on boundaries (better than provided, but still..) – Kaine 27/2, 2012 at 11:18

Please note that you have to convert your hex pattern to lowercase (or add option -i in grep). I've just tested the script here with a big binary file and it works fine. Please print the value of ${start} and ${len} to debug (you can check that start and len > 0 to prevent cases where the pattern is not found in the input. – Glyptodont 27/2, 2012 at 12:26

Just in case: pastebin.com/raw.php?i=hZ5UqAF9 Patterns are in lower case. It simply returns the input file as dump, so start and end position are 0 and input file length. – Kaine 27/2, 2012 at 12:48

Well, I tested your script here and it works fine under a bash and sh script (provided I change the pattern to match some data in my input file). You have to check obviously that both patterns appears in the input. Which version of various tools are you using? Also please print ${start} and ${len} to check what's wrong. Please edit the .hex leftover file and manually check that the patterns are present, just in case... – Glyptodont 27/2, 2012 at 12:56

Try it yourself with script from pastebin on this file: ge.tt/1EjaXGE/v/0 (160K) – Kaine 27/2, 2012 at 13:12

let us continue this discussion in chat – Glyptodont 27/2, 2012 at 13:24

V

1

See this link for a way to do binary grep. Once you have the start and end offset, you should be able with dd to get what you need.

Virus answered 26/2, 2012 at 9:50 Comment(0)

M

1

A variation on the awk solution that assumes that your binary file, once converted in hex with spaces, fits in memory:

xxd -c1 -p file |
  tr "\n" " " |
  sed -n -e 's/.*\(ff d8 ff d0.*aa ff d9\).*/\1/p' |
  xxd -r -p > new_file

Maltreat answered 27/2, 2012 at 22:13 Comment(6)

WOW, this is so sweet and looks so easy. Couldn't be better than this. I'll leave mark on IOranger's answer as it is correct and answered earlier, but this is by far my favourite snippet – Kaine 27/2, 2012 at 22:53

Too bad the quickest get the mark, not the shortest... Anyway, it can still be optimized by removing the tr, replacing it inside sed by -e '1h' -e '2,$H' -e '${x;s/\n/ /g}' and modifying the above substitution to be performed only on last line. Note that this solution does not work one huge binary files, as the file need to be put in memory in sed. On huge files, use the awk solution. – Maltreat 28/2, 2012 at 7:27

Thanks. I tested this on 1GB laptop, and it was fine for 5MB file, but it made my system inaccessible on 50MB file. Is there maybe some general rule for determining "limit" file size based on available RAM, in your opinion? – Kaine 28/2, 2012 at 9:49

A 50MB file means 150MB once decoded and once bytes are separated by spaces. IT is not that much, but could cause sed to behave very slowly: a line of 150MB is a lot ! You could try the -n option to sed to remove buffering, but it could just worsen the problem. It is difficult to give an opinion on the limit: I do not know about sed implementation. The best is to do many tries. Sorry not to be able to help more. – Maltreat 28/2, 2012 at 12:26

Thanks. You helped more then enough – Kaine 28/2, 2012 at 12:32

The three sets of wildcards make sed do a lot of recursive searching, probably... I think that may be the reason that things slow down when the file gets big. – Pig 7/7, 2017 at 22:57

M

1

Another solution in sed, but using less memory:

xxd -c1 -p file |
  sed -n -e '1{N;N;N}' -e '/ff\nd8\nff\nd0/{:begin;p;s/.*//;n;bbegin}' -e 'N;D' | 
  sed -n -e '1{N;N}' -e '/aa\nff\nd9/{p;Q1}' -e 'P;N;D' |
  xxd -r -p > new_file
test ${PIPESTATUS[2]} -eq 1 || rm new_file

The 1st sed prints from ff d8 ff d0 till the end of file. Note that you need as much N in -e '1{N;N;N}' as there is bytes in your 1st pattern less one.

The 2nd sed prints from the beginning of the file to aa ff d9. Note again that you need as much N in -e '1{N;N}' as there is bytes in your 2nd pattern less one.

Again, a test is needed to check if the 2nd pattern is found, and delete the file if it is not.

Note that the Q command is a GNU extension to sed. If you do not have it, you need to trash the rest of the file once the pattern is found (in a loop like the 1st sed, but not printing the file), and check after hex to binary conversion that the new_file end with the wright pattern.

Maltreat answered 28/2, 2012 at 12:58 Comment(6)

I do have this GNU extension to sed, but can't make this script work for some reason – Kaine 28/2, 2012 at 13:27

Sorry, typo in the 2nd sed: it should work if you replace /aa\nff\nd9/ with /af\nff\nd9/. – Maltreat 28/2, 2012 at 15:18

I don't understand what difference that would make? Please try this sample: ge.tt/42cScKE/v/0?c (160K) – Kaine 28/2, 2012 at 15:44

The link is not working :-(. If you do not have any output, it means that those 2 patterns are not found. You can debug the script running the 2 first commands and adding other after. About the change, I think you are looking for data between ff d8 ff d0 and af ff d9, but the script in my solution above is taking data between ff d8 ff d0 and aa ff d9. – Maltreat 28/2, 2012 at 19:13

Sorry, link must have expired. I uploaded on other service, please try here: hotfile.com/dl/148193223/e90ab68/bin.dat.html Patterns are of course present in file, I checked multiple times – Kaine 29/2, 2012 at 1:51

Ok, there was an error in the final test. I corrected it. The error was also in the awk version that I also corrected. – Maltreat 29/2, 2012 at 6:51

L

0

You can use binwalk in order to do so. The tool will autodetect the files (an the offsets) in the input binary.

By using the -e flag, it will extract all the files in the same directory in which you are running the command.

It is installed by default in the newest distros but you can easily install the CLI tool with sudo apt install binwalk.

Here is an example of execution where I have hidden a zip file whose content is a text file called pass.txt. The whole thing is hidden in a .jgp image.

Read the manual for further information.

Lazes answered 22/5, 2023 at 11:21 Comment(0)

Recommended topics

Hot tags