Regular expression parsing a binary file?
Asked Answered
A

4

40

I have a file which mixes binary data and text data. I want to parse it through a regular expression, but I get this error:

TypeError: can't use a string pattern on a bytes-like object

I'm guessing that message means that Python doesn't want to parse binary files. I'm opening the file with the "rb" flags.

How can I parse binary files with regular expressions in Python?

EDIT: I'm using Python 3.2.0

Analiese answered 11/4, 2011 at 9:0 Comment(2)
I'm guessing from the reference to bytes-like object that you're using Python 3, is that correct?Bigelow
Are you asking about how to run re's functions against a binary file? Or are you asking about how to run re's functions against a bytes-like object? I'm interested in the former, but these answers only seem to address the latter (in particular, they don't give any clue, as far as I can tell, as to how to run re's functions against non-rb'\x0d?\x0a'-delimited files that may be larger than the available RAM).Foresail
L
39

I think you use Python 3 .

1.Opening a file in binary mode is simple but subtle. The only difference from opening it in text mode is that the mode parameter contains a 'b' character.

........

4.Here’s one difference, though: a binary stream object has no encoding attribute. That makes sense, right? You’re reading (or writing) bytes, not strings, so there’s no conversion for Python to do.

http://www.diveintopython3.net/files.html#read

Then, in Python 3, since a binary stream from a file is a stream of bytes, a regex to analyse a stream from a file must be defined with a sequence of bytes, not a sequence of characters.

In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string — that is, an array of Unicode characters (of possibly varying byte lengths).

http://www.diveintopython3.net/case-study-porting-chardet-to-python-3.html

and

In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that.

http://www.diveintopython3.net/strings.html#boring-stuff

and

4.6. Strings vs. Bytes# Bytes are bytes; characters are an abstraction. An immutable sequence of Unicode characters is called a string. An immutable sequence of numbers-between-0-and-255 is called a bytes object.

....

1.To define a bytes object, use the b' ' “byte literal” syntax. Each byte within the byte literal can be an ASCII character or an encoded hexadecimal number from \x00 to \xff (0–255).

http://www.diveintopython3.net/strings.html#boring-stuff

So you will define your regex as follows

pat = re.compile(b'[a-f]+\d+')

and not as

pat = re.compile('[a-f]+\d+')

More explanations here:

15.6.4. Can’t use a string pattern on a bytes-like object

Lentil answered 11/4, 2011 at 10:35 Comment(4)
@John Machin What do you mean, please ?Lentil
Is any legal regular expression at risk of confusing a regex for a \x00 or any valid alternatives to representing binary data? My IDE is complaining that there is an invalid escape sequence \d in my byte string.Tortoiseshell
Why is it that examples people make always use hard-coded strings? This doesn't help at all if the regex pattern is read from a text file into a plain string without also explaining how to convert it to a (compatible) binary-string. 😕Geny
@Tortoiseshell You should instead type br'…\d…' or b'…\\d…'. Python doesn't have a separate Regex type like e.g. JS does, only strings, for which the re library expects for a digit "metachar" an actual ASCII backslash followed by an ASCII small D.Foresail
B
35

In your re.compile you need to use a bytes object, signified by an initial b:

r = re.compile(b"(This)")

This is Python 3 being picky about the difference between strings and bytes.

Bigelow answered 11/4, 2011 at 10:19 Comment(2)
And what if the pattern isn't hard-coded in the script but rather read from a text file into a string variable? How do you convert it to a (compatible) byte string? 🤨Geny
@Geny use .encode() on the string object to make it a bytes objectMasaryk
L
0

Here is, how I do it in my own customized "clone" of egrep:

import sys
import re
import posix

try:
    Pattern = re.compile(bytes(sys.argv[1], sys.getdefaultencoding()))
except IndexError:
    print('Usage:\n\t%s pattern file1.zip [file2.zip ...]', sys.argv[0])
    sys.exit(posix.EX_USAGE)
except re.error as e:
    print('Invalid search pattern "%s": %s' % (sys.argv[1], e))
    sys.exit(posix.EX_USAGE)

The sys.getdefaultencoding() is what determines the encoding of the first command-line argument. Knowing the encoding, you convert string into bytes. Then, given bytes -- rather than string -- re.compile() will produce a regular expression suitable for checking bytes.

Latinalatinate answered 18/4, 2024 at 19:20 Comment(0)
O
-3

This is working for me for python 2.6

>>> import re
>>> r = re.compile(".*(ELF).*")
>>> f = open("/bin/ls")
>>> x = f.readline()
>>> r.match(x).groups()
('ELF',)
Obannon answered 11/4, 2011 at 9:13 Comment(1)
This code import re; r = re.compile("(This)"); f = open(r"C:\WINDOWS\system32\mspaint.exe", "rb"); x = f.readline(); r.match(x).groups() returns the same error as my original postAnaliese

© 2022 - 2025 — McMap. All rights reserved.