Regular expression parsing a binary file?

Asked 11/4, 2011 at 9:0 Answered 18/4, 2024 at 19:20

Solved python regex python-3.x parsing binary

I have a file which mixes binary data and text data. I want to parse it through a regular expression, but I get this error:

TypeError: can't use a string pattern on a bytes-like object

I'm guessing that message means that Python doesn't want to parse binary files. I'm opening the file with the "rb" flags.

How can I parse binary files with regular expressions in Python?

EDIT: I'm using Python 3.2.0

Analiese answered 11/4, 2011 at 9:0 Comment(2)

I'm guessing from the reference to bytes-like object that you're using Python 3, is that correct? – Bigelow 11/4, 2011 at 9:25

Are you asking about how to run re's functions against a binary file? Or are you asking about how to run re's functions against a bytes-like object? I'm interested in the former, but these answers only seem to address the latter (in particular, they don't give any clue, as far as I can tell, as to how to run re's functions against non-rb'\x0d?\x0a'-delimited files that may be larger than the available RAM). – Foresail 12/10, 2022 at 15:49

I think you use Python 3 .

1.Opening a file in binary mode is simple but subtle. The only difference from opening it in text mode is that the mode parameter contains a 'b' character.

........

4.Here’s one difference, though: a binary stream object has no encoding attribute. That makes sense, right? You’re reading (or writing) bytes, not strings, so there’s no conversion for Python to do.

http://www.diveintopython3.net/files.html#read

Then, in Python 3, since a binary stream from a file is a stream of bytes, a regex to analyse a stream from a file must be defined with a sequence of bytes, not a sequence of characters.

In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string — that is, an array of Unicode characters (of possibly varying byte lengths).

http://www.diveintopython3.net/case-study-porting-chardet-to-python-3.html

and

In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that.

http://www.diveintopython3.net/strings.html#boring-stuff

and

4.6. Strings vs. Bytes# Bytes are bytes; characters are an abstraction. An immutable sequence of Unicode characters is called a string. An immutable sequence of numbers-between-0-and-255 is called a bytes object.

....

1.To define a bytes object, use the b' ' “byte literal” syntax. Each byte within the byte literal can be an ASCII character or an encoded hexadecimal number from \x00 to \xff (0–255).

http://www.diveintopython3.net/strings.html#boring-stuff

So you will define your regex as follows

pat = re.compile(b'[a-f]+\d+')

and not as

pat = re.compile('[a-f]+\d+')

More explanations here:

15.6.4. Can’t use a string pattern on a bytes-like object

Lentil answered 11/4, 2011 at 10:35 Comment(4)

@John Machin What do you mean, please ? – Lentil 24/6, 2011 at 9:57

Is any legal regular expression at risk of confusing a regex for a \x00 or any valid alternatives to representing binary data? My IDE is complaining that there is an invalid escape sequence \d in my byte string. – Tortoiseshell 27/12, 2018 at 23:53

Why is it that examples people make always use hard-coded strings? This doesn't help at all if the regex pattern is read from a text file into a plain string without also explaining how to convert it to a (compatible) binary-string. 😕 – Geny 22/10, 2020 at 22:48

@Tortoiseshell You should instead type br'…\d…' or b'…\\d…'. Python doesn't have a separate Regex type like e.g. JS does, only strings, for which the re library expects for a digit "metachar" an actual ASCII backslash followed by an ASCII small D. – Foresail 12/10, 2022 at 15:56

In your re.compile you need to use a bytes object, signified by an initial b:

r = re.compile(b"(This)")

This is Python 3 being picky about the difference between strings and bytes.

Bigelow answered 11/4, 2011 at 10:19 Comment(2)

And what if the pattern isn't hard-coded in the script but rather read from a text file into a string variable? How do you convert it to a (compatible) byte string? 🤨 – Geny 22/10, 2020 at 22:49

@Geny use .encode() on the string object to make it a bytes object – Masaryk 29/1, 2022 at 6:12

Here is, how I do it in my own customized "clone" of egrep:

import sys
import re
import posix

try:
    Pattern = re.compile(bytes(sys.argv[1], sys.getdefaultencoding()))
except IndexError:
    print('Usage:\n\t%s pattern file1.zip [file2.zip ...]', sys.argv[0])
    sys.exit(posix.EX_USAGE)
except re.error as e:
    print('Invalid search pattern "%s": %s' % (sys.argv[1], e))
    sys.exit(posix.EX_USAGE)

The sys.getdefaultencoding() is what determines the encoding of the first command-line argument. Knowing the encoding, you convert string into bytes. Then, given bytes -- rather than string -- re.compile() will produce a regular expression suitable for checking bytes.

Latinalatinate answered 18/4, 2024 at 19:20 Comment(0)

-3

This is working for me for python 2.6

>>> import re
>>> r = re.compile(".*(ELF).*")
>>> f = open("/bin/ls")
>>> x = f.readline()
>>> r.match(x).groups()
('ELF',)

Obannon answered 11/4, 2011 at 9:13 Comment(1)

This code

import re; r = re.compile("(This)"); f = open(r"C:\WINDOWS\system32\mspaint.exe", "rb"); x = f.readline(); r.match(x).groups()

returns the same error as my original post – Analiese 11/4, 2011 at 9:40

Recommended topics

Hot tags