How to escape string for generated C++?
Asked Answered
A

3

9

I am writing python script which is generating C++ code based on the data.

I have python variable string which contains a string which can be composed of characters like " or newlines.

What is the best way to escape this string for code generation?

Apostolic answered 18/2, 2013 at 20:53 Comment(3)
Isn't this best solved by using a template engine like jinja which can already escape chars; I know I've done similar before when generating Java code - I'm aware that I may have misinterpreted your question.Vladamar
Anyone know if pycparser fills this use-case? This seems like something it should support, but its documentation is non-existent so I can't really tell...Zashin
This question is similar to: C-style escaping in python. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem.Bimah
A
7

The way I use is based on the observation that C++ strings basically obey the same rules regarding charactes and escaping as Javascript/JSON string.

Python since version 2.6 has a built-in JSON library which can serialize Python data into JSON. Therefore, the code is, assuming we don't need enclosing quotes, just as follows:

import json
string_for_printing = json.dumps(original_string).strip('"')
Apostolic answered 18/2, 2013 at 20:53 Comment(4)
Except when there's Unicode characters in the string. Or when it ends with a quote. Also doesn't work for binary data. Escaping arbitrary data for C++ while keeping it readable is not as easy as it sounds - the last time I did this I just ended up turning every single byte into \xNN form.Tally
@MattiVirkkunen of course for arbitrary data you want a hex dump like xxd -i. But if you can restrict to a reasonable character subset is there a tool for this?Rubble
@MattiVirkkunen C++ supports \unnnn and \Unnnnnnnn escape sequences.Bradway
Use a slice [1:-1] instead of .strip('"') if original_string starts with or ends with ".Bradway
M
3

I use this code, so far without problems:

def string(s, encoding='ascii'):
   if isinstance(s, unicode):
      s = s.encode(encoding)
   result = ''
   for c in s:
      if not (32 <= ord(c) < 127) or c in ('\\', '"'):
         result += '\\%03o' % ord(c)
      else:
         result += c
   return '"' + result + '"'

It uses octal escapes to avoid all potentially problematic characters.

Marquee answered 18/2, 2013 at 21:0 Comment(0)
M
1

We can do better using specifics of C found here (https://www.gnu.org/software/gnu-c-manual/gnu-c-manual.html#Character-Constants) and Python's built-in printable function:

def c_escape():
  import string
  mp = []
  for c in range(256):
    if c == ord('\\'): mp.append("\\\\")
    elif c == ord('?'): mp.append("\\?")
    elif c == ord('\''): mp.append("\\'")
    elif c == ord('"'): mp.append("\\\"")
    elif c == ord('\a'): mp.append("\\a")
    elif c == ord('\b'): mp.append("\\b")
    elif c == ord('\f'): mp.append("\\f")
    elif c == ord('\n'): mp.append("\\n")
    elif c == ord('\r'): mp.append("\\r")
    elif c == ord('\t'): mp.append("\\t")
    elif c == ord('\v'): mp.append("\\v")
    elif chr(c) in string.printable: mp.append(chr(c))
    else:
      x = "\\%03o" % c
      mp.append(x if c>=64 else (("\\%%0%do" % (1+c>=8)) % c, x))
  return mp

This has the advantage of now being a mapping from ordinal value of a character ord(c) to the exact string. Using += for strings is slow and bad practice, so this allows for "".join(...) which is far more performant in Python. Not to mention, indexing a list/table is much faster than doing computations on characters each time through. Also do not waste octal characters either by checking if less characters are needed. However, to use this, you must verify the next character is not a 0 through 7 otherwise you must use the 3 digit octal format.

The table looks like:

[('\\0', '\\000'), ('\\1', '\\001'), ('\\2', '\\002'), ('\\3', '\\003'), ('\\4', '\\004'), ('\\5', '\\005'), ('\\6', '\\006'), '\\a', '\\b', '\\t', '\\n', '\\v', '\\f', '\\r', ('\\16', '\\016'), ('\\17', '\\017'), ('\\20', '\\020'), ('\\21', '\\021'), ('\\22', '\\022'), ('\\23', '\\023'), ('\\24', '\\024'), ('\\25', '\\025'), ('\\26', '\\026'), ('\\27', '\\027'), ('\\30', '\\030'), ('\\31', '\\031'), ('\\32', '\\032'), ('\\33', '\\033'), ('\\34', '\\034'), ('\\35', '\\035'), ('\\36', '\\036'), ('\\37', '\\037'), ' ', '!', '\\"', '#', '$', '%', '&', "\\'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '<', '=', '>', '\\?', '@', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '[', '\\\\', ']', '^', '_', '`', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '{', '|', '}', '~', '\\177', '\\200', '\\201', '\\202', '\\203', '\\204', '\\205', '\\206', '\\207', '\\210', '\\211', '\\212', '\\213', '\\214', '\\215', '\\216', '\\217', '\\220', '\\221', '\\222', '\\223', '\\224', '\\225', '\\226', '\\227', '\\230', '\\231', '\\232', '\\233', '\\234', '\\235', '\\236', '\\237', '\\240', '\\241', '\\242', '\\243', '\\244', '\\245', '\\246', '\\247', '\\250', '\\251', '\\252', '\\253', '\\254', '\\255', '\\256', '\\257', '\\260', '\\261', '\\262', '\\263', '\\264', '\\265', '\\266', '\\267', '\\270', '\\271', '\\272', '\\273', '\\274', '\\275', '\\276', '\\277', '\\300', '\\301', '\\302', '\\303', '\\304', '\\305', '\\306', '\\307', '\\310', '\\311', '\\312', '\\313', '\\314', '\\315', '\\316', '\\317', '\\320', '\\321', '\\322', '\\323', '\\324', '\\325', '\\326', '\\327', '\\330', '\\331', '\\332', '\\333', '\\334', '\\335', '\\336', '\\337', '\\340', '\\341', '\\342', '\\343', '\\344', '\\345', '\\346', '\\347', '\\350', '\\351', '\\352', '\\353', '\\354', '\\355', '\\356', '\\357', '\\360', '\\361', '\\362', '\\363', '\\364', '\\365', '\\366', '\\367', '\\370', '\\371', '\\372', '\\373', '\\374', '\\375', '\\376', '\\377']

Example usage encoding some 4-byte integers as C-strings in little-endian byte order with new lines inserted every 50 characters: v

mp = c_escape()
vals = [30,50,100]
bytearr = [z for i, x in enumerate(vals) for z in x.to_bytes(4, 'little', signed=x<0)]
"".join(mp[x] if not type(mp[x]) is tuple else mp[x][1 if not i == len(bytearr)-1 and bytearr[i+1] in list(range(ord('0'), ord('7')+1)) else 0] + ("\"\n\t\"" if (i % 50) == 49 else "") for i, x in enumerate(bytearr))

#output: '\\36\\0\\0\\0002\\0\\0\\0d\\0\\0\\0'
Marvin answered 16/8, 2021 at 21:41 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.