What is visible in an executable built with Cython, in case non-compiled Python code is executed?
Asked Answered
D

1

0

When we write Cython code (with types), this will eventually be compiled like C-compiled code and we can't recover the source code (except disassembling but then this is similar to disassembling C code), as seen in Are executables produced with Cython really free of the source code?.

But what happens when we write "normal Python code" (interpreted code without types) in a Cython .pyx file and we produce an executable? How much of it will be visible in the strings of the executable?

Example:

import bottle, random, json
app = bottle.Bottle()
@bottle.route('/')
def index():
    return 'hello'
@bottle.route('/random')
def testrand():
    return str(random.randint(0, 100))
@bottle.route('/jsontest')
def testjson():
    x = json.loads('{ "1": "2" }')
    return 'done'
bottle.run()

In this case I see in the test.c:

static const char __pyx_k_1_2[] = "{ \"1\": \"2\" }";
static const char __pyx_k_json[] = "json";
static const char __pyx_k_main[] = "__main__";
static const char __pyx_k_name[] = "__name__";
static const char __pyx_k_test[] = "__test__";
static const char __pyx_k_loads[] = "loads";
static const char __pyx_k_import[] = "__import__";
static const char __pyx_k_cline_in_traceback[] = "cline_in_traceback";

So in example 2, won't all these strings be easily visible in the executable?

Dogface answered 17/5, 2022 at 13:49 Comment(6)
Variants of this question have been asked loads of times. Here's a recent one: #71269670. Yes these strings will be visible fairly easily. Cython is not designed to obfuscate your code so this is not a bug.Atrice
Not particularly familiar with cython, but I imagine if you're on linux you could run strings myexe and find out. Intuitively I'd expect the strings to be relatively easy to extract if one knows what they're doing/where to look.Catercornered
@Atrice Thanks. (Yes I know this is not a bug). I looked at various similar questions but none of them have a concrete example. It would be great to have a canonical question+answer for all these duplicates showing on a real minimal example what can be recovered. Such as my bottle + JSON example. What do you think?Dogface
Are you interested in recovery or obfuscation? If you're trying to hide api keys or other secrets, I feel like that's a different questionAxel
The trouble with a real minimal example is: 1) it's hard to do it honestly (because we all know what Cython code went in to it), 2) "minimal" actually makes it quite a bit easier by reducing the amount of code to scan, and 3) it isn't putting an upper limit on how smart the person trying to reverse engineer it is (so whatever I write, the answer could be worse)Atrice
@Axel I'm interested in both, I want to know the boundaries of what can or can't be done with Cython. About your example of hiding API keys (it's another question indeed), would you have a link to a question+answer about this point? Or just an adviceDogface
A
1

In general you won't be able to avoid having those strings in the resulting executable, this is just how python works - they are needed at the run time.

If we look at a simple C-code:


void do_nothing(){...}

int main(){
  do_nothing();
  return 0;
}

compile and link it statically. When the linker is done, the call of do_nothing (let's assume it is not inlined or optimized out) is just a jump to a memory-address - the name of the function is no longer needed and can be erased from the resulting executable.

Python works differently: there is no linker, we don't use raw memory-addresses during the run time to call some functionality, but use Python-machinery to find it for us given the name of the package/module and of the function - thus we need this information - the names - during the run time. And thus they must be provided during the runtime.


However, if you are game changing the produced c-file you could make the life of the "hacker" somewhat harder.

When there is a string needed for calling Python-functionality, this will result in the following code (e.g. import json):

static const char __pyx_k_json[] = "json";

static PyObject *__pyx_n_s_json;

static __Pyx_StringTabEntry __pyx_string_tab[] = {
  ...
  {&__pyx_n_s_json, __pyx_k_json, sizeof(__pyx_k_json), 0, 0, 1, 1},
  ...
  {0, 0, 0, 0, 0, 0, 0}
};

static CYTHON_SMALL_CODE int __Pyx_InitGlobals(void) {
  if (__Pyx_InitStrings(__pyx_string_tab) < 0) __PYX_ERR(0, 1, __pyx_L1_error);
...
}
...
__pyx_t_1 = __Pyx_Import(__pyx_n_s_json, 0, 0); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 1, __pyx_L1_error)

so one could save "json" as "irnm" (every character shifted by -1) and then restore the real name during the run time before __Pyx_InitStrings is called in __Pyx_InitGlobals.

So now, just dumping the strings in exe would lead to nothing saying combination of characters. One even could go further and load the real names from somewhere after the program started, if this is worth the trouble.

Anzio answered 17/5, 2022 at 15:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.