Fast string array - Cython

Asked 7/7, 2013 at 10:26 Answered 7/1, 2019 at 20:7

Solved python arrays string cython python-2.x

Having following hypothetical code:

cdef extern from "string.h":
    int strcmp(char* str1, char* str2)

def foo(list_str1, list_str2):
    cdef unsigned int i, j
    c_arr1 = ??
    c_arr2 = ??
    for i in xrange(len(list_str1)):
        for j in xrange(len(list_str2)):
            if not strcmp(c_arr1[i], c_arr2[j]):
                do some funny stuff

is there some way how to convert the lists to c arrays?

I have read and tried Cython - converting list of strings to char ** but that only throws errors.

Alienation answered 7/7, 2013 at 10:26 Comment(1)

Added recently Python 3 solution here of similar task, maybe you're interested in reading it. – Dichlorodiphenyltrichloroethane 14/11, 2021 at 12:49

If you're on Python 3, here's an update to @falsetru's answer (untested on Python 2).

cdef extern from "Python.h":
    char* PyUnicode_AsUTF8(object unicode)

from libc.stdlib cimport malloc, free
from libc.string cimport strcmp

cdef char ** to_cstring_array(list_str):
    cdef char **ret = <char **>malloc(len(list_str) * sizeof(char *))
    for i in xrange(len(list_str)):
        ret[i] = PyUnicode_AsUTF8(list_str[i])
    return ret

def foo(list_str1, list_str2):
    cdef unsigned int i, j
    cdef char **c_arr1 = to_cstring_array(list_str1)
    cdef char **c_arr2 = to_cstring_array(list_str2)

    for i in range(len(list_str1)):
        for j in range(len(list_str2)):
            if i != j and strcmp(c_arr1[i], c_arr2[j]) == 0:
                print(i, j, list_str1[i])
    free(c_arr1)
    free(c_arr2)

foo(['hello', 'python', 'world'], ['python', 'rules'])

Warning: The pointer returned by PyUnicode_AsUTF8 is cached in the parent unicode-object. Which has two consequences:

this pointer is only valid as long as the parent unicode-object is alive. Accessing it afterwards leads to undefined behavior (e.g. possible segmentation fault).
The caller of the PyUnicode_AsUTF8 isn't responsible for the freeing the memory.

Herson answered 7/1, 2019 at 20:7 Comment(2)

Maybe it is worth mentioning, that since Python 3.7 it is const char * PyUnicode_AsUTF8(...) docs.python.org/3/c-api/unicode.html#c.PyUnicode_AsUTF8 – Birdlime 8/1, 2019 at 7:36

Thank you, please edit the answer if you feel it needs correction! – Herson 8/1, 2019 at 18:29

Try following code. to_cstring_array function in the following code is what you want.

from libc.stdlib cimport malloc, free
from libc.string cimport strcmp
from cpython.string cimport PyString_AsString

cdef char ** to_cstring_array(list_str):
    cdef char **ret = <char **>malloc(len(list_str) * sizeof(char *))
    for i in xrange(len(list_str)):
        ret[i] = PyString_AsString(list_str[i])
    return ret

def foo(list_str1, list_str2):
    cdef unsigned int i, j
    cdef char **c_arr1 = to_cstring_array(list_str1)
    cdef char **c_arr2 = to_cstring_array(list_str2)

    for i in xrange(len(list_str1)):
        for j in xrange(len(list_str2)):
            if i != j and strcmp(c_arr1[i], c_arr2[j]) == 0:
                print i, j, list_str1[i]
    free(c_arr1)
    free(c_arr2)

foo(['hello', 'python', 'world'], ['python', 'rules'])

Foreside answered 7/7, 2013 at 11:23 Comment(4)

PyString_AsString is python2 only, so this solution will not work for python3 – Birdlime 16/3, 2018 at 12:31

@ead, Beside PyString_AsString, there are xrange calls in OP's code. So I thought it's okay to assume it's python 2 code. Any suggestion to make this solution work both in python 2/3 is welcome. – Foreside 16/3, 2018 at 12:49

I have found out that in Py3 PyUnicode_AsUTF8 is supposed to be used but I run into error: Storing unsafe C derivative of temporary Python reference. When I factorized the code assigning PyUnicode_AsUTF8(list_str[i]) to a temp variable, I run into another error: 'PyUnicode_AsUTF8' is not a constant, variable or function identifier. I have no clue how to proceed at this point. – Abacus 10/6, 2018 at 13:23

It's worth making clear that the memory for the stored char* is owned by the Python strings so is only valid while the Python strings are still alive. It's fine in this question, but I came here from a linked question that had copied the code and run into problems – Anamorphoscope 24/3, 2019 at 23:8

If you're on Python 3, here's an update to @falsetru's answer (untested on Python 2).

cdef extern from "Python.h":
    char* PyUnicode_AsUTF8(object unicode)

from libc.stdlib cimport malloc, free
from libc.string cimport strcmp

cdef char ** to_cstring_array(list_str):
    cdef char **ret = <char **>malloc(len(list_str) * sizeof(char *))
    for i in xrange(len(list_str)):
        ret[i] = PyUnicode_AsUTF8(list_str[i])
    return ret

def foo(list_str1, list_str2):
    cdef unsigned int i, j
    cdef char **c_arr1 = to_cstring_array(list_str1)
    cdef char **c_arr2 = to_cstring_array(list_str2)

    for i in range(len(list_str1)):
        for j in range(len(list_str2)):
            if i != j and strcmp(c_arr1[i], c_arr2[j]) == 0:
                print(i, j, list_str1[i])
    free(c_arr1)
    free(c_arr2)

foo(['hello', 'python', 'world'], ['python', 'rules'])

Warning: The pointer returned by PyUnicode_AsUTF8 is cached in the parent unicode-object. Which has two consequences:

this pointer is only valid as long as the parent unicode-object is alive. Accessing it afterwards leads to undefined behavior (e.g. possible segmentation fault).
The caller of the PyUnicode_AsUTF8 isn't responsible for the freeing the memory.

Herson answered 7/1, 2019 at 20:7 Comment(2)

Maybe it is worth mentioning, that since Python 3.7 it is const char * PyUnicode_AsUTF8(...) docs.python.org/3/c-api/unicode.html#c.PyUnicode_AsUTF8 – Birdlime 8/1, 2019 at 7:36

Thank you, please edit the answer if you feel it needs correction! – Herson 8/1, 2019 at 18:29

Recommended topics

Hot tags