Passing a structured numpy array with strings to a cython function
Asked Answered
D

1

29

I am attempting to create a function in cython that accepts a numpy structured array or record array by defining a cython struct type. Suppose I have the data:

a = np.recarray(3, dtype=[('a', np.float32),  ('b', np.int32), ('c', '|S5'), ('d', '|S3')])
a[0] = (1.1, 1, 'this\0', 'to\0')
a[1] = (2.1, 2, 'that\0', 'ta\0')
a[2] = (3.1, 3, 'dogs\0', 'ot\0')

(Note: the problem described below occurs with or without the null terminator)

I then have the cython code:

import numpy as np
cimport numpy as np

cdef packed struct tstruct:
    np.float32_t a
    np.int32_t b
    char[5] c
    char[3] d

def test_struct(tstruct[:] x):
    cdef:
        int k
        tstruct y

    for k in xrange(3):
        y = x[k]
        print y.a, y.b, y.c, y.d

When I try to run test_struct(a), I get the error:

ValueError: Expected a dimension of size 5, got 8

If in the array and corresponding struct are reordered such that the fields containing strings are not adjacent to each other, then the function works as expected. It appears as if the Cython function is not detecting the boundary between the c and d fields correctly and thinks as if you are passing in a char array of the sum of the lengths.

Short of reshuffling the data (which is possible but not ideal), is there another way to pass a recarray with fixed length string data into Cython?

Update: This appears to be a potential Cython bug. See the following discussion on the Cython google group that hints at where the problem is arising:

https://groups.google.com/forum/#!topic/cython-users/TbLbXdi0_h4

Update 2: This bug has been fixed in the master cython branch on Github as of Feb 23, 2014 and the patch is slated for inclusion in v0.20.2: https://github.com/cython/cython/commit/58d9361e0a6d4cb3d4e87775f78e0550c2fea836

Dosi answered 29/1, 2014 at 15:26 Comment(5)
I don't have a solution, just commiseration. I get the same error with cython 0.20. And it doesn't help to use a structured array with a dtype created using align=True (see docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html).Convocation
@WarrenWeckesser I posted something similar to the cython list, so I'm hoping I might get some traction there.Dosi
Hm, might there be some magic that adds those two strings together? (this would make it 8 instead of 5). what would happen if you inserted a say int data type between the two strings? or left out the last string just to see what happens.Favata
@Favata As I mentioned in the original question, if the two strings are not adjacent to one another (e.g you were to place b between c and d) then everything works as expected. The problem is that the boundary between adjacent strings does not appear to be detected properly.Dosi
@Favata no magic really. Cython produces code for the typechecking. I followed it and I found the piece that loops over the elements of a structure to figure out lenght, etc. For some reason, when two strings are one after the other, it just adds up the lengts of both fields. Not entirely clear to me why, and where's the bug exactly, as I'm not familiar with Cython's codePumpernickel
D
1

This was a bug that has been fixed in the master cython branch on Github as of Feb 22, 2014 and the patch is slated for inclusion in v0.20.2: https://github.com/cython/cython/commit/58d9361e0a6d4cb3d4e87775f78e0550c2fea836

Dosi answered 14/3, 2014 at 20:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.