What exactly is the point of memoryview in Python?
Asked Answered
G

6

129

Checking the documentation on memoryview:

memoryview objects allow Python code to access the internal data of an object that supports the buffer protocol without copying.

class memoryview(obj)

Create a memoryview that references obj. obj must support the buffer protocol. Built-in objects that support the buffer protocol include bytes and bytearray.

Then we are given the sample code:

>>> v = memoryview(b'abcefg')
>>> v[1]
98
>>> v[-1]
103
>>> v[1:4]
<memory at 0x7f3ddc9f4350>
>>> bytes(v[1:4])
b'bce'

Quotation over, now lets take a closer look:

>>> b = b'long bytes stream'
>>> b.startswith(b'long')
True
>>> v = memoryview(b)
>>> vsub = v[5:]
>>> vsub.startswith(b'bytes')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'memoryview' object has no attribute 'startswith'
>>> bytes(vsub).startswith(b'bytes')
True
>>> 

So what I gather from the above:

We create a memoryview object to expose the internal data of a buffer object without copying, however, in order to do anything useful with the object (by calling the methods provided by the object), we have to create a copy!

Usually memoryview (or the old buffer object) would be needed when we have a large object, and the slices can be large too. The need for a better efficiency would be present if we are making large slices, or making small slices but a large number of times.

With the above scheme, I don't see how it can be useful for either situation, unless someone can explain to me what I'm missing here.

Edit1:

We have a large chunk of data, we want to process it by advancing through it from start to end, for example extracting tokens from the start of a string buffer until the buffer is consumed.In C term, this is advancing a pointer through the buffer, and the pointer can be passed to any function expecting the buffer type. How can something similar be done in python?

People suggest workarounds, for example many string and regex functions take position arguments that can be used to emulate advancing a pointer. There're two issues with this: first it's a work around, you are forced to change your coding style to overcome the shortcomings, and second: not all functions have position arguments, for example regex functions and startswith do, encode()/decode() don't.

Others might suggest to load the data in chunks, or processing the buffer in small segments larger than the max token. Okay so we are aware of these possible workarounds, but we are supposed to work in a more natural way in python without trying to bend the coding style to fit the language - aren't we?

Edit2:

A code sample would make things clearer. This is what I want to do, and what I assumed memoryview would allow me to do at first glance. Lets use pmview (proper memory view) for the functionality I'm looking for:

tokens = []
xlarge_str = get_string()
xlarge_str_view =  pmview(xlarge_str)

while True:
    token =  get_token(xlarge_str_view)
    if token: 
        xlarge_str_view = xlarge_str_view.vslice(len(token)) 
        # vslice: view slice: default stop paramter at end of buffer
        tokens.append(token)
    else:   
        break
Garnetgarnett answered 6/9, 2013 at 10:28 Comment(2)
possible duplicate of When should a memoryview be used?Covin
The answer in the referenced question doesn't provide detail. Nor does the question touch on potential issues from a learner's angle.Garnetgarnett
M
123

One reason memoryviews are useful is that they can be sliced without copying the underlying data, unlike bytes/str.

For example, take the following toy example.

import time
for n in (100000, 200000, 300000, 400000):
    data = b'x'*n
    start = time.time()
    b = data
    while b:
        b = b[1:]
    print(f'     bytes {n} {time.time() - start:0.3f}')

for n in (100000, 200000, 300000, 400000):
    data = b'x'*n
    start = time.time()
    b = memoryview(data)
    while b:
        b = b[1:]
    print(f'memoryview {n} {time.time() - start:0.3f}')

On my computer, I get

     bytes 100000 0.211
     bytes 200000 0.826
     bytes 300000 1.953
     bytes 400000 3.514
memoryview 100000 0.021
memoryview 200000 0.052
memoryview 300000 0.043
memoryview 400000 0.077

You can clearly see the quadratic complexity of the repeated string slicing. Even with only 400000 iterations, it's already unmanageable. Meanwhile, the memoryview version has linear complexity and is lightning fast.

Edit: Note that this was done in CPython. There was a bug in Pypy up to 4.0.1 that caused memoryviews to have quadratic performance.

Mani answered 13/12, 2015 at 22:51 Comment(8)
This answer doesn't address the fact that to do anything "useful" as the asker states you have to use bytes() which copies the object...Anjanette
@citizen2077 As my example shows, it is useful for doing intermediate manipulations efficiently, even if you ultimately copy it to a bytes object.Mani
"without copying the underlying data", so b[1:] is like returning a pointer/reference (whatever you name it) starting from index 1?Trews
That is correct. It's a bit like returning a (pointer, length) pair actuallyMani
Is it true that memoryview.tolist() does NOT copy the underlying data either, but rigs the list mechanism to operate on the uncopied data? At least until you modify it. It's not clear from the docs, but seems like it should be the case. Is there a test to verify this (other than performance)?Tarantella
memoryview.tolist() DOES COPY the underlying data best I can tell from reading cpython/Objects/memoryobject.c ; tolist_base() calls unpack_single() ... PyList_SET_ITEM() for each item.Tarantella
"As my example shows, it is useful for doing intermediate manipulations efficiently" Your example shows no such thing, you don't do any manipulation at all you just do indexing with a different offset than zero.Blane
Slicing was the intermediate operation I was talking about in that comment. The point is that you can slice it repeatedly without invoking O(n^2) copies.Mani
E
85

memoryview objects are great when you need subsets of binary data that only need to support indexing. Instead of having to take slices (and create new, potentially large) objects to pass to another API you can just take a memoryview object.

One such API example would be the struct module. Instead of passing in a slice of the large bytes object to parse out packed C values, you pass in a memoryview of just the region you need to extract values from.

memoryview objects, in fact, support struct unpacking natively; you can target a region of the underlying bytes object with a slice, then use .cast() to 'interpret' the underlying bytes as long integers, or floating point values, or n-dimensional lists of integers. This makes for very efficient binary file format interpretations, without having to create more copies of the bytes.

Erudition answered 6/9, 2013 at 10:31 Comment(13)
And what do you do when you need subsets that support more than indexing?!Garnetgarnett
@BaselShishani: not use a memoryview. You are dealing with text, not binary data then.Erudition
Yes, dealing with text. So we don't use memoryview, is there an alternative?Garnetgarnett
What problem are you trying to solve? Are the substrings you need to test that large?Erudition
@ Martijn: Another question on your reply: 'you pass in a memoryview of just the region you need to extract values from': How do you pass a memoryview of just a region? What I see is memoryview takes only a full object argument.Garnetgarnett
@BaselShishani: slicing a memoryview returns a new memoryview covering just that region.Erudition
@BaselShishani: you can use io.StringIO: docs.python.org/3/library/io.html#io.StringIOCirrostratus
@MartijnPieters So you're saying that casting elements one by one of a memoryview object obtained, for example, from a bytes, is more fast that slicing the bytes itself, or convert the entire memoryview slice to bytes? Have you a little bench?Cirrostratus
@MarcoSulla: a memory view is literally a view on a chunk of memory. Conversion to bytes produces a copy, so another area of memory needs reserving and everything is copied across. 'casting' only happens insofar that the values are wrapped in a Python object to represent the type; you'd have to do that if for a bytes value too (e.g. indexing a bytes object has to create a Python int object for each value). Not copying the data is the time saver.Erudition
@MarcoSulla: and the answer by Antimony already has a benchmark, so I refer you to that.Erudition
I read the Antimony's answer. It does only bench the slicing, not the cast(). Slicing the bytes is not identical to slicing the memoryview, you have to cast(), as you said.Cirrostratus
@MarcoSulla: by cast(), do you mean memoryview.cast()? That simply creates a new view object on the same area memory with different parameters for how to treat values if you index values (at which point a boxed value is created) or assign to an index (unboxing the Python object into bytes). It's exactly the same operation as creating a struct.Struct() object, no actual conversions take place.Erudition
@MarcoSulla: e.g. memoryviewobj.cast("Q") is equivalent to using a struct.Struct("Q") object then using .pack(intvalue) or .unpack(bytesvalue), converting between int and bytes. Except there is no need to first create a bytes copy of the memoryview.Erudition
L
10

Let me make plain where lies the glitch in understanding here.

The questioner, like myself, expected to be able to create a memoryview that selects a slice of an existing array (for example a bytes or bytearray). We therefore expected something like:

desired_slice_view = memoryview(existing_array, start_index, end_index)

Alas, there is no such constructor, and the docs don't make a point of what to do instead.

The key is that you have to first make a memoryview that covers the entire existing array. From that memoryview you can create a second memoryview that covers a slice of the existing array, like this:

whole_view = memoryview(existing_array)
desired_slice_view = whole_view[10:20]

In short, the purpose of the first line is simply to provide an object whose slice implementation (dunder-getitem) returns a memoryview.

That might seem untidy, but one can rationalize it a couple of ways:

  1. Our desired output is a memoryview that is a slice of something. Normally we get a sliced object from an object of that same type, by using the slice operator [10:20] on it. So there's some reason to expect that we need to get our desired_slice_view from a memoryview, and that therefore the first step is to get a memoryview of the whole underlying array.

  2. The naive expection of a memoryview constructor with start and end arguments fails to consider that the slice specification really needs all the expressivity of the usual slice operator (including things like [3::2] or [:-4] etc). There is no way to just use the existing (and understood) operator in that one-liner constructor. You can't attach it to the existing_array argument, as that will make a slice of that array, instead of telling the memoryview constructor some slice parameters. And you can't use the operator itself as an argument, because it's an operator and not a value or object.

Conceivably, a memoryview constructor could take a slice object:

desired_slice_view = memoryview(existing_array, slice(1, 5, 2) )

... but that's not very satisfactory, since users would have to learn about the slice object and what its constructor's parameters mean, when they already think in terms of the slice operator's notation.

Liquefy answered 11/2, 2019 at 10:46 Comment(1)
Thanks, that was a helpful clarification.Drown
P
3

Here is python3 code.

#!/usr/bin/env python3

import time
for n in (100000, 200000, 300000, 400000):
    data = b'x'*n
    start = time.time()
    b = data
    while b:
        b = b[1:]
    print ('bytes {:d} {:f}'.format(n,time.time()-start))

for n in (100000, 200000, 300000, 400000):
    data = b'x'*n
    start = time.time()
    b = memoryview(data)
    while b:
        b = b[1:]
    print ('memview {:d} {:f}'.format(n,time.time()-start))
Palestrina answered 24/7, 2018 at 21:55 Comment(0)
B
2

Excellent example by Antimony. Actually, in Python3, you can replace data = 'x'*n by data = bytes(n) and put parenthesis to print statements as below:

import time
for n in (100000, 200000, 300000, 400000):
    #data = 'x'*n
    data = bytes(n)
    start = time.time()
    b = data
    while b:
        b = b[1:]
    print('bytes', n, time.time()-start)

for n in (100000, 200000, 300000, 400000):
    #data = 'x'*n
    data = bytes(n)
    start = time.time()
    b = memoryview(data)
    while b:
        b = b[1:]
    print('memoryview', n, time.time()-start)
Bereniceberenson answered 28/2, 2019 at 7:57 Comment(0)
F
1

The following code might explain it better. Suppose you don't have any control over how foreign_func is implemented. You can either call it with bytes directly or with a memoryview of those bytes:

from pandas import DataFrame
from timeit import timeit


def foreign_func(data):
    def _foreign_func(data):
        # Did you know that memview slice can be compared to bytes directly?
        assert data[:3] == b'xxx'
    _foreign_func(data[3:-3])


# timeit
bytes_times = []
memoryview_times = []
data_lens = []
for n in range(1, 10):
    data = b'x' * 10 ** n
    data_lens.append(len(data))
    bytes_times.append(timeit(
        'foreign_func(data)', globals=globals(), number=10))
    memoryview_times.append(timeit(
        'foreign_func(memoryview(data))', globals=globals(), number=10))


# output
df = DataFrame({
    'data_len': data_lens,
    'memoryview_time': memoryview_times,
    'bytes_time': bytes_times
})
df['times_faster'] = df['bytes_time'] / df['memoryview_time']
print(df)
df[['memoryview_time', 'bytes_time']].plot()

Result:

     data_len  memoryview_time  bytes_time   times_faster
0          10         0.000019    0.000012       0.672033
1         100         0.000016    0.000011       0.690320
2        1000         0.000016    0.000013       0.833314
3       10000         0.000016    0.000037       2.387100
4      100000         0.000016    0.000086       5.300594
5     1000000         0.000018    0.001134      63.357466
6    10000000         0.000009    0.028672    3221.528855
7   100000000         0.000009    0.258822   28758.547214
8  1000000000         0.000009    2.779704  292601.789177

calling with bytes gets exponentially slower

Fernandofernas answered 9/9, 2022 at 10:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.