Python: in-memory object database which supports indexing? [closed]
Asked Answered
K

11

27

I'm doing some data munging which would be quite a bit simpler if I could stick a bunch of dictionaries in an in-memory database, then run simply queries against it.

For example, something like:

people = db([
    {"name": "Joe", "age": 16},
    {"name": "Jane", "favourite_color": "red"},
])
over_16 = db.filter(age__gt=16)
with_favorite_colors = db.filter(favorite_color__exists=True)

There are three confounding factors, though:

  • Some of the values will be Python objects, and serializing them is out of the question (too slow, breaks identity). Of course, I could work around this (eg, by storing all the items in a big list, then serializing their indexes in that list… But that could take a fair bit of fiddling).
  • There will be thousands of data, and I will be running lookup-heavy operations (like graph traversals) against them, so it must be possible to perform efficient (ie, indexed) queries.
  • As in the example, the data is unstructured, so systems which require me to predefine a schema would be tricky.

So, does such a thing exist? Or will I need to kludge something together?

Katowice answered 1/3, 2011 at 22:25 Comment(5)
I'm not sure any solution will meet all three of your requirements. In particular, things that will do the indexing for you are unlikely to work with unwieldy Python objects. If so, you'll need to either serialise it, or construct your own index.Zigzag
Why do you think serialization would be unwieldy with Python objects? The two main types of index — hash table and binary tree — only require objects to implement a hashing function or comparison operators, and Python objects can do all of those…Katowice
But, yes — it's possible that nobody has built a general purpose object database that behaves exactly the way I'd like it to… But I thought it would be worth asking about.Katowice
you might be interested in my answer here or to Divmod's Axiom library (unfortunately their site suffered a major outage recently, so no link, but here's a saved version).Loganiaceous
Well it's 11 years late, but ducks does exactly what you are describing. Example in answer below.Aureolin
T
10

What about using an in-memory SQLite database via the sqlite3 standard library module, using the special value :memory: for the connection? If you don't want to write your on SQL statements, you can always use an ORM, like SQLAlchemy, to access an in-memory SQLite database.

EDIT: I noticed you stated that the values may be Python objects, and also that you require avoiding serialization. Requiring arbitrary Python objects be stored in a database also necessitates serialization.

Can I propose a practical solution if you must keep those two requirements? Why not just use Python dictionaries as indices into your collection of Python dictionaries? It sounds like you will have idiosyncratic needs for building each of your indices; figure out what values you're going to query on, then write a function to generate and index for each. The possible values for one key in your list of dicts will be the keys for an index; the values of the index will be a list of dictionaries. Query the index by giving the value you're looking for as the key.

import collections
import itertools

def make_indices(dicts):
    color_index = collections.defaultdict(list)
    age_index = collections.defaultdict(list)
    for d in dicts:
        if 'favorite_color' in d:
            color_index[d['favorite_color']].append(d)
        if 'age' in d:
            age_index[d['age']].append(d)
    return color_index, age_index


def make_data_dicts():
    ...


data_dicts = make_data_dicts()
color_index, age_index = make_indices(data_dicts)
# Query for those with a favorite color is simply values
with_color_dicts = list(
        itertools.chain.from_iterable(color_index.values()))
# Query for people over 16
over_16 = list(
        itertools.chain.from_iterable(
            v for k, v in age_index.items() if age > 16)
)
Ta answered 1/3, 2011 at 22:31 Comment(4)
I'd considered that… But it would require a bit of engineering to make it all work… I would imagine something like this: paste.pocoo.org/show/346681Katowice
So if nothing else turns up, I may look at going with that… But I'd rather avoid it.Katowice
There's a question of how you're going to get back the original dictionary object based off the IDs returned in your SQL query. I'd check https://mcmap.net/q/57895/-get-object-by-id-duplicateTa
LiteBox implements this idea -- it stores the relevant attributes in a SQLite DB along with an object ID. See example in answer below.Aureolin
M
5

If the in memory database solution ends up being too much work, here is a method for filtering it yourself that you may find useful.

The get_filter function takes in arguments to define how you want to filter a dictionary, and returns a function that can be passed into the built in filter function to filter a list of dictionaries.

import operator

def get_filter(key, op=None, comp=None, inverse=False):
    # This will invert the boolean returned by the function 'op' if 'inverse == True'
    result = lambda x: not x if inverse else x
    if op is None:
        # Without any function, just see if the key is in the dictionary
        return lambda d: result(key in d)

    if comp is None:
        # If 'comp' is None, assume the function takes one argument
        return lambda d: result(op(d[key])) if key in d else False

    # Use 'comp' as the second argument to the function provided
    return lambda d: result(op(d[key], comp)) if key in d else False

people = [{'age': 16, 'name': 'Joe'}, {'name': 'Jane', 'favourite_color': 'red'}]

print filter(get_filter("age", operator.gt, 15), people)
# [{'age': 16, 'name': 'Joe'}]
print filter(get_filter("name", operator.eq, "Jane"), people)
# [{'name': 'Jane', 'favourite_color': 'red'}]
print filter(get_filter("favourite_color", inverse=True), people)
# [{'age': 16, 'name': 'Joe'}]

This is pretty easily extensible to more complex filtering, for example to filter based on whether or not a value is matched by a regex:

p = re.compile("[aeiou]{2}") # matches two lowercase vowels in a row
print filter(get_filter("name", p.search), people)
# [{'age': 16, 'name': 'Joe'}]
Meninges answered 1/3, 2011 at 23:24 Comment(1)
This solution gives O(n) performance per query, for n dictionaries, because filter will iterate through every single dictionary. Performance will be unacceptable for large numbers of dictionaries and queries, which the question author indicates is the case. This is why a hash-based (amortized O(1) per query) solution is critical. It's the reason databases have indices for this kind of work, themselves.Ta
L
5

The only solution I know is a package I stumbled across a few years ago on PyPI, PyDbLite. It's okay, but there are few issues:

  1. It still wants to serialize everything to disk, as a pickle file. But that was simple enough for me to rip out. (It's also unnecessary. If the objects inserted are serializable, so is the collection as a whole.)
  2. The basic record type is a dictionary, into which it inserts its own metadata, two ints under keys __id__ and __version__.
  3. The indexing is very simple, based only on value of the record dictionary. If you want something more complicated, like based on a the attribute of a object in the record, you'll have to code it yourself. (Something I've meant to do myself, but never got around to.)

The author does seem to be working on it occasionally. There's some new features from when I used it, including some nice syntax for complex queries.

Assuming you rip out the pickling (and I can tell you what I did), your example would be (untested code):

from PyDbLite import Base

db = Base()
db.create("name", "age", "favourite_color")

# You can insert records as either named parameters
# or in the order of the fields
db.insert(name="Joe", age=16, favourite_color=None)
db.insert("Jane", None, "red")

# These should return an object you can iterate over
# to get the matching records.  These are unindexed queries.
#
# The first might throw because of the None in the second record
over_16 = db("age") > 16
with_favourite_colors = db("favourite_color") != None

# Or you can make an index for faster queries
db.create_index("favourite_color")
with_favourite_color_red = db._favourite_color["red"]

Hopefully it will be enough to get you started.

Leggett answered 1/3, 2011 at 23:53 Comment(1)
PyDbLite looks very interesting, esp the technique it uses to enable the nice syntax. Thanks for pointing it out.Beulabeulah
B
3

As far as "identity" anything that is hashable you should be able to compare, to keep track of object identity.

Zope Object Database (ZODB): http://www.zodb.org/

PyTables works well: http://www.pytables.org/moin

Also Metakit for Python works well: http://equi4.com/metakit/python.html
supports columns, and sub-columns but not unstructured data

Research "Stream Processing", if your data sets are extremely large this may be useful: http://www.trinhhaianh.com/stream.py/

Any in-memory database, that can be serialized (written to disk) is going to have your identity problem. I would suggest representing the data you want to store as native types (list, dict) instead of objects if at all possible.

Keep in mind NumPy was designed to perform complex operations on in-memory data structures, and could possibly be apart of your solution if you decide to roll your own.

Bencion answered 2/3, 2011 at 0:47 Comment(0)
A
2

I wrote a simple module called Jsonstore that solves (2) and (3). Here's how your example would go:

from jsonstore import EntryManager
from jsonstore.operators import GreaterThan, Exists

db = EntryManager(':memory:')
db.create(name='Joe', age=16)
db.create({'name': 'Jane', 'favourite_color': 'red'})  # alternative syntax

db.search({'age': GreaterThan(16)})
db.search(favourite_color=Exists())  # again, 2 different syntaxes
Aluminous answered 13/11, 2012 at 6:37 Comment(0)
C
1

Not sure if it complies with all your requirements, but TinyDB (using in-memory storage) is also probably worth the try:

>>> from tinydb import TinyDB, Query
>>> from tinydb.storages import MemoryStorage
>>> db = TinyDB(storage=MemoryStorage)
>>> db.insert({'name': 'John', 'age': 22})
>>> User = Query()
>>> db.search(User.name == 'John')
[{'name': 'John', 'age': 22}]

Its simplicity and powerful query engine makes it a very interesting tool for some use cases. See http://tinydb.readthedocs.io/ for more details.

Connecticut answered 20/2, 2017 at 19:15 Comment(0)
B
0

If you are willing to work around serializing, MongoDB could work for you. PyMongo provides an interface almost identical to what you describe. If you decide to serialize, the hit won't be as bad since Mongodb is memory mapped.

Brahmana answered 1/3, 2011 at 22:32 Comment(1)
Not only would it require serialization, but it would also mean running a separate database server =\Katowice
P
0

It should be possible to do what you are wanting to do with just isinstance(), hasattr(), getattr() and setattr().

However, things are going to get fairly complicated before you are done!

I suppose one could store all the objects in a big list, then run a query on each object, determining what it is and looking for a given attribute or value, then return the value and the object as a list of tuples. Then you could sort on your return values pretty easily. copy.deepcopy will be your best friend and your worst enemy.

Sounds like fun! Good luck!

Pitchy answered 1/3, 2011 at 22:59 Comment(2)
Hu? What do isinstance, deepcopy and friends have to do with this? Also, I realize there are a few ways I could implement it… But I'd rather find some code someone else has already written, then work off of that.Katowice
Well, assuming that one had a concatenation of objects, it would be handy to tell what sort of object they were before attempting to interrogate them. You did say that some of the values would be python objects. Deepcopy to avoid side-effects if you had to clone and change your objects. Sorry if it was somehow offensive, I honestly don't know how to gauge anyone else's knowledge having so little of my own. :-)Pitchy
D
0

I started developing one yesterday and it isn't published yet. It indexes your objects and allows you to run fast queries. All data is kept in RAM and I'm thinking about smart load and save methods. For testing purposes it is loading and saving through cPickle.

Let me know if you are still interested.

Downs answered 6/12, 2013 at 14:42 Comment(0)
A
0

ducks is exactly what you are describing.

  • It builds indexes on Python objects
  • It does not serialize or persist anything
  • Missing attributes are handled correctly
  • It uses C libraries so it's very fast and RAM-efficient

pip install ducks

from ducks import Dex, ANY

objects = [
    {"name": "Joe", "age": 16},
    {"name": "Jane", "favourite_color": "red"},
]


# Build the index
dex = Dex(objects, ['name', 'age', 'favourite_color'])

# Look up by any combination of attributes
dex[{'age': {'>=': 16}}]  # Returns Joe

# Match the special value ANY to find all objects with the attribute
dex[{'favourite_color': ANY}] # Returns Jane

This example uses dicts, but ducks works on any object type.

Aureolin answered 13/7, 2022 at 9:41 Comment(0)
P
0

Throwing another option into the mix: odex (I’m the author)

from odex import IndexedSet, attr, and_

class X:
    def __init__(self, a, b):
        self.a = a
        self.b = b

iset = IndexedSet(
    [
        X(a=1, b=4),
        X(a=2, b=5),
        X(a=2, b=6),
        X(a=3, b=7),
    ], 
    indexes=["a"]
)

# Filter objects with SQL-like expressions:
iset.filter("a = 2 AND b = 5") == {X(a=2, b=5)}

# Or, using the fluent interface:
iset.filter(
    and_(
        attr("a").eq(2),
        attr("b").eq(5)
    )
) == {X(a=2, b=5)}

Similar to ducks, this builds indexes on Python objects and doesn't serialize or persist anything.

Similar to sqlite, odex supports full-fledged logical expressions.

Peacemaker answered 21/7 at 15:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.