Improper use of __new__ to generate class instances?
Asked Answered
W

3

13

I'm creating some classes for dealing with filenames in various types of file shares (nfs, afp, s3, local disk) etc. I get as user input a string that identifies the data source (i.e. "nfs://192.168.1.3" or "s3://mybucket/data") etc.

I'm subclassing the specific filesystems from a base class that has common code. Where I'm confused is in the object creation. What I have is the following:

import os

class FileSystem(object):
    class NoAccess(Exception):
        pass

    def __new__(cls,path):
        if cls is FileSystem:
            if path.upper().startswith('NFS://'): 
                return super(FileSystem,cls).__new__(Nfs)
            else: 
                return super(FileSystem,cls).__new__(LocalDrive)
        else:
            return super(FileSystem,cls).__new__(cls,path)

    def count_files(self):
        raise NotImplementedError

class Nfs(FileSystem):
    def __init__ (self,path):
        pass

    def count_files(self):
        pass

class LocalDrive(FileSystem):
    def __init__(self,path):
        if not os.access(path, os.R_OK):
            raise FileSystem.NoAccess('Cannot read directory')
        self.path = path

    def count_files(self):
        return len([x for x in os.listdir(self.path) if os.path.isfile(os.path.join(self.path, x))])

data1 = FileSystem('nfs://192.168.1.18')
data2 = FileSystem('/var/log')

print type(data1)
print type(data2)

print data2.count_files()

I thought this would be a good use of __new__ but most posts I read about it's use discourage it. Is there a more accepted way to approach this problem?

Werth answered 20/1, 2015 at 0:5 Comment(4)
Can you explain further what you're hoping to achieve using __new__() in this way?Zetland
I edited out the use of pre and code. Not sure if it was you or the editor but it was messing up the whitespaceDownbeat
Patrick, I'm looking for a way to make a class that is agnostic of the source of data. The example I use above is a simple case of just counting number of files in a directory. I don't want the user to have to worry about how that is handled behind the scenes.Werth
@Segsfault: Could you provide some links to examples of posts you read discouraging the use of __new__ to do this? Thx.Millhon
M
34

I don't think using __new__() to do what you want is improper. In other words, I disagree with the accepted answer to this question which claims factory functions are always the "best way to do it".

If you really want to avoid using it, then the only options are metaclasses or a separate factory function/method (however see Python 3.6+ Update below). Given the choices available, making the __new__() method one — since it's static by default — is a perfectly sensible approach.

That said, below is what I think is an improved version of your code. I've added a couple of class methods to assist in automatically finding all the subclasses. These support the most important way in which it's better — which is now adding subclasses doesn't require modifying the __new__() method. This means it's now easily extensible since it effectively supports what you could call virtual constructors.

A similar implementation could also be used to move the creation of instances out of the __new__() method into a separate (static) factory method — so in one sense the technique shown is just a relatively simple way of coding an extensible generic factory function regardless of what name it's given.

# Works in Python 2 and 3.

import os
import re

class FileSystem(object):
    class NoAccess(Exception): pass
    class Unknown(Exception): pass

    # Regex for matching "xxx://" where x is any non-whitespace character except for ":".
    _PATH_PREFIX_PATTERN = re.compile(r'\s*([^:]+)://')

    @classmethod
    def _get_all_subclasses(cls):
        """ Recursive generator of all class' subclasses. """
        for subclass in cls.__subclasses__():
            yield subclass
            for subclass in subclass._get_all_subclasses():
                yield subclass

    @classmethod
    def _get_prefix(cls, s):
        """ Extract any file system prefix at beginning of string s and
            return a lowercase version of it or None when there isn't one.
        """
        match = cls._PATH_PREFIX_PATTERN.match(s)
        return match.group(1).lower() if match else None

    def __new__(cls, path):
        """ Create instance of appropriate subclass using path prefix. """
        path_prefix = cls._get_prefix(path)

        for subclass in cls._get_all_subclasses():
            if subclass.prefix == path_prefix:
                # Using "object" base class method avoids recursion here.
                return object.__new__(subclass)
        else:  # No subclass with matching prefix found (& no default defined)
            raise FileSystem.Unknown(
                'path "{}" has no known file system prefix'.format(path))

    def count_files(self):
        raise NotImplementedError


class Nfs(FileSystem):
    prefix = 'nfs'

    def __init__ (self, path):
        pass

    def count_files(self):
        pass


class LocalDrive(FileSystem):
    prefix = None  # Default when no file system prefix is found.

    def __init__(self, path):
        if not os.access(path, os.R_OK):
            raise FileSystem.NoAccess('Cannot read directory')
        self.path = path

    def count_files(self):
        return sum(os.path.isfile(os.path.join(self.path, filename))
                     for filename in os.listdir(self.path))


if __name__ == '__main__':

    data1 = FileSystem('nfs://192.168.1.18')
    data2 = FileSystem('c:/')  # Change as necessary for testing.

    print(type(data1).__name__)  # -> Nfs
    print(type(data2).__name__)  # -> LocalDrive

    print(data2.count_files())  # -> <some number>

Python 3.6+ Update

The code above works in both Python 2 and 3.x. However in Python 3.6 a new class method was added to object named __init_subclass__() which makes the finding of subclasses simpler by using it to automatically create a "registry" of them instead of potentially having to check every subclass recursively as the _get_all_subclasses() method is doing in the above.

I got the idea of using __init_subclass__() to do this from the Subclass registration section in the PEP 487 -- Simpler customisation of class creation proposal. Since the method will be inherited by all the base class' subclasses, registration will automatically be done for sub-subclasses, too (as opposed to only to direct subclasses) — it completely eliminates the need for a method like _get_all_subclasses().

# Requires Python 3.6+

import os
import re

class FileSystem(object):
    class NoAccess(Exception): pass
    class Unknown(Exception): pass

    # Pattern for matching "xxx://"  # x is any non-whitespace character except for ":".
    _PATH_PREFIX_PATTERN = re.compile(r'\s*([^:]+)://')
    _registry = {}  # Registered subclasses.

    @classmethod
    def __init_subclass__(cls, /, path_prefix, **kwargs):
        super().__init_subclass__(**kwargs)
        cls._registry[path_prefix] = cls  # Add class to registry.

    @classmethod
    def _get_prefix(cls, s):
        """ Extract any file system prefix at beginning of string s and
            return a lowercase version of it or None when there isn't one.
        """
        match = cls._PATH_PREFIX_PATTERN.match(s)
        return match.group(1).lower() if match else None

    def __new__(cls, path):
        """ Create instance of appropriate subclass. """
        path_prefix = cls._get_prefix(path)
        subclass = cls._registry.get(path_prefix)
        if subclass:
            return object.__new__(subclass)
        else:  # No subclass with matching prefix found (and no default).
            raise cls.Unknown(
                f'path "{path}" has no known file system prefix')

    def count_files(self):
        raise NotImplementedError


class Nfs(FileSystem, path_prefix='nfs'):
    def __init__ (self, path):
        pass

    def count_files(self):
        pass

class Ufs(Nfs, path_prefix='ufs'):
    def __init__ (self, path):
        pass

    def count_files(self):
        pass

class LocalDrive(FileSystem, path_prefix=None):  # Default file system.
    def __init__(self, path):
        if not os.access(path, os.R_OK):
            raise self.NoAccess(f'Cannot read directory {path!r}')
        self.path = path

    def count_files(self):
        return sum(os.path.isfile(os.path.join(self.path, filename))
                     for filename in os.listdir(self.path))


if __name__ == '__main__':

    data1 = FileSystem('nfs://192.168.1.18')
    data2 = FileSystem('c:/')  # Change as necessary for testing.
    data4 = FileSystem('ufs://192.168.1.18')

    print(type(data1))  # -> <class '__main__.Nfs'>
    print(type(data2))  # -> <class '__main__.LocalDrive'>
    print(f'file count: {data2.count_files()}')  # -> file count: <some number>

    try:
        data3 = FileSystem('c:/foobar')  # A non-existent directory.
    except FileSystem.NoAccess as exc:
        print(f'{exc} - FileSystem.NoAccess exception raised as expected')
    else:
        raise RuntimeError("Non-existent path should have raised Exception!")

    try:
        data4 = FileSystem('foobar://42')  # Unregistered path prefix.
    except FileSystem.Unknown as exc:
        print(f'{exc} - FileSystem.Unknown exception raised as expected')
    else:
        raise RuntimeError("Unregistered path prefix should have raised Exception!")

Millhon answered 21/1, 2015 at 20:33 Comment(17)
Kolmar and martineau both had great suggestions. I'm using each now in different ways. But for this specific issue I went with the approach from @martineau. Thanks!Werth
Minor note: The inner for loop in _get_all_subclasses can be replaced with the line yield from subclass._get_all_subclasses() for the same result, and IMHO minorly more readable and prettier.Bayonne
@Jacco van Dorp: FWIW, I didn't use a yield from because it wasn't added to Python until version 3.3 via PEP 380 (as currently posted, my answer will work in both Python 2 and 3).Millhon
Ah, ok. I never worked with anything older than 3.4 myself.Bayonne
@Gloweye: FYI, the update I added for Python 3.6+ eliminates the _get_all_subclasses() method, which makes your point about yield from moot.Millhon
I believe you need to change super().__init_subclass__(**kwargs) to super().__init_subclass__() otherwise a TypeError is thrown per the PEP.Histochemistry
To expand, no TypeError is thrown here because of the use of kwargs.pop, but that is not called and **kwargs contains keywords, passing to super() will throw the TypeError due to the default implementation not accepting keyword arguments.Histochemistry
@pstatix: Not sure what you're talking about, so I think believe you're mistaken. The code worked when I posted it — I generally test all answers I can before posting them — and it still worked when I tested it again just now using Python 3.8.5. Further, whatever you're trying to say about kwargs.pop() is intelligible. What makes you think it isn't being called?Millhon
@Millhon I apologize for any confusion, my supplemental post hopefully clears it up. I do not have 3.8.5 on my system, but I tested my concern (observation) on 3.6 and 3.7 and resolved the same conclusion. If you have further input to my post, please provide! I found this to be an interesting lesson in the Python data model.Histochemistry
class Nfs(FileSystem, path_prefix='nfs'): - I'm unfamiliar with that syntax of path_prefix='nfs' in class declaration. What's that called?Transnational
@Stabilo: It's called a keyword argument and they are allowed in argument lists — such as those in class definitions — see the inheritance ::= "(" [argument_list] ")" part?Millhon
@Millhon thanks for answering, but what is it used for? is it like dfining a class variable? I can't find any information about that.Transnational
@Stabilo: I found a blog post titled __init_subclass__ – a simpler way to implement class registries in Python that explains what is going on fairly well.Millhon
@Millhon Thanks. Another question - what is the / in def __init_subclass__(cls, /, path_prefix, **kwargs):?Transnational
@Stabilo: That means that all arguments from that point on are keyword-only (can only be passed by specifying them using their name).Millhon
@Millhon thanks. I read that / was introduced in python3.8 (so the # Requires Python 3.6+ comment should be changed) and means Positional-only parameters, so maybe you meant to use * instead to enforce keyword-only parameters?Transnational
@Stabilo: I'll just remove its usage. It's only there because later versions of the documentation started using it in the example code shown.Millhon
D
7

In my opinion, using __new__ in such a way is really confusing for other people who might read your code. Also it requires somewhat hackish code to distinguish guessing file system from user input and creating Nfs and LocalDrive with their corresponding classes.

Why not make a separate function with this behaviour? It can even be a static method of FileSystem class:

class FileSystem(object):
    # other code ...

    @staticmethod
    def from_path(path):
        if path.upper().startswith('NFS://'): 
            return Nfs(path)
        else: 
            return LocalDrive(path)

And you call it like this:

data1 = FileSystem.from_path('nfs://192.168.1.18')
data2 = FileSystem.from_path('/var/log')
Denims answered 20/1, 2015 at 0:53 Comment(6)
IMHO, while this may be simpler to understand, a significant limitation with it (as well as the OP's implementation) is that the factory function/method has to know about all the derived classes there are in advance. In that sense they aren't really providing the same generality that true C++-like "virtual" constructors might provide (although it doesn't support them either). To do so in either language requires something more complicated that allows subclasses to override or augment the base class constructor in a way that creates instances of themselves when the proper conditions are met.Millhon
@Millhon I went with @staticmethod approach because in the original question, the class should know about its subclasses and call their constructors as well. In Python such virtual constructor behaviour you describe can usually be achieved with metaclasses (or maybe there are other ways, like module-level variables, not sure if this would work). Children classes are registered in some class-level set via metaclass, and base class provides a classmethod, that runs through this set and checks if the child is suitable for passed parameters. But I think it's out of scope of this question.Denims
Yes, metaclasses would seem like a logical choice in order to provide the extensibility I described -- although it obviously can be done with just __new__. However, I'm not aware of any idiomatic or canonical ways of using either to truly implement virtual constructors, much less factory methods (which is why I asked the OP where s/he saw posts discouraging the use of __new__ for implementing even the relatively simple static factory method pattern).Millhon
@Millhon Oh, wow, didn't know about __subclasses__. You learn something new every day, eh. Still, even using this, I'd make the factory method a @classmethod, or maybe even a @staticmethod instead of (ab)using __new__Denims
Who said anything about __subclasses__? ;-)Millhon
Thanks for the feedback. @Millhon One of a few posts on __new__ is here: link Though that is more dated. But aside from that and other posts, in any implementation I can get to work the code definitely looks more complex than using something like @staticmethodWerth
H
4

Edit [BLUF]: there is no problem with the answer provided by @martineau, this post is merely to follow up for completion to discuss a potential error encountered when using additional keywords in a class definition that are not managed by the metaclass.

I'd like to supply some additional information on the use of __init_subclass__ in conjuncture with using __new__ as a factory. The answer that @martineau has posted is very useful and I have implemented an altered version of it in my own programs as I prefer using the class creation sequence over adding a factory method to the namespace; very similar to how pathlib.Path is implemented.

To follow up on a comment trail with @martinaeu I have taken the following snippet from his answer:

import os
import re

class FileSystem(object):
    class NoAccess(Exception): pass
    class Unknown(Exception): pass

    # Regex for matching "xxx://" where x is any non-whitespace character except for ":".
    _PATH_PREFIX_PATTERN = re.compile(r'\s*([^:]+)://')
    _registry = {}  # Registered subclasses.

    @classmethod
    def __init_subclass__(cls, /, **kwargs):
        path_prefix = kwargs.pop('path_prefix', None)
        super().__init_subclass__(**kwargs)
        cls._registry[path_prefix] = cls  # Add class to registry.

    @classmethod
    def _get_prefix(cls, s):
        """ Extract any file system prefix at beginning of string s and
            return a lowercase version of it or None when there isn't one.
        """
        match = cls._PATH_PREFIX_PATTERN.match(s)
        return match.group(1).lower() if match else None

    def __new__(cls, path):
        """ Create instance of appropriate subclass. """
        path_prefix = cls._get_prefix(path)
        subclass = FileSystem._registry.get(path_prefix)
        if subclass:
            # Using "object" base class method avoids recursion here.
            return object.__new__(subclass)
        else:  # No subclass with matching prefix found (and no default).
            raise FileSystem.Unknown(
                f'path "{path}" has no known file system prefix')

    def count_files(self):
        raise NotImplementedError


class Nfs(FileSystem, path_prefix='nfs'):
    def __init__ (self, path):
        pass

    def count_files(self):
        pass


class LocalDrive(FileSystem, path_prefix=None):  # Default file system.
    def __init__(self, path):
        if not os.access(path, os.R_OK):
            raise FileSystem.NoAccess('Cannot read directory')
        self.path = path

    def count_files(self):
        return sum(os.path.isfile(os.path.join(self.path, filename))
                     for filename in os.listdir(self.path))


if __name__ == '__main__':

    data1 = FileSystem('nfs://192.168.1.18')
    data2 = FileSystem('c:/')  # Change as necessary for testing.

    print(type(data1).__name__)  # -> Nfs
    print(type(data2).__name__)  # -> LocalDrive

    print(data2.count_files())  # -> <some number>

    try:
        data3 = FileSystem('foobar://42')  # Unregistered path prefix.
    except FileSystem.Unknown as exc:
        print(str(exc), '- raised as expected')
    else:
        raise RuntimeError(
              "Unregistered path prefix should have raised Exception!")

This answer, as written works, but I wish to address a few items (potential pitfalls) others may experience through inexperience or perhaps codebase standards their team requires.

Firstly, for the decorator on __init_subclass__, per the PEP:

One could require the explicit use of @classmethod on the __init_subclass__ decorator. It was made implicit since there's no sensible interpretation for leaving it out, and that case would need to be detected anyway in order to give a useful error message.

Not a problem since its already implied, and the Zen tells us "explicit over implicit"; never the less, when abiding by PEPs, there you go (and rational is further explained).

In my own implementation of a similar solution, subclasses are not defined with an additional keyword argument, such as @martineau does here:

class Nfs(FileSystem, path_prefix='nfs'): ...
class LocalDrive(FileSystem, path_prefix=None): ...

When browsing through the PEP:

As a second change, the new type.__init__ just ignores keyword arguments. Currently, it insists that no keyword arguments are given. This leads to a (wanted) error if one gives keyword arguments to a class declaration if the metaclass does not process them. Metaclass authors that do want to accept keyword arguments must filter them out by overriding __init__.

Why is this (potentially) problematic? Well there are several questions (notably this) describing the problem surrounding additional keyword arguments in a class definition, use of metaclasses (subsequently the metaclass= keyword) and the overridden __init_subclass__. However, that doesn't explain why it works in the currently given solution. The answer: kwargs.pop().

If we look at the following:

# code in CPython 3.7

import os
import re

class FileSystem(object):
    class NoAccess(Exception): pass
    class Unknown(Exception): pass

    # Regex for matching "xxx://" where x is any non-whitespace character except for ":".
    _PATH_PREFIX_PATTERN = re.compile(r'\s*([^:]+)://')
    _registry = {}  # Registered subclasses.

    def __init_subclass__(cls, **kwargs):
        path_prefix = kwargs.pop('path_prefix', None)
        super().__init_subclass__(**kwargs)
        cls._registry[path_prefix] = cls  # Add class to registry.

    ...

class Nfs(FileSystem, path_prefix='nfs'): ...

This will still run without issue, but if we remove the kwargs.pop():

    def __init_subclass__(cls, **kwargs):
        super().__init_subclass__(**kwargs)  # throws TypeError
        cls._registry[path_prefix] = cls  # Add class to registry.

The error thrown is already known and described in the PEP:

In the new code, it is not __init__ that complains about keyword arguments, but __init_subclass__, whose default implementation takes no arguments. In a classical inheritance scheme using the method resolution order, each __init_subclass__ may take out it's keyword arguments until none are left, which is checked by the default implementation of __init_subclass__.

What is happening is the path_prefix= keyword is being "popped" off of kwargs, not just accessed, so then **kwargs is now empty and passed up the MRO and thus compliant with the default implementation (receiving no keyword arguments).

To avoid this entirely, I propose not relying on kwargs but instead use that which is already present in the call to __init_subclass__, namely the cls reference:

# code in CPython 3.7

import os
import re

class FileSystem(object):
    class NoAccess(Exception): pass
    class Unknown(Exception): pass

    # Regex for matching "xxx://" where x is any non-whitespace character except for ":".
    _PATH_PREFIX_PATTERN = re.compile(r'\s*([^:]+)://')
    _registry = {}  # Registered subclasses.

    def __init_subclass__(cls, **kwargs):
        super().__init_subclass__(**kwargs)
        cls._registry[cls._path_prefix] = cls  # Add class to registry.

    ...

class Nfs(FileSystem):
    _path_prefix = 'nfs'

    ...

Adding the prior keyword as a class attribute also extends the use in later methods if one needs to refer back to the particular prefix used by the subclass (via self._path_prefix). To my knowledge, you cannot refer back to supplied keywords in the definition (without some complexity) and this seemed trivial and useful.

So to @martineau I apologize for my comments seeming confusing, only so much space to type them and as shown it was more detailed.

Histochemistry answered 23/9, 2020 at 14:2 Comment(6)
For the record, the path_prefix popped-off is being accessed — even if its value is None. At one point developing what's currently in my answer, I considered using subclass attributes like was being done in the the non-__init_subclass__() version, but decided I liked having things more explicit by passing a keyword argument when declaring the subclasses. Nothing wrong I suppose with doing your approach, but as usual the are trade-offs between the two approaches, I think it's mostly a personal preference.Millhon
@Millhon Indeed it is personal preference. I just wanted to avoid the potential for this "deeper" TypeError that's thrown; figured posting here given that your answer is what inspired my variation and ultimately led me to uncover the need for kwargs.pop along with the difference between using a keyword argument or a class attribute and how that impacts the implementation of __init_subclass__.Histochemistry
Since __init_subclass__() is defined in the base class where the need to pop() the argument is taken care for derived classes, requiring the keyword argument doesn't seem like much of a burden for subclasses to get "right". If you'd like a better error message, the base class implementation could raise one if the keyword argument is missing I suppose.Millhon
@Millhon Its a burden by no means, just an anecdotal reply about your different approaches (since you show one with a class attribute and one without); only meant to expand that one cannot use __init_subclass__ with class definitions containing keywords without using pop() and explaining why that is. I prefer to use the class attribute with __init_subclass__ but as you've said, that's preferential.Histochemistry
I've updated my answer slightly so the pop() is no longer needed — similar to the example shown in the documentation — so that's one less thing to worry about.Millhon
@Millhon what I find interesting is that the documentation has an example that is a mix of both our implementations: it uses a keyword in the class definition but assigns it as a class attribute! I've edited my post to have a BLUF therefore hoping viewers review your content as the accepted (rightfully) answer and read mine only for clarity.Histochemistry

© 2022 - 2024 — McMap. All rights reserved.