Best way to initialize variable in a module?
Asked Answered
T

3

11

Let's say I need to write incoming data into a dataset on the cloud. When, where and if I will need the dataset in my code, depends on the data coming in. I only want to get a reference to the dataset once. What is the best way to achieve this?

  1. Initialize as global variable at start and access through global variable

    if __name__="__main__":
        dataset = #get dataset from internet
    

This seems like the simplest way, but initializes the variable even if it is never needed.

  1. Get reference first time the dataset is needed, save in global variable, and access with get_dataset() method

    dataset = None
    
    def get_dataset():
        global dataset
        if dataset is none
            dataset = #get dataset from internet
        return dataset
    
  2. Get reference first time the dataset is needed, save as function attribute, and access with get_dataset() method

    def get_dataset():
        if not hasattr(get_dataset, 'dataset'):
            get_dataset.dataset = #get dataset from internet
        return get_dataset.dataset
    
  3. Any other way

Tedesco answered 16/9, 2017 at 0:50 Comment(0)
W
8

The typical way to do what you want is to wrap your service calling for the data into a class:

class MyService():
  dataset = None

  def get_data(self):
    if self.dataset = None:
      self.dataset = get_my_data()
    return self.dataset
    

Then you instantiate it once in your main and use it wherever you need it.

if __name__="__main__":

  data_service = MyService()
  data = data_service.get_data()
  # or pass the service to whoever needs it
  my_function_that_uses_data(data_service)

The dataset variable is internal but accessible through a discoverable function. You could also use a property on the instance of the class.

Also, using objects and classes makes it much more clear in a large project, as the functionality should be self-explanatory from the classname and methods.

Note that you can easily make this a generic service too, passing it the way to fetch data in the initialization (like a url?), so it can be re-used with different endpoints.

One caveat to avoid is to instantiate the same class multiple times, in your submodules, as opposed to the main. If you did, the data would be fetched and stored for each instance. On the other hand, you can pass the instance of the class to a sub-module and only fetch the data when it's needed (i.e., it may never be fetched if your submodule never needs it), while with all your options, the dataset needs to be fetched first to be passed somewhere else.

Note about your proposed options:

  1. Initializing in the if __name__ == '__main__' section:

It is not initialized globally if you were to call the module as a module (it would only be initialized when calling the module from shell).

You need to fetch the data to pass it somewhere else, even if you don't need it in main.

  1. Set a global within a function.

The use of global is generally discouraged, as it is in any programming language. Modifying variables out of scope is a recipe for encountering odd behaviors. It also tends to make the code harder to test if you rely on this global which is only set in a specific workflow.

  1. Attribute on a function

This one is a bit of an eye-sore: it would certainly work, and the functionality is very similar to the Class pattern I propose, but you have to admit attributes on functions is not very pythonic. The advantage of the Class is that you can initialize it in many ways, can subclass it etc, and yet not fetch the data until you need it. Using a straight function is 'simpler' but much more limited.

Woods answered 16/9, 2017 at 1:57 Comment(0)
T
2

You can also use the lru_cache decorator from the functools module for achieving the goal of running an expensive operation only once.

As long as the parameters are the same, calling the function again and again returns the same object.

https://docs.python.org/3/library/functools.html#functools.lru_cache

@lru_cache
def fun(input1, input2):
    ... # expensive operation
    return result
Tuberculosis answered 29/6, 2022 at 16:12 Comment(0)
P
0

Similar to MrE's answer, it is best to encapsulate the data with a wrapper.

However, I would recommend you to use a python closure python closure instead of a class.

A class should be used to encapsulate data and relevant functions that are closely related to the data. A class should be something that you will instantiate objects of and objects will retain individuality. You can read more about this here

You can use closures in the following way

def get_dataset_wrapper():
    dataset = None

    def get_dataset():
        nonlocal dataset
        if dataset is none
            dataset = #get dataset from internet
        return dataset
    return get_dataset

You can use this in the following way

dataset = get_dataset_wrapper()()

If the ()() syntax bothers you, you can do this:

def wrapper():
    return get_dataset_wrapper()()
Paulus answered 1/7, 2022 at 6:39 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.