python initialization global-variables function-attributes

Best way to initialize variable in a module?

Let's say I need to write incoming data into a dataset on the cloud. When, where and if I will need the dataset in my code, depends on the data coming in. I only want to get a reference to the dataset once. What is the best way to achieve this?

Initialize as global variable at start and access through global variable
```
if __name__="__main__":
    dataset = #get dataset from internet
```

This seems like the simplest way, but initializes the variable even if it is never needed.

Get reference first time the dataset is needed, save in global variable, and access with get_dataset() method

dataset = None

def get_dataset():
    global dataset
    if dataset is none
        dataset = #get dataset from internet
    return dataset

Get reference first time the dataset is needed, save as function attribute, and access with get_dataset() method

def get_dataset():
    if not hasattr(get_dataset, 'dataset'):
        get_dataset.dataset = #get dataset from internet
    return get_dataset.dataset

Any other way

Solution

The typical way to do what you want is to wrap your service calling for the data into a class:

class MyService():
  dataset = None

  def get_data(self):
    if self.dataset = None:
      self.dataset = get_my_data()
    return self.dataset

Then you instantiate it once in your main and use it wherever you need it.

if __name__="__main__":

  data_service = MyService()
  data = data_service.get_data()
  # or pass the service to whoever needs it
  my_function_that_uses_data(data_service)

The dataset variable is internal but accessible through a discoverable function. You could also use a property on the instance of the class.

Also, using objects and classes makes it much more clear in a large project, as the functionality should be self-explanatory from the classname and methods.

Note that you can easily make this a generic service too, passing it the way to fetch data in the initialization (like a url?), so it can be re-used with different endpoints.

One caveat to avoid is to instantiate the same class multiple times, in your submodules, as opposed to the main. If you did, the data would be fetched and stored for each instance. On the other hand, you can pass the instance of the class to a sub-module and only fetch the data when it's needed (i.e., it may never be fetched if your submodule never needs it), while with all your options, the dataset needs to be fetched first to be passed somewhere else.

Note about your proposed options:

Initializing in the if __name__ == '__main__' section:

It is not initialized globally if you were to call the module as a module (it would only be initialized when calling the module from shell).

You need to fetch the data to pass it somewhere else, even if you don't need it in main.

Set a global within a function.

The use of global is generally discouraged, as it is in any programming language. Modifying variables out of scope is a recipe for encountering odd behaviors. It also tends to make the code harder to test if you rely on this global which is only set in a specific workflow.

Attribute on a function

This one is a bit of an eye-sore: it would certainly work, and the functionality is very similar to the Class pattern I propose, but you have to admit attributes on functions is not very pythonic. The advantage of the Class is that you can initialize it in many ways, can subclass it etc, and yet not fetch the data until you need it. Using a straight function is 'simpler' but much more limited.