Let's say I need to write incoming data into a dataset on the cloud. When, where and if I will need the dataset in my code, depends on the data coming in. I only want to get a reference to the dataset once. What is the best way to achieve this?
Initialize as global variable at start and access through global variable
if __name__="__main__":
dataset = #get dataset from internet
This seems like the simplest way, but initializes the variable even if it is never needed.
Get reference first time the dataset is needed, save in global variable, and access with get_dataset()
method
dataset = None
def get_dataset():
global dataset
if dataset is none
dataset = #get dataset from internet
return dataset
Get reference first time the dataset is needed, save as function attribute, and access with get_dataset()
method
def get_dataset():
if not hasattr(get_dataset, 'dataset'):
get_dataset.dataset = #get dataset from internet
return get_dataset.dataset
Any other way
The typical way to do what you want is to wrap your service calling for the data into a class:
class MyService():
dataset = None
def get_data(self):
if self.dataset = None:
self.dataset = get_my_data()
return self.dataset
Then you instantiate it once in your main and use it wherever you need it.
if __name__="__main__":
data_service = MyService()
data = data_service.get_data()
# or pass the service to whoever needs it
my_function_that_uses_data(data_service)
The dataset
variable is internal but accessible through a discoverable function. You could also use a property
on the instance of the class.
Also, using objects and classes makes it much more clear in a large project, as the functionality should be self-explanatory from the classname and methods.
Note that you can easily make this a generic service too, passing it the way to fetch data in the initialization (like a url?), so it can be re-used with different endpoints.
One caveat to avoid is to instantiate the same class multiple times, in your submodules, as opposed to the main. If you did, the data would be fetched and stored for each instance. On the other hand, you can pass the instance of the class to a sub-module and only fetch the data when it's needed (i.e., it may never be fetched if your submodule never needs it), while with all your options, the dataset needs to be fetched first to be passed somewhere else.
Note about your proposed options:
if __name__ == '__main__'
section:It is not initialized globally if you were to call the module as a module (it would only be initialized when calling the module from shell).
You need to fetch the data to pass it somewhere else, even if you don't need it in main.
The use of global
is generally discouraged, as it is in any programming language. Modifying variables out of scope is a recipe for encountering odd behaviors. It also tends to make the code harder to test if you rely on this global which is only set in a specific workflow.
This one is a bit of an eye-sore: it would certainly work, and the functionality is very similar to the Class
pattern I propose, but you have to admit attributes on functions is not very pythonic. The advantage of the Class is that you can initialize it in many ways, can subclass it etc, and yet not fetch the data until you need it. Using a straight function is 'simpler' but much more limited.