Complex task repetitions¶
Task repetitions are amazing if you want to execute lots of tasks while not repeating yourself in code.
But, in any bigger project, repetitions can become hard to maintain because there are multiple layers or dimensions of repetition.
Here you find some tips on how to set up your project such that adding dimensions and increasing dimensions becomes much easier.
Example¶
You can write multiple loops around a task function where each loop stands for a different dimension. A dimension might represent different datasets or model specifications to analyze the datasets like in the following example. The task arguments are derived from the dimensions.
from pathlib import Path
from typing import Annotated
from pytask import Product
from pytask import task
SRC = Path(__file__).parent
BLD = SRC / "bld"
for data_name in ("a", "b", "c"):
for model_name in ("ols", "logit", "linear_prob"):
@task(id=f"{model_name}-{data_name}")
def task_fit_model(
path_to_data: Path = SRC / f"{data_name}.pkl",
path_to_model: Annotated[Path, Product] = BLD
/ f"{data_name}-{model_name}.pkl",
) -> None: ...
There is nothing wrong with using nested loops for simpler projects. But, often projects are growing over time and you run into these problems.
When you add a new task, you need to duplicate the nested loops in another module.
When you add a dimension, you need to touch multiple files in your project and add another loop and level of indentation.
Solution¶
The main idea for the solution is quickly explained. We will, first, formalize
dimensions into objects using NamedTuple or
dataclass().
Secondly, we will combine dimensions in multi-dimensional objects such that we only have to iterate over instances of this object in a single loop. Here and for the lack of a better name, we will call the object an experiment.
Lastly, we will also use the DataCatalog to not be bothered with
defining paths.
See also
If you have not learned about the DataCatalog yet, start with the
tutorial and continue with the
how-to guide.
from pathlib import Path
from typing import NamedTuple
from pytask import DataCatalog
SRC = Path(__file__).parent
BLD = SRC / "bld"
data_catalog = DataCatalog()
class Dataset(NamedTuple):
name: str
@property
def path(self) -> Path:
return SRC / f"{self.name}.pkl"
class Model(NamedTuple):
name: str
DATASETS = [Dataset("a"), Dataset("b"), Dataset("c")]
MODELS = [Model("ols"), Model("logit"), Model("linear_prob")]
class Experiment(NamedTuple):
dataset: Dataset
model: Model
@property
def name(self) -> str:
return f"{self.model.name}-{self.dataset.name}"
@property
def fitted_model_name(self) -> str:
return f"{self.name}-fitted-model"
EXPERIMENTS = [Experiment(dataset, model) for dataset in DATASETS for model in MODELS]
There are some things to be said.
The
.nameattributes on each dimension need to return unique names and to ensure that by combining them for the name of the experiment, we get a unique and descriptive id.Dimensions might need more attributes than just a name, like paths, keys for the data catalog, or other arguments for the task.
Next, we will use these newly defined data structures and see how our tasks change when we use them.
from pathlib import Path
from typing import Annotated
from typing import Any
from myproject.config import EXPERIMENTS
from myproject.config import Model
from myproject.config import data_catalog
from _pytask.nodes import PythonNode
from pytask import task
for experiment in EXPERIMENTS:
@task(id=experiment.name)
def task_fit_model(
model: Annotated[Model, PythonNode(hash=True)] = experiment.model,
path_to_data: Path = experiment.dataset.path,
) -> Annotated[Any, data_catalog[experiment.fitted_model_name]]: ...
As you see, we lost a level of indentation and we moved all the generations of names and paths to the dimensions and multi-dimensional objects.
Using a PythonNode allows us to hash the model and reexecute the task
if we define other model settings.
Adding another level¶
Extending a dimension by another level is usually quickly done. For example, if we have
another model that we want to fit to the data, we extend MODELS which will
automatically lead to all downstream tasks being created.
...
MODELS = [Model("ols"), Model("logit"), Model("linear_prob"), Model("new_model")]
...
Of course, you might need to alter task_fit_model because the task needs to handle the
new model as well as the others. Here is where it pays off if you are using high-level
interfaces in your code that handle all of the models with a simple
fitted_model = fit_model(data=data, model_name=model_name) call and also return fitted
models that are similar objects.
Executing a subset¶
What if you want to execute a subset of tasks, for example, all tasks related to a model or a dataset?
When you are using the .name attributes of the dimensions and multi-dimensional
objects like in the example above, you ensure that the names of dimensions are included
in all downstream tasks.
Thus, you can simply call pytask with the following expression to execute all tasks related to the logit model.
pytask -k logit
See also
Expressions and markers for selecting tasks are explained in Selecting tasks.
Extending repetitions¶
Some repeated tasks are costly to run - costly in terms of computing power, memory, or
runtime. If you change a task module, you might accidentally trigger all other tasks in
the module to be rerun. Use the @pytask.mark.persist
decorator, which is explained in more detail in this
tutorial.