Write a task¶
Using the project structure from the previous tutorial, write your first task.
The task task_create_random_data is defined in
src/my_project/task_data_preparation.py and generates a data set stored in
bld/data.pkl.
The task_ prefix for modules and task functions is important so that pytask
automatically discovers them.
my_project
│
├───.pytask
│
├───bld
│ └────data.pkl
│
├───src
│ └───my_project
│ ├────__init__.py
│ ├────config.py
│ └────task_data_preparation.py
│
└───pyproject.toml
Generally, a task is a function whose name starts with task_. Tasks produce outputs
and the most common output is a file which we will focus on throughout the tutorials.
The following interfaces are different ways to specify the products of a task which is necessary for pytask to correctly run a workflow. The interfaces are ordered from most (left) to least recommended (right).
Important
You cannot mix different interfaces for the same task. Choose only one.
The task accepts the argument path that points to the file where the data set will be
stored. The path is passed to the task via the default value, BLD / "data.pkl". To
indicate that this file is a product we add some metadata to the argument.
The type hint Annotated[Path, Product] uses
Annotated syntax. The first entry specifies the argument type
(Path), and the second entry (Product) marks this
argument as a product.
# Content of task_data_preparation.py.
from pathlib import Path
from typing import Annotated
import numpy as np
import pandas as pd
from my_project.config import BLD
from pytask import Product
def task_create_random_data(path: Annotated[Path, Product] = BLD / "data.pkl") -> None:
rng = np.random.default_rng(0)
beta = 2
x = rng.normal(loc=5, scale=10, size=1_000)
epsilon = rng.standard_normal(1_000)
y = beta * x + epsilon
df = pd.DataFrame({"x": x, "y": y})
df.to_pickle(path)
Tip
If you want to refresh your knowledge about type hints, read this guide.
Tasks can use produces as an argument name. Every value, or in this case path, passed
to this argument is automatically treated as a task product. Here, the path is given by
the default value of the argument.
# Content of task_data_preparation.py.
from pathlib import Path
import numpy as np
import pandas as pd
from my_project.config import BLD
def task_create_random_data(produces: Path = BLD / "data.pkl") -> None:
rng = np.random.default_rng(0)
beta = 2
x = rng.normal(loc=5, scale=10, size=1_000)
epsilon = rng.standard_normal(1_000)
y = beta * x + epsilon
df = pd.DataFrame({"x": x, "y": y})
df.to_pickle(produces)
Now, execute pytask to collect tasks in the current and subsequent directories.
$ pytask
────────────────────────── Start pytask session ─────────────────────────
Platform: win32 -- Python 3.12.0, pytask 0.5.3, pluggy 1.3.0
Root: C:\Users\pytask-dev\git\my_project
Collected 1 task.
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ Task ┃ Outcome ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━┩
│ <span class="termynal-dim">task_data_preparation.py::</span>task_create_random_data │ <span class="termynal-success">.</span> │
└───────────────────────────────────────────────────┴─────────┘
<span class="termynal-dim">─────────────────────────────────────────────────────────────────────────</span>
<span class="termynal-success">╭───────────</span> <span style="font-weight: bold;">Summary</span> <span class="termynal-success">────────────╮</span>
<span class="termynal-success">│</span> <span style="font-weight: bold;"> 1 Collected tasks </span> <span class="termynal-success">│</span>
<span class="termynal-success">│</span> <span class="termynal-success-textonly"> 1 Succeeded (100.0%) </span> <span class="termynal-success">│</span>
<span class="termynal-success">╰────────────────────────────────╯</span>
<span class="termynal-success">─────────────────────── Succeeded in 0.06 seconds ───────────────────────</span>
Customize task names¶
Use the @task decorator to mark a function as a task regardless of
its function name. You can optionally pass a new name for the task. Otherwise, pytask
uses the function name.
from pytask import task
# The id will be ".../task_data_preparation.py::create_random_data".
@task
def create_random_data(): ...
# The id will be ".../task_data_preparation.py::create_data".
@task(name="create_data")
def create_random_data(): ...
Customize task module names¶
Use the configuration value task_files if you prefer a different naming
scheme for the task modules. task_*.py is the default. You can specify one or multiple
patterns to collect tasks from other files.