Pipelines

Tip

👉 Check out the creating a pipeline from scratch video to learn how to create a pipeline in the visual editor.

Pipelines are an interactive tool for creating and experimenting with your data workflow.

A pipeline is made up of steps and connections:

  • Steps are executable files that run in their own isolated environments.

  • Connections link steps together to define how data flows (see data passing) and the order of step execution.

Pipelines are edited visually and stored in JSON format in the pipeline definition file. This allows pipeline changes (e.g. adding a step) to be versioned.

../_images/final-pipeline.png

The quickstart pipeline.

Parameterizing pipelines

Pipelines take parameters as input (e.g. the data source connection URL) to vary their behaviour. Jobs can use different parameters to iterate through multiple runs of the same pipeline. Parameters can be set in the visual pipeline editor.

Tip

👉 For secrets, use environment variables since parameters are versioned.

Running a pipeline

Note

In this section we will learn what it means to run a pipeline interactively, when to do it, how to do it and what to keep in mind.

Once set up, you can run your pipeline in two ways:

  • Interactive runs inside the pipeline editor.

  • Job runs (see job).

Interactive runs are a great way to rapidly prototype your Pipeline. When using Jupyter Notebook .ipynb files, Pipeline steps are actively changed as if running individual cells in JupyterLab. The output of pipeline steps is stored when you run a step as part of a session. This lets you run just the parts of the Pipeline that you’re working rather than the entirety. You can access these outputs directly from within the JupyterLab kernel for notebook based steps.

Data passing

Pipelines can pass data between steps. For example, in an ETL pipeline, data can be passed between individual extract, transform and load steps.

Data is passed using the Orchest SDK:

import orchest
# Get data from incoming steps.
input_data = orchest.get_inputs()
# Some code that transforms the `input_data`.
res = ...
# Output the data.
orchest.output(res, name="transformed-data")

See more in data passing.

Storing data locally

Pipeline steps can read and write from and to the /data directory. For example:

# Get a text file from some external source.
txt_data = ...

with open("/data/nltk_example_text.txt", "w") as f:
    f.write(txt_data)