Data passing¶

Tip

👉 Check out the full data passing API reference.

To pass data between the different pipeline steps, across different languages, we make use of the Apache Arrow project. The Orchest SDK provides a convenience wrapper of the project to be used within Orchest.

We will start with an example to illustrate how to pass data between pipeline steps before diving into more detail.

Python example¶

Note

💡 Orchest also supports data passing for other languages than Python. For example, check out the Orchest SDK section on R.

In this example we will use Python to illustrate how to pass data between different pipeline steps. Let the pipeline be defined as follows:

Pipeline defined as: step-1, step-2 --> step-3

An example pipeline.¶

In both steps 1 and 2 we will create some data and pass it to step 3 under specific names so that we can later use those names to get the data.

"""step-1"""
import orchest

data = "Hello, World!"

# Output the data so that step-3 can retrieve it.
orchest.output(data, name="my_string")
"""step-2"""
import orchest

data = [3, 1, 4]

# Output the data so that step-3 can retrieve it.
orchest.output(data, name="my_list")

When outputting the data in steps 1 and 2 the data is actually copied to another location in shared memory so that other steps can access it. This explains why you can access the data from inside JupyterLab as well!

Now that data is in memory, step-3 can be executed and get the data for further processing.

"""step-3"""
import orchest

# Get the input for step-3, i.e. the output of step-1 and step-2.
input_data = orchest.get_inputs()

Warning

🚨 Only call orchest.transfer.get_inputs() and orchest.transfer.output() once. Otherwise your code will break in jobs and cause data to get overwritten respectively.

The input_data in step-3 will be as follows:

{
 "my_list": [3, 1, 4],
 "my_string": "Hello, World!",
 "unnamed": []
}

You can see both my_string and my_list, the output data from steps 1 and 2 respectively, are in the received input data. But what is the unnamed? We will answer this in the next section.

Tip

👉 You can increase the size of the shared memory (to allow for larger data to be passed) in the pipeline settings.

Passing data without a name¶

As you could see in the previous example, step-3 received input data with a special key called unnamed. When passing data it is not necessary to give the data you are outputting a name, for example we could change what step-1 is outputting:

"""step-1"""
import orchest

data = "Hello, World!"

# Output the data so that step-3 can retrieve it.
# But this time, don't give a name.
orchest.output(data, name=None)

The input_data in step-3 will now be equal to:

{
 "my_list": [3, 1, 4],
 "unnamed": ["Hello, World!"]
}

Populating the list of the unnamed key with the values of the steps that outputted data without a name.

For example, we could change the code of step-2 to:

"""step-2"""
import orchest

data = [3, 1, 4]

orchest.output(data, name=None)

Making the input_data in step-3 equal to:

{
 "unnamed": ["Hello, World!", [3, 1, 4]]
}

But how exactly is this useful?

By outputting data without a name the receiving step can treat the values as a collection (it is even an ordered collection, see order of unnamed data). Just like in regular programming, sometimes you would rather use a list than a dictionary to store your data.

Tip

👉 For the majority of cases passing data with a name is the way to go!

Order of unnamed data¶

Note

💡 orchet.transfer.get_inputs() actually infers the order via the pipeline definition. The UI simply stores the order in the pipeline definition file and provides a visual handle to it.

The image below is a screenshot from the properties pane in the UI of step-3 from the example above. The order of the list in the screenshot can be changed with a simple drag and drop.

../_images/step-connections.png

Having the above order of connections, the input_data in step-3 becomes (note how the order of the data in unnamed has changed!):

{
 "unnamed": [[3, 1, 4], "Hello, World!"]
}

Top-to-bottom in the UI corresponds with left-to-right in unnamed.

Memory footprint¶

When passing data between steps, by default it is passed through memory and kept it memory for the lifetime of the session. This makes it easy to develop your pipeline without having to rerun it in its entirety again. The data is essentially cached in memory.

If you are passing large amounts of data between steps, then you might have to increase the size of the cache. This can be done through the pipeline settings. Alternatively, you might want to enable auto-eviction so that cached data is removed once all receiving steps have obtained the passed data.