For Pipeline Designers

The pipeline idea

A LOST-Pipeline consists of different elements that will be processed in a defined order by the LOST-Engine to transform data into knowledge.

A LOST-Pipeline is defined by a PipelineTemplate and modeled as directed graph. A Pipeline is an istance of a PipelineTemplate. A PipelineTemplate may define a graph that consits of the following PipelineElements:

  • Script: A user defined script that transforms input data to output data.
  • Datasource: Input data for a annotation pipeline. In most cases this will be a folder with images.
  • AnnotationTask: Some kind of an image annotation task performed by a human annotator.
  • Visualization: Can display an image or html text in the web gui that was generated by a user defined script.
  • DataExport: Provides a download link to a file that was generated by a script.
  • Loop: A loop element points to another element in the Pipeline and creates a loop in the graph. A loop element implements a similar behaviour as a while loop in a programming language.

Designing a pipeline - A first example

In the following we will have a look at the sia_all_tools pipeline which is part of the sia pipeline project example in LOST. Based on this example we will discuss all the important steps when developing your own pipeline.

Pipeline Projects

A pipeline is defined by a json file and is related to Script elements. A Script is essentialy a python file. Multiple pipelines and scripts can be bundled as a pipeline project and imported into lost. A pipeline project is defined as a folder of pipeline and script files. The listing below shows the file sturcture of the sia pipeline project. In our example we will focus on the sia_all_tools pipeline and its related scripts (export_csv, request_annos), which are highlighted in the listing.

sia/
├── export_csv.py
├── request_annos.py
├── request_yolo_annos.py
├── semiauto_yolov3.json
└── sia_all_tools.json

0 directories, 5 files

A Pipeline Definition File

Below you can see the pipeline definition file of the sia_all_tools pipeline. This pipeline will request annotations for all images inside a folder from the Single Image Annotation (SIA) tool and export these annotations to a csv file. The created csv file will be available for download by means of a DataExport element inside the web gui.

As you can see in the listing, the pipeline is defined by a json object that has a description, a author, a pipe-schema-version and a list of pipeline elements. Each element is defined by a json object and has a peN (pipeline element number) which is the identifier of the element itself. All elements need also an attribute that is called peOut and contains a list of elements where the current element is connected to.

The first element in the sia_all_tools pipeline is a Datasource (lines 5 - 11) of type rawFile. This Datasource will provide a path to a folder with images inside of the LOST filesystem. The exact path is selected when the pipeline will be started. The Datasource element is connected to the Script element with peN: 1. This Script element is defined by peN, peOut and a script path plus a script description. The script path needs to be defined relative to the pipeline project folder.

The Script element is connected to an AnnotationTask element with peN: 2 of type sia (lines 20 - 45). Within the json object of the AnnotationTask you can specify the name of the task and also instructions for the annotator. In the configuration part all tools and actions are allowed for SIA in this pipeline. If you would like that a annotator should only use bounding boxes for annotation, you could set point, line and polygon to false.

The AnnotationTask element is connected to a Script element with path: export_csv.py. This script will read all annotations from the AnnotationTask and create a csv file with these annotations inside the LOST filesystem. The created csv file will be made available for download by the DataExport element with peN: 4. A DataExport element may serve an arbitrary file from the LOST filesystem for download.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
{
  "description": "This pipeline selects all images of a datasource and requests annotations.",
  "author": "Jonas Jaeger",
  "pipe-schema-version" : 1.0,
  "elements": [{
      "peN": 0,
      "peOut": [1],
      "datasource": {
        "type": "rawFile"
      }
    },
    {
      "peN": 1,
      "peOut": [2],
      "script": {
        "path": "request_annos.py",
        "description": "Request annotations for all images in a folder"
      }
    },
    {
      "peN": 2,
      "peOut": [3],
      "annoTask": {
        "name": "Single Image Annotation Task",
        "type": "sia",
        "instructions": "Please draw bounding boxes for all objects in image.",
        "configuration": {
          "tools": {
              "point": true,
              "line": true,
              "polygon": true,
              "bbox": true,
              "junk": true
          },
          "annos":{
              "multilabels": false,
              "actions": {
                  "draw": true,
                  "label": true,
                  "edit": true
              },
              "minArea": 250
          },
          "img": {
              "multilabels": false,
              "actions": {
                  "label": true
              }
          }
        }
      }
    },
    {
      "peN": 3,
      "peOut": [4],
      "script": {
        "path": "export_csv.py",
        "description": "Export all annotations to a csv file."
      }
    },
    {
      "peN": 4,
      "peOut": null,
      "dataExport": {}
    }
  ]
}

How to write a script?

request_annos.py

A script in LOST is just a normal python3 module. In the listing below you can see the request_annos.py script from our example pipeline (sia_all_tools). The request_annos.py script will read in a path to an imageset from the previous datasource element in the pipeline and will request annotations from the next annotation task element in the pipeline. This script will also send dummy annotation proposals to the annotation task if one of the arguments is set to ture when the pipeline is started in the web gui.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from lost.pyapi import script
import os

ENVS = ['lost']
ARGUMENTS = {'polygon' : { 'value':'false',
                            'help': 'Add a dummy polygon proposal as example.'},
            'line' : { 'value':'false',
                            'help': 'Add a dummy line proposal as example.'},
            'point' : { 'value':'false',
                            'help': 'Add a dummy point proposal as example.'},
            'bbox' : { 'value':'false',
                            'help': 'Add a dummy bbox proposal as example.'}
            }
class LostScript(script.Script):
    '''Request annotations for each image of an imageset.

    An imageset is basicly a folder with images.
    '''
    def main(self):
        for ds in self.inp.datasources:
            media_path = ds.path
            fs = ds.get_fs()
            annos = []
            anno_types = []
            if self.get_arg('polygon').lower() == 'true':
                polygon= [[0.1,0.1],[0.4,0.1],[0.2,0.3]]
                annos.append(polygon)
                anno_types.append('polygon')
            if self.get_arg('line').lower() == 'true':
                line= [[0.5,0.5],[0.7,0.7]]
                annos.append(line)
                anno_types.append('line')
            if self.get_arg('point').lower() == 'true':
                point= [0.8,0.1]
                annos.append(point)
                anno_types.append('point')
            if self.get_arg('bbox').lower() == 'true':
                box= [0.6,0.6,0.1,0.05]
                annos.append(box)
                anno_types.append('bbox')
            for img_path in fs.ls(media_path):
                self.outp.request_annos(img_path=img_path, annos=annos, anno_types=anno_types, fs=fs)
                self.logger.info('Requested annos for: {}'.format(img_path))

if __name__ == "__main__":
    my_script = LostScript() 

In order to write a LOST script you need to define a class that inherits from lost.pyapi and defines a main method (see below).

from lost.pyapi import script
    '''Request annotations for each image of an imageset.

    An imageset is basicly a folder with images.
    '''
    def main(self):
        for ds in self.inp.datasources:
            media_path = ds.path

Later on you need to instantiate it and your LOST script is done.

    my_script = LostScript() 

In the request_annos.py script you can also see some special variables ENVS and ARGUMENTS. These variables will be read during the import process. The EVNS variable provides meta information for the pipeline engine by defining a list of environments (similar to conda environments) where this script may be executed in. In this way you can assure that a script will only be executed in environments where all your dependencies are installed. All environments are installed in workers that may execute your script. If many different environments are defined within the ENVS list of a script, the pipeline engine will try to assign the script to a worker in the same order as defined within the ENVS list. So if a worker is online that has installed the first environment in the list the pipeline engine will assign the script to this worker. If no worker with the first environment is online, it will try to assign the script to a worker with the second environment in the list and so on.

ARGUMENTS = {'polygon' : { 'value':'false',

The ARGUMENTS variable will be used to provide script arguments that can be set during the start process of a pipline within the web gui. ARGUMENTS are defined as a dictionary of dictionaries that contain the arguments. Each argument is again a dictionary with keys value and help. As you can see in the listing below the first argument is called polygon its value is false and its help text is Add a dummy polygon as example.

                            'help': 'Add a dummy polygon proposal as example.'},
            'line' : { 'value':'false',
                            'help': 'Add a dummy line proposal as example.'},
            'point' : { 'value':'false',
                            'help': 'Add a dummy point proposal as example.'},
            'bbox' : { 'value':'false',
                            'help': 'Add a dummy bbox proposal as example.'}
            }
class LostScript(script.Script):

Within your script you can access the value of an argument with the get_arg(…) method as shown below.

            if self.get_arg('polygon').lower() == 'true':
                polygon= [[0.1,0.1],[0.4,0.1],[0.2,0.3]]
                annos.append(polygon)
                anno_types.append('polygon')

A script can access all the elements it is connected to. Each script has an input and an output object. Since the input of our request_annos.py script is connected to a Datasource element, we access it by iterating over all Datasource objects that are connected to the input and read out the path where a folder with images is provided:

            media_path = ds.path
            fs = ds.get_fs()

Now we can use the path provided by the datasource to read all image files that are located there and request annotations for each image, as you can see in the listing below.

It would be sufficient to provide only the img_path argument to the request_annos(..) method, but in our example script there is also the option to send some dummy annotations to the annotation tool. In a semi automatic setup, you could use an ai to generate some annotation proposal and send these proposals to the annotation tool in the same way.

            for img_path in fs.ls(media_path):
                self.outp.request_annos(img_path=img_path, annos=annos, anno_types=anno_types, fs=fs)
                self.logger.info('Requested annos for: {}'.format(img_path))

Since each script has a logger, we can also write which images we have requested to the pipeline log file. The log file can be downloaded in the web gui. The logger object is a standard python logger.


export_csv.py

The export_csv.py (see Listing e1: Full export_csv.py script.) script will read all annotations from its input and create a csv file from these annotations. This csv file will then be added to a DataExport element, which will provide the file in the web gui for download.

Listing e1: Full export_csv.py script.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from lost.pyapi import script
import os

ENVS = ['lost']
ARGUMENTS = {'file_name_parquet' : { 'value':'annos.parquet',
                            'help': 'Name of the file with exported bbox annotations in parquet format.'},
            'file_name_csv' : { 'value':'annos.csv',
                            'help': 'Name of the file with exported bbox annotations in csv format.'},
            }

class LostScript(script.Script):
    '''This Script creates a csv file from image annotations and adds a data_export
    to the output of this script in pipeline.
    '''
    def main(self):
        df = self.inp.to_df()
        fs = self.get_filesystem()
        
        file_path_parquet = self.get_path(self.get_arg('file_name_parquet'), context='instance')
        file_path_csv = self.get_path(self.get_arg('file_name_csv'), context='instance')
        self.logger.info('File path parquet: {}'.format(file_path_parquet))
        self.logger.info('File path csv: {}'.format(file_path_csv))
        
        with fs.open(file_path_parquet, 'wb') as f:
            df.to_parquet(f)
        
        with fs.open(file_path_csv, 'wb') as f:
            df.to_csv(f,sep=',',
                      header=True,
                      index=False)
        
        self.logger.info('Wrote file: {}'.format(fs.ls(os.path.split(file_path_parquet)[0])))
        self.outp.add_data_export(file_path=file_path_parquet, fs=fs)
    
        self.logger.info('Wrote file: {}'.format(fs.ls(os.path.split(file_path_csv)[0])))
        self.outp.add_data_export(file_path=file_path_csv, fs=fs)

if __name__ == "__main__":
    my_script = LostScript()

Now we will do a step by step walk through the code.

Listing e2: ENVS and ARGUMENTS of export_csv.py.
ARGUMENTS = {'file_name_parquet' : { 'value':'annos.parquet',
                            'help': 'Name of the file with exported bbox annotations in parquet format.'},
            'file_name_csv' : { 'value':'annos.csv',
                            'help': 'Name of the file with exported bbox annotations in csv format.'},

As you can see in the listing above, the script is executed in the standard lost environment. The name of the csv file can be set by the argument file_name and has a default value of annos.csv.

Listing e3: Transforming all annotations from input into a pandas.DataFrame.
    def main(self):

The lost.pyapi.inout.Input.to_df() method will read all annotations form self.inp (Input of this script lost.pyapi.script.Script.inp) and transform the annotations into a pandas.DataFrame

Listing e4: Get the path to store the csv file.
        df = self.inp.to_df()

Now the script will calculate the path to store the csv file (lost.pyapi.script.Script.get_path()). In general a script can store files to three different contexts. Since our csv file should only be used by this instance of the script, the instance context was selected.

It would be also possible to store a file to a context called pipe. In the pipe context all scripts within an annotation pipeline can access the file and exchange information in this way. The third context is called static. The static context allows to access and store files in the pipe project folder.

Listing e5: Store csv file to the LOST filesystem.
        fs = self.get_filesystem()
        
        file_path_parquet = self.get_path(self.get_arg('file_name_parquet'), context='instance')
        file_path_csv = self.get_path(self.get_arg('file_name_csv'), context='instance')

After we have calculated the csv_path, the csv file can be stored to this path. In order to do that the to_csv method from the pandas.DataFrame is used.

Listing e6: Adding the csv file path to all connected DataExport elements.
        self.logger.info('File path parquet: {}'.format(file_path_parquet))

As final step the path to the csv file is assigned to the connected DataExport element in order to make it available for download via the web gui.

Importing a pipeline project

After creating a pipeline it needs to be imported into LOST. Please see Importing a Pipeline Project into LOST for more information.

Debugging a script

When your script starts to throw errors it is time for debugging your script inside the docker container. Please see Debugging a Script for more information.