Skip to content

Run the project pipeline

In this section, we will look at the OpenSAFELY project pipeline.

So far, we have run the single dataset definition step, or scripted action, using the command line with the command:

opensafely exec ehrql:v1 generate-dataset analysis/dataset_definition.py`

A complete OpenSAFELY study may include multiple actions. For example, the first action might extract a dataset, and a subsequent action might generate a table or chart from that data.

The project.yaml file in the study repository defines the actions for an OpenSAFELY project pipeline.

The project.yaml file🔗

In the Visual Studio Code file Explorer, open the project.yaml file by clicking on it. This file will be near the end of the list of files and folders.

You should see a tab with the following content:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
version: "3.0"

# Ignore this `expectations` block. It is required but not used, and will be removed in future versions.
expectations:
  population_size: 1000

actions:
  generate_dataset:
    run: ehrql:v1 generate-dataset analysis/dataset_definition.py --output output/dataset.csv.gz
    outputs:
      highly_sensitive:
        dataset: output/dataset.csv.gz

There is a single action defined, called generate_dataset, in this project pipeline.

The highlighted line is the command that the action runs, and is very similar to the command we previously ran.

The difference is that generate_dataset defines an output stored in the output folder.

Running the action in the pipeline🔗

  1. In the Visual Studio Code file Explorer, confirm that the `output` folder is empty.
  2. In the Visual Studio Code Terminal, type:
    opensafely run generate_dataset
    
    and press ++enter++ to run the pipeline action. You should see output that ends something like the following:
    <...several lines of output...>
    generate_dataset: Extracting output file: output/dataset.csv.gz
    generate_dataset: Finished recording results
    generate_dataset: Completed successfully
    generate_dataset: Cleaning up container and volume
    
    => generate_dataset
       Completed successfully
    
       log file: metadata/generate_dataset.log
       outputs:
         output/dataset.csv.gz  - highly_sensitive
    
    The final line tells you a file of (randomly-generated) patient data has been created at `output/dataset.csv.gz`, and that it should be considered highly sensitive data. What you see here is exactly the same process that would happen on a real, secure server.
  3. When the command completes, recheck the `output` folder and see that it contains a `dataset.csv.gz` file.

Viewing the dataset output🔗

This .csv.gz file is a compressed CSV file that contains a small amount of dummy data (patient ID and sex) based on the dataset definition at analysis/dataset_definition.py.

To view it, first run opensafely unzip output, then open that file (by left-clicking the filename in Visual Studio Code's Explorer, or software like Excel). You'll see that it contains rows for ten randomly-generated dummy patients.

The difference between opensafely exec and opensafely run🔗

Both opensafely exec and opensafely run can run actions.

The difference between them is that:

  • opensafely exec runs actions outside of the project pipeline and is useful for quick feedback during interactive development
  • opensafely run runs actions inside the project pipeline - that is, just as they would be in the secure OpenSAFELY environment containing real patient data