SONG Python SDK

The SONG Python SDK is a simple python module that allows you to interact with a SONG server through Python, with a minimum of coding effort.

It lets you upload payloads synchronously or asynchronously, check their status and create analyses. From there, you can use the power of Python to process and analyze the data within those objects however you see fit.

Prerequisites

Python 3.6 is REQUIRED, since the SDK uses the dataclasses module.

Installation

The official SONG Python SDK is publically hosted on PyPi. To install it, just run the command below:

pip install overture-song

Configuration

  • in generic way, explain how to configure the sdk to be used. just explain ApiConfig and which library to import

Tutorial

This section demonstrates example usage of the overture-song sdk. After completing this tutorial, you will have uploaded your first SONG metadata payload!

For the impatient, the code used below can be found in examples/example_upload.py.

Warning

Python 3.6 or higher is required.

Configuration

Create an ApiConfig object. This object contains the serverUrl, accessToken, and studyId that will be used to interact with the SONG API. In this example we will use https://song.cancercollaboratory.org for the serverUrl and ‘ABC123’ for the studyId. For the access token, please refer to Creating an Access Token.

from overture_song.model import ApiConfig
api_config = ApiConfig('https://song.cancercollaboratory.org', 'ABC123', <my_access_token>)

Next the main API client needs to be instantiated in order to interact with the SONG server.

from overture_song.client import Api
api = Api(api_config)

As a sanity check, ensure that the server is running. If the response is True, then you may proceed with the next section, otherwise the server is not running.

>>> api.is_alive()
True

Create a Study

If the studyId ‘ABC123’ does not exist, then the StudyClient must be instantiated in order to read and create studies.

First create a study client,

from overture_song.client import StudyClient
study_client = StudyClient(api)

If the study associated with the payload does not exist, then create a Study entity,

from overture_song.entities import Study
if not study_client.has(api_config.study_id):
     study = Study.create(api_config.study_id, "myStudyName", "myStudyDescription", "myStudyOrganization")
     study_client.create(study)

Create a Simple Payload

Now that the study exists, you can create your first payload! In this example, a SequencingReadAnalysis will be created. It follows the SequencingRead JsonSchema.

See also

Similarly, for the VariantCallAnalysis, refer to the VariantCall JsonSchema.

Firstly, import all the entities to minimize the import statements.

from overture_song.entities import *

Next, create an example Donor entity:

donor = Donor()
donor.studyId = api_config.study_id
donor.donorGender = "male"
donor.donorSubmitterId = "dsId1"
donor.set_info("randomDonorField", "someDonorValue")

Create an example Specimen entity:

specimen = Specimen()
specimen.specimenClass = "Tumour"
specimen.specimenSubmitterId = "sp_sub_1"
specimen.specimenType = "Normal - EBV immortalized"
specimen.set_info("randomSpecimenField", "someSpecimenValue")

Create an example Sample entity:

sample = Sample()
sample.sampleSubmitterId = "ssId1"
sample.sampleType = "RNA"
sample.set_info("randomSample1Field", "someSample1Value")

Create 1 or more example File entities:

# File 1
file1 = File()
file1.fileName = "myFilename1.bam"
file1.studyId = api_config.study_id
file1.fileAccess = "controlled"
file1.fileMd5sum = "myMd51"
file1.fileSize = 1234561
file1.fileType = "VCF"
file1.set_info("randomFile1Field", "someFile1Value")

# File 2
file2 = File()
file2.fileName = "myFilename2.bam"
file2.studyId = api_config.study_id
file2.fileAccess = "controlled"
file2.fileMd5sum = "myMd52"
file2.fileSize = 1234562
file2.fileType = "VCF"
file2.set_info("randomFile2Field", "someFile2Value")

Create an example SequencingRead experiment entity:

# SequencingRead
sequencing_read_experiment = SequencingRead()
sequencing_read_experiment.aligned = True
sequencing_read_experiment.alignmentTool = "myAlignmentTool"
sequencing_read_experiment.pairedEnd = True
sequencing_read_experiment.insertSize = 0
sequencing_read_experiment.libraryStrategy = "WXS"
sequencing_read_experiment.referenceGenome = "GR37"
sequencing_read_experiment.set_info("randomSRField", "someSRValue")

Finally, use the SimplePayloadBuilder class along with the previously create entities to create a payload.

from overture_song.tools import SimplePayloadBuilder
builder = SimplePayloadBuilder(donor, specimen, sample, [file1, file2], sequencing_read_experiment)
payload = builder.to_dict()

Use a Custom AnalysisId

In some situations, the user may prefer to use a custom analysisId. If not specified in the payload, it is automatically generated by the SONG server during the Save the Analysis step. Although this tutorial uses the analysisId generated by the SONG server, a custom analysisId can be set as follows:

payload['analysisId'] = 'my_custom_analysis_id'

Upload the Payload

With the payload built, the data can now be uploaded to the SONG server for validation. There are 2 modes for validation:

  1. Synchronous - uploads are validated SYNCHRONOUSLY. Although this is the default mode, it can be selected by setting the kwarg is_async_validation to False from the upload method.
  2. Asynchronously - uploads are validated ASYNCHRONOUSLY. This allows the user to upload a batch of payloads. This mode can be selected by setting is_async_validation to True.

After calling the upload method, the payload will be sent to the SONG server for validation, and a response will be returned:

>>> api.upload(json_payload=payload, is_async_validation=False)
{
    "status": "ok",
    "uploadId": "UP-c49742d0-1fc8-4b45-9a1c-ea58d282ac58"
}

If the status field from the response is ok, this means the payload was successfully submitted to the SONG server for validation, and returned a randomly generated uploadId, which is a receipt for the upload request.

Check the Status of the Upload

Before continuing, the previous upload’s status must be checked using the status method, in order to ensure the payload was successfully validated. Using the previous uploadId, the status of the upload can be requested and will return the following response:

>>> api.status('UP-c49742d0-1fc8-4b45-9a1c-ea58d282ac58')
{
    "analysisId": "",
    "uploadId": "UP-c49742d0-1fc8-4b45-9a1c-ea58d282ac58",
    "studyId": "ABC123",
    "state": "VALIDATED",
    "createdAt": [
        2019,
        2,
        16,
        0,
        54,
        31,
        73774000
    ],
    "updatedAt": [
        2019,
        2,
        16,
        0,
        54,
        31,
        75476000
    ],
    "errors": [
        ""
    ],
    "payload": {
        "analysisState": "UNPUBLISHED",
        "sample": [
            {
                "info": {
                    "randomSample1Field": "someSample1Value"
                },
                "sampleSubmitterId": "ssId1",
                "sampleType": "RNA",
                "specimen": {
                    "info": {
                        "randomSpecimenField": "someSpecimenValue"
                    },
                    "specimenSubmitterId": "sp_sub_1",
                    "specimenClass": "Tumour",
                    "specimenType": "Normal - EBV immortalized"
                },
                "donor": {
                    "info": {
                        "randomDonorField": "someDonorValue"
                    },
                    "donorSubmitterId": "dsId1",
                    "studyId": "Study1",
                    "donorGender": "male"
                }
            }
        ],
        "file": [
            {
                "info": {
                    "randomFile1Field": "someFile1Value"
                },
                "fileName": "myFilename1.bam",
                "studyId": "Study1",
                "fileSize": 1234561,
                "fileType": "VCF",
                "fileMd5sum": "myMd51",
                "fileAccess": "controlled"
            },
            {
                "info": {
                    "randomFile2Field": "someFile2Value"
                },
                "fileName": "myFilename2.bam",
                "studyId": "Study1",
                "fileSize": 1234562,
                "fileType": "VCF",
                "fileMd5sum": "myMd52",
                "fileAccess": "controlled"
            }
        ],
        "analysisType": "sequencingRead",
        "experiment": {
            "info": {
                "randomSRField": "someSRValue"
            },
            "aligned": true,
            "alignmentTool": "myAlignmentTool",
            "insertSize": 0,
            "libraryStrategy": "WXS",
            "pairedEnd": true,
            "referenceGenome": "GR37"
        }
    }
}

In order to continue with the next section, the state field MUST have the value VALIDATED, which indicates the upload was validated and there were no errors. If there were errors, the state field would have the value VALIDATION_ERROR, and the field errors would contains details of the validation issues. If there is an error, the user can simply correct the payload, re-upload and check the status again.

Save the Analysis

Once the upload is successfully validated, the upload must be saved using the save method. This generates the following response:

>>> api.save(status_response.uploadId, ignore_analysis_id_collisions=False)
{
    "analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
    "status": "ok"
}

The value of ok in the status field of the response indicates that an analysis was successfully created. The analysis will contain the same data as the payload, with the addition of server-side generated ids, which are generated by an centralized id server. By default, the request DOES NOT IGNORE analysisId collisions, however by setting the save method parameter ignore_analysis_id_collisions to True, collisions will be ignored. This mechanism is considered an override and is heavily discouraged, however it is necessary considering the complexities associated with managing genomic data.

Observe the UNPUBLISHED Analysis

Verify the analysis is unpublished by observing the value of the analysisState field in the response for the get_analysis call. The value should be UNPUBLISHED. Also, observe that the SONG server generated an unique sampleId, specimenId, analysisId and objectId:

>>> api.get_analysis('23c61f55-12b4-11e8-b46b-23a48c7b1324')
{
    "analysisType": "sequencingRead",
    "info": {},
    "analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
    "study": "ABC123",
    "analysisState": "UNPUBLISHED",
    "sample": [
        {
            "info": {
                "randomSample1Field": "someSample1Value"
            },
            "sampleId": "SA599347",
            "specimenId": "SP196154",
            "sampleSubmitterId": "ssId1",
            "sampleType": "RNA",
            "specimen": {
                "info": {
                    "randomSpecimenField": "someSpecimenValue"
                },
                "specimenId": "SP196154",
                "donorId": "DO229595",
                "specimenSubmitterId": "sp_sub_1",
                "specimenClass": "Tumour",
                "specimenType": "Normal - EBV immortalized"
            },
            "donor": {
                "donorId": "DO229595",
                "donorSubmitterId": "dsId1",
                "studyId": "ABC123",
                "donorGender": "male",
                "info": {}
            }
        }
    ],
    "file": [
        {
            "info": {
                "randomFile1Field": "someFile1Value"
            },
            "objectId": "f553bbe8-876b-5a9c-a436-ff47ceef53fb",
            "analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
            "fileName": "myFilename1.bam",
            "studyId": "ABC123",
            "fileSize": 1234561,
            "fileType": "VCF",
            "fileMd5sum": "myMd51                          ",
            "fileAccess": "controlled"
        },
        {
            "info": {
                "randomFile2Field": "someFile2Value"
            },
            "objectId": "6e2ee06b-e95d-536a-86b5-f2af9594185f",
            "analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
            "fileName": "myFilename2.bam",
            "studyId": "ABC123",
            "fileSize": 1234562,
            "fileType": "VCF",
            "fileMd5sum": "myMd52                          ",
            "fileAccess": "controlled"
        }
    ],
    "experiment": {
        "analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
        "aligned": true,
        "alignmentTool": "myAlignmentTool",
        "insertSize": 0,
        "libraryStrategy": "WXS",
        "pairedEnd": true,
        "referenceGenome": "GR37",
        "info": {
            "randomSRField": "someSRValue"
        }
    }
}

Generate the Manifest

With an analysis created, a manifest file must be generated using the ManifestClient , the analysisId from the previously generated analysis, a path to the directory containing the files to be uploaded , and an output file path. If the source_dir does not exist or if the files to be uploaded are not present in that directory , then an error will be shown. By calling the write_manifest method, a Manifest object is generated and then written to a file. This step is required for the next section involving the upload of the object files to the storage server.

from overture_song.client import ManifestClient
manifest_client = ManifestClient(api)
source_dir = "/path/to/directory/containing/files"
manifest_file_path = './manifest.txt'
manifest_client.write_manifest('23c61f55-12b4-11e8-b46b-23a48c7b1324', source_dir, manifest_file_path)

After successful execution, a manifest.txt file will be generated and will have the following contents:

23c61f55-12b4-11e8-b46b-23a48c7b1324
f553bbe8-876b-5a9c-a436-ff47ceef53fb    /path/to/directory/containing/files/myFilename1.bam    myMd51
6e2ee06b-e95d-536a-86b5-f2af9594185f    /path/to/directory/containing/files/myFilename2.bam    myMd52

Upload the Object Files

Upload the object files specified in the payload, using the icgc-storage-client and the manifest file. This will upload the files specified in the manifest.txt file, which should all be located in the same directory.

For Collaboratory - Toronto:

./bin/icgc-storage-client --profile collab   upload --manifest ./manifest.txt

For AWS - Virginia:

./bin/icgc-storage-client --profile aws   upload --manifest ./manifest.txt

See also

Refer to the SCORE Client section for more information about installation, configuration and usage.

Publish the Analysis

Using the same analysisId as before, publish it. Essentially, this is the handshake between the metadata stored in the SONG server (via the analysisIds) and the object files stored in the storage server (the files described by the analysisId)

>>> api.publish('23c61f55-12b4-11e8-b46b-23a48c7b1324')
AnalysisId 23c61f55-12b4-11e8-b46b-23a48c7b1324 successfully published

Observe the PUBLISHED Analysis

Finally, verify the analysis is published by observing the value of the analysisState field in the response for the get_analysis call. If the value is PUBLISHED, then congratulations on your first metadata upload!!

>>> api.get_analysis('23c61f55-12b4-11e8-b46b-23a48c7b1324')
{
    "analysisType": "sequencingRead",
    "info": {},
    "analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
    "study": "ABC123",
    "analysisState": "PUBLISHED",
    "sample": [
        {
            "info": {
                "randomSample1Field": "someSample1Value"
            },
            "sampleId": "SA599347",
            "specimenId": "SP196154",
            "sampleSubmitterId": "ssId1",
            "sampleType": "RNA",
            "specimen": {
                "info": {
                    "randomSpecimenField": "someSpecimenValue"
                },
                "specimenId": "SP196154",
                "donorId": "DO229595",
                "specimenSubmitterId": "sp_sub_1",
                "specimenClass": "Tumour",
                "specimenType": "Normal - EBV immortalized"
            },
            "donor": {
                "donorId": "DO229595",
                "donorSubmitterId": "dsId1",
                "studyId": "ABC123",
                "donorGender": "male",
                "info": {}
            }
        }
    ],
    "file": [
        {
            "info": {
                "randomFile1Field": "someFile1Value"
            },
            "objectId": "f553bbe8-876b-5a9c-a436-ff47ceef53fb",
            "analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
            "fileName": "myFilename1.bam",
            "studyId": "ABC123",
            "fileSize": 1234561,
            "fileType": "VCF",
            "fileMd5sum": "myMd51                          ",
            "fileAccess": "controlled"
        },
        {
            "info": {
                "randomFile2Field": "someFile2Value"
            },
            "objectId": "6e2ee06b-e95d-536a-86b5-f2af9594185f",
            "analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
            "fileName": "myFilename2.bam",
            "studyId": "ABC123",
            "fileSize": 1234562,
            "fileType": "VCF",
            "fileMd5sum": "myMd52                          ",
            "fileAccess": "controlled"
        }
    ],
    "experiment": {
        "analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
        "aligned": true,
        "alignmentTool": "myAlignmentTool",
        "insertSize": 0,
        "libraryStrategy": "WXS",
        "pairedEnd": true,
        "referenceGenome": "GR37",
        "info": {
            "randomSRField": "someSRValue"
        }
    }
}