SONG Python SDK¶
The SONG Python SDK is a simple python module that allows you to interact with a SONG server through Python, with a minimum of coding effort.
It lets you upload payloads synchronously or asynchronously, check their status and create analyses. From there, you can use the power of Python to process and analyze the data within those objects however you see fit.
Prerequisites¶
Python 3.6 is REQUIRED, since the SDK uses the dataclasses module.
Installation¶
The official SONG Python SDK is publically hosted on PyPi. To install it, just run the command below:
pip install overture-song
Configuration¶
- in generic way, explain how to configure the sdk to be used. just explain ApiConfig and which library to import
Tutorial¶
This section demonstrates example usage of the overture-song sdk.
After completing this tutorial, you will have uploaded your first SONG metadata payload!
For the impatient, the code used below can be found in examples/example_upload.py.
Warning
Python 3.6 or higher is required.
Configuration¶
Create an ApiConfig object. This object contains the serverUrl, accessToken, and studyId
that will be used to interact with the SONG API. In this example we will use https://song.cancercollaboratory.org for
the serverUrl and ‘ABC123’ for the studyId. For the access token, please refer to Creating an Access Token.
from overture_song.model import ApiConfig
api_config = ApiConfig('https://song.cancercollaboratory.org', 'ABC123', <my_access_token>)
Next the main API client needs to be instantiated in order to interact with the SONG server.
from overture_song.client import Api
api = Api(api_config)
As a sanity check, ensure that the server is running. If the response is True, then you may proceed with the next
section, otherwise the server is not running.
>>> api.is_alive()
True
Create a Study¶
If the studyId ‘ABC123’ does not exist, then the StudyClient must be
instantiated in order to read and create studies.
First create a study client,
from overture_song.client import StudyClient
study_client = StudyClient(api)
If the study associated with the payload does not exist, then create
a Study entity,
from overture_song.entities import Study
if not study_client.has(api_config.study_id):
study = Study.create(api_config.study_id, "myStudyName", "myStudyDescription", "myStudyOrganization")
study_client.create(study)
Create a Simple Payload¶
Now that the study exists, you can create your first payload!
In this example, a SequencingReadAnalysis will be created.
It follows the
SequencingRead JsonSchema.
See also
Similarly, for the VariantCallAnalysis, refer to the
VariantCall JsonSchema.
Firstly, import all the entities to minimize the import statements.
from overture_song.entities import *
Next, create an example Donor entity:
donor = Donor()
donor.studyId = api_config.study_id
donor.donorGender = "male"
donor.donorSubmitterId = "dsId1"
donor.set_info("randomDonorField", "someDonorValue")
Create an example Specimen entity:
specimen = Specimen()
specimen.specimenClass = "Tumour"
specimen.specimenSubmitterId = "sp_sub_1"
specimen.specimenType = "Normal - EBV immortalized"
specimen.set_info("randomSpecimenField", "someSpecimenValue")
Create an example Sample entity:
sample = Sample()
sample.sampleSubmitterId = "ssId1"
sample.sampleType = "RNA"
sample.set_info("randomSample1Field", "someSample1Value")
Create 1 or more example File entities:
# File 1
file1 = File()
file1.fileName = "myFilename1.bam"
file1.studyId = api_config.study_id
file1.fileAccess = "controlled"
file1.fileMd5sum = "myMd51"
file1.fileSize = 1234561
file1.fileType = "VCF"
file1.set_info("randomFile1Field", "someFile1Value")
# File 2
file2 = File()
file2.fileName = "myFilename2.bam"
file2.studyId = api_config.study_id
file2.fileAccess = "controlled"
file2.fileMd5sum = "myMd52"
file2.fileSize = 1234562
file2.fileType = "VCF"
file2.set_info("randomFile2Field", "someFile2Value")
Create an example SequencingRead experiment entity:
# SequencingRead
sequencing_read_experiment = SequencingRead()
sequencing_read_experiment.aligned = True
sequencing_read_experiment.alignmentTool = "myAlignmentTool"
sequencing_read_experiment.pairedEnd = True
sequencing_read_experiment.insertSize = 0
sequencing_read_experiment.libraryStrategy = "WXS"
sequencing_read_experiment.referenceGenome = "GR37"
sequencing_read_experiment.set_info("randomSRField", "someSRValue")
Finally, use the SimplePayloadBuilder class along with the previously
create entities to create a payload.
from overture_song.tools import SimplePayloadBuilder
builder = SimplePayloadBuilder(donor, specimen, sample, [file1, file2], sequencing_read_experiment)
payload = builder.to_dict()
Use a Custom AnalysisId¶
In some situations, the user may prefer to use a custom analysisId. If not specified in the payload, it is
automatically generated by the SONG server during the Save the Analysis step.
Although this tutorial uses the analysisId generated by the SONG server, a custom analysisId can be set
as follows:
payload['analysisId'] = 'my_custom_analysis_id'
Upload the Payload¶
With the payload built, the data can now be uploaded to the SONG server for validation. There are 2 modes for validation:
- Synchronous - uploads are validated SYNCHRONOUSLY. Although this is the default mode, it can be selected by setting the kwarg
is_async_validationtoFalsefrom theuploadmethod. - Asynchronously - uploads are validated ASYNCHRONOUSLY. This allows the user to upload a batch of payloads. This mode can be selected by setting
is_async_validationtoTrue.
After calling the upload method, the payload will be sent to the SONG server for validation, and a response will be returned:
>>> api.upload(json_payload=payload, is_async_validation=False)
{
"status": "ok",
"uploadId": "UP-c49742d0-1fc8-4b45-9a1c-ea58d282ac58"
}
If the status field from the response is ok, this means the payload was successfully submitted to the SONG server for validation, and returned a randomly generated uploadId, which is a receipt for the upload request.
Check the Status of the Upload¶
Before continuing, the previous upload’s status must be checked using the
status
method, in order to ensure the payload was successfully validated.
Using the previous uploadId, the status of the upload can be requested and will return the following response:
>>> api.status('UP-c49742d0-1fc8-4b45-9a1c-ea58d282ac58')
{
"analysisId": "",
"uploadId": "UP-c49742d0-1fc8-4b45-9a1c-ea58d282ac58",
"studyId": "ABC123",
"state": "VALIDATED",
"createdAt": [
2019,
2,
16,
0,
54,
31,
73774000
],
"updatedAt": [
2019,
2,
16,
0,
54,
31,
75476000
],
"errors": [
""
],
"payload": {
"analysisState": "UNPUBLISHED",
"sample": [
{
"info": {
"randomSample1Field": "someSample1Value"
},
"sampleSubmitterId": "ssId1",
"sampleType": "RNA",
"specimen": {
"info": {
"randomSpecimenField": "someSpecimenValue"
},
"specimenSubmitterId": "sp_sub_1",
"specimenClass": "Tumour",
"specimenType": "Normal - EBV immortalized"
},
"donor": {
"info": {
"randomDonorField": "someDonorValue"
},
"donorSubmitterId": "dsId1",
"studyId": "Study1",
"donorGender": "male"
}
}
],
"file": [
{
"info": {
"randomFile1Field": "someFile1Value"
},
"fileName": "myFilename1.bam",
"studyId": "Study1",
"fileSize": 1234561,
"fileType": "VCF",
"fileMd5sum": "myMd51",
"fileAccess": "controlled"
},
{
"info": {
"randomFile2Field": "someFile2Value"
},
"fileName": "myFilename2.bam",
"studyId": "Study1",
"fileSize": 1234562,
"fileType": "VCF",
"fileMd5sum": "myMd52",
"fileAccess": "controlled"
}
],
"analysisType": "sequencingRead",
"experiment": {
"info": {
"randomSRField": "someSRValue"
},
"aligned": true,
"alignmentTool": "myAlignmentTool",
"insertSize": 0,
"libraryStrategy": "WXS",
"pairedEnd": true,
"referenceGenome": "GR37"
}
}
}
In order to continue with the next section, the state field MUST have the value VALIDATED, which indicates
the upload was validated and there were no errors. If there were errors, the state field would have the value
VALIDATION_ERROR, and the field errors would contains details of the validation issues. If there is an error,
the user can simply correct the payload, re-upload and check the status again.
Save the Analysis¶
Once the upload is successfully validated, the upload must be saved using the
save
method. This generates the following response:
>>> api.save(status_response.uploadId, ignore_analysis_id_collisions=False)
{
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"status": "ok"
}
The value of ok in the status field of the response indicates that an analysis was successfully created. The analysis
will contain the same data as the payload, with the addition of server-side generated ids, which are generated by an
centralized id server. By default, the request DOES NOT IGNORE analysisId
collisions, however by setting the save method parameter ignore_analysis_id_collisions to True, collisions will
be ignored. This mechanism is considered an override and is heavily discouraged, however it is necessary considering the
complexities associated with managing genomic data.
Observe the UNPUBLISHED Analysis¶
Verify the analysis is unpublished by observing the value of the analysisState field in the response for the
get_analysis call. The value should be UNPUBLISHED. Also, observe that
the SONG server generated an unique sampleId, specimenId, analysisId and objectId:
>>> api.get_analysis('23c61f55-12b4-11e8-b46b-23a48c7b1324')
{
"analysisType": "sequencingRead",
"info": {},
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"study": "ABC123",
"analysisState": "UNPUBLISHED",
"sample": [
{
"info": {
"randomSample1Field": "someSample1Value"
},
"sampleId": "SA599347",
"specimenId": "SP196154",
"sampleSubmitterId": "ssId1",
"sampleType": "RNA",
"specimen": {
"info": {
"randomSpecimenField": "someSpecimenValue"
},
"specimenId": "SP196154",
"donorId": "DO229595",
"specimenSubmitterId": "sp_sub_1",
"specimenClass": "Tumour",
"specimenType": "Normal - EBV immortalized"
},
"donor": {
"donorId": "DO229595",
"donorSubmitterId": "dsId1",
"studyId": "ABC123",
"donorGender": "male",
"info": {}
}
}
],
"file": [
{
"info": {
"randomFile1Field": "someFile1Value"
},
"objectId": "f553bbe8-876b-5a9c-a436-ff47ceef53fb",
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"fileName": "myFilename1.bam",
"studyId": "ABC123",
"fileSize": 1234561,
"fileType": "VCF",
"fileMd5sum": "myMd51 ",
"fileAccess": "controlled"
},
{
"info": {
"randomFile2Field": "someFile2Value"
},
"objectId": "6e2ee06b-e95d-536a-86b5-f2af9594185f",
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"fileName": "myFilename2.bam",
"studyId": "ABC123",
"fileSize": 1234562,
"fileType": "VCF",
"fileMd5sum": "myMd52 ",
"fileAccess": "controlled"
}
],
"experiment": {
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"aligned": true,
"alignmentTool": "myAlignmentTool",
"insertSize": 0,
"libraryStrategy": "WXS",
"pairedEnd": true,
"referenceGenome": "GR37",
"info": {
"randomSRField": "someSRValue"
}
}
}
Generate the Manifest¶
With an analysis created, a manifest file must be generated using the
ManifestClient
, the analysisId from the previously generated analysis, a path to the directory containing the files to be uploaded
, and an output file path. If the source_dir does not exist or if the files to be uploaded are not present in that directory
, then an error will be shown. By calling the
write_manifest method, a
Manifest object is generated and then written to a file.
This step is required for the next section involving the upload of the object files to the storage server.
from overture_song.client import ManifestClient
manifest_client = ManifestClient(api)
source_dir = "/path/to/directory/containing/files"
manifest_file_path = './manifest.txt'
manifest_client.write_manifest('23c61f55-12b4-11e8-b46b-23a48c7b1324', source_dir, manifest_file_path)
After successful execution, a manifest.txt file will be generated and will have the following contents:
23c61f55-12b4-11e8-b46b-23a48c7b1324
f553bbe8-876b-5a9c-a436-ff47ceef53fb /path/to/directory/containing/files/myFilename1.bam myMd51
6e2ee06b-e95d-536a-86b5-f2af9594185f /path/to/directory/containing/files/myFilename2.bam myMd52
Upload the Object Files¶
Upload the object files specified in the payload, using the icgc-storage-client and the manifest file.
This will upload the files specified in the manifest.txt file, which should all be located in the same directory.
For Collaboratory - Toronto:
./bin/icgc-storage-client --profile collab upload --manifest ./manifest.txt
For AWS - Virginia:
./bin/icgc-storage-client --profile aws upload --manifest ./manifest.txt
See also
Refer to the SCORE Client section for more information about installation, configuration and usage.
Publish the Analysis¶
Using the same analysisId as before, publish it.
Essentially, this is the handshake between the metadata stored in the SONG server (via the analysisIds) and the object
files stored in the storage server (the files described by the analysisId)
>>> api.publish('23c61f55-12b4-11e8-b46b-23a48c7b1324')
AnalysisId 23c61f55-12b4-11e8-b46b-23a48c7b1324 successfully published
Observe the PUBLISHED Analysis¶
Finally, verify the analysis is published by observing the value of the analysisState field in the response for the
get_analysis call. If the value is PUBLISHED, then congratulations on your first metadata upload!!
>>> api.get_analysis('23c61f55-12b4-11e8-b46b-23a48c7b1324')
{
"analysisType": "sequencingRead",
"info": {},
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"study": "ABC123",
"analysisState": "PUBLISHED",
"sample": [
{
"info": {
"randomSample1Field": "someSample1Value"
},
"sampleId": "SA599347",
"specimenId": "SP196154",
"sampleSubmitterId": "ssId1",
"sampleType": "RNA",
"specimen": {
"info": {
"randomSpecimenField": "someSpecimenValue"
},
"specimenId": "SP196154",
"donorId": "DO229595",
"specimenSubmitterId": "sp_sub_1",
"specimenClass": "Tumour",
"specimenType": "Normal - EBV immortalized"
},
"donor": {
"donorId": "DO229595",
"donorSubmitterId": "dsId1",
"studyId": "ABC123",
"donorGender": "male",
"info": {}
}
}
],
"file": [
{
"info": {
"randomFile1Field": "someFile1Value"
},
"objectId": "f553bbe8-876b-5a9c-a436-ff47ceef53fb",
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"fileName": "myFilename1.bam",
"studyId": "ABC123",
"fileSize": 1234561,
"fileType": "VCF",
"fileMd5sum": "myMd51 ",
"fileAccess": "controlled"
},
{
"info": {
"randomFile2Field": "someFile2Value"
},
"objectId": "6e2ee06b-e95d-536a-86b5-f2af9594185f",
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"fileName": "myFilename2.bam",
"studyId": "ABC123",
"fileSize": 1234562,
"fileType": "VCF",
"fileMd5sum": "myMd52 ",
"fileAccess": "controlled"
}
],
"experiment": {
"analysisId": "23c61f55-12b4-11e8-b46b-23a48c7b1324",
"aligned": true,
"alignmentTool": "myAlignmentTool",
"insertSize": 0,
"libraryStrategy": "WXS",
"pairedEnd": true,
"referenceGenome": "GR37",
"info": {
"randomSRField": "someSRValue"
}
}
}