Data Lake Overview
The Data Lake is a managed repository of all clinical and genomic data ingested into the LifeOmic Platform. A SQL-like query language is used to select, filter and join data across multiple domains into a single view.
See What Data is Available
The content of the Data Lake depends on what data has been ingested into the LifeOmic Platform. As such, the domains (think SQL tables) available might vary from project to project.
There are multiple ways to interrogate the Data Lake about what data is available for a specific project and what the structure of that data is.
Data Lake REST API
GET /api/v1/analytics/data-lake/schema?datasetId={dataset-id}
LifeOmic CLI
lo data-lake list-schemas {dataset-id}
Python LifeOmic Platform SDK
>>> from phc import Session
>>> from phc.services import Analytics
>>> from phc.util import DataLakeQuery
>>> session = Session()
>>> client = Analytics(session)
>>> project_id = '00000000-0000-0000-0000-000000000000'
>>> client.list_data_lake_schemas(project_id)
Execute a Simple Query
The Data Lake uses a SQL-like query language to select, filter and join data across multiple domains. Queries are run in a deferred manner, meaning the request to execute a query will return a query ID rather than a result. The query ID can be used to monitor the progress of the query.
There are four states a query can be in at any given time.
running
. The query is currently being executed.succeeded
. The query has completed and the result file is available.cancelled
. The query was cancelled by a user.failed
. The query failed during execution.
Once a query has successfully completed, the results will be written to a file at a location specified in the initiating query request. The results file will be written as a CSV.
There are multiple ways to execute a query against the Data Lake and poll its completion.
Data Lake REST API
a. Execute the query
POST /api/v1/analytics/data-lake/query
{
"query": "SELECT sample_id, gene, impact, amino_acid_change, histology FROM variant WHERE tumor_site='breast'",
"datasetId": "00000000-0000-0000-0000-000000000000",
"outputFileName": "my-query-results"
}Response
{
"message": "Query execution starting",
"queryId": "11111111-2222-3333-4444-000000000000"
}b. Get the status of the query
GET /api/v1/analytics/data-lake/query/11111111-2222-3333-4444-000000000000
Response
{
"id": "11111111-2222-3333-4444-000000000000",
"accountId": "my-account",
"datasetId": "00000000-0000-0000-0000-000000000000",
"queryString": "U0VMRUNUIHNhbXBsZV9pZCwgZ2VuZSwgaW1wYWN0LCBhbWlub19hY2lkX2NoYW5nZSwgaGlzdG9sb2d5IEZST00gdmFyaWFudCBXSEVSRSB0dW1vcl9zaXRlPSdicmVhc3Qn",
"state": "succeeded",
"outputFileName": "my-query-results",
"startTime": 1589390692167,
"endTime": 1589390696640
}LifeOmic CLI
a. Execute the query
lo data-lake query 00000000-0000-0000-0000-000000000000 -q "SELECT sample_id, gene, impact, amino_acid_change, histology FROM variant WHERE tumor_site='breast'" -o my-query-results
Response
message: Query execution starting
queryId: 11111111-2222-3333-4444-000000000000b. Get the status of the query
lo data-lake get-query 11111111-2222-3333-4444-000000000000
Response
"id": "11111111-2222-3333-4444-000000000000",
"accountId": "my-account",
"datasetId": "00000000-0000-0000-0000-000000000000",
"queryString": "U0VMRUNUIHNhbXBsZV9pZCwgZ2VuZSwgaW1wYWN0LCBhbWlub19hY2lkX2NoYW5nZSwgaGlzdG9sb2d5IEZST00gdmFyaWFudCBXSEVSRSB0dW1vcl9zaXRlPSdicmVhc3Qn",
"state": "succeeded",
"outputFileName": "my-query-results",
"startTime": 1589390692167,
"endTime": 1589390696640Python LifeOmic Platform SDK
a. Execute the query and await the result
>>> from phc import Session
>>> from phc.services import Analytics
>>> from phc.util import DataLakeQuery
>>> session = Session()
>>> client = Analytics(session)
>>> project_id = '00000000-0000-0000-0000-000000000000'
>>> query_string = "SELECT sample_id, gene, impact, amino_acid_change, histology FROM variant WHERE tumor_site='breast'"
>>> output_file_name = 'my-query-results'
>>> query = DataLakeQuery(project_id=project_id, query=query_string, output_file_name=output_file_name)
>>> dataframe = client.execute_data_lake_query_to_dataframe(query)
Query Rate Limits
Each account is limited to 10 concurrent queries across all projects. If a query is executed with the account already at capacity, it will be failed.