Skip to main content

Data Lake Service FAQ

What is LifeOmic Data Lake Service?

LifeOmic Data Lake Service is a managed repository of all clinical and genomic data ingested into the PHC.

The data is persisted in a semi-structured form, enabling users to query, shape and combine data from different modalities into a single model which suits their unique needs. This democratizes data exploration and allows a user to do analytics, data science, and machine learning.

What data is cataloged in the data lake when files or FHIR data is ingested?

This answer depends upon what data has been brought to the project.

To see the list of cataloged data for a specific project, run the following LifeOmic CLI command:

lo data-lake list-schemas <projectId>

When available, Omic and FHIR data domains (or data pools) are possible:

Omic data

  • copy
  • number
  • fusion
  • gene
  • variant

FHIR data

  • condition
  • demographic
  • dosage
  • media
  • medication
  • observation
  • patient
  • procedure
  • sequence
  • specimen

What format does the data lake use to store data?

Apache Parquet powers the storage format of the data found in the data lake.

How can I read data from the data lake?

There are currently four tools which can query the data lake and retrieve results:

  1. LifeOmic Notebook Service - where PHC SDK for Python and LifeOmic CLI are included.
  2. PHC SDK for Python
  3. LifeOmic CLI
  4. Data Lake REST API

What data formats are available for data lake query results?

The output data format supported is: CSV

How can I explore the data available in the data lake for my project?

PHC notebooks are an ideal sandbox for data exploration.

The notebook environments are pre-installed with the PHC SDK for Python as well as modules useful for data exploration, such as Numpy. See the LifeOmic Notebook Service FAQ for more information.