Analyzed Document Data
After your document is analyzed and visible in PrecisionOCR, it may be helpful to pull all of the predictions down for combing through the sheer number of suggestions or perhaps building an integration.
For this example, the goal is to extract an analyzed OCR document related to a fictional Dan Dennis and see all of the suggested medical codes as well as the detailed textual elements of the uploaded PDF.
The first step is to simply get the matching patients by name.
phc.Patient.get_data_frame(term={
"name.family.keyword": "Dennis"
})
id | account | resourceType | name_use | name_given_0 | name_family | birthDate.tz | birthDate.local | gender | meta.tag_lastUpdated.tz | meta.tag_lastUpdated.local |
---|---|---|---|---|---|---|---|---|---|---|
0c23c681-eeb7-491d-bb99-5ab77df53941 | sample | Patient | official | Dan | Dennis | 0 | 1983-02-24T00:00:00+00:00 | male | 0 | 2021-03-18T02:43:51.064000+00:00 |
Fetching Documents from PrecisionOCR
All documents in PrecisionOCR are stored as DocumentReference
resources in the FHIR format. Since they have a unique code that differentiates them from other documents in the account, the phc.Ocr.Document
class provides seamless access to just PrecisionOCR documents. The documents for Dan Dennis are retrieved using the patient_id
parameter.
phc.Ocr.Document.get_data_frame(patient_id="0c23c681-eeb7-491d-bb99-5ab77df53941")
id | account | resourceType | meta.tag_system__lifeomic.com/fhir/dataset__code | meta.tag_system__lifeomic.com/fhir/source__code | meta.tag_lastUpdated.tz | meta.tag_lastUpdated.local | type.coding_system__loinc.org__code | type.coding_system__loinc.org__display | indexed | status | meta.tag_system__lifeomic.com/ocr/document/status__code | docStatus | content | description |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 | sample | DocumentReference | 637475e1-3b26-4d78-87eb-5df66ab9ef59 | PrecisionOCR Service | 0 | 2021-03-18T02:44:57.061000+00:00 | 11488-4 | Consult note | 2021-03-18T02:43:41.846Z | current | SUCCESS | preliminary | [...] | ocr-uploads/D D Notes.pdf |
If the document ID is already known, a single record can be retrieved.
phc.Ocr.Document.get("ebb2ae5a-6563-4bfd-bcf8-de095bb203b1")
In addition to the columns seen above, the DocumentReference
resources include a content JSON column that links to various files.
[{'attachment': {'contentType': 'application/pdf',
'url': 'https://api.dev.lifeomic.com/v1/files/d2b334dd-5461-4480-b289-9ddb66717360'},
'format': {'system': 'https://lifeomic.com/fhir/identifier-type',
'code': 'ocr-file-id',
'display': 'OCR File Identifier'}},
{'attachment': {'contentType': 'application/x-jsonlines',
'url': 'https://api.dev.lifeomic.com/v1/files/f6e73773-4d33-4953-b5a8-0728ed77aa33'},
'format': {'system': 'https://lifeomic.com/fhir/identifier-type',
'code': 'ocr-text-file-id',
'display': 'OCR Text File Identifier'}}]
The PDF file coded as 'ocr-file-id'
refers to the original file in the file service while the line-by-line JSON file coded as 'ocr-text-file-id'
contains the extracted textual elements from the PDF. This second file is discussed more in the Fetching Extracted Textual Elements section.
Fetching Page Data for Documents
phc.Ocr.DocumentComposition.get_data_frame(
document_id="ebb2ae5a-6563-4bfd-bcf8-de095bb203b1",
all_results=True
)
resourceType | title | id | meta.tag_system__lifeomic.com/fhir/source__code | meta.tag_system__lifeomic.com/ocr/documents/page-number__code | meta.tag_lastUpdated.tz | meta.tag_lastUpdated.local | date.tz | date.local | status | subject.reference | author.reference | type.coding_system__loinc.org__code | type.coding_system__loinc.org__display | extension.url__lifeomic.com/fhir/ocr/page-rotation__valueInteger | extension.url__lifeomic.com/fhir/ocr/page-dates__valueString | extension.url__lifeomic.com/fhir/ocr/page-aspect-ratio__valueString | extension.url__lifeomic.com/fhir/ocr/page-image__valueString | extension.url__lifeomic.com/fhir/ocr/masked-word-ids__valueString | text.status | text.div | relatesTo.code | relatesTo.targetReference_reference | account |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Composition | ... | 83daf7bf-6468-45e1-a021-d634ef116521 | PrecisionOCR Service | 0 | 0 | 2021-03-18 02:44:53.520000+00:00 | 0 | 2021-03-18 02:44:50.787000+00:00 | final | user@example.com | 34765-8 | General medicine Note | 0 | nan | 595.28 x 841.89 | f14e940a-2690-406c-a916-c61ca935a71f | nan | generated | Sinus bradycardia. L... | transforms | DocumentReference/ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 | sample | |
Composition | There is no pleural ... | 3124cbc7-fc06-47d0-9636-48efc0a59a21 | PrecisionOCR Service | 1 | 0 | 2021-03-18 02:44:53.522000+00:00 | 0 | 2021-03-18 02:44:50.817000+00:00 | final | user@example.com | 34765-8 | General medicine Note | 0 | [{"isRelative":false,"wordIds":["1385f0c... | 595.28 x 841.89 | b9c32f92-b3c0-452d-9cfd-6b2146fff097 | 29c02c00-f6ef-46bc-9a38-5ebd08d7ff94,f2b2ff2b... | generated | 2004-12-16 1:01 PM C... | transforms | DocumentReference/ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 | sample | |
Composition | AORTIC VALVE: Normal... | 1d194ecd-9ec8-4b22-9e48-b9efa42c9660 | PrecisionOCR Service | 2 | 0 | 2021-03-18 02:44:53.526000+00:00 | 0 | 2021-03-18 02:44:50.955000+00:00 | final | user@example.com | 34765-8 | General medicine Note | 0 | nan | 595.28 x 841.89 | a6781a42-6435-4104-b341-2661a600e80e | 669ca3b9-d3c1-47b1-b412-a12b542dd3a4... | generated | PATIENT/TEST INFORMA... | transforms | DocumentReference/ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 | sample | |
Composition | The left ventricular... | 848c3cd3-38e1-4663-8394-0fd317c8fe6b | PrecisionOCR Service | 3 | 0 | 2021-03-18 02:44:53.523000+00:00 | 0 | 2021-03-18 02:44:50.840000+00:00 | final | user@example.com | 34765-8 | General medicine Note | 0 | nan | 595.28 x 841.89 | 9775ee58-221a-450a-9fe6-854d5d508d2b | nan | generated | Conclusions: PRE-BYP... | transforms | DocumentReference/ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 | sample | |
Composition | QS deflections in le... | 597ba7a0-85b0-4bc1-9055-6c497ea7ad19 | PrecisionOCR Service | 4 | 0 | 2021-03-18 02:44:53.524000+00:00 | 0 | 2021-03-18 02:44:50.876000+00:00 | final | user@example.com | 34765-8 | General medicine Note | 0 | [{"isRelative":false,"wordIds":["7e4c0fc... | 595.28 x 841.89 | 901844d1-fc53-4b6f-8a2a-35c299a547a1 | 7e4c0fc4-1f65-46d5-a3a6-1eaa2ecd33dc... | generated | Sinus rhythm. Left a... | transforms | DocumentReference/ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 | sample |
Fetching Extracted Textual Elements
Aside from the medical code suggestions, documents contain a wealth of graphical and textual coordinates whether scanned from a physical piece of paper or computer generated with text metadata. As seen in the previous section on DocumentReference
resources, the linked JSONL file contains the extracted text and layout information including the pages, lines, words, and even tables in a given document.
[...{'attachment': {'contentType': 'application/x-jsonlines',
'url': 'https://api.dev.lifeomic.com/v1/files/f6e73773-4d33-4953-b5a8-0728ed77aa33'},
'format': {'system': 'https://lifeomic.com/fhir/identifier-type',
'code': 'ocr-text-file-id',
'display': 'OCR Text File Identifier'}}]
The SDK refers to each of these elements as Block
resources and automatically pulls the file down and converts it to a pandas DataFrame
.
phc.Ocr.Block.get_data_frame(
document_id="ebb2ae5a-6563-4bfd-bcf8-de095bb203b1"
)
BlockType | Relationships | Page | Confidence | Text | TextType | RowIndex | ColumnIndex | RowSpan | ColumnSpan | Polygon.X | Polygon.Y | Polygon.X_1 | Polygon.Y_1 | Polygon.X_2 | Polygon.Y_2 | Polygon.X_3 | Polygon.Y_3 | BoundingBox.Width | BoundingBox.Height | BoundingBox.Left | BoundingBox.Top |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PAGE | [{'Type': 'CHILD', 'Ids': ['2076b805-39d3-4d78-ba9... | 1 | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0.9998 | 0 | 0.9998 | 1 | 0 | 1 | 0.9998 | 1 | 0 | 0 |
LINE | [{'Type': 'CHILD', 'Ids': ['840f6da1-3e59-4b8f-989... | 1 | 98.999 | Sinus bradycardia. Lateral T w... | nan | nan | nan | nan | nan | 0.0527 | 0.038 | 0.8421 | 0.038 | 0.8421 | 0.0503 | 0.0527 | 0.0503 | 0.7894 | 0.0122 | 0.0527 | 0.038 |
WORD | nan | 1 | 99.6008 | Sinus... | PRINTED | nan | nan | nan | nan | 0.0527 | 0.0382 | 0.0943 | 0.0382 | 0.0943 | 0.048 | 0.0527 | 0.048 | 0.0416 | 0.0098 | 0.0527 | 0.0382 |
WORD | nan | 1 | 97.2182 | bradycardia.... | PRINTED | nan | nan | nan | nan | 0.0991 | 0.0382 | 0.1909 | 0.0382 | 0.1909 | 0.0501 | 0.0991 | 0.0501 | 0.0919 | 0.0119 | 0.0991 | 0.0382 |
WORD | nan | 1 | 99.7431 | Lateral... | PRINTED | nan | nan | nan | nan | 0.1963 | 0.0384 | 0.2469 | 0.0384 | 0.2469 | 0.0479 | 0.1963 | 0.0479 | 0.0507 | 0.0095 | 0.1963 | 0.0384 |
Some of these elements contain children references such as the case of the line "Sinus bradycardia. Lateral T w..." containing the words "Sinus" and "bradycardia." Each block has an associated coordinate based on the upper-left corner.
Fetching FHIR Suggestions
Of course, the most interesting data is in the extracted suggestions. While the interface suggests a list of medications, procedures, and other resources that is condensed to obviously distinct codes, the underlying data contains numerous slight variations that the user only discovers after picking a distinct suggestion and then seeing the variations in date ranges, code systems, and other features.
In contrast to the LifeOmic Platform interface, the SDK builds a data frame of all permutations of suggestions so that the data can be easily filtered. In the first two rows seen below, for example, the suggestions for observation
look very similar but the code is actually different. In other words, some id
values will be replicated, but one or more columns of the table will be different.
phc.Ocr.Suggestion.get_data_frame(
document_id="ebb2ae5a-6563-4bfd-bcf8-de095bb203b1",
all_results=True
)
id | type | account | project | documentReference | status | originalText | anchorDate | suggestionId | documentPage | date_value | date_confidence | date_isPHI | date_dataSource_source | value_value | value_confidence | value_dataSource_source | code_confidence | code_dataSource_source | code_value_system | code_value_code | code_value_display | date_sourceText | value_sourceText | code_sourceText | onsetDate_value | onsetDate_confidence | onsetDate_isPHI | onsetDate_dataSource_source | bodySite__item | bodySite_confidence | bodySite_dataSource_source | bodySite_value_system | bodySite_value_code | bodySite_value_display | onsetDate_sourceText | bodySite_sourceText | status_value | status_confidence | status_dataSource_source | status_sourceText |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
b424ee52-5a75-4df2-b077-1e2096ffc529 | observation | sandbox | 637475e1-3b26-4d78-87eb-5df66ab9ef59 | ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 | OPEN | 2004-12-26 08:00AM BLOOD WBC-1... | 2021-03-18T12:44:35.969Z | 00015-00004-00001 | ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:00015 | 2004-12-26T12:00:00.000Z | 0.999996 | 1 | comprehend | {'value': 2.68} | 0.967934 | comprehend | 0.967934 | loinc-codes | http://loinc.org | 33229-6 | RBC casts #/area in Urine by Computer assisted method | 2004-12-26 | 2.68 | RBC | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
b424ee52-5a75-4df2-b077-1e2096ffc529 | observation | sandbox | 637475e1-3b26-4d78-87eb-5df66ab9ef59 | ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 | OPEN | 2004-12-26 08:00AM BLOOD WBC-1... | 2021-03-18T12:44:35.969Z | 00015-00004-00001 | ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:00015 | 2004-12-26T12:00:00.000Z | 0.999996 | 1 | comprehend | {'value': 2.68} | 0.967934 | comprehend | 0.967934 | loinc-codes | http://loinc.org | 88970-9 | RBC casts #/area in Urine sediment | 2004-12-26 | 2.68 | RBC | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
b30c3405-ed20-4c04-a730-c17e5b7e777b | condition | sandbox | 637475e1-3b26-4d78-87eb-5df66ab9ef59 | ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 | OPEN | Admission VS72 12 180/90 64"20... | 2021-03-18T12:44:35.577Z | 00014-00015-00000 | ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:00014 | nan | nan | nan | nan | nan | nan | nan | 0.835827 | comprehend | http://hl7.org/fhir/sid/icd-10 | R09.89 | Other specified symptoms and signs involving the circulatory and respiratory systems | nan | nan | carotid bruits | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | ||
b30c3405-ed20-4c04-a730-c17e5b7e777b | condition | sandbox | 637475e1-3b26-4d78-87eb-5df66ab9ef59 | ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 | OPEN | Admission VS72 12 180/90 64"20... | 2021-03-18T12:44:35.577Z | 00014-00015-00000 | ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:00014 | nan | nan | nan | nan | nan | nan | nan | 0.835827 | comprehend | http://hl7.org/fhir/sid/icd-10 | I65.29 | Occlusion and stenosis of unspecified carotid artery | nan | nan | carotid bruits | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | ||
0dd951a7-abc0-4ef7-a7f0-b665fba4b7a9 | procedure | sandbox | 637475e1-3b26-4d78-87eb-5df66ab9ef59 | ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 | OPEN | MCV-87 MCH-30.0 MCHC-34.7 RDW-... | 2021-03-18T12:44:35.970Z | 00015-00005-00000 | ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:00015 | nan | nan | nan | nan | nan | nan | nan | 0.976198 | loinc-codes | http://loinc.org | 30428-7 | MCV [Entitic volume] | nan | MCV | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | |
0dd951a7-abc0-4ef7-a7f0-b665fba4b7a9 | procedure | sandbox | 637475e1-3b26-4d78-87eb-5df66ab9ef59 | ebb2ae5a-6563-4bfd-bcf8-de095bb203b1 | OPEN | MCV-87 MCH-30.0 MCHC-34.7 RDW-... | 2021-03-18T12:44:35.970Z | 00015-00005-00000 | ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:00015 | nan | nan | nan | nan | nan | nan | nan | 0.976198 | loinc-codes | http://loinc.org | 787-2 | MCV [Entitic volume] by Automated count | nan | MCV | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
From this data frame of suggestions, the power of pandas can be leveraged.
=========================================================
Get permutation counts of different types of suggestions
=========================================================
suggestion_df.type.value_counts()
=>
condition 9283
medicationAdministration 931
procedure 914
observation 348
Name: type, dtype: int64
=====================================================
Get medication source texts for the suggested codes
=====================================================
suggestion_df[suggestion_df.type == "medicationAdministration"].code_sourceText.unique()
=>
['K-3.9 CI-99 HCO3-29 AnGap-13', 'ASA', 'Lopressor XL', 'Lipitor',
'Moexpril', 'Glucosamine/chondroitin Gaviscon', 'Docusate Sodium',
'Ranitidine HCI', 'Aspirin', 'Oxycodone-Acetaminophen',
'Atorvastatin', 'Amiodarone', 'Insulin Lispro', 'Insulin Glargine',
'insulin', 'Furosemide', 'Warfarin', 'Insulin lantus', 'humalog',
'beta blockade', 'amiodarone', 'coumadin', 'Toprol XL', 'Moexipril',
'COumadin', 'Coumadin']