Skip to main content

Analyzed Document Data

After your document is analyzed and visible in PrecisionOCR, it may be helpful to pull all of the predictions down for combing through the sheer number of suggestions or perhaps building an integration.

For this example, the goal is to extract an analyzed OCR document related to a fictional Dan Dennis and see all of the suggested medical codes as well as the detailed textual elements of the uploaded PDF.

The first step is to simply get the matching patients by name.

phc.Patient.get_data_frame(term={
"name.family.keyword": "Dennis"
})
idaccountresourceTypename_usename_given_0name_familybirthDate.tzbirthDate.localgendermeta.tag_lastUpdated.tzmeta.tag_lastUpdated.local
0c23c681-eeb7-491d-bb99-5ab77df53941samplePatientofficialDanDennis01983-02-24T00:00:00+00:00male02021-03-18T02:43:51.064000+00:00

Fetching Documents from PrecisionOCR

All documents in PrecisionOCR are stored as DocumentReference resources in the FHIR format. Since they have a unique code that differentiates them from other documents in the account, the phc.Ocr.Document class provides seamless access to just PrecisionOCR documents. The documents for Dan Dennis are retrieved using the patient_id parameter.

phc.Ocr.Document.get_data_frame(patient_id="0c23c681-eeb7-491d-bb99-5ab77df53941")
idaccountresourceTypemeta.tag_systemlifeomic.com/fhir/datasetcodemeta.tag_systemlifeomic.com/fhir/sourcecodemeta.tag_lastUpdated.tzmeta.tag_lastUpdated.localtype.coding_systemloinc.orgcodetype.coding_systemloinc.orgdisplayindexedstatusmeta.tag_systemlifeomic.com/ocr/document/statuscodedocStatuscontentdescription
ebb2ae5a-6563-4bfd-bcf8-de095bb203b1sampleDocumentReference637475e1-3b26-4d78-87eb-5df66ab9ef59PrecisionOCR Service02021-03-18T02:44:57.061000+00:0011488-4Consult note2021-03-18T02:43:41.846ZcurrentSUCCESSpreliminary[...]ocr-uploads/D D Notes.pdf

If the document ID is already known, a single record can be retrieved.

phc.Ocr.Document.get("ebb2ae5a-6563-4bfd-bcf8-de095bb203b1")

In addition to the columns seen above, the DocumentReference resources include a content JSON column that links to various files.

[{'attachment': {'contentType': 'application/pdf',
'url': 'https://api.dev.lifeomic.com/v1/files/d2b334dd-5461-4480-b289-9ddb66717360'},
'format': {'system': 'https://lifeomic.com/fhir/identifier-type',
'code': 'ocr-file-id',
'display': 'OCR File Identifier'}},
{'attachment': {'contentType': 'application/x-jsonlines',
'url': 'https://api.dev.lifeomic.com/v1/files/f6e73773-4d33-4953-b5a8-0728ed77aa33'},
'format': {'system': 'https://lifeomic.com/fhir/identifier-type',
'code': 'ocr-text-file-id',
'display': 'OCR Text File Identifier'}}]

The PDF file coded as 'ocr-file-id' refers to the original file in the file service while the line-by-line JSON file coded as 'ocr-text-file-id' contains the extracted textual elements from the PDF. This second file is discussed more in the Fetching Extracted Textual Elements section.

Fetching Page Data for Documents

phc.Ocr.DocumentComposition.get_data_frame(
document_id="ebb2ae5a-6563-4bfd-bcf8-de095bb203b1",
all_results=True
)
resourceTypetitleidmeta.tag_systemlifeomic.com/fhir/sourcecodemeta.tag_systemlifeomic.com/ocr/documents/page-numbercodemeta.tag_lastUpdated.tzmeta.tag_lastUpdated.localdate.tzdate.localstatussubject.referenceauthor.referencetype.coding_systemloinc.orgcodetype.coding_systemloinc.orgdisplayextension.urllifeomic.com/fhir/ocr/page-rotationvalueIntegerextension.urllifeomic.com/fhir/ocr/page-datesvalueStringextension.urllifeomic.com/fhir/ocr/page-aspect-ratiovalueStringextension.urllifeomic.com/fhir/ocr/page-imagevalueStringextension.urllifeomic.com/fhir/ocr/masked-word-idsvalueStringtext.statustext.divrelatesTo.coderelatesTo.targetReference_referenceaccount
Composition...83daf7bf-6468-45e1-a021-d634ef116521PrecisionOCR Service002021-03-18 02:44:53.520000+00:0002021-03-18 02:44:50.787000+00:00finaluser@example.com34765-8General medicine Note0nan595.28 x 841.89f14e940a-2690-406c-a916-c61ca935a71fnangeneratedSinus bradycardia. L...transformsDocumentReference/ebb2ae5a-6563-4bfd-bcf8-de095bb203b1sample
CompositionThere is no pleural ...3124cbc7-fc06-47d0-9636-48efc0a59a21PrecisionOCR Service102021-03-18 02:44:53.522000+00:0002021-03-18 02:44:50.817000+00:00finaluser@example.com34765-8General medicine Note0[{"isRelative":false,"wordIds":["1385f0c...595.28 x 841.89b9c32f92-b3c0-452d-9cfd-6b2146fff09729c02c00-f6ef-46bc-9a38-5ebd08d7ff94,f2b2ff2b...generated2004-12-16 1:01 PM C...transformsDocumentReference/ebb2ae5a-6563-4bfd-bcf8-de095bb203b1sample
CompositionAORTIC VALVE: Normal...1d194ecd-9ec8-4b22-9e48-b9efa42c9660PrecisionOCR Service202021-03-18 02:44:53.526000+00:0002021-03-18 02:44:50.955000+00:00finaluser@example.com34765-8General medicine Note0nan595.28 x 841.89a6781a42-6435-4104-b341-2661a600e80e669ca3b9-d3c1-47b1-b412-a12b542dd3a4...generatedPATIENT/TEST INFORMA...transformsDocumentReference/ebb2ae5a-6563-4bfd-bcf8-de095bb203b1sample
CompositionThe left ventricular...848c3cd3-38e1-4663-8394-0fd317c8fe6bPrecisionOCR Service302021-03-18 02:44:53.523000+00:0002021-03-18 02:44:50.840000+00:00finaluser@example.com34765-8General medicine Note0nan595.28 x 841.899775ee58-221a-450a-9fe6-854d5d508d2bnangeneratedConclusions: PRE-BYP...transformsDocumentReference/ebb2ae5a-6563-4bfd-bcf8-de095bb203b1sample
CompositionQS deflections in le...597ba7a0-85b0-4bc1-9055-6c497ea7ad19PrecisionOCR Service402021-03-18 02:44:53.524000+00:0002021-03-18 02:44:50.876000+00:00finaluser@example.com34765-8General medicine Note0[{"isRelative":false,"wordIds":["7e4c0fc...595.28 x 841.89901844d1-fc53-4b6f-8a2a-35c299a547a17e4c0fc4-1f65-46d5-a3a6-1eaa2ecd33dc...generatedSinus rhythm. Left a...transformsDocumentReference/ebb2ae5a-6563-4bfd-bcf8-de095bb203b1sample

Fetching Extracted Textual Elements

Aside from the medical code suggestions, documents contain a wealth of graphical and textual coordinates whether scanned from a physical piece of paper or computer generated with text metadata. As seen in the previous section on DocumentReference resources, the linked JSONL file contains the extracted text and layout information including the pages, lines, words, and even tables in a given document.

[...{'attachment': {'contentType': 'application/x-jsonlines',
'url': 'https://api.dev.lifeomic.com/v1/files/f6e73773-4d33-4953-b5a8-0728ed77aa33'},
'format': {'system': 'https://lifeomic.com/fhir/identifier-type',
'code': 'ocr-text-file-id',
'display': 'OCR Text File Identifier'}}]

The SDK refers to each of these elements as Block resources and automatically pulls the file down and converts it to a pandas DataFrame.

phc.Ocr.Block.get_data_frame(
document_id="ebb2ae5a-6563-4bfd-bcf8-de095bb203b1"
)
BlockTypeRelationshipsPageConfidenceTextTextTypeRowIndexColumnIndexRowSpanColumnSpanPolygon.XPolygon.YPolygon.X_1Polygon.Y_1Polygon.X_2Polygon.Y_2Polygon.X_3Polygon.Y_3BoundingBox.WidthBoundingBox.HeightBoundingBox.LeftBoundingBox.Top
PAGE[{'Type': 'CHILD', 'Ids': ['2076b805-39d3-4d78-ba9...1nannannannannannannan000.999800.99981010.9998100
LINE[{'Type': 'CHILD', 'Ids': ['840f6da1-3e59-4b8f-989...198.999Sinus bradycardia. Lateral T w...nannannannannan0.05270.0380.84210.0380.84210.05030.05270.05030.78940.01220.05270.038
WORDnan199.6008Sinus...PRINTEDnannannannan0.05270.03820.09430.03820.09430.0480.05270.0480.04160.00980.05270.0382
WORDnan197.2182bradycardia....PRINTEDnannannannan0.09910.03820.19090.03820.19090.05010.09910.05010.09190.01190.09910.0382
WORDnan199.7431Lateral...PRINTEDnannannannan0.19630.03840.24690.03840.24690.04790.19630.04790.05070.00950.19630.0384

Some of these elements contain children references such as the case of the line "Sinus bradycardia. Lateral T w..." containing the words "Sinus" and "bradycardia." Each block has an associated coordinate based on the upper-left corner.

Fetching FHIR Suggestions

Of course, the most interesting data is in the extracted suggestions. While the interface suggests a list of medications, procedures, and other resources that is condensed to obviously distinct codes, the underlying data contains numerous slight variations that the user only discovers after picking a distinct suggestion and then seeing the variations in date ranges, code systems, and other features.

In contrast to the LifeOmic Platform interface, the SDK builds a data frame of all permutations of suggestions so that the data can be easily filtered. In the first two rows seen below, for example, the suggestions for observation look very similar but the code is actually different. In other words, some id values will be replicated, but one or more columns of the table will be different.

phc.Ocr.Suggestion.get_data_frame(
document_id="ebb2ae5a-6563-4bfd-bcf8-de095bb203b1",
all_results=True
)
idtypeaccountprojectdocumentReferencestatusoriginalTextanchorDatesuggestionIddocumentPagedate_valuedate_confidencedate_isPHIdate_dataSource_sourcevalue_valuevalue_confidencevalue_dataSource_sourcecode_confidencecode_dataSource_sourcecode_value_systemcode_value_codecode_value_displaydate_sourceTextvalue_sourceTextcode_sourceTextonsetDate_valueonsetDate_confidenceonsetDate_isPHIonsetDate_dataSource_sourcebodySite__itembodySite_confidencebodySite_dataSource_sourcebodySite_value_systembodySite_value_codebodySite_value_displayonsetDate_sourceTextbodySite_sourceTextstatus_valuestatus_confidencestatus_dataSource_sourcestatus_sourceText
b424ee52-5a75-4df2-b077-1e2096ffc529observationsandbox637475e1-3b26-4d78-87eb-5df66ab9ef59ebb2ae5a-6563-4bfd-bcf8-de095bb203b1OPEN2004-12-26 08:00AM BLOOD WBC-1...2021-03-18T12:44:35.969Z00015-00004-00001ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:000152004-12-26T12:00:00.000Z0.9999961comprehend{'value': 2.68}0.967934comprehend0.967934loinc-codeshttp://loinc.org33229-6RBC casts [#/area] in Urine by Computer assisted method2004-12-262.68RBCnannannannannannannannannannannannannannannannan
b424ee52-5a75-4df2-b077-1e2096ffc529observationsandbox637475e1-3b26-4d78-87eb-5df66ab9ef59ebb2ae5a-6563-4bfd-bcf8-de095bb203b1OPEN2004-12-26 08:00AM BLOOD WBC-1...2021-03-18T12:44:35.969Z00015-00004-00001ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:000152004-12-26T12:00:00.000Z0.9999961comprehend{'value': 2.68}0.967934comprehend0.967934loinc-codeshttp://loinc.org88970-9RBC casts [#/area] in Urine sediment2004-12-262.68RBCnannannannannannannannannannannannannannannannan
b30c3405-ed20-4c04-a730-c17e5b7e777bconditionsandbox637475e1-3b26-4d78-87eb-5df66ab9ef59ebb2ae5a-6563-4bfd-bcf8-de095bb203b1OPENAdmission VS72 12 180/90 64"20...2021-03-18T12:44:35.577Z00014-00015-00000ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:00014nannannannannannannan0.835827comprehendhttp://hl7.org/fhir/sid/icd-10R09.89Other specified symptoms and signs involving the circulatory and respiratory systemsnannancarotid bruitsnannannannannannannannannannannannannannan
b30c3405-ed20-4c04-a730-c17e5b7e777bconditionsandbox637475e1-3b26-4d78-87eb-5df66ab9ef59ebb2ae5a-6563-4bfd-bcf8-de095bb203b1OPENAdmission VS72 12 180/90 64"20...2021-03-18T12:44:35.577Z00014-00015-00000ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:00014nannannannannannannan0.835827comprehendhttp://hl7.org/fhir/sid/icd-10I65.29Occlusion and stenosis of unspecified carotid arterynannancarotid bruitsnannannannannannannannannannannannannannan
0dd951a7-abc0-4ef7-a7f0-b665fba4b7a9proceduresandbox637475e1-3b26-4d78-87eb-5df66ab9ef59ebb2ae5a-6563-4bfd-bcf8-de095bb203b1OPENMCV-87 MCH-30.0 MCHC-34.7 RDW-...2021-03-18T12:44:35.970Z00015-00005-00000ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:00015nannannannannannannan0.976198loinc-codeshttp://loinc.org30428-7MCV [Entitic volume]nanMCVnannannannannannannannannannannannannannannannan
0dd951a7-abc0-4ef7-a7f0-b665fba4b7a9proceduresandbox637475e1-3b26-4d78-87eb-5df66ab9ef59ebb2ae5a-6563-4bfd-bcf8-de095bb203b1OPENMCV-87 MCH-30.0 MCHC-34.7 RDW-...2021-03-18T12:44:35.970Z00015-00005-00000ebb2ae5a-6563-4bfd-bcf8-de095bb203b1:00015nannannannannannannan0.976198loinc-codeshttp://loinc.org787-2MCV [Entitic volume] by Automated countnanMCVnannannannannannannannannannannannannannannannan

From this data frame of suggestions, the power of pandas can be leveraged.

 =========================================================
Get permutation counts of different types of suggestions
=========================================================
suggestion_df.type.value_counts()
=>
condition 9283
medicationAdministration 931
procedure 914
observation 348
Name: type, dtype: int64

=====================================================
Get medication source texts for the suggested codes
=====================================================
suggestion_df[suggestion_df.type == "medicationAdministration"].code_sourceText.unique()
=>
['K-3.9 CI-99 HCO3-29 AnGap-13', 'ASA', 'Lopressor XL', 'Lipitor',
'Moexpril', 'Glucosamine/chondroitin Gaviscon', 'Docusate Sodium',
'Ranitidine HCI', 'Aspirin', 'Oxycodone-Acetaminophen',
'Atorvastatin', 'Amiodarone', 'Insulin Lispro', 'Insulin Glargine',
'insulin', 'Furosemide', 'Warfarin', 'Insulin lantus', 'humalog',
'beta blockade', 'amiodarone', 'coumadin', 'Toprol XL', 'Moexipril',
'COumadin', 'Coumadin']