Lab Instructions
Lab Instructions for Clinical Ontology Tokenization
Prerequisites
OpenRouter API Key You will be using the free models available on OpenRouter for this exercise.
Sign up for OpenRouter
Go to Profile (Top Right corner) --> Keys
Click on Create Key
Fill Up Name, Credit Limit($5) : Create Key
Copy the Key
Updated Code and Environment
Checked out Lab Repo from GitHub CS595 Lab RepoStep
Pull the code to get the latest updates for GitHub Repo
Python version 3.10 or later
A Python virtual environment to link and use for the project
Do Git Pull on lab repo to make sure you have the latest updates
Setup and Verify LOF Services
Open <Project Root>
Activate the python virtual environment
Go to
lof
folderEdit
.env file
Update client_id and client_secret values with the credentials you have receivedclient_id= client_secret=
Install Requirements
pip install -r requirements.txt
Run
services.py
You should see the message :
LoF Services verified successfully
Possible Error Messages and resolution:
Missing or Incorrect client_id
Failed to get LoF auth token: 400 : {"error":"Invalid client ID"}
Resolution: Update correct client_id in .env file
Missing or Incorrect client_secret
Failed to get LoF auth token: 401 : {"error":"Unauthorized: Client authentication failed","status":401}
Resolution: Update correct client_secret in .env file
Tokenization lab
In this lab we will be
Retrieve tokens for two sample notes
Using google/gemini-2.0-flash-lite-preview-02-05:free and IMO
Compare the tokens from both and display the results in a table
Setup Tokenization lab
Open <Project Root>
Activate the python virtual environment
Go to /labs/tokenization
Install Requirements
pip install -r requirements.txt
Edit medical_note_tokenizer.py
Configure OPENROUTER_API_KEY (Created in Prerequisites)
Instructions
Understand the prompt and sample response structure in constants.py
Understand the below code blocks in medical_note_tokenizer.py
Tokenizers: OpenRouterTokenizer and IMOTokenizer
TokenizationResult
Implement the code blocks in medical_note_tokenizer.py
process_entity_codes
display_comparison
Run the Tokenizer
streamlit run medical_note_tokenizer.py
This launches the streamlit Tokenizer web application
Select sample 1.txt from sample_notes folder
Click on Tokenize and wait for results
Compare the results from OpenRouter and IMO. Note down your observations.
Difference in Assertion Status
Difference in codes captured (Ignore imo:<code>. Compare for others like SNOMED, ICD10, ICD9, LOINC, RxNORM, CPT etc..)
In case of differences search (https://atlas-demo.ohdsi.org/) for codes given by OpenRouter and IMO, note down your observations on what's difference of representation. As showcased in the example below the SNOMED code given by OpenRouter gemini model doesn't seem to be right. We have the got the codes for body structure instead of a problem/condition.
Prepare a report with these observations
Debugging
At times the OpenRouter response may be incomplete
This will result in json.decoder.JSONDecodeError.
Simply rerun if this is for OpenRouter.
This should not happen for IMO
At times OpenRouter response may not be in the same format as instructed in TOKEN_PROMPT
You may see errors like
OpenRouter API Error: list indices must be integers or slices, not str
KeyError 'entities'
Simply rerun if this is for OpenRouter.
If the error is persistent, you may need to adjust TOKEN_PROMPT to give the response as per instructed JSON format
Submission
Create a zip file with the below submission items and submit one zip file.
Short Report (1–2 pages PDF)
Document difficulties or errors you encountered, and how you resolved them.
Document your observations on using an LLM to tokenize/code and using IMO.
Observations on code differences for sample_1.txt and sample_2.txt
Screenshots of token listing for sample_1.txt and sample_2.txt
CSV download, from the Tokens Table, for sample_1.txt and sample_2.txt
You can mouse-hover or expand the table to see the download option
Last updated