Lab Instructions

Lab Instructions for Clinical Ontology Tokenization

Prerequisites

  1. OpenRouter API Key You will be using the free models available on OpenRouter for this exercise.

    1. Sign up for OpenRouter

    2. Go to Profile (Top Right corner) --> Keys

    3. Click on Create Key

    4. Fill Up Name, Credit Limit($5) : Create Key

    5. Copy the Key

  2. Updated Code and Environment

    1. Checked out Lab Repo from GitHub CS595 Lab RepoStep

    2. Pull the code to get the latest updates for GitHub Repo

    3. Python version 3.10 or later

    4. A Python virtual environment to link and use for the project

    5. Do Git Pull on lab repo to make sure you have the latest updates

  3. Setup and Verify LOF Services

    1. Open <Project Root>

    2. Activate the python virtual environment

    3. Go to lof folder

    4. Edit .env file Update client_id and client_secret values with the credentials you have received

      client_id=
      client_secret=
    5. Install Requirements pip install -r requirements.txt

    6. Run services.py

      1. You should see the message : LoF Services verified successfully

    7. Possible Error Messages and resolution:

      1. Missing or Incorrect client_id

        1. Failed to get LoF auth token: 400 : {"error":"Invalid client ID"}

        2. Resolution: Update correct client_id in .env file

      2. Missing or Incorrect client_secret

        1. Failed to get LoF auth token: 401 : {"error":"Unauthorized: Client authentication failed","status":401}

        2. Resolution: Update correct client_secret in .env file

Tokenization lab

In this lab we will be

  1. Retrieve tokens for two sample notes

    1. Using google/gemini-2.0-flash-lite-preview-02-05:free and IMO

  2. Compare the tokens from both and display the results in a table

Setup Tokenization lab

  1. Open <Project Root>

  2. Activate the python virtual environment

  3. Go to /labs/tokenization

  4. Install Requirements pip install -r requirements.txt

  5. Edit medical_note_tokenizer.py

    1. Configure OPENROUTER_API_KEY (Created in Prerequisites)

Instructions

  1. Understand the prompt and sample response structure in constants.py

  2. Understand the below code blocks in medical_note_tokenizer.py

    1. Tokenizers: OpenRouterTokenizer and IMOTokenizer

    2. TokenizationResult

  3. Implement the code blocks in medical_note_tokenizer.py

    1. process_entity_codes

    2. display_comparison

  4. Run the Tokenizer

    streamlit run medical_note_tokenizer.py
    1. This launches the streamlit Tokenizer web application

    2. Select sample 1.txt from sample_notes folder

    3. Click on Tokenize and wait for results

    4. Compare the results from OpenRouter and IMO. Note down your observations.

      1. Difference in Assertion Status

      2. Difference in codes captured (Ignore imo:<code>. Compare for others like SNOMED, ICD10, ICD9, LOINC, RxNORM, CPT etc..)

        1. In case of differences search (https://atlas-demo.ohdsi.org/) for codes given by OpenRouter and IMO, note down your observations on what's difference of representation. As showcased in the example below the SNOMED code given by OpenRouter gemini model doesn't seem to be right. We have the got the codes for body structure instead of a problem/condition.

        2. Prepare a report with these observations

Debugging

  1. At times the OpenRouter response may be incomplete

    1. This will result in json.decoder.JSONDecodeError.

    2. Simply rerun if this is for OpenRouter.

    3. This should not happen for IMO

  2. At times OpenRouter response may not be in the same format as instructed in TOKEN_PROMPT

    1. You may see errors like

      1. OpenRouter API Error: list indices must be integers or slices, not str

      2. KeyError 'entities'

    2. Simply rerun if this is for OpenRouter.

    3. If the error is persistent, you may need to adjust TOKEN_PROMPT to give the response as per instructed JSON format

Submission

Create a zip file with the below submission items and submit one zip file.

  1. Short Report (1–2 pages PDF)

    1. Document difficulties or errors you encountered, and how you resolved them.

    2. Document your observations on using an LLM to tokenize/code and using IMO.

  2. Observations on code differences for sample_1.txt and sample_2.txt

  3. Screenshots of token listing for sample_1.txt and sample_2.txt

  4. CSV download, from the Tokens Table, for sample_1.txt and sample_2.txt

    1. You can mouse-hover or expand the table to see the download option

Last updated