Research Projects

Radiomics-Based Risk Stratification in Head & Neck Cancer

DBT/Wellcome Trust India Alliance Funded Project | QIRAIL, CMC Vellore

Aug 2024 - Present

Background: Accurately estimating individual cancer patient risk of early disease failure is crucial for understanding tumor biology, stratifying patients, and tailoring personalized treatment strategies. Current risk stratification methods inadequately predict locoregional recurrence in head and neck cancer patients, creating a critical need for improved prediction models.

The Challenge: Cancer radiomics converts routine radiological images—traditionally interpreted qualitatively—into quantifiable data describing tumor phenotypes. Unlike tissue biopsies, radiomics captures information from the entire tumor non-invasively, reducing sampling errors and requiring no additional radiation exposure for patients.

Our Approach: We systematically compared 8 metaheuristic feature selection algorithms (Particle Swarm Optimization, Genetic Algorithms, Grey Wolf Optimizer, Whale Optimization Algorithm, etc.) combined with 7 machine learning classifiers on pretreatment CT scans from 367 patients. The key research question: Can pre-treatment radiomics signatures accurately identify advanced HNC patients at higher risk of recurrence?

367

Patients Analyzed

0.81

AUC (95% CI: 0.62-0.95)

10

Feature Signature

56

Model Combinations

🎯 Key Achievements

Developed interpretable prediction model: Created a clinically meaningful 10-feature signature (4 clinical + 6 radiomics features) that achieved AUC 0.81 (95% CI: 0.62-0.95) on held-out test set while maintaining interpretability for clinical adoption
Systematic metaheuristic comparison: First comprehensive evaluation of 8 metaheuristic optimizers (PSO, GA, GWO, WOA, etc.) across multiple classifiers (Logistic Regression, Naive Bayes, SVM, Random Forest, etc.) specifically for radiomics feature selection in HNC
Mechanistic insights into model behavior: Discovered and documented why larger feature sets underperformed due to overfitting in high-dimensional radiomics data, providing valuable guidance for future model development
Clinical validation framework: Established collaboration with oncology team to validate biological plausibility of selected radiomic features, ensuring features reflect real tumor biology rather than imaging artifacts
Reproducible pipeline: Built end-to-end analysis pipeline from DICOM retrieval through feature extraction to model validation, enabling future multi-institutional validation studies

🔬 Methodology Details

Feature extraction: Extracted comprehensive radiomic features from pretreatment CT scans using PyRadiomics, including first-order statistics, shape features, and texture features (GLCM, GLRLM, GLSZM)
Feature selection strategies: Implemented and compared LASSO regularization, SelectKBest univariate selection, and nature-inspired metaheuristics (PSO, WOA) for optimal feature subset identification
Model validation: Employed rigorous temporal splitting and nested cross-validation to prevent data leakage and ensure unbiased performance estimates
Performance metrics: Evaluated models using ROC AUC, calibration curves, decision curve analysis, and clinical net benefit to assess both discrimination and clinical utility

Technologies & Tools

Python PyRadiomics scikit-learn XGBoost PSO Genetic Algorithms Grey Wolf Optimizer SHAP Orthanc DICOM

Reproducibility Study: CNN-Based Head and Neck Cancer Prognosis

QIRAIL, CMC Vellore

2024

Critical examination of reproducibility claims in published CNN model for head and neck cancer outcomes, revealing significant dataset and documentation issues while successfully reproducing results.

🎯 Key Contributions

Challenged reproducibility claims of published CNN model by attempting complete replication across three HNC outcomes (distant metastasis, locoregional failure, overall survival)
Identified major dataset and documentation issues: Incorrectly provided datasets, multiple errors in data files, inadequate result reporting protocols, and poor documentation that contradicted reproducibility claims
Successfully reproduced results despite paper's flaws by correcting dataset errors, implementing missing preprocessing steps, and establishing proper validation protocols
Authors acknowledged reproducibility failures: Communicated findings that led to author recognition of dataset errors and documentation inadequacies in their published work

Technologies & Tools

Python PyTorch CNN DICOM Data Validation

🔗 GitHub Repository

CHAVI: CompreHensive Digital ArchiVe of Cancer Imaging

India's First National Oncology Imaging Biobank | Led by Tata Memorial Centre & IIT Kharagpur

2024

The National Challenge: Development of AI and machine learning models in oncology requires access to large-scale, high-quality, and standardized imaging datasets. However, region-specific imaging biobanks remain scarce in South Asia, limiting AI research capabilities and model generalizability to diverse patient populations.

CHAVI's Mission: Creating India's first centralized repository of de-identified oncology imaging data to democratize medical data for research, foster reproducibility and collaboration, and enable AI-driven innovations in early cancer detection, prognosis prediction, and personalized treatment strategies.

Our Contribution: Designed and implemented comprehensive data preparation and integration pipeline for seamless upload of Head and Neck Cancer imaging and clinical data to CHAVI, ensuring ethical compliance with HIPAA and GDPR while maintaining data utility for AI research.

304+

HNC Cases Curated

100%

FAIR Compliant

Zero

Privacy Violations

Multi

Institutional

🎯 Key Achievements

Curated 304+ anonymized HNC cases with comprehensive validated clinical and imaging metadata, establishing CMC Vellore as a major contributor to India's national cancer imaging repository
Automated FAIR-compliant pipeline: Built end-to-end data processing system ensuring datasets are Findable, Accessible, Interoperable, and Reusable for multi-institutional research collaboration
Rigorous de-identification protocols: Implemented comprehensive PHI/PII removal from both DICOM headers and clinical reports while maintaining data linkage integrity through anonymized identifiers
Quality control framework: Established systematic validation protocols detecting formatting errors, completing missing fields through logical inference, and standardizing data formats across institutional sources
Ethical compliance documentation: Maintained complete audit trails ensuring adherence to HIPAA and GDPR throughout data handling lifecycle

🔬 Technical Implementation

Data preprocessing pipeline: Automated detection and correction of formatting errors, completion of missing fields, and standardization according to CHAVI's structured requirements
De-identification workflow: Multi-stage process removing patient names, hospital identifiers, and PII from imaging metadata (DICOM headers) and masking free-text fields in clinical reports
Data harmonization: Mapped clinical data into standardized formats ensuring interoperability across research platforms, with consistent imaging parameters and metadata fields
Anonymized linking: Assigned unique non-identifiable keys enabling multimodal research (combining clinical, imaging, and outcome data) without compromising patient privacy
AI-readiness optimization: Structured datasets specifically for machine learning applications, ensuring compatibility with common deep learning frameworks

🌐 Broader Impact

Democratizing oncology research: Contributing to public repository enables researchers nationwide to access curated datasets without institutional barriers
Addressing regional disparities: CHAVI fills critical gap in South Asian imaging biobanks, enabling development of AI models trained on locally relevant patient populations
Fostering collaboration: Standardized data formats facilitate multi-institutional validation studies and reproducible research across cancer centers
Enabling precision oncology: Large-scale datasets support development of personalized treatment strategies based on imaging biomarkers

Technologies & Tools

Python DICOM Processing Orthanc XNAT FAIR Principles HIPAA/GDPR Data Anonymization ETL Pipeline

CT-only Automated Segmentation Using 3D nnU-Net

Collaboration with NIT Surathkal | Submitted to Journal of Imaging Informatics in Medicine

2024

Clinical Problem: Head and neck cancer requires precise tumor delineation for effective radiotherapy planning. Manual segmentation is time-consuming (30-60 minutes per case) and subject to significant inter-observer variability. Existing automated methods rely on multimodal PET/CT imaging, which is costly ($1000+ per scan), less accessible in resource-limited settings, and burdensome for patients.

Our Innovation: We developed a CT-only automated segmentation framework using 3D nnU-Net that eliminates the need for expensive PET scans while maintaining clinical-grade accuracy. This approach offers a robust, cost-effective, and scalable solution that can democratize access to precision radiotherapy planning.

Why CT-only Matters: CT scanners are 10x more common than PET/CT in India and globally. By achieving comparable performance with CT alone, we enable advanced radiotherapy planning in community hospitals and resource-limited settings where PET/CT is unavailable.

167

Total Cases

0.65

Global Dice (Combined)

0.71

Median Dice

23.6mm

HD95 Boundary

🎯 Key Achievements

Multi-institutional dataset curation: Assembled and harmonized 167 cases from two institutions (137 MAASTRO public + 30 CMC private), implementing comprehensive de-identification and quality control protocols
Performance on public HN1 dataset: Achieved Global Dice of 0.63, Median Dice of 0.60, demonstrating robust baseline performance on publicly available data
Impact of private data integration: Adding 30 CMC cases (18% of training data) improved Global Dice to 0.65 (+3.99%) and Median Dice to 0.71 (+18.56%), demonstrating value of institutional data diversity
Fold-wise analysis insights: Identified that Fold 1 benefited most from additional data (Global Dice: 0.68, IoU: 0.52), while performance varied by fold, revealing sensitivity to case distribution and data heterogeneity
Precision-sensitivity trade-offs: Achieved high precision (0.79-0.81) with moderate sensitivity (0.52-0.59), indicating conservative segmentation strategy that minimizes false positives—critical for clinical safety

🔬 Technical Methodology

Architecture: Implemented 3D nnU-Net with self-configuring preprocessing pipeline, automatically optimizing patch size, batch size, and network topology for head and neck CT data
Training strategy: Three-fold cross-validation with careful patient-level splitting to prevent data leakage, trained on NVIDIA L40 GPU (48GB) using PyTorch and CUDA 12.6
Evaluation metrics: Comprehensive assessment using Global Dice, Median Dice, IoU, Precision, Sensitivity, Specificity, and HD95 for boundary accuracy
Data harmonization: Standardized imaging parameters, voxel spacing, and intensity normalization across public MAASTRO and private CMC datasets
Clinical validation: Qualitative evaluation by radiation oncologists confirmed strong spatial agreement with expert annotations, with minor under-segmentation in diffuse tumor boundaries

📊 Key Findings & Challenges

Boundary accuracy challenge: HD95 increased from 16mm to 24mm with additional data, indicating reduced boundary precision—likely due to inter-institutional annotation style differences
Center-specific variability: Fold-dependent performance suggests sensitivity to institutional imaging protocols and annotation practices, highlighting need for domain adaptation techniques
Conservative segmentation strategy: High specificity (>0.9998) but lower sensitivity indicates model tends to under-segment rather than over-segment, which is clinically safer but may miss small tumor extensions

🚀 Future Directions

Integrate domain adaptation to reduce center-specific variability
Leverage few-shot learning to address limited annotation challenges
Expand private dataset to 100+ cases for more balanced fold representation
Explore foundation models (MedSAM) and hybrid nnU-Net approaches
Systematic benchmarking of CT-only vs. PET/CT fusion methods

Technologies & Tools

Python 3.9+ nnU-Net V2 PyTorch MONAI 3D CNN NVIDIA L40 GPU CUDA 12.6 DICOM Processing

📄 Paper Under Review (JIIM) 🏆 3rd Prize - Winter Symposium 2025

Large-Scale Imaging Data Curation for Prospective Radiomics Trials

DBT/Wellcome Trust India Alliance Funded | ₹1.35 Crore Grant | PI: Dr. Hannah Mary Thomas T

2020 - 2025

Project Vision: This prospective study aims to answer a fundamental question in precision oncology: Can pre-treatment radiomics signatures accurately identify advanced head and neck cancer patients at higher risk of recurrence and poor survival outcomes? By developing robust image analysis pipelines and predictive models, we seek to enable personalized treatment strategies and optimize limited radiotherapy resources.

Scientific Foundation: Cancer radiomics converts routine radiological images into quantifiable tumor phenotypes. Unlike tissue biopsies that sample only small regions, radiomics captures information from the entire tumor non-invasively from standard clinical scans, providing comprehensive tumor characterization without additional radiation exposure or procedures for patients.

Clinical Significance: Accurately estimating individual patient risk is crucial for understanding tumor biology, stratifying patients based on risk profiles, tailoring personalized treatment strategies, and optimizing use of limited radiotherapy resources in resource-constrained healthcare settings.

~1700

Patients Enrolled

5 Years

Prospective Study

₹1.35Cr

Funding

PET+CT

Multimodal

🎯 My Contributions (2024-2025)

End-to-end radiomics pipeline development: Designed and implemented complete workflow from DICOM retrieval (Orthanc) → GTV-P segmentation (Citric) → PyRadiomics feature extraction, enabling reproducible model training across ~1700 patient cohort
Quality assurance at scale: Established systematic protocols for imaging and clinical data validation, ensuring data integrity across 5-year prospective collection period
AWS cloud infrastructure: Implemented automated S3 pipelines for secure backup and disaster recovery of imaging data, ensuring long-term data preservation and accessibility
Data annotation coordination: Helped coordinate workflows between radiation oncologists, medical physicists, and data scientists for consistent GTV-P delineation across large patient cohort
Infrastructure planning: Drafted NVIDIA Academic Grant Proposal justifying need for high-performance GPU infrastructure for in-house deployment of large-scale deep learning models in clinical environment

📊 Project Aims & Progress

Aim 1 - Robust Segmentation Pipeline: Developed automated tumor segmentation workflow for PET and CT imaging, with collaborative validation through NIT Surathkal partnership
Aim 2 - Predictive Modeling: Built machine learning models using LASSO, SelectKBest, PSO, and WOA feature selection across multiple classifiers (Logistic Regression, SVM, Random Forest, etc.)
Aim 3 - Imaging Archive: Successfully collected data from 1550+ HNC patients prospectively, forming valuable resource for validation and future multi-institutional studies

🔬 Technical Infrastructure

DICOM management: Orthanc-based PACS for centralized storage, retrieval, and anonymization of imaging data with RESTful API integration
Segmentation workflow: Citric platform for collaborative GTV-P delineation with version control and quality metrics tracking
Feature extraction: PyRadiomics-based automated extraction of first-order statistics, shape features, and texture features (GLCM, GLRLM, GLSZM)
Cloud infrastructure: AWS S3 with automated backup schedules, encryption at rest, and multi-region replication for disaster recovery
Data governance: Implemented FAIR principles, institutional IRB compliance, and patient consent tracking systems

🏆 Academic Contributions

Oral Presentation: "Can CT Radiomics Predict Recurrence in Head and Neck Cancer? Early Results from a Prospective Imaging Trial" at 14th Annual Research Day, CMC Vellore (Oct 2024)
Conference Participation: 2nd Annual Winter Symposium on Health Data and AI - Organizing Team Member (March 2025)
CME Attendance: Revolution and Precision in Radiation Oncology, CMC Vellore (March 2025)

Technologies & Tools

Python Orthanc PACS PyRadiomics AWS S3 DICOM Docker Citric RESTful API ETL Pipeline

QIRAIL, CMC Vellore

Radiomics-Based Risk Stratification in Head & Neck Cancer

🎯 Key Achievements

🔬 Methodology Details

Reproducibility Study: CNN-Based Head and Neck Cancer Prognosis

🎯 Key Contributions

CHAVI: CompreHensive Digital ArchiVe of Cancer Imaging

🎯 Key Achievements

🔬 Technical Implementation

🌐 Broader Impact

CT-only Automated Segmentation Using 3D nnU-Net

🎯 Key Achievements

🔬 Technical Methodology

📊 Key Findings & Challenges

🚀 Future Directions

Large-Scale Imaging Data Curation for Prospective Radiomics Trials

🎯 My Contributions (2024-2025)

📊 Project Aims & Progress

🔬 Technical Infrastructure

🏆 Academic Contributions

Quantitative Finance

Volatility-Based Trading Strategy Development

🎯 Key Contributions