🏥

QIRAIL, CMC Vellore

Reproducibility Study: CNN-Based Head and Neck Cancer Prognosis

QIRAIL, CMC Vellore

2024

Critical examination of reproducibility claims in published CNN model for head and neck cancer outcomes, revealing significant dataset and documentation issues while successfully reproducing results.

🎯 Key Contributions

  • Challenged reproducibility claims of published CNN model by attempting complete replication across three HNC outcomes (distant metastasis, locoregional failure, overall survival)
  • Identified major dataset and documentation issues: Incorrectly provided datasets, multiple errors in data files, inadequate result reporting protocols, and poor documentation that contradicted reproducibility claims
  • Successfully reproduced results despite paper's flaws by correcting dataset errors, implementing missing preprocessing steps, and establishing proper validation protocols
  • Authors acknowledged reproducibility failures: Communicated findings that led to author recognition of dataset errors and documentation inadequacies in their published work
Technologies & Tools
Python PyTorch CNN DICOM Data Validation

CHAVI: CompreHensive Digital ArchiVe of Cancer Imaging

India's First National Oncology Imaging Biobank | Led by Tata Memorial Centre & IIT Kharagpur

2024

The National Challenge: Development of AI and machine learning models in oncology requires access to large-scale, high-quality, and standardized imaging datasets. However, region-specific imaging biobanks remain scarce in South Asia, limiting AI research capabilities and model generalizability to diverse patient populations.

CHAVI's Mission: Creating India's first centralized repository of de-identified oncology imaging data to democratize medical data for research, foster reproducibility and collaboration, and enable AI-driven innovations in early cancer detection, prognosis prediction, and personalized treatment strategies.

Our Contribution: Designed and implemented comprehensive data preparation and integration pipeline for seamless upload of Head and Neck Cancer imaging and clinical data to CHAVI, ensuring ethical compliance with HIPAA and GDPR while maintaining data utility for AI research.

304+
HNC Cases Curated
100%
FAIR Compliant
Zero
Privacy Violations
Multi
Institutional

🎯 Key Achievements

  • Curated 304+ anonymized HNC cases with comprehensive validated clinical and imaging metadata, establishing CMC Vellore as a major contributor to India's national cancer imaging repository
  • Automated FAIR-compliant pipeline: Built end-to-end data processing system ensuring datasets are Findable, Accessible, Interoperable, and Reusable for multi-institutional research collaboration
  • Rigorous de-identification protocols: Implemented comprehensive PHI/PII removal from both DICOM headers and clinical reports while maintaining data linkage integrity through anonymized identifiers
  • Quality control framework: Established systematic validation protocols detecting formatting errors, completing missing fields through logical inference, and standardizing data formats across institutional sources
  • Ethical compliance documentation: Maintained complete audit trails ensuring adherence to HIPAA and GDPR throughout data handling lifecycle

🔬 Technical Implementation

  • Data preprocessing pipeline: Automated detection and correction of formatting errors, completion of missing fields, and standardization according to CHAVI's structured requirements
  • De-identification workflow: Multi-stage process removing patient names, hospital identifiers, and PII from imaging metadata (DICOM headers) and masking free-text fields in clinical reports
  • Data harmonization: Mapped clinical data into standardized formats ensuring interoperability across research platforms, with consistent imaging parameters and metadata fields
  • Anonymized linking: Assigned unique non-identifiable keys enabling multimodal research (combining clinical, imaging, and outcome data) without compromising patient privacy
  • AI-readiness optimization: Structured datasets specifically for machine learning applications, ensuring compatibility with common deep learning frameworks

🌐 Broader Impact

  • Democratizing oncology research: Contributing to public repository enables researchers nationwide to access curated datasets without institutional barriers
  • Addressing regional disparities: CHAVI fills critical gap in South Asian imaging biobanks, enabling development of AI models trained on locally relevant patient populations
  • Fostering collaboration: Standardized data formats facilitate multi-institutional validation studies and reproducible research across cancer centers
  • Enabling precision oncology: Large-scale datasets support development of personalized treatment strategies based on imaging biomarkers
Technologies & Tools
Python DICOM Processing Orthanc XNAT FAIR Principles HIPAA/GDPR Data Anonymization ETL Pipeline

CT-only Automated Segmentation Using 3D nnU-Net

Collaboration with NIT Surathkal | Submitted to Journal of Imaging Informatics in Medicine

2024

Clinical Problem: Head and neck cancer requires precise tumor delineation for effective radiotherapy planning. Manual segmentation is time-consuming (30-60 minutes per case) and subject to significant inter-observer variability. Existing automated methods rely on multimodal PET/CT imaging, which is costly ($1000+ per scan), less accessible in resource-limited settings, and burdensome for patients.

Our Innovation: We developed a CT-only automated segmentation framework using 3D nnU-Net that eliminates the need for expensive PET scans while maintaining clinical-grade accuracy. This approach offers a robust, cost-effective, and scalable solution that can democratize access to precision radiotherapy planning.

Why CT-only Matters: CT scanners are 10x more common than PET/CT in India and globally. By achieving comparable performance with CT alone, we enable advanced radiotherapy planning in community hospitals and resource-limited settings where PET/CT is unavailable.

167
Total Cases
0.65
Global Dice (Combined)
0.71
Median Dice
23.6mm
HD95 Boundary

🎯 Key Achievements

  • Multi-institutional dataset curation: Assembled and harmonized 167 cases from two institutions (137 MAASTRO public + 30 CMC private), implementing comprehensive de-identification and quality control protocols
  • Performance on public HN1 dataset: Achieved Global Dice of 0.63, Median Dice of 0.60, demonstrating robust baseline performance on publicly available data
  • Impact of private data integration: Adding 30 CMC cases (18% of training data) improved Global Dice to 0.65 (+3.99%) and Median Dice to 0.71 (+18.56%), demonstrating value of institutional data diversity
  • Fold-wise analysis insights: Identified that Fold 1 benefited most from additional data (Global Dice: 0.68, IoU: 0.52), while performance varied by fold, revealing sensitivity to case distribution and data heterogeneity
  • Precision-sensitivity trade-offs: Achieved high precision (0.79-0.81) with moderate sensitivity (0.52-0.59), indicating conservative segmentation strategy that minimizes false positives—critical for clinical safety

🔬 Technical Methodology

  • Architecture: Implemented 3D nnU-Net with self-configuring preprocessing pipeline, automatically optimizing patch size, batch size, and network topology for head and neck CT data
  • Training strategy: Three-fold cross-validation with careful patient-level splitting to prevent data leakage, trained on NVIDIA L40 GPU (48GB) using PyTorch and CUDA 12.6
  • Evaluation metrics: Comprehensive assessment using Global Dice, Median Dice, IoU, Precision, Sensitivity, Specificity, and HD95 for boundary accuracy
  • Data harmonization: Standardized imaging parameters, voxel spacing, and intensity normalization across public MAASTRO and private CMC datasets
  • Clinical validation: Qualitative evaluation by radiation oncologists confirmed strong spatial agreement with expert annotations, with minor under-segmentation in diffuse tumor boundaries

📊 Key Findings & Challenges

  • Boundary accuracy challenge: HD95 increased from 16mm to 24mm with additional data, indicating reduced boundary precision—likely due to inter-institutional annotation style differences
  • Center-specific variability: Fold-dependent performance suggests sensitivity to institutional imaging protocols and annotation practices, highlighting need for domain adaptation techniques
  • Conservative segmentation strategy: High specificity (>0.9998) but lower sensitivity indicates model tends to under-segment rather than over-segment, which is clinically safer but may miss small tumor extensions

🚀 Future Directions

  • Integrate domain adaptation to reduce center-specific variability
  • Leverage few-shot learning to address limited annotation challenges
  • Expand private dataset to 100+ cases for more balanced fold representation
  • Explore foundation models (MedSAM) and hybrid nnU-Net approaches
  • Systematic benchmarking of CT-only vs. PET/CT fusion methods
Technologies & Tools
Python 3.9+ nnU-Net V2 PyTorch MONAI 3D CNN NVIDIA L40 GPU CUDA 12.6 DICOM Processing

Large-Scale Imaging Data Curation for Prospective Radiomics Trials

DBT/Wellcome Trust India Alliance Funded | ₹1.35 Crore Grant | PI: Dr. Hannah Mary Thomas T

2020 - 2025

Project Vision: This prospective study aims to answer a fundamental question in precision oncology: Can pre-treatment radiomics signatures accurately identify advanced head and neck cancer patients at higher risk of recurrence and poor survival outcomes? By developing robust image analysis pipelines and predictive models, we seek to enable personalized treatment strategies and optimize limited radiotherapy resources.

Scientific Foundation: Cancer radiomics converts routine radiological images into quantifiable tumor phenotypes. Unlike tissue biopsies that sample only small regions, radiomics captures information from the entire tumor non-invasively from standard clinical scans, providing comprehensive tumor characterization without additional radiation exposure or procedures for patients.

Clinical Significance: Accurately estimating individual patient risk is crucial for understanding tumor biology, stratifying patients based on risk profiles, tailoring personalized treatment strategies, and optimizing use of limited radiotherapy resources in resource-constrained healthcare settings.

~1700
Patients Enrolled
5 Years
Prospective Study
₹1.35Cr
Funding
PET+CT
Multimodal

🎯 My Contributions (2024-2025)

  • End-to-end radiomics pipeline development: Designed and implemented complete workflow from DICOM retrieval (Orthanc) → GTV-P segmentation (Citric) → PyRadiomics feature extraction, enabling reproducible model training across ~1700 patient cohort
  • Quality assurance at scale: Established systematic protocols for imaging and clinical data validation, ensuring data integrity across 5-year prospective collection period
  • AWS cloud infrastructure: Implemented automated S3 pipelines for secure backup and disaster recovery of imaging data, ensuring long-term data preservation and accessibility
  • Data annotation coordination: Helped coordinate workflows between radiation oncologists, medical physicists, and data scientists for consistent GTV-P delineation across large patient cohort
  • Infrastructure planning: Drafted NVIDIA Academic Grant Proposal justifying need for high-performance GPU infrastructure for in-house deployment of large-scale deep learning models in clinical environment

📊 Project Aims & Progress

  • Aim 1 - Robust Segmentation Pipeline: Developed automated tumor segmentation workflow for PET and CT imaging, with collaborative validation through NIT Surathkal partnership
  • Aim 2 - Predictive Modeling: Built machine learning models using LASSO, SelectKBest, PSO, and WOA feature selection across multiple classifiers (Logistic Regression, SVM, Random Forest, etc.)
  • Aim 3 - Imaging Archive: Successfully collected data from 1550+ HNC patients prospectively, forming valuable resource for validation and future multi-institutional studies

🔬 Technical Infrastructure

  • DICOM management: Orthanc-based PACS for centralized storage, retrieval, and anonymization of imaging data with RESTful API integration
  • Segmentation workflow: Citric platform for collaborative GTV-P delineation with version control and quality metrics tracking
  • Feature extraction: PyRadiomics-based automated extraction of first-order statistics, shape features, and texture features (GLCM, GLRLM, GLSZM)
  • Cloud infrastructure: AWS S3 with automated backup schedules, encryption at rest, and multi-region replication for disaster recovery
  • Data governance: Implemented FAIR principles, institutional IRB compliance, and patient consent tracking systems

🏆 Academic Contributions

  • Oral Presentation: "Can CT Radiomics Predict Recurrence in Head and Neck Cancer? Early Results from a Prospective Imaging Trial" at 14th Annual Research Day, CMC Vellore (Oct 2024)
  • Conference Participation: 2nd Annual Winter Symposium on Health Data and AI - Organizing Team Member (March 2025)
  • CME Attendance: Revolution and Precision in Radiation Oncology, CMC Vellore (March 2025)
Technologies & Tools
Python Orthanc PACS PyRadiomics AWS S3 DICOM Docker Citric RESTful API ETL Pipeline
📊

Quantitative Finance

Volatility-Based Trading Strategy Development

STARlab Capital, Lucknow

Dec 2023 - Jun 2024

Designed, backtested, and deployed sophisticated volatility-based trading strategies using proprietary tools, demonstrating transferable skills in quantitative analysis and model optimization.

52.38%
ROI Improvement

🎯 Key Contributions

  • Designed, backtested, and deployed volatility-based strategies (Nebula, ARUT, A2) using OptionNet Explorer, Mesosim, OptiTrade, OptiBot tools
  • Enhanced the ARUT strategy, increasing ROI by 52.38% through scenario-driven optimization and real-time feedback
  • Refined internal platforms: improved trade logs, added dynamic filters, and led contributions to OptiTrade's open-source GitHub repository
Technologies & Tools
Python OptionNet Explorer Backtesting Strategy Optimization Risk Management