Status: Preprint pending arXiv submission (awaiting NASA approval)

Overview

Machine learning framework for cloud base height retrieval from NASA ER-2 airborne observations, comparing feature-based (gradient boosting) versus image-based (CNN) approaches. Demonstrated that atmospheric features outperform images, achieving R² = 0.744 with gradient boosting—a 130% improvement over CNNs (R² = 0.320).

Developed during my NASA Goddard Space Flight Center OSTEM internship (Summer 2025).

Key Results

MetricValue
Gradient Boosting R²0.744
CNN R²0.320
Improvement130%
Labeled Samples933

Technical Approach

Data Pipeline

  • HDF5 preprocessing pipeline for NASA ER-2 observations
  • Temporal interpolation and radiometric correction
  • Integration with ERA5 reanalysis atmospheric data
  • 933 labeled samples across multiple flight campaigns

Model Comparison

  • Feature-based: XGBoost gradient boosting with atmospheric variables
  • Image-based: Convolutional neural networks on raw imagery
  • Result: Atmospheric features significantly outperform raw images

Domain Shift Analysis

  • Leave-one-flight-out cross-validation to assess generalization
  • Identified generalization failure (R² = -1.007 on held-out flights)
  • Proposed domain-adversarial solutions for improved robustness

Application

Cloud base height is critical for:

  • Aviation safety and flight planning
  • Climate modeling and weather prediction
  • Satellite calibration/validation

This work demonstrates that physics-informed features can outperform deep learning on image data when domain expertise is available.

Technology Stack

  • ML Frameworks: PyTorch, TensorFlow, scikit-learn
  • Gradient Boosting: XGBoost, LightGBM
  • Data Processing: HDF5, NetCDF, Pandas, NumPy
  • Atmospheric Data: ERA5 reanalysis

Affiliation

NASA Goddard Space Flight Center
OSTEM Intern – Atmospheric Remote Sensing
May – August 2025