Highlights What are the main findings? A novel encoder-decoder framework integrating self-supervised DINOv3 with a hybrid Transformer-LSTM decoder achieves 9-12% improvement over CNN and Vision Transformer baselines across BLEU, CIDEr, METEOR, and ROUGE-L metrics on RSICD and UCM-Captions datasets. DINOv3's self-supervised visual representations eliminate the need for domain-specific supervised pretraining while producing semantically rich features that outperform traditional supervised encoders (VGG16, ResNet50) for remote sensing image description tasks. What are the implications of the main findings? Self-supervised vision transformers represent a robust alternative to supervised CNN-based encoders for multi-modal remote sensing applications, particularly valuable when annotated training data is scarce or expensive to obtain. The proposed LSTM aggregation module between encoder and decoder effectively captures spatial continuity in structured patterns (roads, rivers, boundaries), demonstrating that lightweight sequential processing enhances caption coherence for aerial imagery analysis.Highlights What are the main findings? A novel encoder-decoder framework integrating self-supervised DINOv3 with a hybrid Transformer-LSTM decoder achieves 9-12% improvement over CNN and Vision Transformer baselines across BLEU, CIDEr, METEOR, and ROUGE-L metrics on RSICD and UCM-Captions datasets. DINOv3's self-supervised visual representations eliminate the need for domain-specific supervised pretraining while producing semantically rich features that outperform traditional supervised encoders (VGG16, ResNet50) for remote sensing image description tasks. What are the implications of the main findings? Self-supervised vision transformers represent a robust alternative to supervised CNN-based encoders for multi-modal remote sensing applications, particularly valuable when annotated training data is scarce or expensive to obtain. The proposed LSTM aggregation module between encoder and decoder effectively captures spatial continuity in structured patterns (roads, rivers, boundaries), demonstrating that lightweight sequential processing enhances caption coherence for aerial imagery analysis.Abstract Effective interpretation of coherent and usable information from aerial images (e.g., satellite imagery or high-altitude drone photography) can greatly reduce human effort in many situations, both natural (e.g., earthquakes, forest fires, tsunamis) and man-made (e.g., highway pile-ups, traffic congestion), particularly in disaster management. This research proposes a novel encoder-decoder framework for captioning of remote sensing images that integrates self-supervised DINOv3 visual features with a hybrid Transformer-LSTM decoder. Unlike existing approaches that rely on supervised CNN-based encoders (e.g., ResNet, VGG), the proposed method leverages DINOv3's self-supervised learning capabilities to extract dense, semantically rich features from aerial images without requiring domain-specific labeled pretraining. The proposed hybrid decoder combines Transformer layers for global context modeling with LSTM layers for sequential caption generation, producing coherent and context-aware descriptions. Feature extraction is performed using the DINOv3 model, which employs the gram-anchoring technique to stabilize dense feature maps. Captions are generated through a hybrid of Transformer with Long Short-Term Memory (LSTM) layers, which adds contextual meaning to captions through sequential hidden layer modeling with gated memory. The model is first evaluated on two traditional remote sensing image captioning datasets: RSICD and UCM-Captions. Multiple evaluation metrics like Bilingual Evaluation Understudy (BLEU), Consensus-based Image Description Evaluation (CIDEr), Recall-Oriented Understudy for Gisting Evaluation (ROUGE-L), and Metric for Evaluation of Translation with Explicit Ordering (METEOR), are used to quantify the performance and robustness of the proposed DINOv3 hybrid model. The proposed model outperforms conventional Convolutional Neural Network (CNN) and Vision Transformers (ViT)-based models by approximately 9-12% across most evaluation metrics. Attention heatmaps are also employed to qualitatively validate the proposed model when identifying and describing key spatial elements. In addition, the proposed model is evaluated on advanced remote sensing datasets, including RSITMD, DisasterM3, and GeoChat. The results demonstrate that self-supervised vision transformers are robust encoders for multi-modal understanding in remote sensing image analysis and captioning.