Cross-Domain Dilemma in Photovoltaic Defect Detection: Jiangxing Intelligence Delivers a Solution via Cross-Modal Alignment Technology

As photovoltaic power plants grow larger in scale, inspection challenges have become increasingly prominent.

A practical pain point emerges: when a drone flies from Power Station A to Power Station B, changes in terrain, lighting conditions and camera parameters will render the originally high-precision defect detection model inaccurate.

What does this mean? Operators have to recollect data, relabel samples and retrain the model every time they deploy the system at a new power station. Repetitive workloads, excessive costs and lengthy waiting periods are the prevailing bottlenecks facing photovoltaic inspection across the industry today.

Instead of simply expanding datasets or scaling up large models, Jiangxing Intelligence addresses this challenge by adopting the underlying logic of the physical world. Leveraging cross-modal alignment technology, we enable AI to truly understand semantic correlations across different data modalities.

Our latest research achievement, DD-LIVM, has been accepted by ACM MobiCom 2025, the top international conference in mobile computing. Tested on 7,078 pairs of infrared and visible-light images collected from 9 real-world scenarios across 4 cities, our model achieves an average detection accuracy of 87.7%, outperforming the previous state-of-the-art method by 17.3 percentage points.

Large Models Are No Panacea — Alignment Holds the Key

Over the past two years, large models have been widely adopted across industries. Nevertheless, a critical problem remains unsolved in industrial scenarios: model performance degrades drastically when the underlying data distribution shifts.

Photovoltaic defect detection relies on drones capturing both infrared images (to identify hotspots) and visible-light images (to detect surface abnormalities). However, severe semantic misalignment exists between these two modalities:

Infrared images capture all defect-induced hotspots but can only classify shapes into four categories: point, stripe, block and patch, failing to identify specific defect types.
Visible-light images clearly reveal external defects such as weeds, dust and panel breakages, yet internal faults including open circuits and diode failures are completely invisible under visible light.

Directly concatenating features from the two modalities for model training often leads to severe feature mismatch. During training, the model tends to memorise background specifics such as shadow orientations at a particular power station. Once transferred to a new site, the model loses its reliability and may even learn spurious correlations — for instance, predicting faults merely based on fixed pixel positions in images rather than genuine physical defect characteristics.

Jiangxing Intelligence’s R&D team has thoroughly investigated this issue within our three-tier JX-Phi Universe Physical AI architecture. As defined in JX-Phi Brain (the model layer):

S-VLM (Spatial Vision-Language Model) delivers perception and environmental comprehension, enabling machines to interpret 3D spatial relationships, equipment connections and industrial business semantics.
LT-VLA (Long-Task Vision-Language-Action Model) connects perception with execution, empowering robots to decompose and carry out complex industrial workflows.

DD-LIVM represents a pivotal breakthrough of S-VLM tailored for photovoltaic inspection, with its core design centred on achieving genuine semantic alignment between sensor data from two distinct modalities.

Cross-Modal Alignment: A Three-Step Strategy for Modular Specialisation & Feature Fusion

Our proposed DD-LIVM framework innovatively adopts a three-step Defect-Targeted Fine-Tuning (DTFT) strategy, which can be summarised as specialised individual training followed by integrated collaborative optimisation.

Step 1: Train the Infrared Encoder for Precise Defect Localisation

Infrared imagery excels at hotspot positioning despite relatively low resolution. We freeze the visible-light encoder and train only the infrared encoder to learn shape features of hotspots via contrastive learning. This method enlarges feature distances between dissimilar hotspot shapes (e.g., point-shaped dust vs. strip-shaped weed coverage) while narrowing feature gaps among hotspots with analogous geometries, allowing the infrared encoder to accurately pinpoint potential defect regions.

Step 2: Train the Visible-Light Encoder for Defect Type Recognition

While infrared sensors cannot distinguish specific fault categories, visible-light cameras capture detailed appearances of external defects including weeds, dust and panel damage. We freeze the infrared encoder and train the visible-light encoder to extract surface defect features. For internal faults invisible in visible spectra, contrastive learning clusters these samples into a single category and separates their feature distributions from external defects, enabling the encoder to recognise distinct defect appearances.

Step 3: Dual-Modality Fusion for Accurate Defect Detection & Classification

Following the two preceding training phases, we jointly fine-tune both encoders for full-range defect detection and classification. The infrared encoder identifies defect locations, while the visible-light encoder characterises surface fault patterns. Their fusion delivers precise localisation and fine-grained classification while effectively mitigating overfitting risks.

This methodology inherits the mixture-of-experts residual strategy proposed in our earlier DyGRO-VLA research. By assigning dedicated responsibilities to individual modules to avoid feature conflicts, we significantly enhance generalisation across multi-task and cross-scenario deployments. DyGRO-VLA achieved a 97.1% success rate on the LIBERO benchmark and resolved catastrophic forgetting in VLA models; DD-LIVM reaches 87.7% cross-domain photovoltaic detection accuracy and eliminates model performance degradation across deployment sites. Both innovations empower AI systems with superior stability and reliability for real industrial applications.

Beyond Experimental Results: An Engineering-Ready Deployable System

In addition to the three-step fine-tuning strategy, DD-LIVM incorporates two core engineering designs to translate academic research into field-deployable technology.

1. Generic Spatial Alignment Algorithm

In practice, positional offsets always exist between infrared and visible-light cameras, and such deviations vary across drone devices and mounting heights. Leveraging the consistent physical width of photovoltaic panels in both image modalities, our algorithm automatically calculates scaling and offset parameters via background removal, contour extraction and template matching, achieving pixel-level cross-modal alignment without prior scene knowledge.

2. Hotspot-Driven Intelligent Data Augmentation

Small hotspots caused by dust or stains are highly susceptible to environmental interferences. We simulate heat diffusion using the Laplacian operator to generate diversified hotspot variants under varying solar irradiance and wind speeds. This equips the model with strong environmental adaptability by exposing it to diverse weather conditions during the training phase.

DD-LIVM achieves over 80% detection accuracy across all 9 test scenarios, with an average accuracy of 87.7%. Most notably, cross-domain testing between vastly different sites such as plain ground-mounted power stations and rooftop facilities yields an accuracy improvement ranging from 14% to 23%. This verifies that the model learns the intrinsic physical characteristics of defects rather than superficial scene features.

Industrial Value: Shifting from Customised Training to Plug-and-Play Deployment

A photovoltaic asset operator typically manages dozens of geographically dispersed power stations located on plains, in gobi deserts and on building rooftops. Customised data collection, labelling and model retraining for each new site imposes prohibitive operational costs.

DD-LIVM enables the paradigm of train once, deploy anywhere. After initial model training, no extra labelling is required for newly commissioned power plants; the system operates autonomously following a single drone inspection mission. This is a core capability of the S-VLM module within Jiangxing Intelligence’s three-tier architecture, resolving semantic ambiguity and spatial misalignment among multi-sensor data in the physical world.

Our DyGRO-VLA previously addressed catastrophic forgetting in robotic multi-task learning, whereas DD-LIVM overcomes cross-scenario performance decay. Together, these two technologies underpin the industrial deployment of Jiangxing’s Physical AI by delivering robust cross-task and cross-domain generalisation.

We will gradually integrate DD-LIVM into our full Physical AI product portfolio to deliver plug-and-play intelligent inspection for photovoltaic operators, equipping every power station with a tireless AI-powered monitoring guardian.

Paper Information

Title: DD-LIVM: Pioneering Cross-Domain Photovoltaic Defect Detection Using Large Infrared-Visible Model

Conference: ACM MobiCom 2025 (Top International Conference in Mobile Computing)

Co-Affiliations: The Hong Kong University of Science and Technology, Jiangxing Intelligence

← Back to News