From Image-to-LoRA to In-Context Edit
Some time ago, we released Qwen-Image's Image-to-LoRA model, which can directly convert image data into a LoRA model to generate similar images.
After releasing this model, we have been contemplating whether such capabilities could be extended to image editing models. Unfortunately, we were unable to train an Image-to-LoRA model for image editing. However, we achieved similar functionality using an In-Context Edit technical approach and released a new model. This article will introduce how we implemented such functionality for the newly released Qwen-Image-Edit-2511.
What Can LoRA Models for Image Editing Do?
From image generation to image editing, the role of LoRA models has changed. In image generation, we can use LoRA to control the style of generated images, while in image editing, LoRA is typically used for specific "image-to-image transformations."
For example, the model dx8152/Qwen-Edit-2509-Light-Migration can relight images.
Typically, the editing capabilities implemented by these LoRAs are difficult to describe precisely with text. For instance, in the above example, describing the direction from which light enters, the tone of the light, and how bright it is—natural language struggles to describe every detail of image editing exhaustively. However, when we provide "before-and-after image pairs" as examples, the image editing process becomes crystal clear. LoRA models for image editing are trained precisely through such "before-and-after image pairs." Through this training data, the model understands image editing requirements and applies this "image-to-image transformation" to new images.
Why Can't We Implement an Image-to-LoRA Model for Image Editing?
To be precise, an Image-to-LoRA model for image editing is actually Image-Pair-to-LoRA. For example, in the following case of modifying facial expressions, our Image-Pair-to-LoRA model needs to input the two images in the first row, understand that this transformation is "make the person laugh heartily," and output a LoRA model. Then, we apply this LoRA model to new image editing—input the third image, make the elderly person in the image laugh heartily, and output the fourth image.
If we train an Image-Pair-to-LoRA model using "before-and-after image pairs," the model tends to generate the edited image rather than focusing on the changes between the two images. Therefore, we must use training data with four images to train the model. This stringent data format makes it extremely difficult to construct training datasets at scale—we managed to cobble together a dataset containing only 30,000 samples. Such a meager amount of data makes an Image-Pair-to-LoRA model infeasible.
How to Activate the Model's In-Context Edit Capability?
Since an Image-Pair-to-LoRA model cannot be realized, we considered using research achievements from other technical approaches to implement similar functionality. Note that the entire process can actually be regarded as a multi-image input editing process: the model receives image 1, image 2, and image 3, applies the transformation from image 1 to image 2 onto image 3, and generates image 4. The recently released Qwen-Image-Edit-2511 model happens to be a multi-image editing model, allowing us to directly leverage the model's multi-image editing capability to achieve such In-Context Edit functionality. In-Context Edit is another technology we have been exploring, and it now converges with the Image-to-LoRA capability.
We trained and open-sourced such a model. The model structure is a standard LoRA that can activate Qwen-Image-Edit-2511's In-Context Edit capability. Simply by providing examples of image editing, the model can understand and edit new images on its own. Most importantly, this model structure, by inheriting the editing model's own multi-image editing capability, can be trained with relatively little data (30,000 samples). We achieved functionality similar to an Image-Pair-to-LoRA model through an alternative approach.
What Potential Does In-Context Edit Capability Hold?
A few years ago, a batch of large language models represented by GPT emerged. The rapid development of large language model technology brought dividends to "text-to-text" tasks, completely transforming research in natural language understanding. Today, models like Qwen-Image-Edit have achieved breakthroughs in "image-to-image" tasks, and these image editing models hold promise for application across numerous computer vision tasks.
For example, our In-Context Edit model can be used for image segmentation.
It can also be used for depth estimation.
Of course, the reverse is also possible, similar to ControlNet.
This means that large-scale image editing models can truly be directly applied to numerous computer vision tasks—a question worthy of future research.
What Will We Do Next?
- This model's performance still has significant room for improvement. We are refining the model structure and will release improved models in the future to further unleash the model's In-Context Edit capabilities.
- A model's capabilities are a combination of a series of atomic abilities. We continue to build larger datasets, which will be open-sourced in the future.
- This model enables image editing models to be applied to many computer vision tasks. We will validate its effectiveness on certain tasks and release detailed technical reports in the future.

































