We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to incorporate such descriptions and palettes.
To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions.
To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that NaiLIA outperforms standard methods.
In this study, we define the Nail design Semantic Text-palette Aligned Retrieval (NAIL-STAR) task as follows: given a dense intent description and a palette query for a nail design, the goal is to retrieve a ranked list of appropriate images.
Given the description provided above, the model should retrieve the image enclosed in a green frame. The design is painted with expressive creativity (``mermaid fin patterns and shell''), and the shells are decorated with pre-manufactured embellishments (``rhinestones''). These elements symbolize a theme (``mermaid''), forming an overall impression (``dreamy look''). Moreover, the design's color tones closely align with and .
The terminology used in this study is defined as follows: a target nail design image refers to a nail design image that aligns with the dense intent description and is explicitly labeled as a positive. Meanwhile, an unlabeled positive is a nail design image that could be considered as the target nail design image but lacks explicit labeling.
We propose NaiLIA, a multimodal retrieval method for nail design images based on dense intent descriptions and palette queries. NaiLIA estimates confidence scores for images that align with the given description and can be considered positive examples but are not explicitly labeled (unlabeled positives). Incorporating the scores into the loss function further enhances training efficiency, as it avoids undesired anti-correlation between pairs that should be correlated. In addition, NaiLIA models the relationship between dense intent descriptions and palette queries, enabling it to rank nail designs that closely align with the user's intended color tones more highly.
@article{amemiya2026nailia,
title={NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries},
author={Kanon Amemiya and Daichi Yashima and Kei Katsumata and Takumi Komatsu and Ryosuke Korekata and Seitaro Otsuki and Komei Sugiura},
year={2026},
eprint={2603.05446},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.05446},
}