We will set the links as soon as possible.
At this moment, our dataset, code, and
additional report are provided as supplementary
materials.
We focus on the task of retrieving nail design images based on dense intent descriptions, which represent long and multi-layered user intent for nail designs. This is challenging because such descriptions specify flexibly created paintings and pre-manufactured embellishments, as well as visual characteristics, spatial relationships, themes, and overall impressions. Existing vision-and-language foundation models often struggle to incorporate such multi-layered intent descriptions.
To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions.
To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that the proposed method outperforms standard methods.
In this study, we define the Nail design Semantic Text-image Aligned Retrieval (NAIL-STAR) task as follows: given a dense intent description for a nail design, the goal is to retrieve nail design images that align with the description at a high rank in the output image list. This is difficult because a nail design generally consists of a painted portion, which allows for creative flexibility, and a decorative portion, which can only be modified through the selection and arrangement of pre-manufactured embellishments. Furthermore, the descriptions often include the themes and spatial relationships of the designs, in addition to visual characteristics.
Given the description provided above, the model should retrieve the image enclosed in a green frame. The design is painted with expressive creativity ("fins and shells with a light blue appearance"), and the shells are decorated with pre-manufactured embellishments ("pearl nail accessories"). These elements symbolize a theme ("mermaid"), forming an overall impression ("a fresh and sparkling look").
The terminology used in this study is defined as follows: a target nail design image refers to a nail design image that aligns with the dense intent description and is explicitly labeled as a positive. Meanwhile, an unlabeled positive is a nail design image that could be considered as the target nail design image but lacks explicit labeling.
We propose NaiLIA, a multimodal retrieval method for nail design images based on dense intent descriptions. It differs from existing approaches in the following aspects. First, NaiLIA estimates confidence scores of unlabeled positives and incorporates these scores into a loss function. This approach can lead to an efficient training process by avoiding undesired anti-correlation between that should be correlated. Second, NaiLIA decomposes the dense intent descriptions and structures natural language descriptions of nail design images to align them in a multi-layered manner.
Coming soon...