A multimodal embedding modifier generates a modified seed search selection embedding for providing a set of search results. The multimodal embedding modifier enhances the ability and accuracy of identifying a user's true intent when searching the online marketplace. For example, embodiments disclosed herein can allow a user to navigate multiple modalities for an item. In some embodiments, a user may select a search result corresponding to an initial search query, and further modify the selected search result by inputting a modifier (e.g., a textual modifier). The multimodal embedding modifier can be trained using a training dataset including a text embedding, an image embedding, another type of embedding, or a combination thereof.