Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights

Abstract

With the increasing numbers of publicly available models, there are probably pre-trained, online models for most tasks users require. However, current model search methods are rudimentary, essentially a text-based search in the documentation, thus users cannot find the relevant models. This paper presents ProbeLog, a method for retrieving classification models that can recognize a target concept, such as "Dog", without access to model metadata or training data. Differently from previous probing methods, ProbeLog computes a descriptor for each output dimension (logit) of each model, by observing its responses on a fixed set of inputs (probes). Our method supports both logit-based retrieval ("find more logits like this") and zero-shot, text-based retrieval ("find all logits corresponding to dogs"). As probing-based representations require multiple costly feedforward passes through the model, we develop a method, based on collaborative filtering, that reduces the cost of encoding repositories by 3x. We demonstrate that ProbeLog achieves high retrieval accuracy, both in real-world and fine-grained search tasks and is scalable to full-size repositories.

Task & Motivation

Our objective is to accurately and efficiently search for relevant models in a large repository that can recognize a target concept, e.g., ``Dog''. While existing solutions search approaches rely on the text of the user-uploaded documentation, our analysis shows that real-world models are often poorly documented. We observed all 1.2M models cards in Hugging Face and found that 60% of all models have no information in their model cards. Thus, we aim to search for new models based on their weights alone, without assuming access to the training data or metadata.

A figure showing that HuggingFace Models are poorly documented

Logit-level Descriptors

We introduce ProbeLog, a probing based logit-level descriptor especially designed for model search. I.e., ProbeLog represents each model output (logit) separately, instead of using a single representation for the entire model. To extract ProbeLog descriptors for a certain logit, we pass a set of n ordered, fixed input samples (probes) through the model. Intuitively, these are a set of standardized questions that we ask the model. In practice, we compose the list of probes by randomly sampling images (without replacement) from an out-of-distribution image dataset.

We define the ProbeLog descriptor for logit i of model f as the normalized responses of all probes at this logit:

\[ ProbeLog(f,i) = [f(x_1)[i], f(x_2)[i], \cdots, f(x_n)[i]] \]

Comparing between different ProbeLog descriptors can find models that are similar in their ability to recognize a specific concept. Next, we show how to use these descriptors to search for models by text.

A figure showing our ProbeLog logit descriptors

Zero-Shot Model Search

Alone, ProbeLog descriptors only provide a way to search by logit, which assumes the user already has such model. To enable finding new models, we extend our method to a search by text setting, where the user can search for concepts by simply naming them in text, making our search method zero-shot. To do so, we utilize a multimodal text alignment model (e.g., CLIP) to create for generating ProbeLog-like descriptors from text alone. First, we compute the embeddings from each probe as well as the user's description of the target concept. We then define the zero-shot ProbeLog descriptor of the target concept as the vector of dot products between the embeddings of each probe and that of the target text.

A figure showing our zero-shot ProbeLog logit descriptors

Collaborative Probing

Lastly, as creating the ProbeLog representations for an entire model repository can be very costly, we present Collaborative Probing, a method to reduce the number of probes passed through each model by 3x. Instead of probing all models with all probes, we only use a random selection of the probes for each model, and treat the remaining probes as missing data. We then complete the missing information with matrix-factorization based collaborative filtering. We later show this results in greatly improved performance for low probe numbers.

Results

We showcase ProbeLog's effectiveness on two real-world datasets that we curate: one based on models that we train for fine-grained logit search evaluation, and the other containing models that we download from Hugging Face. Our method is scalable and can handle large models with high effectiveness and efficiency. It achieves high retrieval accuracy, reaching over 40% top-1 accuracy when predicting whether a model can recognize an ImageNet target concept from text (where a random method only scores 0.1%). Furthermore, we establish the strong performance of our Collaborative Probing approach, showing it can substantially reduce the number of probes passed through each model.

A table comparing ProbeLog to previous approaches

Search Results. We test ProbeLog on real-world and fine-grained model Hubs. ProbeLog shows impressive top-1 search accuracy, both in search-by-logit and search-by-text tasks.

Collaborative Probing Enhancement. Collaborative Probing can reduce the number of needed probes by 3x.

            BibTeX
            
      @misc{kahana2025modelrecognizedogszeroshot,
            title={Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights},
            author={Jonathan Kahana and Or Nathan and Eliahu Horwitz and Yedid Hoshen},
            year={2025},
            eprint={2502.09619},
            archivePrefix={arXiv},
            primaryClass={cs.LG},
            url={https://arxiv.org/abs/2502.09619}
            }