A major technical pain point in Task-Oriented Dialogue (TOD)systems is the “cold start” problem in new domains. Traditional reinforcement learning policies are often tied to a specific domain’s ontology, meaning a model trained for “hotel booking” cannot interpret the actions required for “train scheduling” without extensive retraining. This lack of zero-shot generalization limits the scalability of AI assistants in dynamic, multi-domain environments. Developing a policy that can understand and act in an entirely new domain based on semantic intuition is a critical challenge for the field.
In response, the research team from Beihang University developed the UR-QVP framework. The “Unified Representation”(UR)component moves away from fixed labels to an ontology-agnostic approach, representing both states and actions using their natural language semantic features. This enables the model to infer the purpose of novel actions by comparing them to known ones in a shared latent space. Complementing this, the “Q-Values Perturbation”(QVP)family of exploration strategies allows the model to maintain adaptive decision-making even when faced with the uncertainties of a new domain.
Evaluation on the MultiWOZ datasets confirms that UR-QVP effectively achieves high-performance zero-shot transfer. The model demonstrates a superior ability to navigate complex multi-turn dialogues in unseen domains, reaching success rates that traditional RL policies can only achieve after fine-tuning. This work offers a robust roadmap for building more flexible and intelligent conversational agents capable of immediate deployment across diverse services.