Authors
Le Van Nguyen1 and Rory Sie2, 1University of Wollongong, Australia, 2NIB Group, Australia
Abstract
Reinforcement Learning with Human Feedback (RLHF) has significantly enhanced the performance of large language models (LLMs) in tasks such as summarization, dialogue generation, and content moderation. However, the reliance on human-annotated data makes RLHF expensive and difficult to scale. To address these challenges, Reinforcement Learning from AI Feedback (RLAIF) has emerged as a promising alternative. In RLAIF, AI-generated preference labels replace human feedback, offering a more cost-effective and scalable solution while maintaining competitive performance. Despite its success in single-model families, RLAIF's generalizability across diverse model architectures and scales remains unclear. This study extends the evaluation of RLAIF by applying it to three different model families'T5, Phi-3.5, and LLaMa 3.2' representing a variety of model sizes and architectures. We compare RLAIF with traditional supervised fine-tuning (SFT) and examine the impact of model size on its effectiveness. Our findings reveal that RLAIF improves model alignment across all architectures, although the extent of the improvement varies depending on the model type. The research contributes to the broader discussion on improving the efficiency and scalability of reinforcement learning techniques for LLM alignment. By evaluating RLAIF across multiple architectures, our work provides practical guidance for implementing AI feedback-based alignment techniques that are applicable to a wide range of LLMs, advancing the field of AI model fine-tuning.
Keywords
Reinforcement Learning, AI Feedback, Large Language Models, Alignment, Scaling