ChatGPT is a member of the generative pre-trained transformer (GPT) family of language models. It was fine-tuned (an approach to transfer learning[6]) over an improved version of OpenAI’s GPT-3 known as “GPT 3.5“.[7] The fine-tuning process leveraged both supervised learning as well as reinforcement learning in a process called reinforcement learning from human feedback (RLHF).[8][9] Both approaches used human trainers to improve the model’s performance. In the case of supervised learning, the model was provided with conversations in which the trainers played both sides: the user and the AI assistant. In the reinforcement learning step, human trainers first ranked responses that the model had created in a previous conversation.[10] These rankings were used to create ‘reward models’ that the model was further fine-tuned on using several iterations of Proximal Policy Optimization (PPO).[8][11] Proximal Policy Optimization algorithms present a cost-effective benefit to trust region policy optimization algorithms; they negate many of the computationally expensive operations with faster performance.[12][13] The models were trained in collaboration with Microsoft on their Azure supercomputing infrastructure, using Nvidia GPUs, “supercomputer developed for OpenAI is a single system with more than 285,000 CPU cores, 10,000 GPUs and 400 gigabits per second of network connectivity for each GPU server”.[14]
OpenAI collects data from ChatGPT users to further train and fine-tune the service. Users can upvote or downvote responses they receive from ChatGPT and fill out a text field with additional feedback.[15][16]
valami





