ChatGPT was trained on massive amounts of data that have been gathered from the internet and other sources through 2021, by using Reinforcement Learning from Human Feedback (RLHF).
It first demonstrates data from humans and trains a supervised policy;
The next step is to run the model and let humans manually rank (label) the quality of outputs produced by the model from best to worst, then collect those new feedback data to train a reward model;
Then use a reinforcement learning algorithm (Proximal Policy Optimization) to optimize a policy by training a model against the reward model. (the key phases of RLHF)
ChatGPT is the third version, it is fine-tuned from a model in the GPT3.5 series (a code model) making it more human-like.