OpenAI: Two new versions and the output dataset of GPT-2 out!

Today, OpenAI have released the versions of GPT-2, which is a new AI model. GPT-2 is capable of generating coherent paragraphs of text without needing any task-specific training. The release includes a medium 345M version and a small 117M version of GPT-2. They have also shared the 762M and 1.5B versions with partners in the AI and security communities who are working to improve societal preparedness for large language models. The earlier version release of GPT was in the year 2018. In February 2019, Open-AI had made an announcement about GPT-2 with many samples and policy implications.

Read More: OpenAI’s new versatile AI model, GPT-2 can efficiently write convincing fake news from just a few words

The team at OpenAI has decided on a staged release of GPT-2. Staged release will have the gradual release of family models over time. The reason behind the staged release of GPT-2 is to give people time to assess the properties of these models, discuss their societal implications, and evaluate the impacts of release after each stage.

The 345M parameter version of GPT-2 has improved performance relative to the 117M version, though it does not offer much ease of generating coherent text. Also it would be difficult to misuse the 345M version.

Many factors like ease of use for generating coherent text, the role of humans in the text generation process, the likelihood and timing of future replication and publication by others, evidence of use in the wild and expert-informed inferences about unobservable uses, etc were considered while releasing this staged 345M version.

The team is hopeful that the ongoing research on bias, detection, and misuse will boost them to publish larger models and in six months, they will share a fuller analysis of language models’ societal implications and the heuristics for release decisions.

The team at OpenAI is looking for partnerships with academic institutions, non-profits, and industry labs which will focus on increasing societal preparedness for large language models. They are also open to collaborating with researchers working on language model output detection, bias, and publication norms, and with organizations potentially affected by large language models.

The output dataset contains GPT-2 outputs from all 4 model sizes, with and without top-k truncation, as well as a subset of the WebText corpus used to train GPT-2. The dataset features approximately 250,000 samples per model/hyperparameter pair, which will be sufficient to help a wider range of researchers perform quantitative and qualitative analysis.

To know more about the release, head over to the official release announcement.