\
  The most prestigious law school admissions discussion board in the world.
BackRefresh Options Favorite

broadest commercially usable open collections of synthetic data for agentic AI

The Nemotron dataset collection spans pre- and post-trainin...
Mauve weed whacker
  02/11/26


Poast new message in this thread



Reply Favorite

Date: February 11th, 2026 8:39 PM
Author: Mauve weed whacker

The Nemotron dataset collection spans pre- and post-training, personas, safety, RL, and RAG datasets, including over 10T language tokens and 18 million supervised fine-tuning (SFT) data samples.

Generating, filtering, and curating this size of data is a huge undertaking making these datasets openly available under permissive licenses. Researchers and developers can now train, fine-tune, and evaluate models with greater transparency and build models faster.

(http://www.autoadmit.com/thread.php?thread_id=5833870&forum_id=2betting#49664402)