NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Coaching
2 mins read

NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Coaching




Joerg Hiller
Might 07, 2025 15:38

NVIDIA introduces Nemotron-CC, a trillion-token dataset for giant language fashions, built-in with NeMo Curator. This revolutionary pipeline optimizes information high quality and amount for superior AI mannequin coaching.



NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training

NVIDIA has built-in its Nemotron-CC pipeline into the NeMo Curator, providing a groundbreaking strategy to curating high-quality datasets for giant language fashions (LLMs). The Nemotron-CC dataset leverages a 6.3-trillion-token English language assortment from Widespread Crawl, aiming to reinforce the accuracy of LLMs considerably, based on NVIDIA.

Developments in Knowledge Curation

The Nemotron-CC pipeline addresses the constraints of conventional information curation strategies, which frequently discard doubtlessly helpful information resulting from heuristic filtering. By using classifier ensembling and artificial information rephrasing, the pipeline generates 2 trillion tokens of high-quality artificial information, recovering as much as 90% of content material misplaced by filtering.

Modern Pipeline Options

The pipeline’s information curation course of begins with HTML-to-text extraction utilizing instruments like jusText and FastText for language identification. It then applies deduplication to take away redundant information, using NVIDIA RAPIDS libraries for environment friendly processing. The method consists of 28 heuristic filters to make sure information high quality and a PerplexityFilter module for additional refinement.

High quality labeling is achieved by means of an ensemble of classifiers that assess and categorize paperwork into high quality ranges, facilitating focused artificial information technology. This strategy allows the creation of various QA pairs, distilled content material, and arranged information lists from the textual content.

Affect on LLM Coaching

Coaching LLMs with the Nemotron-CC dataset yields important enhancements. For example, a Llama 3.1 mannequin educated on a 1 trillion-token subset of Nemotron-CC achieved a 5.6-point improve within the MMLU rating in comparison with fashions educated on conventional datasets. Moreover, fashions educated on lengthy horizon tokens, together with Nemotron-CC, noticed a 5-point enhance in benchmark scores.

Getting Began with Nemotron-CC

The Nemotron-CC pipeline is on the market for builders aiming to pretrain basis fashions or carry out domain-adaptive pretraining throughout varied fields. NVIDIA gives a step-by-step tutorial and APIs for personalisation, enabling customers to optimize the pipeline for particular wants. The combination into NeMo Curator permits for seamless growth of each pretraining and fine-tuning datasets.

For extra info, go to the NVIDIA weblog.

Picture supply: Shutterstock


Leave a Reply

Your email address will not be published. Required fields are marked *