Close Menu
    Facebook X (Twitter) Instagram
    Facebook Instagram YouTube
    Crypto Go Lore News
    Subscribe
    Wednesday, May 27
    • Home
    • Market Analysis
    • Latest
      • Bitcoin News
      • Ethereum News
      • Altcoin News
      • Blockchain News
      • NFT News
      • Market Analysis
      • Mining News
      • Technology
      • Videos
    • Trending Cryptos
    • AI News
    • Market Cap List
    • Mining
    • Trading
    • Contact
    Crypto Go Lore News
    Home»Blockchain»NVIDIA Introduces Nemotron-CC: A Massive Dataset for LLM Pretraining
    Blockchain

    NVIDIA Introduces Nemotron-CC: A Massive Dataset for LLM Pretraining

    CryptoExpertBy CryptoExpertJanuary 13, 2025No Comments3 Mins Read
    Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email VKontakte Telegram
    NVIDIA Introduces Nemotron-CC: A Massive Dataset for LLM Pretraining
    Share
    Facebook Twitter Pinterest Email Copy Link
    BTCC




    Iris Coleman
    Jan 10, 2025 14:13

    NVIDIA debuts Nemotron-CC, a 6.3-trillion-token English dataset, enhancing pretraining for large language models with innovative data curation methods.





    NVIDIA has announced the release of Nemotron-CC, a groundbreaking 6.3-trillion-token English language dataset designed to advance the pretraining of large language models (LLMs). This dataset, derived from Common Crawl, aims to elevate the accuracy and efficiency of LLMs through innovative data curation techniques, including the use of 1.9 trillion tokens of synthetically generated data, according to NVIDIA.

    Enhancing LLM Pretraining

    NVIDIA’s initiative addresses a critical need in LLM training, where the quality of pretraining datasets plays a pivotal role. While recent models like Meta’s Llama series have been based on datasets comprising up to 15 trillion tokens, the exact composition of these datasets remains largely undisclosed. Nemotron-CC seeks to fill this gap by providing the wider community with a high-quality dataset capable of supporting both short and long token horizon training.

    Traditional datasets often sacrifice up to 90% of data to improve benchmark accuracies, limiting their utility for extensive training. Nemotron-CC, however, demonstrates how to transform Common Crawl data into a superior dataset, surpassing even the Llama 3.1 8B model through advanced methods such as classifier ensembling and synthetic data rephrasing.

    Significant Results

    Nemotron-CC’s efficacy is evidenced by its performance in various benchmarks. When training 8B parameter models for one trillion tokens, the high-quality subset Nemotron-CC-HQ outperforms leading datasets like DCLM, increasing MMLU scores by 5.6 points. Furthermore, the complete 6.3-trillion-token dataset matches DCLM on MMLU while offering four times more unique real tokens. This enables effective training over long token horizons, with Nemotron-CC-trained models surpassing Llama 3.1 8B in multiple metrics, including a 5-point increase in MMLU and a 3.1-point rise in ARC-Challenge scores.

    okex

    Innovative Data Curation Techniques

    The development of Nemotron-CC involved several key insights. By ensembling different model-based classifiers, NVIDIA was able to select a broader array of high-quality tokens. Additionally, rephrasing techniques reduced noise and errors, yielding diverse and valuable data variants. The decision to disable traditional heuristic filters further boosted the dataset’s quality without compromising accuracy.

    NVIDIA utilized its NeMo Curator tool to extract and refine data from Common Crawl, applying filters for language, deduplication, and quality classification. This process was complemented by synthetic data generation, contributing approximately two trillion tokens to the dataset.

    Future Prospects

    Nemotron-CC is positioned as a vital resource for pretraining state-of-the-art LLMs over varying token horizons. NVIDIA plans to expand its offerings by releasing more specialized datasets, including those focused on specific domains like mathematics, to further enhance LLM capabilities.

    Image source: Shutterstock



    Source link

    okex
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Telegram Copy Link
    CryptoExpert
    • Website

    Related Posts

    Blockchain

    Polymarket Briefly Appears in Google News Before Being Removed

    April 12, 2026
    Blockchain

    OpenAI Launches Safety Fellowship to Tackle AI Alignment Research

    April 8, 2026
    Blockchain

    DeFi Is Optimizing For gas, Not For Markets

    April 2, 2026
    Blockchain

    Bitcoin Finds $65K Support as Week 14 Data Shows Easing Sell Pressure

    March 30, 2026
    Blockchain

    Memecoins Are Not Dead, but Will Return in Another Form: Crypto Exec

    December 15, 2025
    Blockchain

    BNB Hackathon in Abu Dhabi Showcases Innovative Blockchain Solutions

    December 14, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Recommended
    Editors Picks

    Ethereum Sees 56.9% Jump in Transfers as Adoption Gains Ground

    April 12, 2026

    Polymarket Briefly Appears in Google News Before Being Removed

    April 12, 2026

    The Bitcoin miner sell-off looks close to exhaustion marking impending reversal in market pressure

    April 9, 2026

    Uniswap price outlook as Ethereum’s Vitalik Buterin offloads UNI tokens

    April 9, 2026
    Latest Posts

    We are a leading platform dedicated to delivering authoritative insights, news, and resources on cryptocurrencies and blockchain technology. At Crypto Go Lore News, our mission is to empower individuals and businesses with reliable, actionable, and up-to-date information about the cryptocurrency ecosystem. We aim to bridge the gap between complex blockchain technology and practical understanding, fostering a more informed global community.

    Latest Posts

    Ethereum Sees 56.9% Jump in Transfers as Adoption Gains Ground

    April 12, 2026

    Polymarket Briefly Appears in Google News Before Being Removed

    April 12, 2026

    The Bitcoin miner sell-off looks close to exhaustion marking impending reversal in market pressure

    April 9, 2026
    Newsletter

    Subscribe to Updates

    Get the latest Crypto news from Crypto Golore News about crypto around the world.

    Facebook Instagram YouTube
    • Contact
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    © 2026 CryptoGoLoreNews. All rights reserved by CryptoGoLoreNews.

    Type above and press Enter to search. Press Esc to cancel.

    bitcoin
    Bitcoin (BTC) $ 75,254.00
    ethereum
    Ethereum (ETH) $ 2,069.98
    tether
    Tether (USDT) $ 0.998447
    bnb
    BNB (BNB) $ 652.87
    xrp
    XRP (XRP) $ 1.33
    usd-coin
    USDC (USDC) $ 0.999742
    solana
    Solana (SOL) $ 83.61
    tron
    TRON (TRX) $ 0.373096
    figure-heloc
    Figure Heloc (FIGR_HELOC) $ 1.03
    staked-ether
    Lido Staked Ether (STETH) $ 2,265.05