Close Menu
    Facebook X (Twitter) Instagram
    Facebook Instagram YouTube
    Crypto Go Lore News
    Subscribe
    Thursday, May 8
    • Home
    • Market Analysis
    • Latest
      • Bitcoin News
      • Ethereum News
      • Altcoin News
      • Blockchain News
      • NFT News
      • Market Analysis
      • Mining News
      • Technology
      • Videos
    • Trending Cryptos
    • AI News
    • Market Cap List
    • Mining
    • Trading
    • Contact
    Crypto Go Lore News
    Home»Blockchain»NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training
    Blockchain

    NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training

    CryptoExpertBy CryptoExpertMay 8, 2025No Comments2 Mins Read
    Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email VKontakte Telegram
    NVIDIA Unveils Nemotron-CC: A Trillion-Token Dataset for Enhanced LLM Training
    Share
    Facebook Twitter Pinterest Email Copy Link
    Blockonomics




    Joerg Hiller
    May 07, 2025 15:38

    NVIDIA introduces Nemotron-CC, a trillion-token dataset for large language models, integrated with NeMo Curator. This innovative pipeline optimizes data quality and quantity for superior AI model training.





    NVIDIA has integrated its Nemotron-CC pipeline into the NeMo Curator, offering a groundbreaking approach to curating high-quality datasets for large language models (LLMs). The Nemotron-CC dataset leverages a 6.3-trillion-token English language collection from Common Crawl, aiming to enhance the accuracy of LLMs significantly, according to NVIDIA.

    Advancements in Data Curation

    The Nemotron-CC pipeline addresses the limitations of traditional data curation methods, which often discard potentially useful data due to heuristic filtering. By employing classifier ensembling and synthetic data rephrasing, the pipeline generates 2 trillion tokens of high-quality synthetic data, recovering up to 90% of content lost by filtering.

    Innovative Pipeline Features

    The pipeline’s data curation process begins with HTML-to-text extraction using tools like jusText and FastText for language identification. It then applies deduplication to remove redundant data, utilizing NVIDIA RAPIDS libraries for efficient processing. The process includes 28 heuristic filters to ensure data quality and a PerplexityFilter module for further refinement.

    Quality labeling is achieved through an ensemble of classifiers that assess and categorize documents into quality levels, facilitating targeted synthetic data generation. This approach enables the creation of diverse QA pairs, distilled content, and organized knowledge lists from the text.

    bybit

    Impact on LLM Training

    Training LLMs with the Nemotron-CC dataset yields significant improvements. For instance, a Llama 3.1 model trained on a 1 trillion-token subset of Nemotron-CC achieved a 5.6-point increase in the MMLU score compared to models trained on traditional datasets. Furthermore, models trained on long horizon tokens, including Nemotron-CC, saw a 5-point boost in benchmark scores.

    Getting Started with Nemotron-CC

    The Nemotron-CC pipeline is available for developers aiming to pretrain foundation models or perform domain-adaptive pretraining across various fields. NVIDIA provides a step-by-step tutorial and APIs for customization, enabling users to optimize the pipeline for specific needs. The integration into NeMo Curator allows for seamless development of both pretraining and fine-tuning datasets.

    For more information, visit the NVIDIA blog.

    Image source: Shutterstock



    Source link

    Tokenmetrics
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Telegram Copy Link
    CryptoExpert
    • Website

    Related Posts

    Blockchain

    Hacken CEO sees ‘no shift’ in crypto security as April hacks hit $357M

    May 7, 2025
    Blockchain

    Digital Asset Fund Flows Report: Bitcoin (BTC) and Ethereum (ETH) Lead Inflows

    May 6, 2025
    Blockchain

    Notcoin says tap-to-earn ‘probably dead’ as Telegram games see shift

    May 5, 2025
    Blockchain

    Why tokenized gold beats other paper alternatives — Gold DAO

    May 4, 2025
    Blockchain

    Vitalik wants to make Ethereum ‘as simple as Bitcoin’ in 5 years

    May 3, 2025
    Blockchain

    Projects push crypto use cases

    May 2, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Recommended
    Editors Picks

    cat cutting cucumber#fypシ゚viral #memes #viralvideo #youtubeautomation #ai #news #shortvideo

    May 8, 2025

    Crypto News 59 #virtual #AIXBT #VADER #LUNA #TAOCAT #G3 #GAME #BTC #XRP #USDT #trump #shorts #eth

    May 8, 2025

    Home Ethereum Mining Farm Update 1080p 60fps H264 128kbit AAC

    May 8, 2025

    Dogecoin faces $500 million liquidation test as price eyes $0.2 recovery

    May 8, 2025
    Latest Posts

    We are a leading platform dedicated to delivering authoritative insights, news, and resources on cryptocurrencies and blockchain technology. At Crypto Go Lore News, our mission is to empower individuals and businesses with reliable, actionable, and up-to-date information about the cryptocurrency ecosystem. We aim to bridge the gap between complex blockchain technology and practical understanding, fostering a more informed global community.

    Latest Posts

    cat cutting cucumber#fypシ゚viral #memes #viralvideo #youtubeautomation #ai #news #shortvideo

    May 8, 2025

    Crypto News 59 #virtual #AIXBT #VADER #LUNA #TAOCAT #G3 #GAME #BTC #XRP #USDT #trump #shorts #eth

    May 8, 2025

    Home Ethereum Mining Farm Update 1080p 60fps H264 128kbit AAC

    May 8, 2025
    Newsletter

    Subscribe to Updates

    Get the latest Crypto news from Crypto Golore News about crypto around the world.

    Facebook Instagram YouTube
    • Contact
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    © 2025 CryptoGoLoreNews. All rights reserved by CryptoGoLoreNews.

    Type above and press Enter to search. Press Esc to cancel.

    bitcoin
    Bitcoin (BTC) $ 101,281.14
    ethereum
    Ethereum (ETH) $ 2,077.00
    tether
    Tether (USDT) $ 1.00
    xrp
    XRP (XRP) $ 2.25
    bnb
    BNB (BNB) $ 618.60
    solana
    Solana (SOL) $ 160.06
    usd-coin
    USDC (USDC) $ 1.00
    dogecoin
    Dogecoin (DOGE) $ 0.190232
    cardano
    Cardano (ADA) $ 0.736641
    tron
    TRON (TRX) $ 0.254644