Close Menu
    Facebook X (Twitter) Instagram
    Facebook Instagram YouTube
    Crypto Go Lore News
    Subscribe
    Wednesday, May 27
    • Home
    • Market Analysis
    • Latest
      • Bitcoin News
      • Ethereum News
      • Altcoin News
      • Blockchain News
      • NFT News
      • Market Analysis
      • Mining News
      • Technology
      • Videos
    • Trending Cryptos
    • AI News
    • Market Cap List
    • Mining
    • Trading
    • Contact
    Crypto Go Lore News
    Home»AI News»Meta AI and NYU Researchers Propose E-RLHF to Combat LLM Jailbreaking
    AI News

    Meta AI and NYU Researchers Propose E-RLHF to Combat LLM Jailbreaking

    CryptoExpertBy CryptoExpertAugust 18, 2024No Comments5 Mins Read
    Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email VKontakte Telegram
    Meta AI and NYU Researchers Propose E-RLHF to Combat LLM Jailbreaking
    Share
    Facebook Twitter Pinterest Email Copy Link
    Changelly


    Large Language Models (LLMs) have gained prominence in deep learning, demonstrating exceptional capabilities across various domains such as assistance, code generation, healthcare, and theorem proving. The training process for LLMs typically involves two stages: pretraining with massive corpora and an alignment step using Reinforcement Learning from Human Feedback (RLHF). However, LLMs need help generating appropriate content. Despite their effectiveness in multiple tasks, these models are prone to producing offensive or inappropriate content, including hate speech, malware, fake information, and social biases. This vulnerability stems from the unavoidable presence of harmful elements within their pretraining datasets. The alignment process, crucial for addressing these issues, is not universally applicable and depends on specific use cases and user preferences, making it a complex challenge for researchers to overcome

    Researchers have made significant efforts to enhance LLM safety through alignment techniques, including supervised fine-tuning, red teaming, and refining the RLHF process. However, these attempts have led to an ongoing cycle of increasingly sophisticated alignment methods and more inventive “jailbreaking” attacks. Existing approaches to address these challenges fall into three main categories: baseline methods, LLM automation and suffix-based attacks, and manipulation of the decoding process. Baseline techniques like AutoPrompt and ARCA optimize tokens for harmful content generation, while LLM automation methods such as AutoDAN and GPTFuzzer employ genetic algorithms to create plausible jailbreaking prompts. Suffix-based attacks like GCG focus on improving interpretability. Despite these efforts, current methods need help with semantic plausibility and cross-architecture applicability. The lack of a principled universal defense against jailbreaking attacks and limited theoretical understanding of this phenomenon remain significant challenges in the field of LLM safety.

    Researchers from NYU and MetaAI, FAIR introduce a theoretical framework for analyzing LLM pretraining and jailbreaking vulnerabilities. By decoupling input prompts and representing outputs as longer text fragments, the researchers quantify adversary strength and model behavior. They provide a PAC-Bayesian generalization bound for pretraining, suggesting inevitable harmful outputs in high-performing models. The framework demonstrates that jailbreaking remains unpreventable even after safety alignment. Identifying a key drawback in RL Fine-Tuning objectives, the researchers propose methods to train safer, more resilient models without compromising performance. This approach offers new insights into LLM safety and potential improvements in alignment techniques.

    Researchers present a comprehensive theoretical framework for analyzing language model jailbreaking vulnerabilities, modeling prompts as query-concept tuples, and LLMs as generators of longer text fragments called explanations. The researchers introduce key assumptions and define notions of harmfulness, presenting a non-vacuous PAC-Bayesian generalization bound for pretraining Language Models. This bound implies that well-trained LMs may exhibit harmful behavior when exposed to such content during training. Building on these theoretical insights, the research proposes E-RLHF (Expanded Reinforcement Learning from Human Feedback), an innovative approach to improve language model alignment and reduce jailbreaking vulnerabilities. E-RLHF modifies the standard RLHF process by expanding the safety zone in the output distribution, replacing harmful prompts with safety-transformed versions in the KL-divergence term of the objective function. This innovation aims to increase safe explanations in the model’s output for harmful prompts without affecting performance on non-harmful ones. The approach can be integrated into the Direct Preference Optimization objective, eliminating the need for an explicit reward model. 

    okex

    The researchers have conducted experiments using the alignment handbook code base and a publicly available SFT model. For evaluating their proposed E-DPO method using the Harmbench and AdvBench datasets, measuring safety alignment with various jailbreak adversaries. Results showed that E-DPO reduced the average Attack Success Rate (ASR) across all adversaries for both datasets, achieving 36.95% for Harmbench and 20.89% for AdvBench, demonstrating improvements over standard DPO. The study also assessed helpfulness using the MT-Bench project, with E-DPO scoring 6.6, surpassing the SFT model’s score of 6.3. The researchers concluded that E-DPO enhances safety alignment without sacrificing model helpfulness, and can be combined with system prompts for further safety improvements.

    This study presented a theoretical framework for language model pretraining and jailbreaking, focusing on dissecting input prompts into query and concept pairs. Their analysis yielded two key theoretical results: first, language models can mimic the world after pretraining, leading to harmful outputs for harmful prompts; and second, jailbreaking is inevitable due to alignment challenges. Guided by these insights, the team developed a simple yet effective technique to enhance safety alignment. Their experiments demonstrated improved resilience to jailbreak attacks using this new methodology, contributing to the ongoing efforts to create safer and more robust language models.

    Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

    Don’t Forget to join our 48k+ ML SubReddit

    Find Upcoming AI Webinars here

    Asjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.



    Source link

    okex
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Telegram Copy Link
    CryptoExpert
    • Website

    Related Posts

    AI News

    AI Trading Bots Explained (Pocket Option Guide)

    April 9, 2026
    AI News

    How is AI reshaping opportunities for students? #news #ai #trending #opportunity #shorts

    April 3, 2026
    AI News

    Create Stunning AI Videos in Minutes! LunaBloomAI Full Tutorial for Beginners (2024)

    December 16, 2025
    AI News

    Glimmering Labs of 2050 AI Shaping Tomorrow’s Materials

    December 15, 2025
    AI News

    Sunday Funny Comic #google #AI News #War #Dogs Virals memes #stockmarket #news #crypto #shorts

    December 14, 2025
    AI News

    ✨ What I Noticed About AI Today 🤖 | Simple Tip for Beginners #shorts

    December 13, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Recommended
    Editors Picks

    Ethereum Sees 56.9% Jump in Transfers as Adoption Gains Ground

    April 12, 2026

    Polymarket Briefly Appears in Google News Before Being Removed

    April 12, 2026

    The Bitcoin miner sell-off looks close to exhaustion marking impending reversal in market pressure

    April 9, 2026

    Uniswap price outlook as Ethereum’s Vitalik Buterin offloads UNI tokens

    April 9, 2026
    Latest Posts

    We are a leading platform dedicated to delivering authoritative insights, news, and resources on cryptocurrencies and blockchain technology. At Crypto Go Lore News, our mission is to empower individuals and businesses with reliable, actionable, and up-to-date information about the cryptocurrency ecosystem. We aim to bridge the gap between complex blockchain technology and practical understanding, fostering a more informed global community.

    Latest Posts

    Ethereum Sees 56.9% Jump in Transfers as Adoption Gains Ground

    April 12, 2026

    Polymarket Briefly Appears in Google News Before Being Removed

    April 12, 2026

    The Bitcoin miner sell-off looks close to exhaustion marking impending reversal in market pressure

    April 9, 2026
    Newsletter

    Subscribe to Updates

    Get the latest Crypto news from Crypto Golore News about crypto around the world.

    Facebook Instagram YouTube
    • Contact
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    © 2026 CryptoGoLoreNews. All rights reserved by CryptoGoLoreNews.

    Type above and press Enter to search. Press Esc to cancel.

    bitcoin
    Bitcoin (BTC) $ 74,421.00
    ethereum
    Ethereum (ETH) $ 2,025.63
    tether
    Tether (USDT) $ 0.998595
    bnb
    BNB (BNB) $ 647.55
    xrp
    XRP (XRP) $ 1.31
    usd-coin
    USDC (USDC) $ 0.999766
    solana
    Solana (SOL) $ 82.49
    tron
    TRON (TRX) $ 0.368031
    staked-ether
    Lido Staked Ether (STETH) $ 2,265.05
    figure-heloc
    Figure Heloc (FIGR_HELOC) $ 1.03