Close Menu
    Facebook X (Twitter) Instagram
    Facebook Instagram YouTube
    Crypto Go Lore News
    Subscribe
    Wednesday, May 27
    • Home
    • Market Analysis
    • Latest
      • Bitcoin News
      • Ethereum News
      • Altcoin News
      • Blockchain News
      • NFT News
      • Market Analysis
      • Mining News
      • Technology
      • Videos
    • Trending Cryptos
    • AI News
    • Market Cap List
    • Mining
    • Trading
    • Contact
    Crypto Go Lore News
    Home»AI News»Microsoft drops Florence-2, a unified model to handle a variety of vision tasks
    AI News

    Microsoft drops Florence-2, a unified model to handle a variety of vision tasks

    CryptoExpertBy CryptoExpertJune 19, 2024No Comments5 Mins Read
    Share Facebook Twitter Pinterest Copy Link LinkedIn Tumblr Email VKontakte Telegram
    Microsoft drops Florence-2, a unified model to handle a variety of vision tasks
    Share
    Facebook Twitter Pinterest Email Copy Link
    Paxful


    It’s time to celebrate the incredible women leading the way in AI! Nominate your inspiring leaders for VentureBeat’s Women in AI Awards today before June 18. Learn More

    Today, Microsoft’s Azure AI team dropped a new vision foundation model called Florence-2 on Hugging Face.

    Available under a permissive MIT license, the model can handle a variety of vision and vision-language tasks using a unified, prompt-based representation. It comes in two sizes — 232M and 771M parameters — and already excels at tasks such as captioning, object detection, visual grounding and segmentation, performing on par or better than many large vision models out there.

    While the real-world performance of the model is yet to be tested, the work is expected to give enterprises a single, unified approach to handle different types of vision applications. This will save investments on separate task-specific vision models that fail to beyond their primary function, without extensive fine-tuning.

    Phemex

    What makes Florence-2 unique?

    Today, large language models (LLMs) sit at the heart of enterprise operations. A single model can provide summaries, write marketing copies and even handle customer service in many cases. The level of adaptability across domains and tasks has been amazing. But, this success has also left researchers wondering: Can vision models, which have been largely task-specific, do the same? 

    VB Transform 2024 Registration is Open

    Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now

    At the core, vision tasks are more complex than text-based natural language processing (NLP). They demand comprehensive perceptual ability. Essentially, to achieve universal representation of diverse vision tasks, a model must be capable of understanding spatial data across different scales, from broad image-level concepts like object location, to fine-grained pixel details, as well as semantic details such as high-level captions to detailed descriptions.

    When Microsoft tried solving this, it found two key roadblocks: Scarcity of comprehensively annotated visual datasets and the absence of a unified pretraining framework with a singular network architecture that integrated the ability to understand spatial hierarchy and semantic granularity.

    To address this, the company first used specialized models to generate a visual dataset called FLD-5B. It included a total of 5.4 billion annotations for 126 million images, covering details from high-level descriptions to specific regions and objects. Then, using this data, it trained Florence-2, which uses a sequence-to-sequence architecture (a type of neural network designed for tasks involving sequential data) integrating an image encoder and a multi-modality encoder-decoder. This enables the model to handle various vision tasks, without requiring task-specific architectural modifications​​.

    “All annotations in the dataset, FLD-5B, are uniformly standardized into textual outputs, facilitating a unified multi-task learning approach with consistent optimization with the same loss function as the objective,” the researchers wrote in the paper detailing the model. “The outcome is a versatile vision foundation model capable of performing a variety of tasks… all within a single model governed by a uniform set of parameters. Task activation is achieved through textual prompts, reflecting the approach used by large language models.”

    Performance better than larger models

    When prompted with images and text inputs, Florence-2 handles a variety of tasks, including object detection, captioning, visual grounding and visual question answering. More importantly, it delivers this with quality on par or better than many larger models. 

    For instance, in a zero-shot captioning test on the COCO dataset, both 232M and 771M versions of Florence outperformed Deepmind’s 80B parameter Flamingo visual language model with scores of 133 and 135.6, respectively. They even did better than Microsoft’s own visual grounding-specific Kosmos-2 model.

    When fine-tuned with public human-annotated data, Florence-2, despite its compact size, was able to compete closely with several larger specialist models across tasks like visual question answering.

    “The pre-trained Florence-2 backbone enhances performance on downstream tasks, e.g. COCO object detection and instance segmentation, and ADE20K semantic segmentation, surpassing both supervised and self-supervised models,” the researchers noted. “Compared to pre-trained models on ImageNet, ours improves training efficiency by 4X and achieves substantial improvements of 6.9, 5.5, and 5.9 points on COCO and ADE20K datasets.”

    As of now, both pre-trained and fine-tuned versions of Florence-2 232M and 771M are available on Hugging Face under a permissive MIT license that allows for unrestricted distribution and modification for commercial use or private use. 

    It will be interesting to see how developers will put it to use and offload the need for separate vision models for different tasks. Small, task-agnostic models can not only save developers the need to work with different models but also cut down the compute costs by a significant margin.

    VB Daily

    Stay in the know! Get the latest news in your inbox daily

    By subscribing, you agree to VentureBeat’s Terms of Service.

    Thanks for subscribing. Check out more VB newsletters here.

    An error occured.



    Source link

    okex
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email Telegram Copy Link
    CryptoExpert
    • Website

    Related Posts

    AI News

    AI Trading Bots Explained (Pocket Option Guide)

    April 9, 2026
    AI News

    How is AI reshaping opportunities for students? #news #ai #trending #opportunity #shorts

    April 3, 2026
    AI News

    Create Stunning AI Videos in Minutes! LunaBloomAI Full Tutorial for Beginners (2024)

    December 16, 2025
    AI News

    Glimmering Labs of 2050 AI Shaping Tomorrow’s Materials

    December 15, 2025
    AI News

    Sunday Funny Comic #google #AI News #War #Dogs Virals memes #stockmarket #news #crypto #shorts

    December 14, 2025
    AI News

    ✨ What I Noticed About AI Today 🤖 | Simple Tip for Beginners #shorts

    December 13, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Recommended
    Editors Picks

    Ethereum Sees 56.9% Jump in Transfers as Adoption Gains Ground

    April 12, 2026

    Polymarket Briefly Appears in Google News Before Being Removed

    April 12, 2026

    The Bitcoin miner sell-off looks close to exhaustion marking impending reversal in market pressure

    April 9, 2026

    Uniswap price outlook as Ethereum’s Vitalik Buterin offloads UNI tokens

    April 9, 2026
    Latest Posts

    We are a leading platform dedicated to delivering authoritative insights, news, and resources on cryptocurrencies and blockchain technology. At Crypto Go Lore News, our mission is to empower individuals and businesses with reliable, actionable, and up-to-date information about the cryptocurrency ecosystem. We aim to bridge the gap between complex blockchain technology and practical understanding, fostering a more informed global community.

    Latest Posts

    Ethereum Sees 56.9% Jump in Transfers as Adoption Gains Ground

    April 12, 2026

    Polymarket Briefly Appears in Google News Before Being Removed

    April 12, 2026

    The Bitcoin miner sell-off looks close to exhaustion marking impending reversal in market pressure

    April 9, 2026
    Newsletter

    Subscribe to Updates

    Get the latest Crypto news from Crypto Golore News about crypto around the world.

    Facebook Instagram YouTube
    • Contact
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    © 2026 CryptoGoLoreNews. All rights reserved by CryptoGoLoreNews.

    Type above and press Enter to search. Press Esc to cancel.

    bitcoin
    Bitcoin (BTC) $ 75,078.00
    ethereum
    Ethereum (ETH) $ 2,066.35
    tether
    Tether (USDT) $ 0.998427
    bnb
    BNB (BNB) $ 654.03
    xrp
    XRP (XRP) $ 1.34
    usd-coin
    USDC (USDC) $ 0.99976
    solana
    Solana (SOL) $ 84.28
    tron
    TRON (TRX) $ 0.37073
    figure-heloc
    Figure Heloc (FIGR_HELOC) $ 1.03
    staked-ether
    Lido Staked Ether (STETH) $ 2,265.05