{"id":4049703,"date":"2025-09-02T06:15:00","date_gmt":"2025-09-02T10:15:00","guid":{"rendered":"https:\/\/www.computerworld.com\/article\/4049703\/microsoft-researchers-develop-new-tech-for-video-ai-agents.html"},"modified":"2025-09-02T13:21:31","modified_gmt":"2025-09-02T17:21:31","slug":"microsoft-researchers-develop-new-tech-for-video-ai-agents","status":"publish","type":"post","link":"https:\/\/www.computerworld.com\/article\/4049703\/microsoft-researchers-develop-new-tech-for-video-ai-agents.html","title":{"rendered":"Microsoft researchers develop new tech for video AI agents"},"content":{"rendered":"<div id=\"remove_no_follow\">\n\t\t<div class=\"grid grid--cols-10@md grid--cols-8@lg article-column\">\n\t\t\t\t\t  <div class=\"col-12 col-10@md col-6@lg col-start-3@lg\">\n\t\t\t\t\t\t<div class=\"article-column__content\">\n<section class=\"wp-block-bigbite-multi-title\"><div class=\"container\"><\/div><\/section>\n\n\n\n<p>Microsoft researchers are developing technologies for a new class of video AI agents to explore three-dimensional spaces before making decisions.<\/p>\n\n\n\n<p>The technology framework, called MindJourney, uses a range of AI technologies to understand and analyze 3D spaces, reason about the surroundings, and predict movement, the researchers wrote in a <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/mindjourney-enables-ai-to-explore-simulated-3d-worlds-to-improve-spatial-interpretation\/\" target=\"_blank\" rel=\"noreferrer noopener\">blog entry<\/a> late last month.<\/p>\n\n\n\n<h5 class=\"wp-block-heading\"><strong>[ Related: <\/strong><a href=\"https:\/\/www.computerworld.com\/article\/3843138\/agentic-ai-ongoing-coverage-of-its-impact-on-the-enterprise.html\"><strong>Agentic AI &ndash; News and insights<\/strong><\/a><strong> ]<\/strong><\/h5>\n\n\n\n<p>MindJourney includes video-generation systems, vision language models (VLMs), and reasoning techniques that can predict surroundings, patterns, and movement. These technologies are packaged around &ldquo;world models&rdquo; that simulate real-world surroundings.<\/p>\n\n\n\n<p>Vision language models analyze pixels in visual surroundings to identify and reason around objects and surroundings. For example, <a href=\"https:\/\/www.computerworld.com\/article\/4037662\/nvidias-new-genai-model-helps-robots-think-like-humans.html\">recent work by Nvidia<\/a> for its Cosmos VLMs helps robots move and take action in their surroundings.<\/p>\n\n\n\n<p>MindJourney explores spaces by combining real-world images with scenes generated by the world model. For example, the framework&rsquo;s reasoning capabilities generate multiple visual scenarios that agents may see when moving in different directions. This is much like how text-based AI generators work.<\/p>\n\n\n\n<p>&ldquo;This enhancement could enable agents more accurately interpret spatial relationships and physical dynamics, helping them to operate effectively in changing environments,&rdquo; the researchers wrote in the blog entry.<\/p>\n\n\n\n<p>VLMs excel at 2D surroundings, but the visual world is in 3D, and MindJourney provides better viewpoints of real-world scenarios, and ultimately aims to forecast how scenes change over time, according to the Microsoft researchers.<\/p>\n\n\n\n<p>MindJourney &ldquo;sketches a concise camera trajectory, while the world model synthesizes the corresponding view at each step. The VLM then reasons over this multi-view evidence gathered during the interactive exploration,&rdquo; the researchers wrote in a <a href=\"https:\/\/arxiv.org\/abs\/2507.12508\" target=\"_blank\" rel=\"noreferrer noopener\">paper<\/a>.<\/p>\n\n\n\n<p>MindJourney&rsquo;s technologies could improve assistive robots and remote inspection, and enrich virtual and augmented reality experiences, the researchers wrote in the paper.<\/p>\n\n\n\n<p>But there are also concerns.<\/p>\n\n\n\n<p>&ldquo;More capable spatial reasoning can enhance autonomous surveillance systems or military platforms; and greater autonomy could displace certain manual-labor jobs,&rdquo; the researchers wrote.<\/p>\n\n\n\n<p>Early AI research such as Google&rsquo;s milestone <a href=\"https:\/\/research.google.com\/archive\/unsupervised_icml2012.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">cat detector<\/a> (PDF) &nbsp;focused on identifying still images through vision models.<\/p>\n\n\n\n<p>Video AI is the next frontier, with Nvidia leading the charge. Nvidia is focused on strong vision capabilities through robotic eyes. The company in late August announced a new <a href=\"https:\/\/www.computerworld.com\/article\/4045542\/nvidias-new-computer-gives-ai-brains-to-robots.html\">computer for robots called Jetson Thor<\/a> that is capable of running VLMs locally.<\/p>\n\n\n\n<p>Most of the popular large-language models are now able to handle images, video, and text, but are limited in scope when it comes to visual AI.<\/p>\n<\/div><\/div><\/div><\/div>","protected":false},"excerpt":{"rendered":"<p>Microsoft researchers are developing technologies for a new class of video AI agents to explore three-dimensional spaces before making decisions. The technology framework, called MindJourney, uses a range of AI technologies to understand and analyze 3D spaces, reason about the surroundings, and predict movement, the researchers wrote in a blog entry late last month. [ [&hellip;]<\/p>\n","protected":false},"author":513,"featured_media":100062339,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"__idg_published_ids":[],"__idg_published_status":"draft","embargo_date":"","multi_title":"{\"titles\":{\"headline\":{\"value\":\"Microsoft researchers develop new tech for video AI agents\",\"additional\":{\"short_title\":\"Microsoft researchers develop new tech for video AI agents\",\"headline_subheadline\":\"The MindJourney framework helps AI agents explore simulated 3D spaces to improve spatial reasoning.\",\"headline_desc\":\"The MindJourney framework helps AI agents explore simulated 3D spaces to improve spatial reasoning.\"}},\"seo\":{\"value\":\"Microsoft researchers develop new tech for video AI agents\",\"additional\":{\"seo_desc\":\"The MindJourney framework helps AI agents explore simulated 3D spaces to improve spatial reasoning.\"}},\"social\":{\"value\":\"Microsoft researchers develop new tech for video AI agents\",\"additional\":{\"social_desc\":\"The MindJourney framework helps AI agents explore simulated 3D spaces to improve spatial reasoning.\"}}},\"subtitles\":[]}","old_id_in_onecms":"","_idg_updated_flag":false,"_idg_updated_date":"","hreflang_xdefault":0,"content_type":"News","suppress_html_meta":"{}","byline":"","featured_video_id":0,"supress_floating_video":false,"prevent_index":0,"has_duration":0,"teaser_paragraphs":"","is_translated_post":0,"idg_original_post_id":0,"idg_translated_post_ids":[],"idg_original_post_publication":"","idg_original_post_language":"","idg_original_post_brand":"","reviews":null,"suppress_monetization":"{}","is_premium":0,"external_post_link":"","suppress_fake_sidebar":"{}","first_published_date":"2025-09-02T13:07:57-04:00","hide_featured_image_for_post":false,"post_featured_image_nocaption":true,"post_featured_image_caption":"","automatic_content_time":3,"manual_content_time":0,"most_popular_author":null,"more_from_author":null,"footnotes":null},"categories":[1885,2200,2888,1896,2199],"tags":[],"languages":[21],"editions":[12],"publication":[9,10],"territory":[],"story_types":[32],"article_type":[],"sponsorships":[],"blogs":[],"podcast_series":[],"origin":[7179],"coauthors":[6776],"class_list":{"0":"post-4049703","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-artificial-intelligence","8":"category-augmented-reality","9":"category-generative-ai","10":"category-robotics","11":"category-virtual-reality","12":"languages-en","13":"editions-global","14":"publication-computerworld","15":"publication-us-default","16":"story_types-news","17":"origin-wp"},"jetpack_featured_media_url":"https:\/\/www.computerworld.com\/wp-content\/uploads\/2025\/09\/4049703-0-77812500-1756833723-agentai.jpg?quality=50&strip=all","eyebrow":{"eyebrow":"news","eyebrow_style":"default","eyebrow_feed_title":"news","eyebrow_feed_style":"default"},"review_score":null,"article_type_name":"","author_name":"Agam Shah","author_meta":[{"authorID":513,"name":"Agam Shah","url":"https:\/\/www.computerworld.com\/profile\/agam-shah\/","img":"","defaultUrl":"https:\/\/secure.gravatar.com\/avatar\/900322c501916366324b839032259b3f273a3deaedd7d9cade31b31feb0f3c8c?s=96&d=mm&r=g","profileImage":null,"job_title":"Senior Reporter"}],"multiple_name":"Agam Shah","_embedded":"Agam Shah","_links":{"self":[{"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/posts\/4049703","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/users\/513"}],"replies":[{"embeddable":true,"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/comments?post=4049703"}],"version-history":[{"count":0,"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/posts\/4049703\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/media\/100062339"}],"wp:attachment":[{"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/media?parent=4049703"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/categories?post=4049703"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/tags?post=4049703"},{"taxonomy":"languages","embeddable":true,"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/languages?post=4049703"},{"taxonomy":"editions","embeddable":true,"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/editions?post=4049703"},{"taxonomy":"publication","embeddable":true,"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/publication?post=4049703"},{"taxonomy":"territory","embeddable":true,"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/territory?post=4049703"},{"taxonomy":"story_types","embeddable":true,"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/story_types?post=4049703"},{"taxonomy":"article_type","embeddable":true,"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/article_type?post=4049703"},{"taxonomy":"sponsorships","embeddable":true,"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/sponsorships?post=4049703"},{"taxonomy":"blogs","embeddable":true,"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/blogs?post=4049703"},{"taxonomy":"podcast_series","embeddable":true,"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/podcast_series?post=4049703"},{"taxonomy":"origin","embeddable":true,"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/origin?post=4049703"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.computerworld.com\/wp-json\/wp\/v2\/coauthors?post=4049703"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}