• More News

  • Can Large Language Models Help Robots To Navigate?

    int2

    ByNidhi Agarwal

    Someday, you might want a home robot to carry laundry to the basement, a task requiring it to combine verbal instructions with visual cues. However, this is challenging for AI agents as current systems need multiple complex machine-learning models and extensive visual data, which are hard to obtain.

    Researchers from MIT and the MIT-IBM Watson AI Lab have developed a navigation method that translates visual inputs into text descriptions. A large language model then processes these descriptions to guide a robot through multistep tasks. This approach, which uses text captions instead of computationally intensive visual representations, allows the model to generate extensive synthetic training data efficiently. 

    Solving a vision problem with language

    Researchers have developed a navigation method for robots using a simple captioning model that translates visual observations into text descriptions. These descriptions, along with verbal instructions, are input into a large language model, which then decides the robot’s next step. After each step, the model generates a scene caption to help update the robot’s trajectory, continually guiding it towards its goal. The information is standardized in templates, presenting it as a series of choices based on the surroundings, like choosing to move towards a door or an office, streamlining the decision-making process.

    Advantages of language

    When tested, this language-based navigation approach didn’t outperform vision-based methods but offered distinct advantages. It uses fewer resources, allowing for rapid synthetic data generation—for instance, creating 10,000 synthetic trajectories from only 10 real-world ones. Also, its use of natural language makes the system more understandable to humans and versatile across different tasks, using a single type of input. However, it does lose some information that vision-based models capture, like depth. Surprisingly, combining this language-based approach with vision-based methods improves navigation capabilities.

    Researchers aim to enhance their method by developing a navigation-focused captioner and exploring how large language models can demonstrate spatial awareness to improve navigation.

  • Job Shop

    Subscribe today to the industry leading magazine and newsletter

    Keep up-to-date with information about worldwide trends straight to your inbox or through our printed magazine.