The world of Artificial Intelligence is evolving at an unprecedented pace, and OpenAI is at the forefront of this transformation. In a recent announcement, OpenAI introduced groundbreaking enhancements to ChatGPT, allowing it to perceive and interact with the world in a new way. These developments include voice and image capabilities, making ChatGPT more versatile and accessible.
Voice & Image Capabilities
One of the most significant updates to ChatGPT is the introduction of voice capabilities. Users can now engage in dynamic, back-and-forth conversations with their AI assistant. Whether you’re on the move, looking for a bedtime story for your family, or settling a dinner table debate, ChatGPT’s voice feature has got you covered.
Voice in Action
Unlocking the Power of Voice
Users can head to Settings in the mobile app to initiate a voice conversation and opt into voice conversations. Then, tap the headphone icon on the home screen and choose your preferred voice from five distinct options. This voice’s capability is remarkable because of its human-like audio generated from text and a short speech sample. OpenAI collaborated with professional voice actors to bring these voices to life and employed the Whisper speech recognition system to accurately transcribe spoken words into text.
Image: A New Dimension of Interaction
Seeing the World through ChatGPT’s Eyes
In addition to voice, ChatGPT now boasts image capabilities. Users can show the AI one or more images, opening up a myriad of possibilities. Troubleshoot why your grill won’t start, plan a meal by exploring the contents of your fridge, or analyze complex graphs for work-related data—ChatGPT can assist you in understanding and discussing a wide range of visual content.
How It Works
Users can tap the photo button to capture or select an image to get started with images. This feature is available on iOS and Android, with the option to discuss multiple images or use a drawing tool to guide your assistant. Multimodal GPT-3.5 and GPT-4 power the image understanding, leveraging their language reasoning skills to interpret photographs, screenshots, and documents containing text and images.
Gradual Deployment of Voice & Image
OpenAI’s commitment to safety and the gradual deployment of advanced AI models is evident in these new features. While voice technology opens doors to creativity and accessibility, it poses potential risks like impersonation and fraud. Therefore, OpenAI primarily uses it for voice chat, collaborating rigorously with voice actors and partners like Spotify to ensure responsible usage.
Vision-based models, while promising, also bring challenges, such as potential hallucinations and misinterpretations. Before widespread deployment, OpenAI conducted tests with red teamers and alpha testers to assess risks. Collaboration with organizations like Be My Eyes has informed the approach, ensuring ChatGPT respects privacy and maintains accuracy.
Transparency and Limitations
OpenAI commits to being transparent about the model’s limitations and encourages users to refrain from engaging in high-risk use cases without proper verification. ChatGPT excels at transcribing English text but may perform poorly with some other languages, especially non-Roman scripts. Non-English users are advised to exercise caution when using ChatGPT for specialized topics.
Currently, voice and image capabilities are being rolled out to Plus and Enterprise users, with plans to expand access to other groups, including developers, in the near future. OpenAI is eager to gather real-world usage and feedback to improve and refine these innovative features.