OpenAI's GPT4 showcases its multimodal capabilities, which could revolutionize various industries, but still lags behind human performance in reading text from complex images.
GPT4's multimodal capabilities are being showcased through releases this week, including text to 3D text, speech to text, and embodiment, with language and visual models complementing each other and solving captures without slowing down.
OpenAI's GPT4 can attain outstanding results exceeding human performance levels in medical questions without vision, but struggles with media-based questions.
GPT4's multimodal capabilities allow it to understand humor, read menus, interpret the physical world, and read graphs and text from images, which has the potential to change the world.
GPT4 scores 78 on the VQA Benchmark for reading text from complex images, outperforming the previous state of the art model, but still falls short of human performance by only 7%.
Natural language can now be translated directly into code in blender creating detailed 3D models with fascinating physics, and the borders of text image 3D and embodiment are beginning to be broken down.
Conformer, a new voice recognition API, outperforms OpenAI's Whisper API with fewer errors and has the potential to revolutionize industries such as law and medicine.
Assembly line robots are now commercially available and improvements in text, audio, 3D, and embodiment are starting to merge and complement each other, potentially leading to revolutionary advancements.