- 7 rules to follow before installing a home security camera - and where you should never put one
- You can control your Chromebook with just a glance now
- Is classic Outlook crashing when you start or reply to an email? A fix is on the way
- Samsung will still give you $50 for reserving a Galaxy S25 preorder within the next few hours
- Preparing for the PCI 4.0 Implementation in the Retail environment
OpenAI lets developers build real-time voice apps – at a substantial premium
OpenAI’s annual developer day took place Wednesday in San Francisco, with a raft of product and feature announcements. The event’s centerpiece was the company’s introduction of its real-time application programming interface (API).
The feature for developers makes it possible to send and receive spoken-language inputs and outputs during inference operations, or making predictions with a production large language model (LLM). It is hoped this type of interaction can enable a more fluid, real-time conversation between a person and a language model.
Also: OpenAI’s Altman sees ‘superintelligence’ just around the corner – but he’s short on details
This capability also comes at a hefty premium. OpenAI currently prices the GPT-4o large language model, which is the model that forms the basis for the real-time API, at $2.50 per million tokens of input text, and $10 per million output tokens.
The real-time input and output cost is at least twice that rate, based on both text and audio tokens, since GPT-4o needs both kinds of input and output. Input and output tokens for GPT-4o when using the real-time API cost $5 and $20, respectively, per million tokens.
For voice tokens, the cost is a whopping $100 per million audio input tokens and $200 per million audio output tokens.
Also: How to use ChatGPT to optimize your resume
OpenAI notes that with standard statistics for voice conversations, the pricing of audio tokens “equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output.”
OpenAI gives examples of how real-time voice can be used in generative AI, including an automated health coach giving a person advice, and a language tutor that can engage in conversations with a student to practice a new language.
During the developer conference, OpenAI offered a way to reduce the total cost to developers, with prompt caching, which is re-using tokens on inputs that have been previously submitted to the model. That approach cuts the price of GPT-4o input text tokens in half.
Also: OpenAI’s budget GPT-4o mini model is now cheaper to fine-tune, too
Also introduced Wednesday was LLM “distillation”, which lets developers use the data from larger models to train smaller models.
A developer captures the input and output of one of OpenAI’s more capable language models, such as GPT-4o, using the technique known as “stored completions”. Those stored completions then become the training data to “fine tune” a smaller model, such as GPT-4o mini.
OpenAI bills the distillation service as a way to eliminate a lot of iterative work required by developers to train smaller models from larger models.
“Until now, distillation has been a multi-step, error-prone process,” says the company’s blog on the matter, “which required developers to manually orchestrate multiple operations across disconnected tools, from generating datasets to fine-tuning models and measuring performance improvements.”
Also: Businesses can reach decision dominance using AI. Here’s how
Distillation comes in addition to OpenAI’s existing fine-tuning service, the difference being that you can use the larger model’s input-output pairs as the fine-tuning data. To the fine-tuning service, the company Wednesday added image fine tuning. A developer submits a data set of images, just as they would with text, to make an existing model, such as GPT-4o, more specific to a task or a domain of knowledge.
An example in practice is work by food delivery service Grab. The company uses real-world images of street signs to have GPT-4o perform mapping of the company’s delivery routes. “Grab was able to improve lane count accuracy by 20% and speed limit sign localization by 13% over a base GPT-4o model, enabling them to better automate their mapping operations from a previously manual process,” states OpenAI.
Pricing is based on chopping up each image a developer submits into tokens, which are then priced at $3.75 per million input tokens and $15 per million output tokens, the same as standard fine-tuning. For training image models, the cost is $25 per million tokens.