- Project Flux
- Posts
- Is AGI the Gift We’ve Been Waiting For? OpenAI Unwraps the Most Powerful LLM Yet!” 🎄✨
Is AGI the Gift We’ve Been Waiting For? OpenAI Unwraps the Most Powerful LLM Yet!” 🎄✨
From festive cheer to frontier AI, we bring you OpenAI’s groundbreaking o3 model, transforming benchmarks, and a Christmas roundup of the year’s most exciting tech and project developments!
Merry Christmas from Project Flux
🎉🎄 A super merry Christmas to all our amazing readers! The team at Project Flux is absolutely thrilled to wish you a spectacularly festive week ahead! We can't wait to join you next week for our exciting 2025 prediction episode. But for now, here's a little festive treat from Project Flux to get you in the holiday spirit (crafted with a sprinkle of AI magic, naturally)! 🎁✨
This Week’s BIG thing
In this week’s newsletter we’ve focused on the big story that is OpenAI’s o3 model. What is the o3 model? How does it perform? Has OpenAI made history? Is AGI even closer than we thought? Let’s get stuck in.
Is AGI on the horizon? OpenAI reveal o3, today’s most powerful LLM
OpenAI has unveiled what is being touted as the most advanced LLM to-date. The new flagship o3 model was displayed on December 20, 2024, smashing AI intelligence benchmarks out of the park. Released as the grand finale of their 12-day "Shipmas" product announcement series, the flagship model indicates that we are nearing AGI faster than most have predicted. Here’s what you need to know:
Why might this be AGI?
We understand the shouts that the o3 model is or symbolises a big step towards AGI, and that’s purely because of how well the LLM performs across industry benchmarks. The model has achieved remarkable results that approach or exceed human-level performance in several domains:
87.5% on the ARC-AGI benchmark, which measures abstract reasoning, logical deduction, problem solving and pattern recognition. It surpassed the human average of 85%.
25.2% on Frontier Math, a benchmark previously considered impossible for AI to solve effectively1
International Grandmaster level coding performance, ranking among top 175 programmers globally
Why might it not be AGI?
Despite the impressive achievements of o3, Our view is that it remains an advanced form of narrow AI rather than artificial general intelligence (AGI). Here’s why:
Benchmark limitations: Intelligence benchmarks can be susceptible to data contamination, where models inadvertently train on answers, leading to inflated performance metrics. ARC-AGI uses private evals to combat this.
Specialised evaluation: Many benchmarks focus on specific tasks and may not accurately represent general intelligence. Developers can tailor training to excel in these benchmarks, which may not translate to broader applicability.
Lack of emotional intelligence: AI models like o3 are primarily trained for pattern recognition and lack the capacity to understand or replicate the emotional nuances inherent in human interactions.
Precision in agentic workflows: More powerful models in the "o" series, such as o1, exhibit reduced precision in agentic workflows compared to predecessors like GPT-4o, making their outputs less predictable and interpretable.
Performance on human tasks: While o3 demonstrates remarkable capabilities, it struggles with tasks that most adults and teenagers can easily perform, such as visuospatial planning and verbal reasoning, as evidenced by its challenges with certain questions in the ARC-AGI benchmark.
Decline on advanced benchmarks: The model's performance significantly drops on more challenging assessments like ARC-AGI-2, where it scores below 30%, indicating limitations in handling complex, novel problems.
How did OpenAI unlock such powerful capability?
Newer, more powerful models now seem to evolve from sophisticated post-training techniques to encourage an AI chain-of-thought. What we assume to be the key driver of enhanced intelligence here is the use of reinforcement learning on chain-of-thought to scale inference compute. Or to put it simply - giving the AI more time to think with guidance. Here’s how the model was developed:
1.Deliberative alignment
A training protocol to enhance model’s safety and alignment to human values
Initial training: The model is first trained to be helpful without incorporating safety constraints.
Specialised datasets: Datasets are created containing chain-of-thought (CoT) completions that reference explicit safety specifications.
Supervised fine-tuning: The model undergoes supervised fine-tuning to establish reasoning patterns that adhere to safety guidelines.
Reinforcement learning: Reinforcement learning with a reward model is employed to optimise the CoT process, ensuring the model's outputs align with safety and helpfulness criteria.
2.Chain of thought
Problem decomposition: It breaks down complex problems into smaller, manageable steps, facilitating more effective problem-solving.
Processing time: The model takes between 5 seconds to several minutes to process queries, allowing for thorough analysis.
Internal queries: It generates internal follow-up questions to refine its understanding before formulating responses.
3.Simulated reasoning
Reflection: o3 pauses and reflect on its thought processes before responding, enhancing response accuracy.
Evaluation: It assesses complex tasks with greater precision through deliberate reasoning.
Planning: The model plans responses using a "private chain of thought" technique, ensuring coherent and contextually appropriate outputs.
What does the o3 mean for OpenAI?
OpenAI has received both praise and criticism. They’re lauded for pushing AI boundaries, yet some critique it for potentially prioritising impressive benchmark results over practical, real-world applications. Concerns are growing about balancing OpenAI's research ambitions with its commercial interests, especially given the high computational costs of models like o3, raising questions about accessibility and transparency. Still, whilst we wait for Anthropic to play their hand, o3 is firmly positioned as the most advanced LLM model.
How might o3 impact social and economic markets?
Broadly, o3's capabilities have sparked discussions about economic and societal impacts. Its potential to rank among top coders or solve complex problems suggests transformative effects on job markets, education, and various industries. However, alongside this excitement are ethical and safety concerns regarding job displacement, AI misuse, and alignment with human values. Public perception of AI continues to evolve, with some viewing it as heralding a new intelligent era, while others see it as sophisticated mimicry lacking true understanding or autonomy.
The rabbit hole
What’s new: Tech
• Google Gemini Series: Google released Gemini 2.0 Flash and Gemini-exp-1206, excelling in video input handling, reasoning, and context processing, while introducing a real-time multimodal API.
• NVIDIA Jetson Orin Nano Super: NVIDIA launched an affordable generative AI supercomputer for £250, enabling robotics and local AI projects.
• DeepMind Video Model: DeepMind’s new model captures cinematic effects and real-world physics in 4K, surpassing previous models in quality and adherence to prompts.
• Meta’s AdaCache for Video Generation: Meta developed AdaCache, a training-free method to accelerate AI-driven video generation.
• Amazon Nova Suite: Amazon’s Nova suite integrates multimodal AI for text, images, and short videos, enhancing creative applications.
• Meta and OpenAI Leadership Disputes: Meta and OpenAI are involved in disputes over California regulations, revisiting discussions on AI development models.
• AI Drives Innovation in Materials and Patents: Utilising machine learning tools has led to a 44% increase in material discoveries and a 39% rise in patents compared to standard methods.
• AI Disruption in the Travel Industry: AI is transforming the travel industry with advancements like Project Mariner for itinerary planning and personalised AI assistants.
What’s new: Projects
• AI Consulting Market to Grow 8X by 2032: The AI consulting market is projected to expand from £6.9 billion in 2023 to £54.7 billion by 2032, driven by increasing adoption of AI technologies across industries. Econ Market Research
• Waymo to Test Autonomous Vehicles in Tokyo: Waymo plans to begin autonomous vehicle testing in Tokyo by early 2025, marking its first international expansion beyond the US. The Verge
• Amazon Workers Strike Across Seven Facilities: Workers at seven US Amazon locations have initiated strikes over labor conditions, demanding better pay and treatment.
• EU to Launch 300 Internet Satellites: The European Union is investing £11.1 billion in a satellite initiative to rival Starlink, aiming to deploy 300 satellites into orbit by 2030.
• Precision Neuroscience Funding: Neuralink competitor Precision Neuroscience is nearing a £102 million funding round, valuing the company at £500 million.
What’s new: Productivity
• ChatGPT Expands Accessibility: ChatGPT search and advanced features like voice calls and WhatsApp texting are now free, ensuring broader access for users.
• GitHub Copilot Free in VS Code: GitHub has made its AI-powered coding assistant, Copilot, free for Visual Studio Code users, democratising coding assistance.
• YouTube AI Training Control: YouTube now allows creators to decide if their videos can be used in AI training, providing greater control over their content.
• Reddit Answers AI: Reddit is testing a new AI feature that summarises relevant responses from its platform, improving user efficiency.
• Using ChatGPT for Cold Leads: Sales teams can utilize ChatGPT to generate email sequences for re-engaging cold leads, streamlining outreach efforts.
• Succeed with Personalized Habit Systems: Implementing personalised routines is key to enhancing productivity and achieving long-term goals.
Project Flux Podcast
Tune in to our Christmas special podcast, where we showcase the standout guests of 2024. This episode is a true delight, featuring insightful conversations with experts who generously shared their time with us this year. Don't miss out—listen below!
Meet Project Flux: About Us
At Project Flux, we're committed to pioneering the future of construction and project delivery through the lens of cutting-edge Artificial Intelligence insights. Our vision is to be at the forefront of integrating AI into the fabric of project delivery, transforming how projects are conceptualised, planned, and executed.
Our Mission Project Flux aims to not only inform and educate but also to inspire professionals in the construction industry to embrace the transformative potential of AI. We believe in the power of AI to revolutionise project delivery, making it more efficient, predictive, and adaptable to the dynamic demands of the modern world.
What We Offer Through our insightful newsletters, podcasts and curated content on LinkedIn, and engaging discussions, Project Flux serves as a resource for professionals seeking to stay ahead in their field. We offer a blend of practical advice, thought leadership, and the latest developments in AI and construction technology.
What People Say…
“It was a real pleasure being a guest on the Project Flux podcast. James and Yoshi are really on top of things when it comes to AI in general and its application in project management specifically. If you just have a few minutes a week have a read through their newsletter so you can stay informed. If you have just a bit more time, they know how to ask the right questions in the podcast."