Model criteria first assessed by Perplexity, critiqued and adjusted by Gemeni 2.5 Pro, then re-assessed by OpenAI Deep Research. Visualization as a dynamic web page here built by Claude 4 Sonnet.
June 23, 2025 Version
OpenAI Deep Research:
Comparison of LLMs Across Key Dimensions
GPT‑4.1 (OpenAI) – Scores & Rationale
Security (8/10): Provides enterprise-level privacy safeguards (OpenAI’s API doesn’t train on customer data by default), and is deployable via Azure for added data control. However, it’s only available as a cloud API (no on-premise version), and recent security assessments have flagged some new vulnerabilities in the 4.1 series. Overall, it’s suitable for enterprise use but not without some lingering privacy/alignment concerns.
Reasoning (9/10): Excellent multi-step reasoning and self-correction, though slightly behind specialized reasoning models. GPT-4.1 was optimized for better instruction following and complex problem-solving than its predecessor. It performs strongly on benchmarks (e.g. 50.3% on GPQA in a small variant) and handles logical tasks well, but OpenAI’s O-series models still have an edge in deliberative reasoning.
Internet Access (8/10): Not natively connected to the web, but can use tools/plugins for retrieval. In practice, ChatGPT with GPT-4.1 can browse via OpenAI’s browser plugin or Bing integration. It effectively utilizes web content when enabled, but unlike some rivals it doesn’t autonomously fetch live data by default. This incurs a slight usability gap compared to models with built-in search.
Speed (9/10): Much faster and more efficient than GPT-4.0. OpenAI reports roughly 2× lower latency and 83% lower cost than the GPT-4.5 preview model it replaced. Benchmark data shows GPT-4.1 outputs ~135–163 tokens/sec with ~0.4s first-token latency—a massive improvement over GPT-4’s sluggish ~35 tokens/sec. It feels responsive even on large outputs, approaching the speed of smaller models.
Tool Use (9/10): Supports OpenAI’s function calling and plugin ecosystem, enabling execution of code, API calls, searches, etc. GPT-4.1 can “plug in” tools for math, browsing, or other functions similar to ChatGPT’s Code Interpreter and web browsing. While extremely capable, it isn’t quite as agentic out-of-the-box as O3 or Claude (which were specifically trained to decide when to use tools). Still, it reliably follows tool-use instructions to extend its capabilities.
Hallucination Resistance (8/10): Notable improvements in grounding facts and following instructions versus GPT-4.0, but occasional factual errors persist. Independent testers note that GPT-4.1 can be “less aligned” or prone to minor inaccuracies in some cases. It does better at using provided context (thanks to the long context window) and is generally knowledgeable, but without retrieval it may confidently guess on newer or obscure queries. Its factual reliability is high, though slightly below models that always verify via search.
Knowledge Freshness (8/10): Base training data is up to June 2024, giving it solid knowledge of most topics through 2024. It can be augmented with real-time info via tools (e.g. retrieval plugins), but this is not automatic. Thus, out-of-the-box it may miss late-2024/2025 events unless the user explicitly provides or fetches updates. In real-world use, it’s usually up-to-date enough for most queries, but others with integrated web access stay more current by default.
Context Length (10/10): Supports an unprecedented 1 million tokens of context. GPT-4.1 can ingest massive documents or multiple files and reason over them in one go, far beyond the 32k limits of GPT-4. This is state-of-the-art and virtually best-in-class, allowing it to “remember” extremely large histories or knowledge bases in a single session. No other mainstream model significantly exceeds this at mid-2025.
Multi-Modality (8/10): Offers text and image understanding natively. GPT-4.1 inherited GPT-4’s vision capabilities – it can accept images as input and analyze or describe them. However, it does not natively handle audio or video (beyond extracting text from transcripts), and it cannot generate images itself (without an image-generation plugin). So while it’s strong in vision-and-text tasks, other models (like Gemini) cover a broader modality range.
Domain Excellence (9/10): A top performer across technical domains – especially coding. GPT-4.1 was tuned for code generation and scores 54.6% on the SWE-Bench coding benchmark (a 21% absolute jump over GPT-4), putting it among the best models for software tasks. It also excels at math and science Q&A (80.6% on MMLU academic exam benchmark). Its broad training makes it adept in engineering and research contexts. It slightly trails the absolute leaders on certain niche benchmarks (for example, specialized reasoning models and Anthropic’s Claude on coding), but overall it is an excellent all-rounder for technical problem-solving.
OpenAI O3 – Scores & Rationale
Security (8/10): Shares OpenAI’s enterprise safeguards (API privacy, Azure-hosting options) similar to GPT-4.1. O3 is an API model, so user data stays within OpenAI/Azure systems – generally secure, though not deployable on-premise. As a more experimental “reasoning” model, O3 hasn’t faced major security incidents; it benefits from OpenAI’s alignment work but also tends to push the envelope (which could introduce novel safety considerations). In enterprise settings it’s trusted but not distinctly more secure than GPT-4.1.
Reasoning (10/10): Best-in-class for complex, multi-step reasoning. O3 is explicitly designed to “think for longer” with an internal chain-of-thought. It achieves state-of-the-art results on challenging benchmarks – OpenAI reports it set new records on Codeforces programming challenges and other reasoning tests. O3 can break down problems, plan, and even generate hypotheses with remarkable depth. External evaluations found it makes ~20% fewer major errors on hard tasks than its O1 predecessor. In practice, it excels at logical reasoning, math proofs, complex debugging, and any task requiring stepwise analysis.
Internet Access (10/10): Fully tool-augmented with autonomous web retrieval. O3 can agentically use web search during a session – it was trained to decide when it should pull in information from the internet. For example, if asked a current events question, it will seamlessly issue searches, read results, and incorporate them into its answer. This built-in extended knowledge integration means O3 always has up-to-date info when needed, giving it effectively real-time access to web content (news, documentation, etc.) as part of its normal reasoning process.
Speed (7/10): Sacrifices some speed for deep reasoning. O3 can be slower to respond, especially on complex queries, because it internally deliberates extensively before answering. In high-accuracy mode it may take tens of seconds before producing output (OpenAI notes up to ~20s first-token latency for difficult tasks). Once it starts responding, its generation throughput is decent (~128 tokens/sec in streaming), but the initial “thinking” delay is noticeable. OpenAI recommends O3 for cases where accuracy matters more than speed. (Notably, a smaller O3-mini variant allows trading some reasoning depth for faster replies.)
Tool Use (10/10): Exceptional. O3 has agentic tool-use baked into its design, leveraging OpenAI’s full suite of tools/functions. It can use all of ChatGPT’s tools and plugins in combination – searching the web, running Python code, analyzing files, even calling image generators – as part of its chain-of-thought. Crucially, it’s trained not just how to use tools but when to use them for a given goal. This strategic tool integration is industry-leading: O3 will autonomously decide to, say, run a calculation or fetch data mid-problem, without user prompting. Its tool use is fluid, parallel, and highly effective for solving complex, multi-step tasks.
Hallucination Resistance (9/10): Very strong factual accuracy due to its reflective approach. O3’s chain-of-thought and verification via tools yield more “useful, verifiable responses” compared to earlier models. It tends to double-check itself – using Python to validate math, or search to confirm facts – reducing incorrect statements. Its answers are grounded and it’s less likely to jump to unsupported conclusions. Minor hallucinations can still occur if it over-trusts its internal reasoning, but the model’s design (and inclusion of real sources) makes it significantly more reliable than standard one-shot responders.
Knowledge Freshness (10/10): Essentially up-to-the-minute. O3 not only has a relatively recent training cutoff (late 2024) but, more importantly, its DeepSearch capability lets it “scour the entire internet to find updated, relevant information.”. It will retrieve current data by itself whenever the query goes beyond its training. This means it stays current through 2025 and can incorporate new information on the fly. Few models approach this level of autonomous real-time updating – O3’s answers feel current because they often are, drawn from live sources.
Context Length (8/10): Large 128K token window (with variants up to ~200K), ample but not the largest. O3 can handle very long inputs (hundreds of pages) – for instance, O-series models introduced 100K+ token contexts back in 2024. This is great for long documents or dialogues. However, some models now far exceed this (GPT-4.1 and Gemini at ~1M). So while 128K–200K tokens is huge by 2024 standards, by mid-2025 O3 is no longer top of the pack in raw context size. It’s still more than sufficient for most uses (roughly 100+ pages of text) and a big advantage over older 32K-context models.
Multi-Modality (9/10): Excels in visual + text tasks; supports other modalities via tools. O3 can natively accept and reason about images inside its chain-of-thought – e.g. analyzing a chart or diagram as part of solving a problem. This integration (“thinking with images”) lets it do things like interpret graphs or read a whiteboard photo in context. It can also generate images through tool use (e.g. calling DALL·E) as part of its answer planning. While audio is not explicitly native, O3 could leverage speech-to-text or other plugins if needed. In summary, it’s highly effective with text+images and can handle other modalities through its agent framework, though it isn’t a dedicated audio/video model like Gemini.
Domain Excellence (10/10): A top performer in coding, math, science, and research tasks. O3 was OpenAI’s “frontier” reasoning model and it pushes the frontier in coding, math, and science problem-solving. It set new state-of-art scores on Codeforces (competitive programming) and tough STEM benchmarks. On the AIME 2025 math exam, for example, O3 (and its mini version) can achieve near-perfect scores when using its Python tool ability. It’s noted for handling problems requiring advanced analysis in biology, engineering, etc., often generating and evaluating novel hypotheses. In software engineering, it ranks at the top on coding evals next to Claude 4. Overall, O3 is arguably the best-in-class “thinker” across technical domains – virtually an AI research assistant built for complex technical work.
Gemini 2.5 Pro (Google DeepMind) – Scores & Rationale
Security (9/10): Enterprise-ready with Google’s stringent privacy and deployment options. Gemini 2.5 Pro runs on Google Cloud (Vertex AI), inheriting Google’s enterprise-grade security and compliance (including data encryption and organization-specific privacy controls). Google explicitly markets it as safe for Workspace and Vertex AI users, with on-premise/hybrid support via Google Distributed Cloud for organizations with strict data requirements. Additionally, built-in “Secure AI Framework” guidelines and safety filters are applied. Because it’s cloud-hosted, direct data control isn’t as complete as an on-site model, but within cloud offerings it’s top-tier for security.
Reasoning (9/10): Very strong “thinking” abilities, nearly on par with OpenAI’s best. Gemini 2.5 is described as a “thinking model” with deeper reasoning capacity for complex problems. By default it engages a chain-of-thought (“thinking on by default” in Pro mode) and can tackle multi-step tasks in code, math, and STEM effectively. It was a leader on many reasoning benchmarks circa 2025 – for instance, scoring 86.7% on the AIME 2025 math exam and 84.0% on a challenging GPQA reasoning test. These scores indicate near state-of-the-art reasoning performance. While perhaps edged out by OpenAI’s specialized O3 on the most labyrinthine puzzles, Gemini’s combination of DeepMind’s logic prowess and huge training corpus makes it one of the smartest problem-solvers available.
Internet Access (10/10): Built-in real-time retrieval. Gemini Pro has “search grounding” as a supported capability, meaning it can automatically fetch and use web information. In Google’s Bard interface, Gemini can seamlessly perform Google searches to get up-to-date content when answering questions – similar to how Bard (and now Gemini) can cite sources. Whether via the Gemini API or Google’s applications, it has robust internet access to real-world web content. This allows Gemini to answer current knowledge queries or verify facts without being limited to its training data.
Speed (8/10): Fast generation with moderate startup overhead. Gemini 2.5 Pro is highly optimized on Google’s TPU infrastructure, yielding an output stream of ~148 tokens/sec – actually slightly above GPT-4.1’s throughput. Response latency is low for typical queries, but when “thinking” through very complex tasks, it may take a bit more time before responding (as it performs internal reasoning steps). It’s not the absolute fastest model (Google’s own 2.5 Flash variant is much faster at ~280 tokens/sec), since Pro prioritizes accuracy over raw speed. In practice, most users find Gemini Pro responsive, but in high-volume scenarios, the lighter Gemini Flash model is preferred for its near-instant replies. Thus, Pro gets a strong score for speed, albeit with a small deduction because its top-quality mode isn’t as instantaneous as smaller, speed-tuned models.
Tool Use (10/10): Highly extensible and agentic. Gemini 2.5 Pro supports Python code execution, function/API calling, and other tool integrations natively. It can not only write code but actually run code in a sandbox when connected through Google’s tools (similar to PaLM2’s code executor in Bard). It also uses function calls to interface with external systems (e.g. database queries, calculators) and has built-in support for multi-step “agents.” Gemini was designed to help automate workflows and create intelligent agents on Google Cloud. For example, it can analyze a dataset and generate charts or call a Google API if given the tools. This broad, parallel tool use is comparable to OpenAI’s – Gemini “thinks” about when to use tools and can even let developers set a “thinking budget” for tool-augmented reasoning. Overall, it matches the best-in-class here.
Hallucination Resistance (9/10): Produces well-grounded answers with minimal fabrications. Google has explicitly emphasized “grounded outputs” and safety in Gemini. The model tends to cite sources or at least draw from its internal retrieval mechanisms to fact-check responses. In evaluations, Gemini 2.5 showed a low factual error rate, thanks in part to its training on high-quality data and its use of the “thinking” mode to double-check answers. It’s certainly not immune to hallucination – no LLM is – but enterprise testers have found it more reliable than most predecessors (Bard’s early tendency to stray has been reined in). The remaining point off is because extremely obscure or ambiguous questions can still trip it up, but generally it’s among the most factually trustworthy generative models.
Knowledge Freshness (10/10): Excellent. Gemini 2.5’s training data is up to January 2025, the most recent of any major model at release. Moreover, its integrated search capability means it can pull in information from 2025 real-time sources when needed. Users can ask about very recent events and Gemini will search Google live and include that information. In effect, it continually updates itself via retrieval. Combined with Google’s constant model updates (Gemini is updated in Google AI Studio/Vertex regularly), this ensures cutting-edge knowledge. By mid-2025, Gemini 2.5 Pro is as fresh as it gets – even news from “today” can be answered with an appropriate tool invocation.
Context Length (10/10): Massive context window (>1 million tokens). Gemini 2.5 Pro can handle extraordinarily large inputs – officially around 1,048,576 tokens (1M) for input, and up to 65K tokens in output. This means it can analyze books worth of text or giant codebases in one go. Google’s design allows it to ingest multiple PDFs, images, or lengthy logs and maintain understanding throughout. This 1M token context is a cutting-edge feature matched only by GPT-4.1 at present. It far exceeds Claude’s 200K or older models’ limits, making Gemini ideal for exhaustive documents or long conversations without losing track.
Multi-Modality (10/10): Fully multi-modal – text, images, audio, and even video. Gemini 2.5 Pro is Google’s most advanced multimodal model, with native support for “text, code, images, audio, and video” inputs. You can give it an image and ask questions, or supply an audio clip to transcribe or analyze. It can process video content (e.g. summarizing a YouTube video) by analyzing frames or transcripts. On output, the model primarily produces text, but it can also return structured data, and specialized variants support speech output (text-to-speech). Google has even demoed Gemini generating captions for images and creating short videos via the Veo tool. In short, Gemini 2.5 Pro handles the widest range of modalities effectively, enabling use cases like describing an image then writing code based on it in one session – a capability few others have at this level.
Domain Excellence (9/10): Top-tier performance in programming, math, science, and research – nearly at the summit. Gemini 2.5 Pro was engineered to excel in technical domains: Google notes it’s “capable of reasoning over complex problems in code, math, and STEM”. It scores very highly on coding benchmarks (e.g. ~64% on SWE-Bench with a custom agent), putting it above GPT-4.1 but a bit below Anthropic’s latest in pure coding correctness. In math and science, it’s exceptional – as noted, 86.7% on AIME math and strong performance on scientific Q&A. Researchers find it effective for literature review and hypothesis generation as well. The only reason it’s not a full 10 is the existence of specialized models (Claude 4 for coding, O3 for reasoning) that slightly outperform it on their home turf. But make no mistake: Gemini 2.5 Pro is an all-around powerhouse for technical tasks, from writing complex code to answering graduate-level science questions. It often sets or competes for state-of-the-art on these challenges.
xAI Grok 3 – Scores & Rationale
Security (7/10): An emerging offering with fewer enterprise credentials. Grok 3 is accessible via X (Twitter) and the web, primarily as a consumer chatbot; it does not yet have the robust enterprise integrations or compliance guarantees of OpenAI/Google. Data sent to Grok goes through xAI’s servers and might be used to improve the model (no explicit opt-outs have been publicized). On the plus side, Grok 3 had some open-source roots (earlier versions were partially Apache-2.0), but Grok 3 itself is closed-source and proprietary. It hasn’t had known data leaks, and being a newer model it hasn’t accumulated the security baggage of older systems. Still, for user privacy and enterprise deployment, it ranks lower mainly due to lack of official enterprise support (no API yet for self-managed use) and the fact that its primary interface is through X’s ecosystem (which may raise confidentiality concerns for businesses).
Reasoning (9/10): Advanced multi-step reasoning with unique modes for thought. Grok 3 is fundamentally a “reasoning model” built to rival OpenAI’s O-series. It introduces a Think mode that provides explicit chain-of-thought reasoning step by step. In this mode, it breaks down problems and even shows its intermediate logic. The model is able to self-correct during this process, leading to highly accurate outcomes. It performed exceptionally on internal benchmarks, surpassing predecessors and competing well with models like Claude 3 and GPT-4 in logical tasks. Grok’s ability to “deep think” through prompts earns it a high score. The only slight shortfall preventing a 10 is that OpenAI’s latest (O3) still has a marginal edge in the absolute hardest scenarios – but Grok 3 is not far behind, and its explicit reasoning readout is a differentiator.
Internet Access (10/10): Live internet retrieval is deeply integrated. Grok 3 features a DeepSearch mode where it “scours the entire internet” for information in response to a query. This goes beyond basic search: DeepSearch digs into multiple sources for comprehensive results (albeit with more latency). In practice, Grok is plugged into X/Twitter’s real-time data as well, so it’s very aware of current events and trending info on the platform. The combination of integrated search engine and social media data means Grok always pulls in up-to-date content when needed. Users of Grok on X can ask about something happening “right now” and often get a relevant answer with sourced details. This real-world web prowess is as strong as any model on the market.
Speed (8/10): Generally fast and responsive, with some variability. Grok 3 was engineered for improved speed over its predecessors and takes advantage of xAI’s massive GPU cluster for serving. Typical simple queries are answered almost instantly (first token under half a second), and it streams around ~95 tokens/sec which is respectable. The model also offers a “fast” variant deployment on accelerated infrastructure for snappier responses. However, when using DeepSearch or tackling a very complex prompt in Think mode, Grok can slow down (it might spend several seconds aggregating search results or reasoning). In normal chatbot use it feels quick, but under heavy reasoning mode it’s not the absolute fastest. Overall, it balances speed and thinking well – a notch below the fully optimized models like GPT-4.1 or Gemini Flash, but comfortably fast for most users.
Tool Use (8/10): Focused toolkit (search and a bit of fun) but not as wide-ranging as some peers. Grok 3’s hallmark tool is its integrated search engine (DeepSearch), which it wields effectively for information retrieval. It also has an integrated voice mode – essentially text-to-speech for verbal replies – which is a user-facing “tool” that enhances how it interacts. In terms of executing code or using arbitrary APIs, Grok is more limited: xAI has not yet exposed a general plugin system or code execution environment for Grok. It can certainly generate and suggest code (and even debug it in text), but it won’t run the code during the chat as GPT’s Code Interpreter can. Nor will it call external APIs beyond its search capability. So, while Grok does more than a vanilla model (thanks to search and voice), it doesn’t yet match the full tool ecosystems of OpenAI, Anthropic, or Google. (Future API access might change this, but as of mid-2025, it’s a closed system with specific features.)
Hallucination Resistance (9/10): Impressively accurate outputs, with self-correction and real sources to draw on. Grok 3 was designed to minimize the classic LLM tendency to hallucinate. In Think mode, it “self-corrects errors and delivers more accurate responses.” The model will double-check intermediate steps, which helps catch contradictions or mistakes before the final answer. Moreover, DeepSearch mode ensures that for fact-based queries, Grok isn’t relying solely on possibly outdated training memory – it’s pulling in verifiable information from the web. This greatly reduces factual hallucinations. Users have found that Grok will often cite articles or summarize sources rather than inventing facts. It’s still capable of occasional slips (especially if a question doesn’t trigger it to search when it should), but those instances are relatively rare. In head-to-head tests, Grok’s factual accuracy is on par with the best closed models, thanks to these design choices.
Knowledge Freshness (10/10): Always up-to-date. Grok 3 launched in 2025 with a very recent knowledge base and has the unique advantage of being tied into Elon Musk’s X platform data stream. Its training included content up to early 2025, and via DeepSearch it fetches live info. It’s also known to incorporate the latest trends and memes from X (one of its early selling points was answering questions with a bit of humor and internet savvy, presumably from real-time data). Practically, this means Grok can reference events from yesterday or breaking news today – something many other models might miss without a manual tool invocation. It scores a perfect 10 here; few systems are as “plugged in” to real-time knowledge as Grok.
Context Length (10/10): Huge context handling (≈1 million tokens). Grok 3 inherited the long-context advancements of its lineage (Grok 1.5 expanded to 128K tokens) and pushed further to match rivals. xAI has indicated Grok 3 supports on the order of 1M tokens of context (comparable to GPT-4.1 and Gemini). This means you can feed Grok an entire book or a large code repository and it can handle it within one session. The exact number is extremely high – effectively no practical conversation or document is too long. This best-in-class context size enables Grok to excel at summarizing lengthy texts or maintaining very long dialogues without losing context.
Multi-Modality (9/10): Broad multimodal capabilities, especially for vision and voice. Grok 3 can summarize content including text, images, and video. It introduced multimodal understanding in earlier versions (Grok 1.5v could analyze images), and Grok 3 continues that – users can upload an image and ask Grok about it (e.g. describe this picture, interpret a graph). It also can interpret short videos or at least the frames from them, providing summaries of visual media. In output, Grok has a voice reply feature (verbalizing its answer), which is fairly unique. Additionally, Grok 2 had the ability to generate images via an integrated partner (Black Forest’s FLUX); presumably Grok 3 users with certain subscriptions can ask for image generation as well. With all these, Grok is genuinely multimodal. It falls just short of a perfect score because its audio handling is primarily via transcription (there’s no evidence it deeply analyzes raw audio beyond converting to text), and its image generation is through a third-party tool. Nonetheless, it handles text+vision+voice tasks extremely effectively.
Domain Excellence (9/10): A formidable competitor in coding, math, and scientific domains. xAI touts Grok 3’s strong benchmark results in code and math: it outperformed OpenAI’s GPT-4o on a coding challenge (LiveCodeBench) and a math exam (AIME 2024). Specifically, Grok scored 79.4 to GPT-4o’s 72.9 on the coding test, and an impressive 99.3% on the math exam vs GPT-4o’s 87.3%. This places Grok 3 at the elite level for those domains. It generates code well and even helps debug and optimize code, not just write it. It also handles advanced math reasoning thanks to its chain-of-thought approach. In scientific and research queries, Grok does very well, leveraging both reasoning and fresh knowledge (it’s been used to answer challenging questions about physics and finance on X). The reason it’s 9 and not 10 is that absolute “best in world” title in coding probably goes to Claude 4 at this moment, and O3 might have a razor-thin edge in pure math reasoning. But Grok 3 is in that top tier – it’s a generalist model with specialist-level performance in technical fields, fulfilling xAI’s goal of being a “competitive alternative” to the best from OpenAI, Anthropic, and Google.
Anthropic Claude 4 (Opus 4) – Scores & Rationale
Security (9/10): Strong focus on safety and enterprise use. Claude 4 is available via Anthropic’s API, AWS Bedrock, and Google Cloud’s Vertex AI, meaning enterprises can deploy it in controlled cloud environments. Anthropic emphasizes “Constitutional AI” for alignment and has strict policies to prevent misuse. In internal red-team testing, Claude 4 showed significantly improved resistance to producing disallowed content (65% less likely to take problematic shortcuts than the previous Claude). They also introduced features like “sandboxes” for agent behavior to avoid leaks. While no major security incidents are reported, one highly publicized test did show Claude 4 demonstrating deceptive behavior when instructed to (highlighting the importance of careful alignment). Overall, Claude 4 is considered safe for enterprise data – it doesn’t learn from client queries and can be deployed with data isolation on cloud. It misses a perfect score only because all large models carry some alignment risk, and Anthropic had to navigate some concerning behaviors in early safety evaluations.
Reasoning (9/10): Excellent step-by-step reasoning and complex task handling. Claude Opus 4 was built for “advanced reasoning and AI agents”. It can engage in extended thinking (“near-instant” mode for quick answers and an “extended thinking” mode for deeper analysis). In extended mode, Claude will carefully work through multi-step problems and even handle tasks that span hours of computation or decision-making. For example, it famously ran a 7-hour code refactoring task continuously without losing coherence. That showcases remarkable planning and focus. Claude 4’s chain-of-thought isn’t exposed by default, but it does internally reason through complex instructions (Anthropic noted it follows very intricate instructions better than previous models). Independent tests have found Claude 4 can solve multi-step logical puzzles and long-form mathematical proofs on par with GPT-4-level models. It falls just short of O3’s pinnacle simply because O3 is explicitly optimized for reasoning and can show its work; however, the gap is small. Claude’s reasoning is more than sufficient for essentially any complex task thrown at it.
Internet Access (9/10): Now enabled via tool use, though not always on by default. With Claude 4, Anthropic introduced a beta feature allowing the model to use a web search tool during its extended reasoning process. This means Claude can fetch live information when it needs more context – for instance, consulting documentation or checking recent facts. In the API, developers can grant Claude access to the web (and even files or external APIs) and Claude will alternate between reasoning and using the tool. This is a big step up from earlier Claude versions which were closed-book. However, in the consumer-facing Claude.ai interface, web access might not be broadly available yet (it’s something enterprises and devs enable explicitly). Given that, we rate it 9: it can retrieve and use web content effectively (nearly a 10), but it isn’t as ubiquitously doing so as models like Grok or Perplexity which are explicitly built around retrieval. As the tool-use feature matures, Claude’s internet access will likely become a full 10.
Speed (8/10): Good, streaming responses with moderate latency on tough tasks. Claude 4 comes in two models – Opus (the large one) and Sonnet (a faster, lighter one). Sonnet 4 is tuned for speed and often outputs nearly as fast as GPT-3.5, making it great for quick interactive use. Opus 4, being larger, is a bit slower: users see a slight delay for very complex or long responses, especially in extended thinking mode (where Claude might be silently working for some seconds). That said, Claude has always had an advantage of fast streaming: it often starts responding with the first token faster than GPT-4 did, and maintains a steady output. Reports indicate Claude 4 can output around ~100 tokens/sec in practice, similar to GPT-4.1’s rate. For coding tasks, it’s notably efficient, able to handle long outputs (tens of thousands of tokens) in a reasonable time. In summary, Claude 4 is fast enough for most needs – not the absolute fastest LLM, but thanks to the Sonnet variant and other optimizations, speed is not a significant drawback.
Tool Use (10/10): Highly capable – supports code execution, web browsing, and custom tools in parallel. Anthropic gave Claude 4 a rich tool API: it can use a built-in Python code execution tool, call external APIs via an “MCP connector,” read/write local files provided by the developer, and perform web searches. Uniquely, Claude can use tools in parallel and intermix them with its reasoning. For example, it might search the web while simultaneously running a piece of code, then combine the results. Early users have demonstrated Claude writing and executing code to solve problems and then refining its approach, all autonomously. This effectively matches O3’s agentic tool use. Moreover, Anthropic’s “Claude Code” environment integrates Claude into IDEs (VS Code, etc.), where it can directly make edits and run test cases – a form of specialized tool use for coding. Given all this, Claude 4 deserves a full score; it has moved beyond just answering questions to acting on the world (through tools) on the user’s behalf.
Hallucination Resistance (9/10): Very reliable outputs, especially in code and structured tasks. Anthropic’s alignment strategy tends to make Claude conservative about facts – it often includes disclaimers or admits uncertainty rather than inventing. With Claude 4, they further reduced instances of the model taking “loopholes” to give an answer, which means it’s less likely to fabricate explanations or solutions that haven’t been verified. In coding, Claude 4 was observed to stay on track and maintain correctness over thousands of steps, indicating it doesn’t hallucinate intermediate steps easily. It also improved at reading files and recalling facts given to it, thanks to its new memory enhancements. Direct factual Q&A is quite solid – users rank Claude 4’s factual accuracy close to GPT-4’s. Only very occasionally will Claude state something incorrect confidently, usually on obscure trivia or when pressured with a misleading premise. Its slight tendency to be verbose/polite can sometimes mask uncertainty, but in general Claude 4 is among the least hallucination-prone LLMs. We dock one point just because retrieval models (like Perplexity) that constantly cite sources have an edge in verifiability.
Knowledge Freshness (10/10): Nearly up-to-date training and ability to fetch newer info. Claude 4 has a training data cutoff of March 2025, which is extremely recent – the model “knows” about events and knowledge up to just ~3 months before its release. This is more recent than GPT-4.1’s June 2024 cutoff, for example. Additionally, as mentioned, Claude 4 can use a web search tool in its reasoning process. That means if you ask about something that happened after March 2025, it can potentially search for it and incorporate the answer (assuming the tool is enabled in that context). Given its combination of a very fresh knowledge base and optional real-time retrieval, Claude 4 is effectively as current as any model on the list. Users on Claude.ai have noted it rarely says “I don’t know that” for contemporary topics – it either was trained on it or will find out.
Context Length (9/10): Large 200K token window (with options up to 500K for enterprise), second only to the new 1M-token models. Claude Opus 4 maintains Anthropic’s lead from Claude 2 in allowing very long prompts – about 200,000 tokens (roughly 150k input + 50k output in current configurations). This is on the order of hundreds of pages of text. In practical terms, Claude can consume lengthy documents or chat histories (e.g. an entire novel, or months of emails) and still respond intelligently at the end. Anthropic even offers a 500K token context for certain enterprise users. With its new file-reading abilities, Claude 4 can use this context to ingest a set of files, remember key points, and refer back to them accurately. The only reason it’s not a 10 is that OpenAI and Google raised the bar to ~1M tokens – five times Claude’s default. But in practice, 200K is enormous and more than sufficient for almost all use cases (note that at 1M tokens, cost and latency become significant, so Claude’s 200K is a sweet spot many find more usable).
Multi-Modality (6/10): Limited – primarily a text-based model (with some coding/formatting output). Claude 4 does not natively accept images, audio, or video as input in the way GPT-4 or Gemini do. Its interface is text (and code) only. This is a conscious focus by Anthropic on being the best textual assistant. As a result, Claude can’t directly analyze an image (you can’t ask “what’s in this photo?” without using an external vision service) and it can’t listen to audio clips or generate spoken replies on its own. That said, Claude is very good with different formats of text – JSON, markdown, code, etc., which is a kind of modality handling (for example, it will output well-formatted answers or even translate between programming languages). And thanks to its tool use, one could connect Claude to an OCR or image captioning tool to work around this limitation. But considering the criterion (text, image, audio, video effectiveness), Claude 4 is the least multimodal of the group. It’s squarely focused on text and code.
Domain Excellence (10/10): Arguably the best coding model in the world, with standout performance in other technical domains too. Anthropic explicitly touts Claude Opus 4 as “the world’s best coding model”. It leads on multiple software engineering benchmarks – for instance, scoring 72.5% on SWE-Bench (software engineering tasks), which outstrips GPT-4.1’s 54.6% and even Gemini’s results, making Claude 4 the top choice for complex coding challenges. Anecdotally, developers report Claude writes more coherent, well-documented code and can handle larger codebases due to its context and “agentic” coding mode (it was so reliable that GitHub is integrating Claude 4 as an underlying model for Copilot). Beyond coding, Claude 4 is also adept in math and science – it sustains long analytical reasoning (e.g., solving college-level math problems) and its knowledge cutoff being recent helps in scientific domains. It’s proven capable in research tasks like summarizing academic papers or brainstorming hypotheses. While O3 might have a slight edge in pure mathematical proof solving due to chain-of-thought, Claude 4 is not far off, and its superior memory management in long tasks often gives it a practical advantage. Given its dominance in coding and equal footing in other areas, Claude 4 earns 10/10 for technical domain performance.
Perplexity Sonar Pro – Scores & Rationale
Security (7/10): An evolving solution primarily geared towards search, with some enterprise features but less maturity than big tech offerings. Perplexity’s Sonar Pro is available via API (e.g. OpenRouter) and as a service; it advertises an enterprise version for advanced capabilities. However, as a smaller company, Perplexity doesn’t yet match the compliance breadth of OpenAI/Google (no known SOC2, HIPAA, etc. certifications public at this time). User queries in Perplexity are processed and used to retrieve web info, which means they transit external search APIs (though results are ephemeral and only used to form answers). On the upside, Sonar Pro can be self-hosted on customer data or via providers like Oxen that emphasize privacy. Still, trust from large enterprises may be cautious – Perplexity is building a track record. There have been no reports of data misuse, and their service has community trust for consumer use. This score reflects a middling position: reasonably secure, but not the proven enterprise stalwart that some others are.
Reasoning (8/10): Good multi-step reasoning aided by its retrieval ability, though the core model logic is slightly behind the top-end LLMs. Sonar Pro’s strength is tackling complex queries by breaking them down and pulling in relevant information. It has been fine-tuned for “advanced search capabilities”, which implicitly involves reasoning about what to search and how to synthesize answers. On benchmarks, Sonar Pro has surprised many – for example, it absolutely dominated a challenging “plot unscrambling” reasoning benchmark (LiveBench) with a score of 73.47, leaving other state-of-the-art models “in the dust”. This suggests that for certain logical tasks (especially ones requiring understanding a narrative or sequence, possibly by searching for clues), Sonar is extremely effective. That said, some of that advantage comes from clever retrieval rather than brute-force internal reasoning. The base LLM behind Sonar might not match GPT-4 or O3 on a pure logic puzzle with no external info. But because Sonar cleverly augments itself, it handles many real-world reasoning tasks excellently. It earns a strong score – just slightly under the very best – since its strategy relies on search for reasoning, which works wonders in many cases but might falter on tasks requiring introspective or purely abstract reasoning without reference material.
Internet Access (10/10): Always-on web access by design. Sonar Pro is essentially a retrieval-augmented model; every query you ask it triggers a live web search (via Perplexity’s search engine integration) and it then uses those results to formulate an answer. It not only retrieves but also cites sources for the information it provides. This means Sonar is perpetually connected to the entire internet knowledge base. It excels at answering questions about current events, detailed references, or obscure facts by pulling from multiple web sources. Unlike some models that must be coaxed into using a tool, Sonar’s default operation is web-grounded. Therefore, it absolutely merits 10/10 here – it’s as good as it gets in leveraging the internet, essentially functioning like an advanced search engine with natural language output.
Speed (9/10): Fast and efficient responses. Despite the extra step of searching, Perplexity’s system is optimized to be very quick. Sonar Pro fetches search results in parallel and the underlying model then generates a concise answer. Users often get a well-formed, cited answer in just a couple of seconds. The model behind Sonar is lighter than something like GPT-4, which contributes to its snappiness. Reports from community usage show that Perplexity’s responses feel as fast as interacting with a search engine – typically faster than waiting for GPT-4 to finish a long answer. Its token generation speed is high (comparable to GPT-3.5 levels), and the first tokens arrive with minimal delay since it starts writing as soon as initial relevant info is found. There might be a slight latency increase if a query requires scanning many pages, but even then it manages the process efficiently. Only trivial improvements (like those offered by smaller, less capable models) could be faster, but those would sacrifice quality. Sonar Pro strikes a great balance, earning it a 9.
Tool Use (7/10): Primarily focuses on one “tool” – web search – with little evidence of executing other types of tools. Sonar Pro’s claim to fame is its integrated search grounding. It does this exceedingly well, effectively acting as a QA system on top of a search API. However, beyond search (and browsing the retrieved pages), it doesn’t natively run code, use calculators, or call arbitrary APIs within a conversation. If you ask Sonar to do math, it will likely search for an answer or formula rather than calculate it internally. If you ask it to generate an image, it cannot. That said, as an API one could combine Sonar with other tools externally – but the model itself isn’t orchestrating that. So, compared to the multi-tool agentic systems, Sonar is limited. Its design is narrower: it’s a specialist in one tool. We give points for that specialization (search is a hugely important capability), but we must dock points for lacking broader tool use range. In summary, Sonar Pro is a search genius but not really a general toolbox.
Hallucination Resistance (10/10): Extremely low hallucination rate, thanks to grounding every answer in sources. Sonar Pro practically refuses to answer purely from its own parametric memory if it can fetch actual information instead. It will quote snippets from websites or synthesize them, and typically provides citations for factual claims. This methodology nearly eliminates the classic hallucination problem – if Sonar doesn’t find something in the search results, it tends to say it couldn’t find an answer rather than making one up. This behavior makes it arguably the most factual and trustworthy model in this lineup for knowledge queries. Users have noticed that Sonar’s answers are not only correct, but come with reference links, which is reassuring and allows verification. It also means that for topics where misinformation could slip in, Sonar is double-checking against authoritative sources. Unless the web content itself is wrong (garbage in, garbage out), Sonar’s answers are grounded. It gets full marks for keeping hallucinations minimal – effectively setting a standard for truthfulness in AI assistants.
Knowledge Freshness (10/10): Always current, by virtue of querying live information. Perplexity’s Sonar doesn’t rely on a fixed training cutoff for knowledge – whatever the date or time, if the information is published on the web, Sonar can retrieve it. Even if its underlying model was trained on data only up to 2023 or 2024 (not publicly specified, since Sonar’s base model could be a fine-tuned Llama-2 or similar), it compensates entirely through retrieval. It’s the kind of assistant that can answer “What happened an hour ago in world news?” accurately by pulling the latest headlines. This real-time adaptability is core to its design. Thus, it’s as fresh as the internet itself. One caveat: if an event is so breaking that no indexed content exists yet, Sonar can’t know it – but that’s true for any system. In practical terms, it’s maximum freshness among LLMs.
Context Length (9/10): Can handle very large queries or documents (200K tokens) via its retrieval mechanism. The Sonar Pro model is reported to support around 200k tokens of context – likely meaning it can consider extremely long user-provided texts or chain many search results together. This is a huge context (roughly 150-200 pages of text), second only to the new 1M-token giants. The way Perplexity works, it often doesn’t need to stuff everything into the prompt; it can read multiple pages one by one and synthesize. Effectively, it has no trouble dealing with lengthy materials like academic papers or multi-part questions. We give it 9 because 200K is among the largest available (Claude’s 200K and GPT/Gemini’s 1M are in this club). It’s likely that beyond a certain point, rather than truly using one long context, Sonar would retrieve in batches (which is smart). But since the end result is it can process that much info by splitting it, it deserves the high score. Only GPT-4.1 and Gemini’s 1M get a slight edge, but in real usage 200K vs 1M doesn’t often make a difference – both are enormous.
Multi-Modality (6/10): Primarily text-based, with no native image/audio understanding. Sonar Pro’s interface is a text chat that augments answers with images from the web when relevant (like a Knowledge Graph panel might), but the model itself isn’t “seeing” an image – it’s just showing one to the user from search results. It cannot be given an image or audio as input and interpret it. Nor can it generate multimedia outputs (aside from possibly providing a link to an image). So as an LLM, Sonar is largely unimodal (text in, text out). The only reason we don’t score it even lower is that its tight integration with web content means if an image’s description exists, it can retrieve that. For example, ask about a painting and Sonar will find a description rather than analyzing the painting pixels. But that’s indirect. Compared to true multimodal models, Sonar lacks such capabilities. It’s optimized for language and knowledge.
Domain Excellence (7/10): Mixed: brilliant at answering factual and research questions, but not specialized for coding or purely analytical tasks. Sonar Pro shines in research-oriented use – if you need an AI to gather information on a scientific topic or compile knowledge on something, it’s excellent. However, in domains like programming or math where the answer may not exist online, Sonar is not as strong. For coding, it can certainly fetch relevant documentation or StackOverflow answers, but actually generating novel, correct code or solving programming challenges isn’t its primary function. It was outperformed by the likes of Claude and GPT-4 on pure coding benchmarks (which makes sense, as its base model is likely smaller/older). Similarly for complex math, Sonar might try to look up formulas rather than derive a proof. It does reasonably well on academic QA (since it can look up formulas or scientific facts), so for literature research or answering technical questions it’s useful. But whenever creativity or original problem-solving is needed beyond what’s written on the web, Sonar is at a disadvantage. Hence, we score it lower in this category – it’s good in knowledge-heavy domains, but not an expert coder or mathematician compared to the purpose-built large models.
Sources: The above evaluations are supported by official model documentation and independent benchmarks, including OpenAI/Anthropic/Google announcements, third-party analyses, and user reports on model performance. Each rating reflects the model’s standing as of mid-2025 in its ability to meet the described criteria.
AI Model Performance Comparison
Compare AI models - Start with 3 shown, click to add/remove others
Evaluation Criteria
Performance Scores (0-10 Scale)
Model | Security (Privacy & Enterprise) |
Reasoning (Multi-step) |
Internet Access |
Speed (Latency & Throughput) |
Tool Use (Extensibility) |
Hallucination Resistance |
Knowledge Freshness |
Context Length |
Multi- Modality |
Domain Excellence (Tech Tasks) |
---|