Commentary
by
Ian Reynolds,
Benjamin Jensen,
and
Yasir Atalan
Published April 16, 2025
In early 2025, the Chinese AI company DeepSeek made international news upon releasing a large language model (LLM) that appeared to outperform the traditional AI powerhouse companies, largely headquartered in the United States. The success of DeepSeek has led to worries among U.S. policymakers that, despite U.S. policy efforts to undercut Beijing’s AI industry, China may be overtaking the United States in AI development due to the supposed cheaper training costs of DeepSeek’s model vs. U.S. competitors.
While DeepSeek has demonstrated impressive performance on a range of tasks, such as coding and quantitative reasoning, it has yet to be evaluated for its preferences in foreign policy-related scenarios. To address this gap, we present results of DeepSeek tendencies based on the CSIS Futures Lab Critical Foreign Policy Decision (CFPD) Benchmark. The CFPD Benchmark evaluates foundational LLMs in key foreign policy decisionmaking domains, ranging from questions about deterrence and crisis escalation to a wide range of diplomatic preferences about alliance formation and intervention. Our initial study investigated seven major LLMs across the four aforementioned domains, including Llama 3.1 8B Instruct, Llama 3.1 70B Instruct, GPT-4o, Gemini 1.5 Pro-002, Mistral 8x22B, Claude 3.5, and Qwen2 72B, finding notable variation in model preferences. As part of this research initiative, we evaluate recently released large language models to continuously update our dashboard. In line with this effort, we have now released our findings specific to the DeepSeek-V3 model.
Overall, our evaluation reveals DeepSeek shares a troubling tendency toward more hawkish, escalatory recommendations seen in other Chinese LLMs like Qwen2. Troublingly, this tendency is particularly acute in scenarios involving free, Western countries like the United States, the United Kingdom, and France. This finding raises concerns about model-induced bias in decision support tools, where AI preferences could lead to algorithmic drift and subtly steer analysts toward aggressive courses of action misaligned with strategic objectives.
Policymakers should recognize that off-the-shelf LLMs exhibit inconsistent and often escalatory decision preferences in crisis scenarios, making their uncritical integration into foreign policy workflows a high-risk proposition. To mitigate these risks, national security organizations should invest in continuous model evaluation, expert-curated fine-tuning, and scenario-based benchmarking to ensure LLM outputs align with strategic objectives and political intent. This process requires sustained, independent benchmarking efforts like the CFPD Benchmarking project.
Adding DeepSeek to the Mix
DeepSeek is an open-source AI model developed by Chinese researchers. Open-source in this context means that some level of model weights, parameters, and code are available to the public. Closed models, including most of OpenAI’s products, often restrict user access to many of these features, thus protecting the company’s intellectual property and profits while also, in theory, increasing security and limiting misuse by the general public. Open-source models are designed to enable user-level customization and collaboration in a cost-effective manner. This openness and transparency increase the range of downstream use cases and adaptation, but at the cost of security.
While the difference between open and closed generative AI models is more of a gradient than a binary, these differences do have important implications for AI development. Proponents of closed models argue that keeping models closed and centralized limits the capacity for misuse, such as leveraging AI models to create bioweapons or spread harmful content. Advocates of an open-source approach, however, point to democratizing access to the technology as well as the benefits of more open science by allowing for greater collaboration and innovation, as outweighing the possible risks. This debate will have important implications for progress in the field in the future, as major AI companies, like Meta and Anthropic, forge different paths to model development and user access. Moreover, such debates will have political and governance ramifications, particularly as governments such as China appear to be throwing their weight behind open-source models created by Chinese companies, like DeepSeek, and some politicians in the United States advocate for a more centralized, closed-model path for AI development, at least in the short term. In other words, the global technology competition between the United States and China is creating an upside-down world in which a closed, authoritarian society (i.e., China) favors open-source technology.
DeepSeek caused a major stir in the technology industry by releasing a model that appeared to outperform major U.S. competitors, such as OpenAI, on a range of benchmarking tasks while achieving far greater efficiency in model training. The release caused a significant market shock during its launch as the company suggested that they trained a model with a small fraction of OpenAI’s computational chips. This apparent success challenged what was the conventional path to improved model performance—increasing computational resources and model parameter size. DeepSeek’s purported ability to efficiently train their model and achieve high-level performance appeared to challenge the feasibility of U.S. export controls seeking to squeeze China’s AI progress. However, subsequent claims suggested that DeepSeek simply “distilled” OpenAI’s ChatGPT to achieve its success. In any event, public reaction was substantial as some observers argued that DeepSeek may be a modern “Sputnik moment,” comparing it to the 1957 Soviet satellite launch, which led to the perception of falling behind in the space race within the United States. However, whether DeepSeek’s trajectory represents a real “Sputnik moment” or reflects a broader trend within China’s technology industry—leveraging existing breakthroughs to rapidly close gaps with brute force—is still debated.
The release of DeepSeek-V3 also sparked a debate about the large language models’ biases in sensitive topics, such as politics. For example, some argued that DeepSeek is unresponsive to topics that are sensitive concerning Chinese political history, such as events in Tiananmen Square and China-Taiwan relations. Moreover, experts have warned against the use of DeepSeek due to concerns over misinformation. In the future, these biases will likely matter more as open-source models are leveraged by institutions and private companies, amplifying the institutional bias of these models inherent in the training data and the training process to downstream tasks.