There is a fundamental problem with modern AI that most practitioners do not talk about. We have built systems that can recognize faces, translate languages, and generate human-like text. But ask these systems a simple question like “Why did that happen?” and they fall apart. This is because nearly all machine learning algorithms are built on correlation, not causation. They find patterns in data, but they cannot reason about the underlying mechanisms that generate that data.
This distinction matters more than you might think. If you are building a recommendation system, correlation might be enough. But if you are building a robot that needs to decide whether to push a button, or a medical system that needs to recommend a treatment, you need to understand why things happen, not just what tends to happen together.
In this post, I want to explore why causality is the missing piece in modern AI, what tools exist for causal reasoning, and how this connects to practical applications in robotics and reinforcement learning.
The Correlation Trap: Why Pattern Matching is Not Enough
Let me start with a simple example that illustrates the problem. Suppose you are training a model to predict whether a patient will recover from a disease. Your data shows a strong correlation: patients who receive a certain treatment are less likely to recover. A naive model would learn that the treatment is harmful and recommend avoiding it.
But here is the catch. What if doctors only prescribe this treatment to the sickest patients? In that case, the treatment might actually be beneficial, but the correlation is reversed because the patients who receive it were already more likely to die. This is a classic example of confounding, where a third variable (disease severity) affects both the treatment decision and the outcome.
A correlation-based model cannot distinguish between these scenarios. It sees the pattern and reports it. A causal model, on the other hand, would ask: “What would happen if we intervened and gave this treatment to a randomly selected patient?” This is a fundamentally different question, and it requires a fundamentally different framework to answer.
This problem is not hypothetical. A systematic review of 90 studies on immune checkpoint inhibitors in cancer treatment found that while 72% used traditional machine learning and 22% used deep learning, none incorporated causal inference. As a result, these models were not included in phase III clinical trial designs or referenced in major clinical guidelines. The models could predict, but they could not inform action.
The Ladder of Causation: Three Levels of Reasoning
Judea Pearl, a computer scientist who won the Turing Award for his work on causality, proposed a framework called the “Ladder of Causation” (also known as the Pearl Causal Hierarchy). This framework distinguishes three qualitatively different levels of reasoning:
Level 1: Association (Seeing) This is the domain of most machine learning. At this level, you observe data and ask questions like “What is the probability of Y given that I observe X?” This is correlation. You are seeing what happens, but you are not doing anything.
Example question: “What is the probability that a customer will buy product B given that they bought product A?”
Level 2: Intervention (Doing) At this level, you ask what would happen if you actively changed something in the world. The key mathematical tool here is the “do” operator, written as do(X=x). This is different from conditioning on X=x because when you intervene, you break the natural relationships that caused X to take a certain value.
Example question: “What would happen to sales if we changed the price to $10?” (not “What do we observe when the price is $10?”)
Level 3: Counterfactual (Imagining) This is the highest level. Here, you ask questions about alternate realities. Given that something specific happened, what would have happened if circumstances had been different?
Example question: “The patient took the treatment and died. Would they have survived if they had not taken the treatment?”
Most machine learning operates at Level 1. Some recent work pushes into Level 2. Very few systems can reason at Level 3. But for building truly intelligent systems, especially in robotics and reinforcement learning, we need all three levels.
Structural Causal Models: The Mathematical Foundation
To reason about causality, we need a formal language. Structural Causal Models (SCMs) provide this foundation. An SCM consists of three components:
- Variables: Both observed (endogenous) and unobserved (exogenous)
- Structural equations: Functions that define how each variable is determined by its causes
- A directed acyclic graph (DAG): A visual representation where arrows indicate causal relationships
Here is a simple example. Suppose we want to model the relationship between studying (S), intelligence (I), and exam scores (E). A structural causal model might look like this:
I = f_I(U_I) # Intelligence is determined by unobserved factors
S = f_S(U_S) # Study hours are determined by unobserved factors
E = f_E(I, S, U_E) # Exam score is caused by intelligence and studying
The corresponding DAG would show arrows from I to E and from S to E, but no arrow between I and S (assuming they are independent).
This formal structure allows us to do something powerful: we can mathematically derive what would happen under interventions. If we “do” (intervene on) S by forcing everyone to study for 10 hours, we replace the equation for S with S=10 and propagate the effects through the model.
The Building Blocks: Confounders, Colliders, and Mediators
To apply causal reasoning in practice, you need to understand three fundamental structures that appear in causal graphs. These structures determine whether and how you should adjust for variables when estimating causal effects.
Confounders (Forks) A confounder is a common cause of both the treatment and the outcome. In a graph, it looks like X <- C -> Y, where C is the confounder. If you do not adjust for C, you will see a spurious association between X and Y even if X has no causal effect on Y.
Example: The relationship between coffee consumption and lung cancer. Both are caused by a common factor (smoking behavior), so adjusting for smoking is necessary to see the true effect of coffee.
Colliders A collider is a variable that is caused by both the treatment and another variable. In a graph, it looks like X -> C <- Y. Here is the counterintuitive part: if you adjust for a collider, you create a spurious association between X and Y, even if none exists.
Example: Consider a study of hospitalized patients looking at the relationship between two diseases. Hospitalization is a collider (caused by having either disease). If you only study hospitalized patients, you might find a negative correlation between the diseases, even if they are independent in the general population.
Mediators A mediator is a variable that lies on the causal path between treatment and outcome: X -> M -> Y. Whether you adjust for a mediator depends on your question. If you want the total effect of X on Y, do not adjust. If you want the direct effect (bypassing M), you might need to adjust, but this opens up other complexities.
Example: The effect of education (X) on income (Y) through occupation (M). If you want to know the total effect of education, do not adjust for occupation. If you want to know whether education affects income through channels other than occupation, the analysis becomes more complex.
Understanding these structures is essential because the same statistical operation (adjusting for a variable) can reduce bias (with confounders), create bias (with colliders), or answer a different question (with mediators).
Simpson’s Paradox: When Correlation Lies
One of the most striking demonstrations of why causality matters is Simpson’s Paradox. This is a phenomenon where a trend appears in several groups of data but disappears or reverses when the groups are combined.
A famous real-world example comes from UC Berkeley admissions data. When looking at overall admission rates, women appeared to be admitted at a lower rate than men, suggesting potential discrimination. But when the data was broken down by department, women were actually admitted at equal or higher rates in most departments.
What happened? Women tended to apply to more competitive departments with lower acceptance rates, while men tended to apply to departments with higher acceptance rates. The aggregate data was confounded by department choice.
The question “Should we aggregate or disaggregate?” cannot be answered by statistics alone. It requires understanding the causal structure. In this case, if we want to know whether the admissions process is biased, we should disaggregate by department because departments make the decisions. But if we want to know whether women face systemic barriers (including being steered toward more competitive fields), the aggregate data tells a different story.
Pearl’s do-calculus provides a formal framework for resolving Simpson’s Paradox. When the back-door criterion is satisfied (meaning you can block all confounding paths), you can calculate the true causal effect. If not, do-calculus offers other strategies.
Causal Discovery: Learning Causal Structure from Data
In practice, we often do not know the true causal structure. Causal discovery algorithms attempt to learn the causal graph from observational data. The main approaches fall into two categories:
Constraint-Based Methods (PC, FCI) These algorithms use conditional independence tests to rule out certain causal structures. The PC algorithm assumes no unobserved confounders and produces a partially directed graph (some edges remain undirected because the direction cannot be determined from data alone). The FCI algorithm relaxes the no-confounder assumption but produces an even more ambiguous output.
Functional Methods (LiNGAM) These algorithms make stronger assumptions about the data-generating process. LiNGAM assumes linear relationships with non-Gaussian noise. Under these assumptions, the causal direction can be uniquely identified because the distribution of residuals differs depending on which direction is causal.
In Python, the DoWhy library provides a principled interface for causal inference, including causal discovery. Here is a simplified example:
import dowhy
from dowhy import CausalModel
# Define a causal graph (or discover it)
model = CausalModel(
data=df,
treatment='treatment',
outcome='outcome',
common_causes=['age', 'gender'], # confounders
)
# Identify the causal effect
identified_estimand = model.identify_effect()
# Estimate the effect
estimate = model.estimate_effect(
identified_estimand,
method_name="backdoor.propensity_score_matching"
)
# Refute the estimate (check robustness)
refutation = model.refute_estimate(
identified_estimand,
estimate,
method_name="random_common_cause"
)
The key insight is that DoWhy separates the steps: define the causal model, identify whether the effect can be estimated, estimate it, and then try to refute the estimate using various checks.
Causal Reinforcement Learning: Where Causality Meets Robotics
This is where causality becomes essential for building intelligent systems. Reinforcement Learning (RL) is about learning to take actions to maximize rewards. But standard RL learns from correlations in the data, which can lead to brittle policies that fail when the environment changes.
Consider a robot learning to pick up objects. A correlation-based RL agent might learn that “when my gripper is near a red object, I get a reward.” But this fails to capture the causal mechanism: the reward comes from successfully grasping the object, not from being near red things. If the training environment happened to have many red objects, the agent learns the wrong lesson.
Causal RL addresses this by incorporating causal models into the learning process. There are several approaches:
Causal Influence Detection This technique uses interventions to identify which state variables actually influence outcomes. Researchers have used this to identify relevant features for robotic manipulation. The robot learns which factors causally influence success and which are spurious correlations. The result is dramatically better generalization from simulation to real-world deployment.
Counterfactual Data Augmentation Using a causal model, agents can generate “counterfactual trajectories” by asking questions like “What would have happened if I had taken a different action?” This allows learning from imagined experiences, improving sample efficiency.
Causal World Models Instead of learning a black-box dynamics model, agents learn a structured causal model of the environment. This model can then be used for counterfactual reasoning, intervention planning, and transfer to new tasks.
The results are compelling. Causal RL agents show improved sample efficiency (they learn faster), better robustness to distribution shift, and improved interpretability (we can understand why they make decisions).
Can LLMs Do Causal Reasoning?
Large Language Models like GPT-4 have shown impressive capabilities in many tasks. Can they reason causally? The research here is nuanced.
On standard benchmarks, LLMs perform surprisingly well on some causal tasks. GPT-4 achieves 97% accuracy on pairwise causal discovery (determining which of two variables causes the other) and 92% on certain counterfactual reasoning tasks. This significantly outperforms previous approaches.
But there are serious limitations. LLMs exhibit “brittle” behavior with high sensitivity to how questions are phrased. Their performance drops significantly when dealing with complex multi-variable scenarios that resemble real-world data. And most concerning, some research shows that LLMs often arrive at correct answers through flawed reasoning processes.
The fundamental issue is that LLMs are trained to predict text, not to model causal mechanisms. They may have memorized many causal facts from their training data, but they cannot derive new causal relationships from first principles. As one paper put it, they function as “causal parrots,” capable of reciting causal knowledge without truly understanding or applying it.
This does not mean LLMs are useless for causal tasks. They can be valuable for generating hypotheses about causal structure, translating domain knowledge into formal causal models, and explaining causal reasoning to users. But they should not replace rigorous causal inference frameworks for important decisions.
Practical Advice: Getting Started with Causality
If you want to incorporate causal thinking into your work, here is where I would start:
1. Learn to Draw DAGs Before any analysis, sketch the causal graph. What causes what? What are the potential confounders? What are you trying to estimate? Tools like DAGitty (dagitty.net) make this easy and can automatically identify which variables you need to adjust for.
2. Distinguish Prediction from Causal Inference Not every problem requires causal reasoning. If you just need to predict what will happen (without intervening), correlation-based ML is fine. But if you need to know what will happen if you change something, you need causal methods.
3. Use the Right Tools Python libraries like DoWhy provide a principled framework for causal inference. The key steps are: model the problem causally, identify whether the causal effect can be estimated from your data, estimate it, and refute (test the robustness of) your estimate.
4. Read the Foundational Work “The Book of Why” by Judea Pearl and Dana Mackenzie is an accessible introduction. For more technical depth, “Causal Inference in Statistics: A Primer” provides the mathematical foundations. Miguel Hernan’s free course “Causal Diagrams: Draw Your Assumptions Before Your Conclusions” on edX is excellent for epidemiological applications.
Conclusion: The Path Forward for Intelligent Systems
Causality is not just an academic curiosity. It is the foundation for building AI systems that can reason about interventions, plan actions, and learn robustly from limited data. As we push toward more autonomous systems in robotics, healthcare, and other high-stakes domains, the ability to reason causally becomes essential.
The good news is that the tools exist. Judea Pearl’s do-calculus provides a complete framework for causal reasoning. Structural Causal Models give us a formal language. Causal discovery algorithms help us learn structure from data. And causal reinforcement learning shows how to build agents that understand what actually causes outcomes.
The challenge is adoption. Most machine learning education focuses on prediction, not causal inference. Most tools are optimized for finding patterns, not testing interventions. Changing this requires a shift in how we think about the problems AI should solve.
If you take one thing from this post, let it be this: the next time you see a correlation in your data, pause and ask “But is it causal?” That simple question opens up a world of more rigorous, more reliable, and ultimately more intelligent systems.
References
- Pearl, Judea. “The Seven Tools of Causal Inference with Reflections on Machine Learning.” Communications of the ACM, 2019.
- Scholkopf et al. “Toward Causal Representation Learning.” Proceedings of the IEEE, 2021.
- Bareinboim, Elias. “Causal Reinforcement Learning.” Columbia University Causal AI Lab.
- Zeng et al. “Causal Reasoning and Large Language Models: Opening a New Frontier for Causality.” arXiv, 2023.
- Deng et al. “Causal Reinforcement Learning: A Survey.” arXiv, 2023.