← Back to index

Below is a detailed development of the equations for the reinforcement learning (RL) framework applied to agent-based economic systems with dynamic asset pricing. These equations cover agent dynamics, asset pricing mechanisms, reward functions, and optimization processes.


1. Agent Dynamics

1.1. State Space ($S_i(t)$)

The state of agent $i$ at time $t$ includes its balance, asset holdings, and market observations:
$$ S_i(t) = \left[ B_i(t), H_{i,1}(t), H_{i,2}(t), \dots, H_{i,M}(t), P_1(t), P_2(t), \dots, P_M(t), \mathbf{M}(t) \right] $$
where:
- $B_i(t)$: Balance of agent $i$ at time $t$.
- $H_{i,j}(t)$: Holdings of asset $j$ by agent $i$ at time $t$.
- $P_j(t)$: Price of asset $j$ at time $t$.
- $\mathbf{M}(t)$: Market conditions (e.g., total supply, demand, external shocks).

1.2. Action Space ($A_i(t)$)

The action of agent $i$ at time $t$ includes buying, selling, or holding assets:
$$ A_i(t) = \left[ a_{i,1}(t), a_{i,2}(t), \dots, a_{i,M}(t) \right] $$
where $a_{i,j}(t)$ represents the action for asset $j$:
$$ a_{i,j}(t) \in {-1, 0, 1} $$
- $-1$: Sell one unit of asset $j$.
- $0$: Hold asset $j$.
- $1$: Buy one unit of asset $j$.

1.3. Policy ($\pi_i$)

The policy $\pi_i$ maps states to actions:
$$ \pi_i: S_i(t) \rightarrow A_i(t) $$
For example, using a softmax policy:
$$ \pi_i(a_i(t) | S_i(t)) = \frac{e^{Q_i(S_i(t), a_i(t)) / \tau}}{\sum_{a'} e^{Q_i(S_i(t), a') / \tau}} $$
where $Q_i$ is the action-value function, and $\tau$ is the temperature parameter.


2. Asset Pricing Mechanisms

2.1. Bonding Curve

The price of asset $j$ is determined by a bonding curve $f_j$:
$$ P_j(t) = f_j(S_j(t), \mathbf{\Phi}_j(t)) $$
where:
- $S_j(t)$: Total supply of asset $j$ at time $t$.
- $\mathbf{\Phi}_j(t)$: Parameters of the bonding curve (e.g., slope, intercept).

Examples of bonding curves:
- Linear:
$$ P_j(t) = m_j S_j(t) + b_j $$
- Exponential:
$$ P_j(t) = a_j e^{k_j S_j(t)} $$
- Sigmoid:
$$ P_j(t) = \frac{K_j}{1 + e^{-k_j(S_j(t) - S_{0,j})}} $$

2.2. Market Clearing

Prices can also emerge from market clearing conditions:
$$ D_j(P_j(t)) = O_j(P_j(t)) $$
where $D_j$ and $O_j$ are the demand and supply functions for asset $j$.


3. Reward Function

3.1. Utility-Based Reward

The reward $r_i(t)$ for agent $i$ is based on its utility function $U_i$:
$$ r_i(t) = U_i(B_i(t), H_{i,j}(t), P_j(t)) $$
For example, a profit-maximizing agent might have:
$$ U_i = \sum_{j=1}^M \left[ H_{i,j}(t) \cdot P_j(t) \right] + B_i(t) $$

3.2. Risk-Adjusted Reward

A risk-averse agent might include a penalty for volatility:
$$ U_i = \sum_{j=1}^M \left[ H_{i,j}(t) \cdot P_j(t) \right] + B_i(t) - \lambda \cdot \text{Var}(P_j(t)) $$
where $\lambda$ is the risk aversion coefficient.


4. System Dynamics

4.1. Asset Supply Update

The supply of asset $j$ changes based on minting, burning, or agent actions:
$$ S_j(t+1) = S_j(t) + \sum_{i=1}^N a_{i,j}(t) $$

4.2. Agent Balance Update

The balance of agent $i$ is updated based on trading and policy interventions:
$$ B_i(t+1) = B_i(t) - \sum_{j=1}^M a_{i,j}(t) \cdot P_j(t) + \text{Income}_i(t) + \text{PolicyEffects}_i(t) $$

4.3. Agent Holdings Update

The holdings of agent $i$ for asset $j$ are updated as:
$$ H_{i,j}(t+1) = H_{i,j}(t) + a_{i,j}(t) $$


5. Reinforcement Learning Algorithms

5.1. Q-Learning

The action-value function $Q_i$ is updated as:
$$ Q_i(S_i(t), A_i(t)) \leftarrow Q_i(S_i(t), A_i(t)) + \alpha \left[ r_i(t) + \gamma \max_{a'} Q_i(S_i(t+1), a') - Q_i(S_i(t), A_i(t)) \right] $$
where:
- $\alpha$: Learning rate.
- $\gamma$: Discount factor.

5.2. Policy Gradient

The policy parameters $\theta_i$ are updated using the gradient:
$$ \theta_i \leftarrow \theta_i + \alpha \nabla_{\theta_i} \log \pi_i(A_i(t) | S_i(t)) \cdot G_i(t) $$
where $G_i(t)$ is the cumulative reward:
$$ G_i(t) = \sum_{k=0}^T \gamma^k r_i(t+k) $$

5.3. Deep RL

For high-dimensional state spaces, a neural network approximates the Q-function or policy:
$$ Q_i(S_i(t), A_i(t)) \approx Q_i(S_i(t), A_i(t); \mathbf{W}_i) $$
where $\mathbf{W}_i$ are the network weights.


6. Optimization

6.1. Bonding Curve Parameter Optimization

The parameters $\mathbf{\Phi}j(t)$ are optimized to minimize price volatility:
$$ \min
{\mathbf{\Phi}_j} \text{Var}(P_j(t)) $$

6.2. Airdrop Strategy Optimization

The airdrop strategy is optimized to maximize adoption:
$$ \max_{\mathbf{AirdropParams}} \sum_{i=1}^N \mathbb{I}(H_{i,j}(t) > 0) $$
where $\mathbb{I}$ is the indicator function.


7. Metrics for Evaluation

7.1. Price Volatility

$$ \text{Volatility}(P_j) = \sqrt{\frac{1}{T} \sum_{t=1}^T (P_j(t) - \bar{P}_j)^2} $$

7.2. Wealth Distribution (Gini Coefficient)

$$ G = \frac{\sum_{i=1}^N \sum_{k=1}^N |B_i(t) - B_k(t)|}{2N \sum_{i=1}^N B_i(t)} $$

7.3. Resource Utilization

$$ \text{Utilization} = \frac{\sum_{i=1}^N Q_{i,j}(t)}{C_j} $$
where $C_j$ is the total capacity of resource $j$.


These equations provide a comprehensive mathematical foundation for applying RL to agent-based economic systems. They can be adapted and extended for specific use cases, such as DeFi protocols, token economies, or resource allocation systems.