Skip to content

Problem Formulation

1.1 Environment: Ring Road Mixed Traffic

This project studies a single-lane ring road where one RL-controlled autonomous vehicle (CAV) shares the road with a human-driven head vehicle. The ring topology eliminates boundary effects (on-ramps, traffic lights) and isolates the core car-following dynamics.

The head vehicle (car0) periodically changes its cruising speed at random, creating disturbances that propagate through the ring. The AV (car1) must learn a speed-control policy that maintains safe following distances, tracks the traffic flow, and avoids abrupt acceleration changes.

The problem is formulated as a Markov Decision Process (MDP):

Component Description
State Normalized velocities and positions of all vehicles on the ring
Action Longitudinal acceleration command applied to the CAV
Reward Quadratic cost penalizing velocity error, spacing error, and control effort
Transition Determined by SUMO's car-following models and the applied acceleration

The simulation runs at a time step of \(\Delta t = 0.1\,\text{s}\), and each episode lasts up to 500 s (5,000 steps).


1.2 State Space

The observation is a fixed-length vector of normalized velocities and positions for all \(N\) vehicles:

\[ \mathbf{s} = \bigl[\underbrace{v_0/v_{\max},\;\dots,\;v_{N-1}/v_{\max}}_{\text{velocities}},\;\underbrace{p_0/L,\;\dots,\;p_{N-1}/L}_{\text{positions}}\bigr] \]

where:

  • \(v_{\max} = 30\;\text{m/s}\) is the maximum speed.
  • \(L\) is the ring circumference (computed from the SUMO network at startup).
  • Vehicles are sorted by ID so each index always maps to the same vehicle.
  • If fewer than \(N\) vehicles are present, the vector is zero-padded to maintain a fixed shape of \((2N,)\).
  • All values are clipped to \([0, 1]\).

For the default two-vehicle scenario, the observation has shape \((4,)\): [v_head, v_agent, p_head, p_agent].


1.3 Action Space

The native action space is a continuous scalar acceleration:

\[ a \in [a_{\min},\; a_{\max}] = [-3.0,\; 3.0]\;\text{m/s}^2 \]

The acceleration is integrated into velocity each step:

\[ v_{t+1} = \text{clip}\bigl(v_t + a \cdot \Delta t,\; 0,\; v_{\max}\bigr) \]

and applied to the vehicle via TraCI's setSpeed() for immediate response.

Discretization

Because both Q-learning and DQN operate over discrete actions, the continuous range is discretized into \(n\) evenly spaced bins using DiscretizeActionWrapper:

\[ \mathcal{A} = \bigl\{a_{\min},\;\; a_{\min} + \delta,\;\; a_{\min} + 2\delta,\;\;\dots,\;\; a_{\max}\bigr\}, \quad \delta = \frac{a_{\max} - a_{\min}}{n - 1} \]

1.4 Reward Function

The active reward function is based on the DeeP-LCC [1] cost formulation, which penalizes deviations from equilibrium velocity, desired spacing, and excessive control effort.

Components

Velocity error — penalizes deviation from the target speed (\(v^{*} = 15\;\text{m/s}\)):

\[ R_v = -w_v \cdot (v_{\text{ego}} - v^{*})^2 \]

Spacing error — penalizes deviation from the OVM equilibrium spacing \(s^{*}\):

\[ R_s = -w_s \cdot \bigl(\text{clip}(d_{\text{gap}} - s^{*},\;-20,\;20)\bigr)^2 \]

where \(s^{*}\) is computed using the Optimal Velocity Model (OVM):

\[ s^{*} = \frac{\arccos\!\bigl(1 - 2\,v^{*}/v_{\max}\bigr)}{\pi}\,(s_{\text{go}} - s_{\text{st}}) + s_{\text{st}} \]

with \(s_{\text{st}} = 5\;\text{m}\) (stop spacing) and \(s_{\text{go}} = 35\;\text{m}\) (free-flow spacing).

Control effort — penalizes large accelerations for smooth driving:

\[ R_u = -w_u \cdot a^2 \]

Safety constraint — a hard penalty when the gap falls below a minimum:

\[ R_{\text{safety}} = \begin{cases} -100 & \text{if } d_{\text{gap}} < s_{\min} \\ 0 & \text{otherwise} \end{cases} \]

Total Reward

\[ R = \frac{R_v + R_s + R_u + R_{\text{safety}}}{100} \]

The division by 100 scales per-step values into a range suitable for policy gradient methods.

Default Weights

Parameter Symbol Default
Velocity \(w_v\) 0.8
Spacing \(w_s\) 0.7
Control \(w_u\) 0.1
Min gap \(s_{\min}\) 5.0 m

1.5 Constrained MDP (CMDP) Formulation

The standard MDP formulation treats spacing violations as reward penalties. This conflates safety with performance — the agent can trade off collision risk against velocity tracking if the penalty weights are not tuned carefully.

A cleaner formulation is to separate the objective from the constraints using a Constrained Markov Decision Process (CMDP):

Objective

Maximize the expected cumulative DeeP-LCC reward:

\[ \max_\pi \; \mathbb{E}_\pi \left[\sum_{t=0}^{T} \gamma^t \, r(s_t, a_t)\right] \]

where \(r(s_t, a_t)\) is the DeeP-LCC reward (velocity error + spacing error + control effort, Section 1.4).

Constraints

Two spacing constraints must be satisfied in expectation:

Constraint 1 — Minimum spacing (\(s_\text{min}\)): the bumper-to-bumper gap to the physical leader must stay above a safe following distance.

\[ \mathbb{E}_\pi \left[\sum_{t=0}^{T} \gamma^t \, c_1(s_t, a_t)\right] \leq d_1 \]

where:

\[ c_1(s_t, a_t) = \max\!\bigl(0,\; s_\text{min} - s_t^{\text{leader}}\bigr) / s_\text{min} \]

Constraint 2 — Maximum spacing (\(s_\text{max}\)): the gap to the head vehicle must stay below a threshold to prevent the CAV from drifting away and exploiting the reward structure.

\[ \mathbb{E}_\pi \left[\sum_{t=0}^{T} \gamma^t \, c_2(s_t, a_t)\right] \leq d_2 \]

where:

\[ c_2(s_t, a_t) = \max\!\bigl(0,\; s_t^{\text{head}} - s_\text{max}\bigr) / s_\text{max} \]

Both cost functions are normalised to \([0, 1]\) so the Lagrange multipliers operate on a consistent scale regardless of the physical units.

Lagrangian relaxation

The CMDP is solved via Lagrangian relaxation, converting the constrained problem into an unconstrained one with adaptive penalty multipliers:

\[ r_\text{aug}(s_t, a_t) = r(s_t, a_t) - \lambda_1 \cdot c_1(s_t, a_t) - \lambda_2 \cdot c_2(s_t, a_t) \]

The multipliers \(\lambda_i\) are updated via dual gradient ascent at each rollout:

\[ \lambda_i \leftarrow \text{clip}\!\left(\lambda_i + \alpha_\lambda \cdot (\bar{c}_i - d_i),\; 0,\; \lambda_\text{max}\right) \]

where \(\bar{c}_i\) is the mean cost over the rollout and \(d_i\) is the violation tolerance.

Enforcement strategy

In practice, the two constraints are enforced differently:

Constraint Soft (Lagrangian) Hard (Safety Layer) Rationale
\(s_\text{min}\) (too close) Yes Yes Both mechanisms: hard clip prevents collision, soft penalty shapes policy
\(s_\text{max}\) (too far) No Yes Lagrangian penalty alone was too slow (400k+ steps still drifting); hard constraint prevents exploit immediately

The safety layer is a physics-based filter that clips accelerations before they reach SUMO (see Lagrangian PPO: Safety Layer). It operates independently of the Lagrangian penalty — the hard constraint prevents violations from occurring, while the soft penalty shapes the policy gradient to learn behaviors that avoid needing the constraint. In practice, the safety layer is so effective that the Lagrangian multiplier decays to zero (see exploration log analysis).

Parameters

Parameter Symbol Value Description
Min spacing \(s_\text{min}\) 5.0 m Minimum bumper-to-bumper gap to leader
Max spacing \(s_\text{max}\) 40.0 m Maximum gap to head vehicle (\(2 \times s^*\))
Violation tolerance \(d_i\) 0.1 Threshold below which \(\lambda\) decreases
Multiplier LR \(\alpha_\lambda\) 0.05 Dual gradient step size
Initial \(\lambda\) \(\lambda_0\) 1.0 Starting multiplier value
Max \(\lambda\) \(\lambda_\text{max}\) 10.0 Upper bound on multiplier

References

[1] Wang, J., Zheng, Y., Li, K., & Xu, Q. (2023). DeeP-LCC: Data-EnablEd Predictive Leading Cruise Control in Mixed Traffic Flow. IEEE Transactions on Control Systems Technology, 31(6), 2760–2776. doi:10.1109/tcst.2023.3288636