Graphical Abstract Figure

System Architecture for Training and Evaluating the RL-Based Chaser Drone Policy with Gazebo, ROS, YOLOv5, DDPG, and Ardupilot

Graphical Abstract Figure

System Architecture for Training and Evaluating the RL-Based Chaser Drone Policy with Gazebo, ROS, YOLOv5, DDPG, and Ardupilot

Close modal

Abstract

Unmanned aerial vehicles (UAVs) are fast becoming a low-cost, affordable tool for various security and surveillance tasks. It has led to the use of UAVs (drones) for unlawful activities such as spying or infringing on restricted or private air spaces. This rogue use of drone technology makes it challenging for security agencies to maintain the safety of many critical infrastructures. Additionally, because of the drones’ varied low-cost design and agility, it has become challenging to identify and track them using conventional radar systems. This paper proposes a deep reinforcement learning-based approach for identifying and tracking an intruder drone using a chaser drone. Our proposed solution employs computer vision techniques interleaved with a deep reinforcement learning control for tracking the intruder drone within the chaser’s field of view. The complete end-to-end system has been implemented using robot operating system and Gazebo, with an Ardupilot-based flight controller for flight stabilization and maneuverability. The proposed approach has been evaluated on multiple dynamic scenarios of intruders’ trajectories and compared with a proportional-integral-derivative-based controller. The results show that the deep reinforcement learning policy achieves a tracking accuracy of 85%. The intruder localization module is able to localize drones in 98.5% of the frames. Furthermore, the learned policy can track the intruder even when there is a change in the speed or orientation of the intruder drone.

1 Introduction

Unmanned aerial vehicles (UAVs), commonly known as drones, have quickly become a tool for carrying out remote surveillance missions. They are attributed to having almost undetectable radar signatures and can perform various controlled maneuvers that are both complex and very unpredictable. These maneuvers can include rapid accelerations, sharp turns, and sudden changes in altitude, making drones highly agile and difficult to track or intercept using traditional methods like patrol officers, radars, or passive surveillance systems like CCTV. These capabilities of drones are further enhanced by their ability to be programmed for complex waypoint-based missions and autonomous operations without direct human intervention. In addition, the ability of drones to carry out operations autonomously complicates detection strategies because they can make real-time decisions based on the data they collect, alter their flight paths, and come up with new tactics to evade detection and countermeasures. This presents unique challenges for legacy defense systems that rely on predictable flight dynamics, controlled waypoint execution, and human-in-the-loop oversight to detect and track an intruder in the airspace. In recent years, many drone applications have emerged, from the delivery of medical supplies in remote locations to security applications such as border patrol and surveillance [1]. In addition, drones have proven their utility and effectiveness in search and rescue operations during natural disasters [2], highlighting their cost effectiveness and adaptability as an emerging technology.

Due to the diverse capabilities inherent in drones, they are increasingly being used for illicit activities, such as unauthorized surveillance [3], intrusion into secure spaces, and covert transport of contraband and weapons. Current research addressing the application of pursuit-evasion techniques for the timely detection and mitigation of intrusions caused by these drones is limited [4]. The complexity of the issue is increased because intruder drones often evade traditional radar systems due to their varied miniature designs, configurations, and ability to fly at high-speeds at low altitudes with minimal acoustic emissions. This requires reviewing traditional defense systems and developing new techniques and strategies capable of countering the unique challenges posed by drones.

In Ref. [5], we investigated the scenario of chasing intruder drones using a monocular camera, the deep reinforcement learning (RL) model, and the onboard computation unit to ensure security and timely response to threats imposed by drones to a secured area. Extending our work, this paper deals with comprehensive reward formulation, scaling up the simulation for new chasing strategies, implementing novel training environments, evaluating policy in dynamic environments similar to real-world environments, and comparisons with a proportional-integral-derivative (PID)-based controller. In highly dynamic pursuit-evasion environments characterized by rapidly evolving intruder positions, conventional drone control methods prove insufficient. Hence, a learning-based framework presents a promising approach for controlling chaser drones and intercepting unpredictable target drones. Specifically, RL offers an adaptive technique to derive optimal policies in non-stationary and dynamic environments. Unlike hard-coded logical programs or simple control flow scripts, RL systems enable agents to optimize behavior through iterative interactions with a simulated environment. This requires real-time sensing and trajectory modeling to estimate intruder motions continuously. To achieve the task of tracking and following an intruder drone, our approach involves rapid processing of incoming visuals of the chaser drone, continuous estimation of subsequent movement actions, a high-level controller for the drone’s movements, and extensive testing to fine-tune the learned control policy further. The extended contributions of this work are as follows:

  1. The reward function optimizes the chaser drone’s alignment and speed during pursuit.

  2. A penalty is added when the chaser gets too close to the intruder to ensure safe dynamics.

  3. Multiple dynamic trajectories are used to train a robust chaser drone policy.

  4. Extensive experiments are conducted in varied environmental settings to evaluate the proposed model, along with comparisons with a PID-based controller on the same task.

2 Related Work

Various technological approaches have been proposed to track an intruder drone using the camera feed from a flying UAV. References [6] and [7] have explored deep learning for detecting and tracking cooperative and non-cooperative UAVs using visual cameras. In Ref. [8], authors have developed an autonomous long-range drone detection system with a high accuracy of 95.5% at 250 m. In Ref. [9], the authors present a novel approach utilizing a stereo camera system to detect, track, and intercept a faster UAV by reconstructing the intruder’s trajectory. This technique is suitable for medium-range detection of drones and can be used to identify the trajectory the intruder follows. Furthermore, Ref. [10] proposed an approach to detect flying objects such as UAVs and aircraft when they occupy a small portion of the field of view (FOV), possibly moving against complex backgrounds and are filmed with a moving camera. In the context of tracking moving targets with UAV-borne radars, Ref. [11] presented improved Kalman filter variants for UAV tracking with radar motion models, which could provide insight into integrating radar-based tracking with visual feed for enhanced intruder drone chase. In Ref. [12], authors have proposed using a 5X52D grid to partition the environment into 25 distinct states and assign Q values. Although this approach may be viable in basic, indoor, and familiar settings, its effectiveness is limited when applied to real-world scenarios. The study specifically focuses on a relatively small state space (25), which is inadequate for most practical applications. In Ref. [13], authors have presented a solution for tracking drones using an action–decision network approach. It helps determine the optimal placement of the boundary box for subsequent steps. Although the method outlined in the paper accounts for the likely location of the chaser in the FOV and the direction of the next search, it does not address the need for real-time decisions to continuously pursue the intruder. In Ref. [14], the authors have proposed a system capable of identifying objects in the sky. It can distinguish between birds, clouds, and drones. It can detect false positives and provide accurate detection of drones. In Ref. [15], the author has published a data set of 500 video pairs, along with around 580k manually annotated boundary boxes. This data set is used to benchmark various drone detection and tracking methods. The paper uses a dual-flow semantic consistency method for drone tracking. In Ref. [16], authors have proposed a method for real-time agricultural surveillance using drones to detect, classify, and track objects while comparing various models such as Yolov7, SSD, Mask R-CNN, and Faster R-CNN for object detection. In Ref. [17], the authors introduced a novel framework to improve drone navigation. This system dynamically adjusts task execution locations, input resolution, and image compression ratios to achieve low inference latency, high prediction accuracy, and extended flight distances. As previously discussed, while there are various techniques for detecting and localizing drones through visual data, there remains a notable gap in the research surrounding the continuous intercepting and responsive counter-attack for intruder drones via reinforcement learning-based controls. Our proposed approach addresses this issue and establishes a reliable solution.

3 System Description

This work involves two drones: the chaser drone, δchaser, and the intruder drone, δintruder. The objective of δchaser is to track δintruder over a long period of time. δchaser is equipped with a monocular camera to capture the latter within its FOV. FOV is the area captured by the camera in its image plane. In addition, the intended trajectory and velocity of δintruder are unknown to δchaser. Successful tracking over a long period occurs when δintruder is present in δchaser’s FOV with a pre-defined configuration. The pre-defined configuration for drone tracking is described as
(1)

In Eq. (1), FOVcenterX,FOVcenterY represents central coordinates of δchaser’s FOV and Xδintrudert,Yδintrudert represents coordinates of δintruder in δchaser’s FOV. W and H represent the width and height of δchaser’s FOV in pixel units. The size of δintruder in δchaser’s FOV defined as σ(δintruder,FoV) provides localization and noisy depth estimation of δchaser and δintruder, denoted by d(δchaser,δintruder) as represented in Eq. (2). The distance between δchaser and δintruder is shown in Fig. 1.

Fig. 1
Reward structure representation for the chaser drone
Fig. 1
Reward structure representation for the chaser drone
Close modal
(2)
The main goal is to minimize the gap between the chaser drone and the intruder drone while simultaneously maintaining the δintruder’s position at the center of the FOV of δchaser, given as
(3)

4 Proposed Methodology

This work proposes an RL-based framework for autonomous tracking and chasing an intruder drone by a chaser drone. An end-to-end pipeline is designed for this work, which can capture images from the chaser drone’s FoV via robot operating system (ROS), run a computer vision framework to detect δintruder from δchaser’s FOV using YOLOv5,2 feed required state information into RL framework and translate the output into appropriate high-level control signals for δchaser which are then fed into quadcopter running Ardupilot. YOLOv5 outputs a boundary box of δintruder, which is then preprocessed and fed into an RL-based framework. The source code for the implementation and the annotated data set used to train YOLOv5 is available publicly.3 The next section defines a computer vision-based localization module, as used in the framework.

4.1 Intruder Localization.

The chase drone,δchaser, captures raw frames subscribed using RosTopic /drone1/iriscam//drone1/iriscam/image_raw. For detecting δintruder in FOV, the you only look once (YOLOv5) object detection framework is employed. It takes a raw image as input, preprocesss it, and provides boundary box coordinates of the detected intruder represented by Xlow,Ylow,Xhigh,Yhigh. The localization module is trained by manually annotating 8000 images from real-world images and frames captured from the Gazebo simulation. During the creation of the dataset, various orientations and heights of the chaser drone are used along with multiple weather conditions and backgrounds. The dataset is expanded to 24,000 images using various transformation techniques, such as rotation, flipping, occlusion, and cropping. The resultant model can localize drones in 98.5% of the frames. The localization module can also be seamlessly integrated into real-world drone tracking applications. The localization module can be swapped with other frameworks, such as YOLOv4, SSDs, or R-CNN, without significant changes to the overall system architecture. In Ref. [18], the authors have compared YOLOv5 with YOLOv4 and YOLOv3 to detect landing zones for UAVs. The results showed improvements in accuracy, precision, and recall when using YOLOv5.

4.2 Markov Decision Process.

Typically, an RL problem is framed as a Markov decision process (MDP), with the agent having full access to the environment’s state. However, in this problem, the environment is partially observable; therefore, a limited amount of historical information is encoded in the state to make the system Markovian. An MDP is a tuple S,A,P,R, where S denotes the set of environment states, A refers to the agent’s permissible actions, P represents the environment’s transition probabilities, and R is the reward set for the agent’s actions. However, in this particular problem, the environment’s transition probabilities P are not known. Consequently, drones must learn a policy by repeatedly interacting with and sampling the environment.

States: In the framework, δchaser has access to the images captured by the mounted camera and its velocity. The camera image FOV(δchaser) is communicated using ROS to the intruder module, which is processed using YOLOv5. This network localizes δintruder in FOV(δchaser) and returns the boundary box coordinates. Then, these pixel coordinates are used as the state space for δchaser. The overall observation space of δchaser is a total of eight components represented as O:Xlow,Ylow,Xhigh,Yhigh,VX,VY,VZ,Yδchaser, where VX,VY,VZ are the velocities of δchaser along the X,Y,Z axes and Yδchaser represents the current orientation of δchaser. Furthermore, we kept five such previous observation tuples to form a single state of the environment at any point of time, represented by St=f(Ot4,Ot3,Ot2,Ot1,Ot)

Actions: The action space of the chaser drone is defined as A:VX,VY,VZ,Yδchaser, where the first three values represent velocities in forward, lateral, and vertical directions. Yδchaser represents the yaw of the drone. All the parameters of the action tuple are continuous. The range of each component is clipped between the range of (2,2)m/s.

Rewards: The most crucial aspect of learning an RL policy is to design a well-suited reward model. Rewards encode the objective of the system within the MDP problem. In our system, we consider two types of reward, namely Rtrack and Ralign. This reward model ensures that the chaser drone δchaser must always keep δintruder in the center of FOV(δchaser), as well as reduce the distance between δintruder and δchaser, (δintruder,δchaser). Ralign is calculated as the Euclidean distance between the center of FoV(δchaser), i.e., FOVcenterX,FOVcenterYXδintrudert,Yδintrudert while Rtrack is the total perimeter of the bounding box encompassing δintruder in FOV(δchaser), given as
(4)
where
(5)
(6)
(7)
where κalign[ω1,ω2) and κtrack[ψ1,ψ2). A substantial penalty ρ2 is imposed on the chaser drone when the intruder’s position remains outside the chaser drone’s FOV for an extended period. Consequently, if, for a duration of TMax steps, the chaser drone fails to detect δintruder within its field of view, FOV(δchaser), the episode ends early, accompanied by a penalty of ρ2. In an alternative scenario, if the chaser drone approaches δintruder closely, thus increasing the risk of a collision, a penalty of ρ1 is incurred, but the episode continues. The various components of the rewards are shown in Fig. 1.
(8)
The total reward (R) for any step at time t is calculated as a linear combination of three components, given as
(9)
where αr,βr,γr are the constant non-zero weights for tracking, alignment, and penalty terms, respectively.

4.3 Learning the Tracking Policy.

As we are dealing with a continuous action space, A, rather than discrete, policy gradient methods fit better in this setting, particularly deep deterministic policy gradients (DDPG) [19]. DDPG consists of two deep neural networks, the actor and critic network, as shown in Fig. 2. The actor’s policy network denoted as μ(s|θμ) is a function of state space that gives an action as output given a state; these actions are executed by the δchaser in the environment, while the critic network Q(s,a|θQ) is used to evaluate the viability of the actions generated by the actor. DDPG is a suitable framework for our objective of controlling δchaser continuously according to the visual feed and generating continuous actions in the X,Y,Z directions, as well as controlling the orientation of δchaser. There is a replay buffer , which collects and stores previous samples from the environment in the form of (st,at,rt+1,st+1), where st is the current state of the drone, at is the action taken in st, rt+1 is the reward observed after taking at in st, and st+1 is the next state that the drone landed up in. The replay buffer addresses the issues of sample inefficiency and makes updates more productive.

Fig. 2
The proposed DDPG-based model consists of actor, critic, and target networks for learning the control policy for the chaser drone
Fig. 2
The proposed DDPG-based model consists of actor, critic, and target networks for learning the control policy for the chaser drone
Close modal
In each step of the episode, a random mini-batch of samples is taken uniformly at random, and the actor and critic networks are updated. The actor parameters are updated in the direction of the gradient of the performance objective J
(10)
Where the expectation is taken over states st coming from a discounted state visitation distribution for a stochastic behavior policy β. The critic is updated by minimizing the expected loss between the critic’s value and the target generated from the target critic network, given as
(11)

Detailed instructions and the environment setup scripts are available publicly.4

5 Experimental Setup

To train and assess the proposed approach, we implemented the system for chaser and intruder drones using Gazebo and ROS. Gazebo, a 3D simulator with a robust physics engine, allows realistic simulation of various scenarios and interactions, incorporating sensors such as cameras, LiDARs, and GPS. ROS, a widely used open-source middleware for robotic functionalities, adopts a subscriber–publisher model with libraries and tools that facilitate communication among different modules within a robotic system. It promotes the development of reusable code in a standardized API format, enhancing the construction, modification, and interaction with robots.

Ardupilot, an open-source flight controller, was employed to control the drone. Its functionalities include GPS navigation, waypoint movement, return to launch, hovering, and an inertial measurement unit model. In our implementation, Gazebo serves as the 3D simulation platform, ROS facilitates communication between the chaser drone and the Gazebo environment, and Ardupilot guides the flight maneuvers of the chaser drone based on learned control from the DDPG model. Figure 3 shows the overall system architecture. ROS is central to the implementation, providing middleware support through its publisher–subscriber framework. Various ROS topics, as shown in Fig. 3, perform specific functions such as drone image capture, drone detection, training, and translation of actions for the drone.

Fig. 3
Comprehensive system architecture for training and evaluating the RL-based chaser drone policy using Gazebo, ROS, YOLOv5, DDPG, and Ardupilot
Fig. 3
Comprehensive system architecture for training and evaluating the RL-based chaser drone policy using Gazebo, ROS, YOLOv5, DDPG, and Ardupilot
Close modal

5.1 Training Simulation.

The training environment is simulated in Gazebo, along with a localization and communication module using ROS and YOLOv5. The intruder’s velocity varies from episode to episode, ranging between 5 m/s and 10 m/s; this defaults to the values generated during random trajectory generation. The policy is initialized for training with high exploration noise, enabling the DDPG model to gather a wider variety of observations. The training is performed for a total of 4000 episodes. During training, for each episode, a different trajectory is generated at random. Both δchaser and δintruder are randomly spawned in the world, then δintruder starts following the generated trajectory, and the whole episode continues for a maximum of 1500 time-steps or until δintruder is out of sight for 50 consecutive time-steps. To facilitate exploration during policy learning, an additional noise is introduced, which is sampled from an Ornstein–Uhlenbeck (OU) process N, given by
(12)

The comprehensive training of the DDPG model is conducted on a DELL Server equipped with an Intel Xeon Processor, NVIDIA Quadro RTX A4000 8 GB graphics card, and 64 GB RAM. The Gazebo simulation runs on an ASUS system that features an AMD 5800H processor, 16 GB RAM, and an NVIDIA RTX 3060 6 GB graphics card. Communication between DDPG training and Gazebo simulation utilized python API calls through the Flask Framework. This distributed implementation streamlines the processes and establishes a controller–responder architecture, facilitating scalability to more clusters when necessary. The intruder localization module is tested under various environmental conditions. The module was tested in broad daylight, night, fog, and mist conditions. The module had difficulty detecting intruders in foggy conditions but performed well in the other three conditions. The module is further tested in various environments, including rural versus urban environments and hilly areas. To deploy the solution in the real world, a UAV with an onboard camera unit and a computation unit is required. The computing unit is responsible for intruder localization and policy execution. The updated velocity changes are sent to Ardupilot Firmware, which in turn is responsible for the movement of the UAV. Sim2Real Transfer is a challenging part, but by designing the state space to be independent of environmental variations, fine-tuning can be minimized. Including a velocity clipping function, which clips the model output in the range of (2,2) m/s, ensures no abrupt change in the movement of UAV may arise from real-world testing. Including safety in real-world deployment is part of our other ongoing research. The specific hyperparameter values are detailed in Table 1, which includes fine-tuning parameters for the deep RL framework and the upper and lower limits for the reward structure.

Table 1

Hyper-parameters for DDPG model training

Hyper-parameterValue
Discount factor (γ)0.99
Mini-batch size (B)128
Actor learning rate (ηπ )0.001
Critic learning rate (ηQ )0.001
Replay buffer size (N)100,000
Target update parameter (τ)0.001
RTrack range (ω1,ω2,ω3)(100,+5,200)
RAlign range (ψ1,ψ2,ψ3)(150,+10,200)
Penalty range (ρ1,ρ2)(50,250)
Time-step for penalty (TMax)50
Reward function (αr,βr,γr)(0.4,0.4,0.2)
Hyper-parameterValue
Discount factor (γ)0.99
Mini-batch size (B)128
Actor learning rate (ηπ )0.001
Critic learning rate (ηQ )0.001
Replay buffer size (N)100,000
Target update parameter (τ)0.001
RTrack range (ω1,ω2,ω3)(100,+5,200)
RAlign range (ψ1,ψ2,ψ3)(150,+10,200)
Penalty range (ρ1,ρ2)(50,250)
Time-step for penalty (TMax)50
Reward function (αr,βr,γr)(0.4,0.4,0.2)

5.2 Deep Network Architectures.

DDPG utilizes two principal neural network architectures: the actor network and the critic network. The actor network directly maps states to actions and outputs the best-learned action for any given state, aiming to maximize the policy’s performance. The critic network evaluates the action output by the actor by computing the value function, which estimates the quality of the action taken from a particular state. Both networks update their weights to better predict and evaluate actions. The key hyperparameters in DDPG include separate learning rates for the actor and critic networks, which determine how quickly the networks adjust during training. The discount factor (γ), usually set between 0.9 and 0.99, balances immediate and future rewards. The size of the replay buffer influences the range of experiences for learning, while the batch size dictates the number of experiences sampled for network updates. The τ value, usually around 0.001, controls the rate at which the target networks are updated. Finally, the noise processes, defined by mean, θ, and σ parameters, govern the exploration behavior using the OU process.

5.3 Testing Environments.

In this section, we discuss the various complex environments ranging from low-speed straight-line trajectories to very high-speed circular trajectories of intruder drones that are to be stress tested and evaluate the proposed chaser drone model. During the training phase, multiple trajectories of δintruder must be included to make the training more robust and dynamic. If random starting locations of δintruder are not used along with random trajectories that the δintruder is taking, it may cause the framework to learn a suboptimal policy, which does not capture typical evading trajectories that the δintruder may take during real-world deployment.

Straight Path: In this scenario, the intruder drone navigates primarily along a straight path with slight turns in between. The speed of the intruder is fixed at 5 m/s, and is equipped with a mechanism for dynamic evasion tactics whenever the δchaser approaches close to δintruder.

Zig-Zag Path: In this scenario, the intruder moves with a velocity of 5 m/s in a zig-zag path. This path introduces rapid changes in direction, challenging the chaser drone to adapt quickly to the unpredictable movements of the Intruder.

Circular Path: In this scenario, the intruder follows a smooth, continuous circular path, which introduces a new challenge in chasing strategy. Here, the trajectory is described by significant curvature in a smooth manner, unlike scenarios with abrupt turns or zig-zag patterns.

Sinusoidal Trajectory: In this scenario, the intruder follows a trajectory similar to a sine wave whose amplitude and frequency are given by y(t)=50sin((2π/60)3t+ϕ). It oscillates three times a minute and has an amplitude of 50 m.

Random Trajectory: In this scenario, the intruder attempts a highly unpredictable trajectory with a lot of turns and sudden speed changes, which introduces a high uncertainty in movements. This scenario is designed to simulate situations where the intruder may employ very erratic and unpredictable movements to evade.

High-speed: In this scenario, the chaser policy is evaluated at a steady speed of 10 m/s, testing for cases where high-speed can be used to evade or cover large areas rapidly. A chaser’s ability to adapt to this increased speed is crucial to maintaining an effective pursuit.

Varying Speed: In this section, the intruder’s speed is varied pseudorandomly in the range (5,10) m/s so that it can drift apart from FOV. This scenario demands extensive use of the track\_reward component of total reward to track the intruder properly.

Occlusions: In this scenario, while the intruder is being followed, it gets occluded behind some buildings and is not visible in the FOV of chaser. This situation poses a new challenge where the chaser needs to extensively use the previously available information to estimate the approximate trajectory of the intruder and use it to continuously track it until the intruder is again visible.

5.4 Performance Metrics.

During training, we tracked the progress of our DDPG model using some metrics, although it is challenging to accurately evaluate the output policy. We describe some of the metrics that helped us keep track of the training process:

  1. Total reward: The sum of rewards over time reflects policy performance. Alone, it does not guarantee the robustness of policy.

  2. Critic loss: This metric indicates information on convergence during training.

  3. Absolute value error: Discrepancy between the actual return and the predicted Q-value. It assesses the agent’s understanding of the environment.

  4. Average mean trajectory error: Measures the alignment between chaser and intruder UAVs during evaluation.

  5. Episode length: Duration of chaser tracking intruder. It reflects policy improvement.

  6. Average mean chase distance error: Measures proximity without collision during the chase for successful pursuit.

Next, we present the results gathered during the training and further evaluation of the trained policy of the chaser drone on various test scenarios.

6 Results

In this section, we present the results of the performance of our proposed approach on various metrics as described in Sec. 5.4. During the performance evaluation of the chaser drone, the policy is not updated and only trained weights are used to generate control actions (A) for the chaser drone. Multiple evaluation episodes are executed with random starting locations of δchaser and δintruder to evaluate the effectiveness of the learned control policy.

6.1 Policy Improvement During Training.

The training is performed in Gazebo and ROS, where the DDPG-based model trains the chaser control policy. The simulation runs for 5000 episodes until convergence in total episodic return. Figure 4(a) represents the total reward collected in every episode during training. From the plot, it can be seen that as more and more episodes are finished, the total reward shows improvement, which correlates with episode length. During the training phase, there is a provision for exploration to find alternative ways for tracking δintruder. A persistent exploration parameter, introduced by the OU noise, ensures that δchaser continues to experiment with various strategies for tracking δintruder. The gradual enhancement with higher reward values in the reward graph indicates that the chaser drones have acquired the ability to effectively track δintruder, leading to a higher accumulation of rewards in later episodes.

Fig. 4
(a) Total reward per episode during training episodes of the δchaser and (b) critic loss, when learning policy for δchaser
Fig. 4
(a) Total reward per episode during training episodes of the δchaser and (b) critic loss, when learning policy for δchaser
Close modal

Figure 4(b) depicts the critic’s loss from the DDPG model. This graph essentially shows the performance of the critic model that learns the value function of an action. As noted, as the number of episodes increases, the critic loss decreases and reaches a very low value. This shows that the critic model can better estimate the actions; hence, the DDPG can learn a good control policy for the chaser drone.

Regarding the influence of Ralign and Rtrack on overall reward R, initial observations indicate that, in the early stages, the contribution of Rtrack to total reward is minimal. However, as the episodes unfold, there is a continuous improvement in Rtrack as δchaser becomes more adept at following δintruder, thereby leading to an overall increase in R. Initially, there are instances where δchaser prioritizes the maximization of Ralign at the expense of Rtrack. However, with the progression of episodes, there is a notable improvement in Rtrack, accompanied by a slight reduction in Ralign. The total reward plot indicates that δchaser aligns properly with δintruder. In later episodes, the focus shifts to minimizing the distance between δchaser and δintruder for consistent tracking. This underscores the effectiveness of the proposed approach in learning a control policy for the pursuit of an intruder drone. The DDPG model demonstrates convergence toward a more refined tracking and following policy, emphasizing the suitability of the proposed reward function for the given task.

6.2 Policy Performance During Testing.

To test the policy learned from the previous section, we executed more than 2500 episodes of test runs in which the policy parameters were not updated in the DDPG model. Each episode of test runs involves a mixture of trajectories from the considered testing environments mentioned in Sec. 5.4.

Figure 5(a) shows the average reward per episode received by the chaser drone δchaser while tracking δintruder for a 2500-episode test run. From the graph, it can be observed that δchaser displays a consistent performance wherein the variations in per episode total reward are within a fixed range. Figure 5(b) shows the absolute value error for δchaser during the test episodes. Absolute value error also displays consistent performance across 2500 episodes. These plots indicate the stable system performance of the chaser drone in the identification and tracking tasks of the intruder.

Fig. 5
(a) Total reward per episode during evaluation, when the trained policy was used without any updates and (b) the absolute value error during this process
Fig. 5
(a) Total reward per episode during evaluation, when the trained policy was used without any updates and (b) the absolute value error during this process
Close modal

Figure 6 shows the visualization of the chaser drone FOV for a single test episode where δintruder is moving in a zig-zag path. Beneath the FOV snapshot of the chaser drone, the track reward and align reward values are also mentioned along with the action vector of the chaser drone policy. The trajectory locations from where the chaser drone’s FOV snapshots are taken are shown in the graph below, numbered 1–8. As can be observed, the size of the δintruder as visible in the δchaser’s FOV greatly affects the track reward, while the distance of δintruder from the center of the FOV affects the align reward.

Fig. 6
δchaser’s view and trajectory followed. The top part shows the FOV view, and the bottom part shows the trajectory of δchaser and δintruder. The X- and Y-axis represent distance in meters.
Fig. 6
δchaser’s view and trajectory followed. The top part shows the FOV view, and the bottom part shows the trajectory of δchaser and δintruder. The X- and Y-axis represent distance in meters.
Close modal

6.3 Testing Endurance.

We also ran an extended endurance test for the chaser drone in which a single episode continuously ran for 4 h (30,000 time-steps) in our ROS & Gazebo simulation. Figure 7 shows the performance in a continuous series of episodes for 30,000 time-steps in a single episode of a varying speed test environment. The graphs show the total reward (in the middle) per step, the align reward (in the top), and the track reward (in the bottom). As can be observed, the chaser drone performed consistently during the long endurance test till the end of 30,000 steps. The align reward also shows consistent performance, while the track reward fluctuates, reaching a maximum of around 12,500 steps and then coming down. This is an expected performance, as the align reward keeps δintruder close to the center of δchaser’s FOV. However, the track reward focuses on following the δintruder and gets affected by the speed of δintruder and its sudden turns or changes in trajectory. The proposed DDPG-based learned policy can keep a fine trade-off between align and track rewards such that δintruder is always in the FOV of δchaser. This result also shows that the proposed approach can nullify any effects of the compounding of errors in the long run chase of the δintruder. With usually 30–60 min of flight time for commercial drones on average, the learned chaser policy can run continuously until the intruder runs out of battery power. We further noted that the average mean trajectory error and average mean distance error are 37.5m and 57.2m, respectively, for the long-endurance episode. These errors are well within the maximum error ranges of 50 m and 75 m, respectively.

Fig. 7
Reward spread along with its sub-components for long endurance test during evaluation and average mean trajectory and distance errors when the episode ran continuously for 4 h in Gazebo
Fig. 7
Reward spread along with its sub-components for long endurance test during evaluation and average mean trajectory and distance errors when the episode ran continuously for 4 h in Gazebo
Close modal

6.4 Performance in Different Testing Environments.

We test the chaser drone policy on various testing environments as described in Sec. 5.3 using 4000 test episodes.

The graph in Fig. 8(a) depicts the average mean trajectory error in different trajectories adopted by the intruder. As can be observed, the average mean trajectory error is the least for the zig-zag, straight path, and high-speed scenarios. This observation parallels the common intuition, as this scenario requires fewer adjustments to keep the intruder in the center of FOV. The average mean trajectory error is greater in the case of sinusoidal and circular as these scenarios require constant realignment of the chaser so that the Intruder can be kept close to the center of FOV. In the random trajectory of the intruder and the trajectory with occlusions, the average mean trajectory error is the highest (20 m), mainly due to the intruder not being in FOV for many time-steps. In the variable speed instance, the error is greater because once chasing is going on, a sudden change in speed requires realignment and leads to the drifting of the intruder from the center of FOV. However, all trajectory errors are within less than half of the maximum limit of trajectory error, which is 50 m. This shows the robustness of the chaser drone policy in tracking and following the intruder drone in varying circumstances.

Fig. 8
(a) Average mean trajectory error and (b) average mean distance error in various testing environments
Fig. 8
(a) Average mean trajectory error and (b) average mean distance error in various testing environments
Close modal

Figure 8(b) depicts the average mean chase distance error during the test runs in various environments. As observed, the distance is the least for the straight path scenario, as chaser can now focus on decreasing the distance while few decisions need to be made for realignment. In the case of the zig-zag path scenario, for both 5 m/s and 10 m/s speeds, the error is 30 m on average, which indicates the chaser is required to make a few decisions for realignment as well. In variable-speed scenarios, the path remains mostly the same; only the speed is adjusted, leading to quick adjustments to chaser’s velocity. In the case of circular and sinusoidal trajectories, the reported values are 37.673 m and 35.847 m, respectively. In these scenarios, the chaser UAV must constantly adjust its speed and orientation to keep up with the intruder UAV, leading to a slight increase in the average distance. In the case of random trajectories, the reported value is 40.583, which is more than in the scenarios discussed above. This is due to the high amount of orientation adjustments required to keep track of the intruder UAV. In the case of occlusions, the highest error is reported because it causes the chaser UAV to perform multiple trajectory adjustments to relocate the intruder UAV and resume the chase. With a typical UAV’s visual range of 100–150 m, error rates are well below the requirement and lead to neither the sudden sighting loss of intruder UAV nor decision-making time limitations.

The graph in Fig. 9 shows the total reward collected per episode by δchaser while chasing δintruder on various trajectories as described in Sec. 5.3. As can be seen in the graph, δchaser can consistently have good performance for straight and zig-zag paths. In the evaluation phase, δchaser can track circular trajectories due to the inclusion of the history of the five recent frames in the state space. This helps δchaser to identify a general direction of movement of the intruder based on historical data and track it effectively. The total reward accumulation is the lowest in the case of the circular and sinusoidal trajectory of the intruder.

Fig. 9
Total reward per episode during evaluation episodes of the chaser drone in various testing environments
Fig. 9
Total reward per episode during evaluation episodes of the chaser drone in various testing environments
Close modal

6.5 Comparison With a Proportional-Integral-Derivative Controller.

In this section, we compare the performance of the proposed RL based chaser drone policy against a PID based chaser drone controller. The proposed RL-based method can handle complex, non-linear systems, and real-world scenarios where states evolve over time. On the other hand, the PID controller calculates an “error,” which is the difference between the measured state and the desired set-point. This method is widely used in industrial control systems where the relationship between the input and the output is known and relatively stable. To perform the comparison, both controllers are evaluated on a test trajectory of δintruder, which consists of various portions, including straight, zig-zag, circular, sinusoidal, and random paths. Figure 10 shows the total reward per time-step during the evaluation of our proposed DDPG-based chaser policy and PID-based controller. The chase starts with a straight path for the first 2000 time-steps, then a zig-zag path with a speed of 5 m/s till 6000 time-steps. During the initial evaluation, δchaser can track δintruder in the case of DDPG and PID. However, the total reward generated for them has an average difference of 30 reward points. For the next 4000 steps, the δintruder is moving in a sinusoidal path until 10,000 time-steps and then again in a zig-zag path with a speed of 10 m/s until 12,000 time-steps. In this evaluation episode, the gap between the DDPG policy and the PID controller’s reward decreased mainly due to sudden maneuvers required to keep up with δintruder. After that, δintruder is moving in a straight path with varied velocity up to 14,000 time-steps, and lastly, it is moving in a pseudo-random path up to 17,500 time-steps. In this evaluation track, the PID controller cannot keep up with δintruder; this is mainly because once δintruder is lost from the δchaserFOV, PID controller is not able to start tracking it again. In the last phase of the evaluation track, δchaser is able to chase δintruder consistently until the end of the evaluation. Overall, one can note that the DDPG controller maintains a higher reward than the PID controller and can chase for a longer duration in varied trajectories.

Fig. 10
Comparison of the learned policy versus the PID controller during evaluation for variable trajectories. The graph shows the total reward accumulated over time, highlighting the effectiveness of the RL-based policy in maintaining the intruder within the FOV and adapting to sudden changes in trajectory and speed.
Fig. 10
Comparison of the learned policy versus the PID controller during evaluation for variable trajectories. The graph shows the total reward accumulated over time, highlighting the effectiveness of the RL-based policy in maintaining the intruder within the FOV and adapting to sudden changes in trajectory and speed.
Close modal

6.6 Total Tracking Time for Comparison With Proportional-Integral-Derivative Controller.

In addition to the reward-based evaluation, a new performance metric, total tracking time (TTT), has been introduced to provide a more comprehensive comparison between the proposed RL based chaser drone policy and the PID controller. The TTT metric measures the duration for which the intruder drone remains visible in the FOV of the chaser drone across various trajectories. Figure 11 presents the tracking counts of the proposed RL-based and PID-based approach, highlighting the significant differences in their tracking capabilities. While chasing δintruder, DDPG learned policy is able to detect it for 1846 instances versus 1421 of the PID controller for the initial straight-line trajectory. For the zig-zag path, DDPG tracks for 1879 instances versus 1511 for the PID Controller. Similarly, for sinusoidal and high-speed zig-zag paths, DDPG tracks for 1726 and 1300 instances, respectively, while PID controller tracks for 1695 and 157 instances. During the final stage of the evaluation trajectory, PID is unable to track δintruder, while DDPG tracks for 1654 and 1718 instances. From the graph, it can be seen that the DDPG learned policy can maintain consistent tracking instances for a longer duration.

Fig. 11
Comparison of learned policy versus PID controller during evaluation based on TTT. When the δintruder is detected in FOV, the instance is recorded and shown in the graph.
Fig. 11
Comparison of learned policy versus PID controller during evaluation based on TTT. When the δintruder is detected in FOV, the instance is recorded and shown in the graph.
Close modal

7 Limitations

This section presents the scenarios in which δchaser may lose track of δintruder. In scenarios where the speed of δintruder’ is too high compared to the speed of δchaser, it cannot be tracked for a longer time. The δchaser’s camera can handle 30 frames per second, and if movement is fast enough not to be captured by the camera, δintruder can again escape. These cases may be handled with more training and further fine-tuning of the policy. In case of extreme environmental conditions, also, the localization module can provide incorrect location information, leading to δchaser moving in the opposite direction.

8 Conclusions and Future Work

Deep reinforcement learning methods have increasingly performed better in various control tasks. This paper focuses on the problem of tracking an intruder drone using a chaser drone using deep reinforcement learning based chaser drone policy. The reward function has been formulated using the FOV of the chaser drone’s camera view. Training and evaluation of the proposed approach is achieved using Gazebo and ROS, along with Ardupilot as the flight controller. The deep reinforcement learning-based chaser drone policy has been evaluated in various test environments using various evaluation metrics. The learned policy is also compared to a PID controller. The results show that the learned policy is robust to various maneuvers of the intruder drone and can continue tracking the intruder for a longer duration. Performance against the PID controller further validates the adaptability and better performance of the proposed approach.

As part of future work, the deployment of chaser swarms for intruder pursuit and neutralization with a larger combined FOV can be explored. Furthermore, exploration of hierarchical policies for autonomous takeoff, pursuit, and return to recharging stations for drone swarms is envisioned. This approach seeks to advance intelligent pursuit strategies and effectively enhance restricted airspace protection.

Conflict of Interest

There are no conflicts of interest.

Data Availability Statement

The data and information that support the findings of this article are freely available online.5

Footnotes

References

1.
Zeng
,
Y.
,
Lyu
,
J.
, and
Zhang
,
R.
,
2018
, “
Cellular-Connected UAV: Potential, Challenges, and Promising Technologies
,”
IEEE Wireless Commun.
,
26
(
1
), pp.
120
127
.
2.
Garg
,
A.
, and
Jha
,
S. S.
,
2023
, “
Decentralized Critical Area Coverage Using Multi-UAV System With Guided Explorations During Floods
,”
CASE
, pp.
1
6
.
3.
Basak
,
S.
,
Rajendran
,
S.
,
Pollin
,
S.
, and
Scheers
,
B.
,
2021
, “
Combined RF-Based Drone Detection and Classification
,”
IEEE Trans. Cognit. Commun. Networking
,
8
(
1
), pp.
111
120
.
4.
De Souza
,
C. N.
,
2021
, “
Decentralized Multi-agent Pursuit Using Deep Reinforcement Learning
,”
IEEE Rob. Autom. Lett.
,
6
(
3
), pp.
4552
4559
.
5.
Kainth
,
S.
,
Sahoo
,
S.
,
Pal
,
R.
, and
Jha
,
S. S.
,
2023
, “
Chasing the Intruder: A Reinforcement Learning Approach for Tracking Unidentified Drones
,”
Advances In Robotics
,
Ropar, India
,
Aug. 29
, ACM, pp.
1
6
.
6.
Opromolla
,
R.
,
Inchingolo
,
G.
, and
Fasano
,
G.
,
2019
, “
Airborne Visual Detection and Tracking of Cooperative UAVs Exploiting Deep Learning
,”
Sensors
,
19
(
19
), p.
4332
.
7.
Unlu
,
E.
,
Zenou
,
E.
,
Riviere
,
N.
, and
Dupouy
,
P.-E.
,
2019
, “
Deep Learning-Based Strategies for the Detection and Tracking of Drones Using Several Cameras
,”
IPSJ Trans. Comput. Vis. Appl.
,
11
(
1
), pp.
1
13
.
8.
Zhang
,
X.
, and
Kusrini
,
K.
,
2021
, “
Autonomous Long-Range Drone Detection System for Critical Infrastructure Safety
,”
Multimedia Tools Appl.
,
80
(
15
), pp.
23723
23743
.
9.
Barisic
,
A.
,
Petric
,
F.
, and
Bogdan
,
S.
,
2021
, “Brain Over Brawn: Using a Stereo Camera to Detect, Track, and Intercept a faster UAV by Reconstructing the Intruder’s Trajectory,” preprint arXiv:2107.00962.
10.
Rozantsev
,
A.
,
Lepetit
,
V.
, and
Fua
,
P.
,
2015
, “
Flying Objects Detection From a Single Moving Camera
,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, June 7–12, pp.
4128
4136
.
11.
Wei
,
Y.
,
Hong
,
T.
, and
Kadoch
,
M.
,
2020
, “
Improved Kalman Filter Variants for UAV Tracking With Radar Motion Models
,”
Electronics
,
9
(
5
), p.
768
.
12.
Pham
,
H. X. L.
,
2018
, “
Reinforcement Learning for Autonomous UAV Navigation Using Function Approximation
,” 2018 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR),
Philadelphia, PA, Aug. 6–8, IEEE
, pp.
1
6
.
13.
Akhloufi
,
M. A.
,
Arola
,
S.
, and
Bonnet
,
A.
,
2019
, “
Drones Chasing Drones: Reinforcement Learning and Deep Search Area Proposal
,”
Drones
,
3
(
3
), p.
58
.
14.
Fernandes
,
A. B.
,
2019
, “
Drone, Aircraft and Bird Identification in Video Images Using Object Tracking and Residual Neural Networks
,”
2019 11th International Conference on Electronics, Computers and Artificial Intelligence (ECAI)
,
Pitesti, Romania
,
June 27–29
,
IEEE
, pp.
1
6
.
15.
Hu
,
Y.
,
Wu
,
X.
,
Zheng
,
G.
, and
Liu
,
X.
,
2019
, “
Object Detection of UAV for Anti-UAV Based on Improved Yolo V3
,” 2019 Chinese Control Conference (CCC),
IEEE
, pp.
8386
8390
.
16.
Gao
,
J. B.
,
2024
, “
Analysis of Various Machine Learning Algorithms for Using Drone Images in Livestock Farms
,”
Agriculture
,
14
(
4
), p.
522
.
17.
Zeng
,
L.
,
Chen
,
H.
,
Feng
,
D.
,
Zhang
,
X.
, and
Chen
,
X.
,
2024
, “
A3D: Adaptive, Accurate, and Autonomous Navigation for Edge-Assisted Drones
,”
IEEE/ACM Trans. Networking
,
32
(
1
), pp.
713
728
.
18.
Nepal
,
U.
,
2022
, “
Comparing YOLOv3, YOLOv4 and YOLOv5 for Autonomous Landing Spot Detection in Faulty UAVs
,”
Sensors
,
22
(
2
), p.
464
.
19.
Silver
,
D.
,
Lever
,
G.
,
Heess
,
N.
,
Degris
,
T.
,
Wierstra
,
D.
, and
Riedmiller
,
M.
,
2014
, “
Deterministic Policy Gradient Algorithms
,”
Proceedings of Machine Learning Research
, pp.
387
395
. https://api.semanticscholar.org/CorpusID:13928442