The DRL agent is a very simple agent design with only one skill. This agent does not use machine teaching to decompose the task into skills that can be trained separately. Instead, the entire reaction is controlled by a skill trained with deep reinforcement learning.
Let's get started configuring this agent!
1. Publish the DRL Skill to Your Project
This agent has a single skill called Control Full Reaction. To publish that skill to your project you will need to open up your favorite code editor and terminal. In your terminal, navigate to the skills folder and use the command with the Composable CLI.
composablskillpublishcontrol_reaction
Return to the agent builder studio and refresh the page. You will see the skill in the skills menu on the left of your page.
Explore the Code Files
All skills, perceptors, and selectors have a minimum of two files in them. A Python file contains the code that the agent will use, and a config file.
pyproject.toml, a config file with the following information.
See the Code
[project]name ="Control Full Reaction"version ="0.1.0"description =""authors = [{ name = "John Doe", email = "john.doe@composabl.com"}]dependencies = ["composabl-core","numpy"][composabl]type="skill-teacher"entrypoint ="control_reaction.teacher:BaseCSTR"
A Python file. For this agent, we use a teacher function with the following code and explanations in comments inline.
See the Code
import mathimport numpy as npfrom composabl import TeacherclassBaseCSTR(Teacher):def__init__(self,*args,**kwargs):""" Initialize the BaseCSTR skill with default values. Args: *args: Variable length argument list. **kwargs: Arbitrary keyword arguments. """# Initialize observation history to track past observations self.obs_history =None# Initialize lists to store reward and error histories self.reward_history = [] self.error_history = [] self.rms_history = [] # Root Mean Square of errors# Initialize variable to store the last computed reward self.last_reward =0# Initialize a counter to track the number of steps or iterations self.count =0asyncdeftransform_sensors(self,obs,action):""" Process and potentially modify sensor observations before they are used. Args: obs (dict): Current sensor observations. action: The action to be taken. Returns: dict: Transformed sensor observations. Note: - Currently, this method returns the observations unchanged. - This can be customized to apply transformations if needed. """# Currently, no transformation is applied to sensorsreturn obsasyncdeftransform_action(self,transformed_obs,action):""" Process and potentially modify the action before it is executed. Args: transformed_obs (dict): Transformed sensor observations. action: The original action. Returns: The (potentially) modified action. Note: - Currently, this method returns the action unchanged. - This can be customized to modify actions based on certain criteria. """# Currently, no transformation is applied to the actionreturn actionasyncdeffiltered_sensor_space(self):""" Define which sensors are relevant for this skill. Returns: list: Names of the sensors to be used. Note: - Specifies a list of sensor names that this skill will utilize. - Helps in focusing the skill's operations on relevant data. """# Specify the sensors that this skill will usereturn ['T','Tc','Ca','Cref','Tref','Conc_Error','Eps_Yield','Cb_Prod']asyncdefcompute_reward(self,transformed_obs,action,sim_reward):""" Compute the reward based on the transformed observations and action. Args: transformed_obs (dict): Transformed sensor observations. action: The action taken. sim_reward: The reward from the simulation/environment. Returns: float: Calculated reward. Behavior: - If `obs_history` is None, initializes it with the current observation and returns 0.0. - Otherwise, appends the current observation to `obs_history`. - Calculates the squared error between reference concentration (`Cref`) and actual concentration (`Ca`). - Appends the error to `error_history`. - Computes the Root Mean Square (RMS) of the error history and appends it to `rms_history`. - Calculates the reward using an exponential decay function based on the sum of all errors. - Appends the calculated reward to `reward_history`. - Increments the `count`. - Returns the calculated reward. """if self.obs_history isNone:# If this is the first observation, initialize the history self.obs_history = [transformed_obs]return0.0# No reward on the first stepelse:# Append the current observation to the history self.obs_history.append(transformed_obs)# Calculate the squared error between reference concentration and actual concentrationtry: cref =float(transformed_obs['Cref']) ca =float(transformed_obs['Ca'])except (KeyError,ValueError,TypeError) as e:# Handle missing or invalid sensor dataprint(f"Error accessing 'Cref' or 'Ca' in transformed_obs: {e}")return0.0 error = (cref - ca) **2 self.error_history.append(error)# Store the error# Calculate the Root Mean Square (RMS) of the error history rms = math.sqrt(np.mean(self.error_history)) self.rms_history.append(rms)# Store the RMS value# Compute the reward as an exponential decay based on the sum of errors reward = math.exp(-0.01* np.sum(self.error_history)) self.reward_history.append(reward)# Store the reward# Increment the step counter self.count +=1return reward # Return the calculated rewardasyncdefcompute_action_mask(self,transformed_obs,action):""" Optionally compute an action mask to restrict available actions. Args: transformed_obs (dict): Transformed sensor observations. action: The action to be masked. Returns: Optional[List[bool]]: A mask indicating which actions are allowed. Returns None, meaning no action masking is applied. Note: - Currently, no action masking is implemented. - This can be customized to enforce action constraints. """# Currently, no action masking is appliedreturnNoneasyncdefcompute_success_criteria(self,transformed_obs,action):""" Determine whether the success criteria have been met. Args: transformed_obs (dict): Transformed sensor observations. action: The action taken. Returns: bool: True if success criteria are met, False otherwise. Behavior: - Currently always returns False. - Can be implemented with logic to check if certain conditions are satisfied. """# Placeholder for success criteria logic success =False# Implement actual success condition based on observations and actionsreturn successasyncdefcompute_termination(self,transformed_obs,action):""" Determine whether the training episode should terminate. Args: transformed_obs (dict): Transformed sensor observations. action: The action taken. Returns: bool: True if the episode should terminate, False otherwise. Behavior: - Currently always returns False. - Can be implemented with logic to terminate based on certain conditions. """# Placeholder for termination condition logicreturnFalse
3. Add the Skill to your DRL Agent
Drag the skill control_reaction that you can now see on the left-hand side of your project onto the Skill Layer.
4. Run Your Training Session
Now, we are ready to train your agent and see the results. We suggest you run 50 training cycles.
5. View Results
When the training has been completed, you can view your results in the training sessions tab in the UI. This will show you information on how well the agent is learning.
You will most likely see a steep learning curve as the agent experiments with different control strategies and learns from the results. When the learning curve plateaus, that usually means that the skill is trained.
Analyze the DRL Agent's Performance
Conversion rate: 90%
Thermal runaway risk: Low
We tested this fully trained agent and plotted the results.
The DRL agent performs well. Its relatively thin shadow means that it performs consistently over different conditions and it stays within the safety threshold almost every time
This agent controls the initial steady state very well, staying right on the benchmark line. But during the transition, the DRL agent goes off the benchmark line quite a bit. It doesn't notice right away when the transition phase begins, staying too long in the lower region of the graph, and then overcorrecting. That's because DRL works by experimentation, teaching itself how to get results by exploring every possible way to tackle a problem. It has no prior knowledge or understanding of a situation and relies entirely on trial and error. That means that it is potentially well suited to complex processes – like the transition phase - that can’t easily be represented mathematically.
But its behavior is erratic because it can’t distinguish between the phases. The DRL agent’s skills do better than the traditional automation benchmark but still leave some room for improvement.