The strategy pattern is one of the key design patterns of Machine Teaching. When you use the strategy pattern, you break down the task into specific skills that each handle one aspect of the process to be controlled. This allows you to "teach" the agent using subject matter expertise.
In the strategy pattern, each skill is either trained using deep reinforcement learning or controlled with a programmed algorithm. Then a special skill called a selector decides which skill should make the decision based on the current conditions.
In the industrial mixer problem, the process is divided into three skills based on the phase of the process. All three action skills and the selector are trained with DRL: each skill practices in the conditions it will face and learns to control its part of the process by experimenting over time.
Think of the strategy pattern as like a math class with three students. Student A loves fractions, Student B is great at decimal problems, and Student C thinks in percentages. The selector is their teacher. She reads each question, sees what kind of problem it is, and then assigns it to the student who can solve it best, because of their own special math talent.
Let's get started configuring this agent!
1. Publish the Skills to Your Project
This agent has three skills called start_reaction, control_transition, and produce_product. To publish them to your project you will need to open up your favorite code editor and terminal. In your terminal, navigate to the skills folder and use the command with the Composable CLI.
import mathimport numpy as npfrom composabl import TeacherclassBaseCSTR(Teacher):def__init__(self,*args,**kwargs):""" Initialize the BaseCSTR skill with default values. Args: *args: Variable length argument list. **kwargs: Arbitrary keyword arguments. """# Initialize observation history to track past observations self.obs_history =None# Initialize lists to store reward and error histories self.reward_history = [] self.error_history = [] self.rms_history = [] # Root Mean Square of errors# Initialize variable to store the last computed reward self.last_reward =0# Initialize a counter to track the number of steps or iterations self.count =0asyncdeftransform_sensors(self,obs,action):""" Process and potentially modify sensor observations before they are used. Args: obs (dict): Current sensor observations. action: The action to be taken. Returns: dict: Transformed sensor observations. Note: - Currently, this method returns the observations unchanged. - This can be customized to apply transformations if needed. """# Currently, no transformation is applied to sensorsreturn obsasyncdeftransform_action(self,transformed_obs,action):""" Process and potentially modify the action before it is executed. Args: transformed_obs (dict): Transformed sensor observations. action: The original action. Returns: The (potentially) modified action. Note: - Currently, this method returns the action unchanged. - This can be customized to modify actions based on certain criteria. """# Currently, no transformation is applied to the actionreturn actionasyncdeffiltered_sensor_space(self):""" Define which sensors are relevant for this skill. Returns: list: Names of the sensors to be used. Note: - Specifies a list of sensor names that this skill will utilize. - Helps in focusing the skill's operations on relevant data. """# Specify the sensors that this skill will usereturn ['T','Tc','Ca','Cref','Tref','Conc_Error','Eps_Yield','Cb_Prod']asyncdefcompute_reward(self,transformed_obs,action,sim_reward):""" Compute the reward based on the transformed observations and action. Args: transformed_obs (dict): Transformed sensor observations. action: The action taken. sim_reward: The reward from the simulation/environment. Returns: float: Calculated reward. Behavior: - If `obs_history` is None, initializes it with the current observation and returns 0.0. - Otherwise, appends the current observation to `obs_history`. - Calculates the squared error between reference concentration (`Cref`) and actual concentration (`Ca`). - Appends the error to `error_history`. - Computes the Root Mean Square (RMS) of the error history and appends it to `rms_history`. - Calculates the reward using an exponential decay function based on the sum of all errors. - Appends the calculated reward to `reward_history`. - Increments the `count`. - Returns the calculated reward. """if self.obs_history isNone:# If this is the first observation, initialize the history self.obs_history = [transformed_obs]return0.0# No reward on the first stepelse:# Append the current observation to the history self.obs_history.append(transformed_obs)# Calculate the squared error between reference concentration and actual concentrationtry: cref =float(transformed_obs['Cref']) ca =float(transformed_obs['Ca'])except (KeyError,ValueError,TypeError) as e:# Handle missing or invalid sensor dataprint(f"Error accessing 'Cref' or 'Ca' in transformed_obs: {e}")return0.0 error = (cref - ca) **2 self.error_history.append(error)# Store the error# Calculate the Root Mean Square (RMS) of the error history rms = math.sqrt(np.mean(self.error_history)) self.rms_history.append(rms)# Store the RMS value# Compute the reward as an exponential decay based on the sum of errors reward = math.exp(-0.01* np.sum(self.error_history)) self.reward_history.append(reward)# Store the reward# Increment the step counter self.count +=1return reward # Return the calculated rewardasyncdefcompute_action_mask(self,transformed_obs,action):""" Optionally compute an action mask to restrict available actions. Args: transformed_obs (dict): Transformed sensor observations. action: The action to be masked. Returns: Optional[List[bool]]: A mask indicating which actions are allowed. Returns None, meaning no action masking is applied. Note: - Currently, no action masking is implemented. - This can be customized to enforce action constraints. """# Currently, no action masking is appliedreturnNoneasyncdefcompute_success_criteria(self,transformed_obs,action):""" Determine whether the success criteria have been met. Args: transformed_obs (dict): Transformed sensor observations. action: The action taken. Returns: bool: True if success criteria are met, False otherwise. Behavior: - Currently always returns False. - Can be implemented with logic to check if certain conditions are satisfied. """# Placeholder for success criteria logic success =False# Implement actual success condition based on observations and actionsreturn successasyncdefcompute_termination(self,transformed_obs,action):""" Determine whether the training episode should terminate. Args: transformed_obs (dict): Transformed sensor observations. action: The action taken. Returns: bool: True if the episode should terminate, False otherwise. Behavior: - Currently always returns False. - Can be implemented with logic to terminate based on certain conditions. """# Placeholder for termination condition logicreturnFalse
See the Control Transition Skill Code
pyproject.toml
// Some codname ="Control Transition"version ="0.1.0"description =""authors = [{ name = "John Doe", email = "john.doe@composabl.com"}]dependencies = ["composabl-core","numpy"][composabl]type="skill-teacher"entrypoint ="control_transition.teacher:BaseCSTR"
teacher.py
import mathimport numpy as npfrom composabl import TeacherclassBaseCSTR(Teacher):def__init__(self,*args,**kwargs):""" Initialize the BaseCSTR skill with default values. Args: *args: Variable length argument list. **kwargs: Arbitrary keyword arguments. """# Initialize observation history to track past observations self.obs_history =None# Initialize lists to store reward and error histories self.reward_history = [] self.error_history = [] self.rms_history = [] # Root Mean Square of errors# Initialize variable to store the last computed reward self.last_reward =0# Note: Initialized twice; redundancy should be removed.# Initialize a counter to track the number of steps or iterations self.count =0asyncdeftransform_sensors(self,obs,action):""" Process and potentially modify sensor observations before they are used. Args: obs (dict): Current sensor observations. action: The action to be taken. Returns: dict: Transformed sensor observations. Note: - Currently, this method returns the observations unchanged. - This can be customized to apply transformations if needed. """# Currently, no transformation is applied to sensorsreturn obsasyncdeftransform_action(self,transformed_obs,action):""" Process and potentially modify the action before it is executed. Args: transformed_obs (dict): Transformed sensor observations. action: The original action. Returns: The (potentially) modified action. Note: - Currently, this method returns the action unchanged. - This can be customized to modify actions based on certain criteria. """# Currently, no transformation is applied to the actionreturn actionasyncdeffiltered_sensor_space(self):""" Define which sensors are relevant for this skill. Returns: list: Names of the sensors to be used. Note: - Specifies a list of sensor names that this skill will utilize. - Helps in focusing the skill's operations on relevant data. """# Specify the sensors that this skill will usereturn ['T','Tc','Ca','Cref','Tref','Conc_Error','Eps_Yield','Cb_Prod']asyncdefcompute_reward(self,transformed_obs,action,sim_reward):""" Compute the reward based on the transformed observations and action. Args: transformed_obs (dict): Transformed sensor observations. action: The action taken. sim_reward: The reward from the simulation/environment. Returns: float: Calculated reward. Behavior: - If `obs_history` is None, initializes it with the current observation and returns 0.0. - Otherwise, appends the current observation to `obs_history`. - Calculates the squared error between reference concentration (`Cref`) and actual concentration (`Ca`). - Appends the error to `error_history`. - Computes the Root Mean Square (RMS) of the error history and appends it to `rms_history`. - Calculates the reward using an exponential decay function based on the sum of all errors. - Appends the calculated reward to `reward_history`. - Increments the `count`. - Returns the calculated reward. """if self.obs_history isNone:# If this is the first observation, initialize the history self.obs_history = [transformed_obs]return0.0# No reward on the first stepelse:# Append the current observation to the history self.obs_history.append(transformed_obs)# Calculate the squared error between reference concentration and actual concentrationtry: cref =float(transformed_obs['Cref']) ca =float(transformed_obs['Ca'])except (KeyError,ValueError,TypeError) as e:# Handle missing or invalid sensor dataprint(f"Error accessing 'Cref' or 'Ca' in transformed_obs: {e}")return0.0 error = (cref - ca) **2 self.error_history.append(error)# Store the error# Calculate the Root Mean Square (RMS) of the error history rms = math.sqrt(np.mean(self.error_history)) self.rms_history.append(rms)# Store the RMS value# Compute the reward as an exponential decay based on the sum of errors reward = math.exp(-0.01* np.sum(self.error_history)) self.reward_history.append(reward)# Store the reward# Increment the step counter self.count +=1return reward # Return the calculated rewardasyncdefcompute_action_mask(self,transformed_obs,action):""" Optionally compute an action mask to restrict available actions. Args: transformed_obs (dict): Transformed sensor observations. action: The action to be masked. Returns: Optional[List[bool]]: A mask indicating which actions.
import mathimport numpy as npfrom composabl import TeacherclassBaseCSTR(Teacher):def__init__(self,*args,**kwargs):""" Initialize the BaseCSTR skill with default values. Args: *args: Variable length argument list. **kwargs: Arbitrary keyword arguments. """# Initialize observation history to track past observations self.obs_history =None# Initialize lists to store reward and error histories self.reward_history = [] self.error_history = [] self.rms_history = [] # Root Mean Square of errors# Initialize variable to store the last computed reward self.last_reward =0# Initialize a counter to track the number of steps or iterations self.count =0asyncdeftransform_sensors(self,obs,action):""" Process and potentially modify sensor observations before they are used. Args: obs (dict): Current sensor observations. action: The action to be taken. Returns: dict: Transformed sensor observations. Note: - Currently, this method returns the observations unchanged. - This can be customized to apply transformations if needed. """# Currently, no transformation is applied to sensorsreturn obsasyncdeftransform_action(self,transformed_obs,action):""" Process and potentially modify the action before it is executed. Args: transformed_obs (dict): Transformed sensor observations. action: The original action. Returns: The (potentially) modified action. Note: - Currently, this method returns the action unchanged. - This can be customized to modify actions based on certain criteria. """# Currently, no transformation is applied to the actionreturn actionasyncdeffiltered_sensor_space(self):""" Define which sensors are relevant for this skill. Returns: list: Names of the sensors to be used. Note: - Specifies a list of sensor names that this skill will utilize. - Helps in focusing the skill's operations on relevant data. """# Specify the sensors that this skill will usereturn ['T','Tc','Ca','Cref','Tref','Conc_Error','Eps_Yield','Cb_Prod']asyncdefcompute_reward(self,transformed_obs,action,sim_reward):""" Compute the reward based on the transformed observations and action. Args: transformed_obs (dict): Transformed sensor observations. action: The action taken. sim_reward: The reward from the simulation/environment. Returns: float: Calculated reward. Behavior: - If `obs_history` is None, initializes it with the current observation and returns 0.0. - Otherwise, appends the current observation to `obs_history`. - Calculates the squared error between reference concentration (`Cref`) and actual concentration (`Ca`). - Appends the error to `error_history`. - Computes the Root Mean Square (RMS) of the error history and appends it to `rms_history`. - Calculates the reward using an exponential decay function based on the sum of all errors. - Appends the calculated reward to `reward_history`. - Increments the `count`. - Returns the calculated reward. """if self.obs_history isNone:# If this is the first observation, initialize the history self.obs_history = [transformed_obs]return0.0# No reward on the first stepelse:# Append the current observation to the history self.obs_history.append(transformed_obs)# Calculate the squared error between reference concentration and actual concentrationtry: cref =float(transformed_obs['Cref']) ca =float(transformed_obs['Ca'])except (KeyError,ValueError,TypeError) as e:# Handle missing or invalid sensor dataprint(f"Error accessing 'Cref' or 'Ca' in transformed_obs: {e}")return0.0 error = (cref - ca) **2 self.error_history.append(error)# Store the error# Calculate the Root Mean Square (RMS) of the error history rms = math.sqrt(np.mean(self.error_history)) self.rms_history.append(rms)# Store the RMS value# Compute the reward as an exponential decay based on the sum of errors reward = math.exp(-0.01* np.sum(self.error_history)) self.reward_history.append(reward)# Store the reward# Increment the step counter self.count +=1return reward # Return the calculated rewardasyncdefcompute_action_mask(self,transformed_obs,action):""" Optionally compute an action mask to restrict available actions. Args: transformed_obs (dict): Transformed sensor observations. action: The action to be masked. Returns: Optional[List[bool]]: A mask indicating which actions are allowed. Returns None, meaning no action masking is applied. Note: - Currently, no action masking is implemented. - This can be customized to enforce action constraints. """# Currently, no action masking is appliedreturnNoneasyncdefcompute_success_criteria(self,transformed_obs,action):""" Determine whether the success criteria have been met. Args: transformed_obs (dict): Transformed sensor observations. action: The action taken. Returns: bool: True if success criteria are met, False otherwise. Behavior: - Currently always returns False. - Can be implemented with logic to check if certain conditions are satisfied. """# Placeholder for success criteria logic success =False# Implement actual success condition based on observations and actionsreturn successasyncdefcompute_termination(self,transformed_obs,action):""" Determine whether the training episode should terminate. Args: transformed_obs (dict): Transformed sensor observations. action: The action taken. Returns: bool: True if the episode should terminate, False otherwise. Behavior: - Currently always returns False. - Can be implemented with logic to terminate based on certain conditions. """# Placeholder for termination condition logicreturnFalse
2. Add the Skills to Your Strategy Pattern Agent
Drag the skills start_reaction, control_transition, and produce_product that you can now see on the left-hand side of your project onto the skills layer. Drag the skills from the side in the order you would like them to be used.
3. Configure the Selector
The green diamond that appears when you place multiple skills alongside each other is the selector. This is the "math teacher" skill that makes the decision about which of the action skills should be chosen to make each decision.
Click on the selector to configure it. In this case, the default configurations are most likely correct.
The goals of the top-level selector in an agent should be the same as the goals of the agent as a whole. When the UI automatically creates a selector, it adds the project-level goals by default.
For a phased process like the industrial mixer reaction, a fixed-order sequence is appropriate. That means that the selector has the agent apply the skills one at a time, rather than switching back and forth between skills.
4. Configure Scenarios
Scenarios are a key piece of successfully training an agent with the strategy pattern. Scenarios are different possible conditions represented within the simulation. Skills train to specialize in the different scenarios - for example, the Start Reaction skill specializes in controlling the reaction when the temperature and concentration levels are those found at the beginning of the reaction.
This is what allows the skills to differentiate from each other. The three specialized skills practice only on their designated phase of the process and learn to succeed in their own specific conditions. The selector practices with the whole process so that it knows which skill to choose at any point.
4.1 Add Scenarios
Go to the Scenarios page using the lefthand navigation menu. There, click Add Scenario to create a new scenario for your agent to use in training.
When you are building an agent for your own use case, you will define the scenarios based on your knowledge of the task and process. In this case, we provide the values that define the phases of the chemical manufacturing process. Create these scenarios for your agent:
Full reaction: Cref Is 8.57, Tref Is 311 |
Startup: Cref Is 8.5698, Tref Is 311.2612 |
Transition: Cref Is 8.56, Tref Is 311, Is 22 |
Production: Cref Is 2, Tref Is 373.1311 |
4.2 Create Scenario Flows
Scenario flows allow you to connect scenarios that have a sequential relationship to ensure that your agent gets practice in navigating the different conditions in the order in which they will occur.
For this problem, you do not need to create sequential connections between the scenarios. Drag all the scenarios to the first column to make them available to your skills and selectors.
4.3 Add Scenarios to Skills and Selectors
Once you have your scenarios set up and connected with scenario flows, you can add them to skills and selectors to tell the skills and selectors what conditions they need to practice in. This helps them to develop their specialized expertise.
In the Agent Builder Studio, click on each skill and the selector in turn. For each, click on Scenarios and then click the dropdown arrows to show the available scenarios. Check the box for each scenario to apply to the skill.
Start reaction: Startup
Transition: Transition
Produce product: Production
Selector: Full reaction
5. Run Your Training Session
Now, we are ready to train your agent and see the results. We suggest you run 50 training cycles. You will see the skills training one at a time. Each skill will train the selected number of cycles.
6. View Results
When the training has been completed, you can view your results in the training sessions tab in the UI. This will show you information on how well the agent is learning.
You will likely see a steep learning curve as the agent experiments with different control strategies and learns from the results. When the learning curve plateaus, that usually means that the skill is trained.
Analyzing the Strategy Pattern Agent's Performance
Conversion rate: 92% Thermal runaway risk: Low
We tested this fully trained agent and plotted the results.
This agent performance is not perfect, but it stays closer to the benchmark line than either of the two single-skill agents. It just needs some help avoiding thermal runaway. We can provide that by adding a perception layer.