The strategy pattern is one of the key design patterns of Machine Teaching. When you use the strategy pattern, you break down the task into specific skills that each handle one aspect of the process to be controlled. This allows you to "teach" the agent using subject matter expertise.
In the strategy pattern, each skill is either trained using deep reinforcement learning or controlled with a programmed algorithm. Then a special skill called a selector decides which skill should make the decision based on the current conditions.
In the industrial mixer problem, the process is divided into three skills based on the phase of the process. All three action skills and the selector are trained with DRL: each skill practices in the conditions it will face and learns to control its part of the process by experimenting over time.
Think of the strategy pattern as like a math class with three students. Student A loves fractions, Student B is great at decimal problems, and Student C thinks in percentages. The selector is their teacher. She reads each question, sees what kind of problem it is, and then assigns it to the student who can solve it best, because of their own special math talent.
Let's get started configuring this agent!
1. Publish the Skills to Your Project
This agent has three skills called start_reaction, control_transition, and produce_product. To publish them to your project you will need to open up your favorite code editor and terminal. In your terminal, navigate to the skills folder and use the command with the Composable CLI.
Return to the agent builder studio and refresh the page. The skills will appear in the skills menu on the left of your page.
Explore the Code Files
All skills, perceptors, and selectors have a minimum of two files in them. A Python file contains the code that the agent will use and a config file.
pyproject.toml, a config file with the following information.
A Python file. For this agent, we use teacher functions with the following code and explanations in the comments inline.
File Structure
See the Start Reaction Skill Code
pyproject.toml
//[project]
name = "Start Reaction"
version = "0.1.0"
description = ""
authors = [{ name = "John Doe", email = "john.doe@composabl.com" }]
dependencies = [
"composabl-core",
"numpy"
]
[composabl]
type = "skill-teacher"
entrypoint = "start_reaction.teacher:BaseCSTR"
teacher.py
import math
import numpy as np
from composabl import Teacher
class BaseCSTR(Teacher):
def __init__(self, *args, **kwargs):
"""
Initialize the BaseCSTR skill with default values.
Args:
*args: Variable length argument list.
**kwargs: Arbitrary keyword arguments.
"""
# Initialize observation history to track past observations
self.obs_history = None
# Initialize lists to store reward and error histories
self.reward_history = []
self.error_history = []
self.rms_history = [] # Root Mean Square of errors
# Initialize variable to store the last computed reward
self.last_reward = 0
# Initialize a counter to track the number of steps or iterations
self.count = 0
async def transform_sensors(self, obs, action):
"""
Process and potentially modify sensor observations before they are used.
Args:
obs (dict): Current sensor observations.
action: The action to be taken.
Returns:
dict: Transformed sensor observations.
Note:
- Currently, this method returns the observations unchanged.
- This can be customized to apply transformations if needed.
"""
# Currently, no transformation is applied to sensors
return obs
async def transform_action(self, transformed_obs, action):
"""
Process and potentially modify the action before it is executed.
Args:
transformed_obs (dict): Transformed sensor observations.
action: The original action.
Returns:
The (potentially) modified action.
Note:
- Currently, this method returns the action unchanged.
- This can be customized to modify actions based on certain criteria.
"""
# Currently, no transformation is applied to the action
return action
async def filtered_sensor_space(self):
"""
Define which sensors are relevant for this skill.
Returns:
list: Names of the sensors to be used.
Note:
- Specifies a list of sensor names that this skill will utilize.
- Helps in focusing the skill's operations on relevant data.
"""
# Specify the sensors that this skill will use
return ['T', 'Tc', 'Ca', 'Cref', 'Tref', 'Conc_Error', 'Eps_Yield', 'Cb_Prod']
async def compute_reward(self, transformed_obs, action, sim_reward):
"""
Compute the reward based on the transformed observations and action.
Args:
transformed_obs (dict): Transformed sensor observations.
action: The action taken.
sim_reward: The reward from the simulation/environment.
Returns:
float: Calculated reward.
Behavior:
- If `obs_history` is None, initializes it with the current observation and returns 0.0.
- Otherwise, appends the current observation to `obs_history`.
- Calculates the squared error between reference concentration (`Cref`) and actual concentration (`Ca`).
- Appends the error to `error_history`.
- Computes the Root Mean Square (RMS) of the error history and appends it to `rms_history`.
- Calculates the reward using an exponential decay function based on the sum of all errors.
- Appends the calculated reward to `reward_history`.
- Increments the `count`.
- Returns the calculated reward.
"""
if self.obs_history is None:
# If this is the first observation, initialize the history
self.obs_history = [transformed_obs]
return 0.0 # No reward on the first step
else:
# Append the current observation to the history
self.obs_history.append(transformed_obs)
# Calculate the squared error between reference concentration and actual concentration
try:
cref = float(transformed_obs['Cref'])
ca = float(transformed_obs['Ca'])
except (KeyError, ValueError, TypeError) as e:
# Handle missing or invalid sensor data
print(f"Error accessing 'Cref' or 'Ca' in transformed_obs: {e}")
return 0.0
error = (cref - ca) ** 2
self.error_history.append(error) # Store the error
# Calculate the Root Mean Square (RMS) of the error history
rms = math.sqrt(np.mean(self.error_history))
self.rms_history.append(rms) # Store the RMS value
# Compute the reward as an exponential decay based on the sum of errors
reward = math.exp(-0.01 * np.sum(self.error_history))
self.reward_history.append(reward) # Store the reward
# Increment the step counter
self.count += 1
return reward # Return the calculated reward
async def compute_action_mask(self, transformed_obs, action):
"""
Optionally compute an action mask to restrict available actions.
Args:
transformed_obs (dict): Transformed sensor observations.
action: The action to be masked.
Returns:
Optional[List[bool]]: A mask indicating which actions are allowed.
Returns None, meaning no action masking is applied.
Note:
- Currently, no action masking is implemented.
- This can be customized to enforce action constraints.
"""
# Currently, no action masking is applied
return None
async def compute_success_criteria(self, transformed_obs, action):
"""
Determine whether the success criteria have been met.
Args:
transformed_obs (dict): Transformed sensor observations.
action: The action taken.
Returns:
bool: True if success criteria are met, False otherwise.
Behavior:
- Currently always returns False.
- Can be implemented with logic to check if certain conditions are satisfied.
"""
# Placeholder for success criteria logic
success = False
# Implement actual success condition based on observations and actions
return success
async def compute_termination(self, transformed_obs, action):
"""
Determine whether the training episode should terminate.
Args:
transformed_obs (dict): Transformed sensor observations.
action: The action taken.
Returns:
bool: True if the episode should terminate, False otherwise.
Behavior:
- Currently always returns False.
- Can be implemented with logic to terminate based on certain conditions.
"""
# Placeholder for termination condition logic
return False
See the Control Transition Skill Code
pyproject.toml
// Some codname = "Control Transition"
version = "0.1.0"
description = ""
authors = [{ name = "John Doe", email = "john.doe@composabl.com" }]
dependencies = [
"composabl-core",
"numpy"
]
[composabl]
type = "skill-teacher"
entrypoint = "control_transition.teacher:BaseCSTR"
teacher.py
import math
import numpy as np
from composabl import Teacher
class BaseCSTR(Teacher):
def __init__(self, *args, **kwargs):
"""
Initialize the BaseCSTR skill with default values.
Args:
*args: Variable length argument list.
**kwargs: Arbitrary keyword arguments.
"""
# Initialize observation history to track past observations
self.obs_history = None
# Initialize lists to store reward and error histories
self.reward_history = []
self.error_history = []
self.rms_history = [] # Root Mean Square of errors
# Initialize variable to store the last computed reward
self.last_reward = 0 # Note: Initialized twice; redundancy should be removed.
# Initialize a counter to track the number of steps or iterations
self.count = 0
async def transform_sensors(self, obs, action):
"""
Process and potentially modify sensor observations before they are used.
Args:
obs (dict): Current sensor observations.
action: The action to be taken.
Returns:
dict: Transformed sensor observations.
Note:
- Currently, this method returns the observations unchanged.
- This can be customized to apply transformations if needed.
"""
# Currently, no transformation is applied to sensors
return obs
async def transform_action(self, transformed_obs, action):
"""
Process and potentially modify the action before it is executed.
Args:
transformed_obs (dict): Transformed sensor observations.
action: The original action.
Returns:
The (potentially) modified action.
Note:
- Currently, this method returns the action unchanged.
- This can be customized to modify actions based on certain criteria.
"""
# Currently, no transformation is applied to the action
return action
async def filtered_sensor_space(self):
"""
Define which sensors are relevant for this skill.
Returns:
list: Names of the sensors to be used.
Note:
- Specifies a list of sensor names that this skill will utilize.
- Helps in focusing the skill's operations on relevant data.
"""
# Specify the sensors that this skill will use
return ['T', 'Tc', 'Ca', 'Cref', 'Tref', 'Conc_Error', 'Eps_Yield', 'Cb_Prod']
async def compute_reward(self, transformed_obs, action, sim_reward):
"""
Compute the reward based on the transformed observations and action.
Args:
transformed_obs (dict): Transformed sensor observations.
action: The action taken.
sim_reward: The reward from the simulation/environment.
Returns:
float: Calculated reward.
Behavior:
- If `obs_history` is None, initializes it with the current observation and returns 0.0.
- Otherwise, appends the current observation to `obs_history`.
- Calculates the squared error between reference concentration (`Cref`) and actual concentration (`Ca`).
- Appends the error to `error_history`.
- Computes the Root Mean Square (RMS) of the error history and appends it to `rms_history`.
- Calculates the reward using an exponential decay function based on the sum of all errors.
- Appends the calculated reward to `reward_history`.
- Increments the `count`.
- Returns the calculated reward.
"""
if self.obs_history is None:
# If this is the first observation, initialize the history
self.obs_history = [transformed_obs]
return 0.0 # No reward on the first step
else:
# Append the current observation to the history
self.obs_history.append(transformed_obs)
# Calculate the squared error between reference concentration and actual concentration
try:
cref = float(transformed_obs['Cref'])
ca = float(transformed_obs['Ca'])
except (KeyError, ValueError, TypeError) as e:
# Handle missing or invalid sensor data
print(f"Error accessing 'Cref' or 'Ca' in transformed_obs: {e}")
return 0.0
error = (cref - ca) ** 2
self.error_history.append(error) # Store the error
# Calculate the Root Mean Square (RMS) of the error history
rms = math.sqrt(np.mean(self.error_history))
self.rms_history.append(rms) # Store the RMS value
# Compute the reward as an exponential decay based on the sum of errors
reward = math.exp(-0.01 * np.sum(self.error_history))
self.reward_history.append(reward) # Store the reward
# Increment the step counter
self.count += 1
return reward # Return the calculated reward
async def compute_action_mask(self, transformed_obs, action):
"""
Optionally compute an action mask to restrict available actions.
Args:
transformed_obs (dict): Transformed sensor observations.
action: The action to be masked.
Returns:
Optional[List[bool]]: A mask indicating which actions.
See the Produce Produce Skill Code
pyproject.toml
[project]
name = "Produce Product"
version = "0.1.0"
description = ""
authors = [{ name = "John Doe", email = "john.doe@composabl.com" }]
dependencies = [
"composabl-core"
]
[composabl]
type = "skill-teacher"
entrypoint = "produce_product.teacher:BaseCSTR"
teacher.py
import math
import numpy as np
from composabl import Teacher
class BaseCSTR(Teacher):
def __init__(self, *args, **kwargs):
"""
Initialize the BaseCSTR skill with default values.
Args:
*args: Variable length argument list.
**kwargs: Arbitrary keyword arguments.
"""
# Initialize observation history to track past observations
self.obs_history = None
# Initialize lists to store reward and error histories
self.reward_history = []
self.error_history = []
self.rms_history = [] # Root Mean Square of errors
# Initialize variable to store the last computed reward
self.last_reward = 0
# Initialize a counter to track the number of steps or iterations
self.count = 0
async def transform_sensors(self, obs, action):
"""
Process and potentially modify sensor observations before they are used.
Args:
obs (dict): Current sensor observations.
action: The action to be taken.
Returns:
dict: Transformed sensor observations.
Note:
- Currently, this method returns the observations unchanged.
- This can be customized to apply transformations if needed.
"""
# Currently, no transformation is applied to sensors
return obs
async def transform_action(self, transformed_obs, action):
"""
Process and potentially modify the action before it is executed.
Args:
transformed_obs (dict): Transformed sensor observations.
action: The original action.
Returns:
The (potentially) modified action.
Note:
- Currently, this method returns the action unchanged.
- This can be customized to modify actions based on certain criteria.
"""
# Currently, no transformation is applied to the action
return action
async def filtered_sensor_space(self):
"""
Define which sensors are relevant for this skill.
Returns:
list: Names of the sensors to be used.
Note:
- Specifies a list of sensor names that this skill will utilize.
- Helps in focusing the skill's operations on relevant data.
"""
# Specify the sensors that this skill will use
return ['T', 'Tc', 'Ca', 'Cref', 'Tref', 'Conc_Error', 'Eps_Yield', 'Cb_Prod']
async def compute_reward(self, transformed_obs, action, sim_reward):
"""
Compute the reward based on the transformed observations and action.
Args:
transformed_obs (dict): Transformed sensor observations.
action: The action taken.
sim_reward: The reward from the simulation/environment.
Returns:
float: Calculated reward.
Behavior:
- If `obs_history` is None, initializes it with the current observation and returns 0.0.
- Otherwise, appends the current observation to `obs_history`.
- Calculates the squared error between reference concentration (`Cref`) and actual concentration (`Ca`).
- Appends the error to `error_history`.
- Computes the Root Mean Square (RMS) of the error history and appends it to `rms_history`.
- Calculates the reward using an exponential decay function based on the sum of all errors.
- Appends the calculated reward to `reward_history`.
- Increments the `count`.
- Returns the calculated reward.
"""
if self.obs_history is None:
# If this is the first observation, initialize the history
self.obs_history = [transformed_obs]
return 0.0 # No reward on the first step
else:
# Append the current observation to the history
self.obs_history.append(transformed_obs)
# Calculate the squared error between reference concentration and actual concentration
try:
cref = float(transformed_obs['Cref'])
ca = float(transformed_obs['Ca'])
except (KeyError, ValueError, TypeError) as e:
# Handle missing or invalid sensor data
print(f"Error accessing 'Cref' or 'Ca' in transformed_obs: {e}")
return 0.0
error = (cref - ca) ** 2
self.error_history.append(error) # Store the error
# Calculate the Root Mean Square (RMS) of the error history
rms = math.sqrt(np.mean(self.error_history))
self.rms_history.append(rms) # Store the RMS value
# Compute the reward as an exponential decay based on the sum of errors
reward = math.exp(-0.01 * np.sum(self.error_history))
self.reward_history.append(reward) # Store the reward
# Increment the step counter
self.count += 1
return reward # Return the calculated reward
async def compute_action_mask(self, transformed_obs, action):
"""
Optionally compute an action mask to restrict available actions.
Args:
transformed_obs (dict): Transformed sensor observations.
action: The action to be masked.
Returns:
Optional[List[bool]]: A mask indicating which actions are allowed.
Returns None, meaning no action masking is applied.
Note:
- Currently, no action masking is implemented.
- This can be customized to enforce action constraints.
"""
# Currently, no action masking is applied
return None
async def compute_success_criteria(self, transformed_obs, action):
"""
Determine whether the success criteria have been met.
Args:
transformed_obs (dict): Transformed sensor observations.
action: The action taken.
Returns:
bool: True if success criteria are met, False otherwise.
Behavior:
- Currently always returns False.
- Can be implemented with logic to check if certain conditions are satisfied.
"""
# Placeholder for success criteria logic
success = False
# Implement actual success condition based on observations and actions
return success
async def compute_termination(self, transformed_obs, action):
"""
Determine whether the training episode should terminate.
Args:
transformed_obs (dict): Transformed sensor observations.
action: The action taken.
Returns:
bool: True if the episode should terminate, False otherwise.
Behavior:
- Currently always returns False.
- Can be implemented with logic to terminate based on certain conditions.
"""
# Placeholder for termination condition logic
return False
2. Add the Skills to Your Strategy Pattern Agent
Drag the skills start_reaction, control_transition, and produce_product that you can now see on the left-hand side of your project onto the skills layer. Drag the skills from the side in the order you would like them to be used.
3. Configure the Selector
The green diamond that appears when you place multiple skills alongside each other is the selector. This is the "math teacher" skill that makes the decision about which of the action skills should be chosen to make each decision.
Click on the selector to configure it. In this case, the default configurations are most likely correct.
The goals of the top-level selector in an agent should be the same as the goals of the agent as a whole. When the UI automatically creates a selector, it adds the project-level goals by default.
For a phased process like the industrial mixer reaction, a fixed-order sequence is appropriate. That means that the selector has the agent apply the skills one at a time, rather than switching back and forth between skills.
4. Configure Scenarios
Scenarios are a key piece of successfully training an agent with the strategy pattern. Scenarios are different possible conditions represented within the simulation. Skills train to specialize in the different scenarios - for example, the Start Reaction skill specializes in controlling the reaction when the temperature and concentration levels are those found at the beginning of the reaction.
This is what allows the skills to differentiate from each other. The three specialized skills practice only on their designated phase of the process and learn to succeed in their own specific conditions. The selector practices with the whole process so that it knows which skill to choose at any point.
4.1 Add Scenarios
Go to the Scenarios page using the lefthand navigation menu. There, click Add Scenario to create a new scenario for your agent to use in training.
When you are building an agent for your own use case, you will define the scenarios based on your knowledge of the task and process. In this case, we provide the values that define the phases of the chemical manufacturing process. Create these scenarios for your agent:
Full reaction: Cref Is 8.57, Tref Is 311 |
Startup: Cref Is 8.5698, Tref Is 311.2612 |
Transition: Cref Is 8.56, Tref Is 311, Is 22 |
Production: Cref Is 2, Tref Is 373.1311 |
4.2 Create Scenario Flows
Scenario flows allow you to connect scenarios that have a sequential relationship to ensure that your agent gets practice in navigating the different conditions in the order in which they will occur.
For this problem, you do not need to create sequential connections between the scenarios. Drag all the scenarios to the first column to make them available to your skills and selectors.
4.3 Add Scenarios to Skills and Selectors
Once you have your scenarios set up and connected with scenario flows, you can add them to skills and selectors to tell the skills and selectors what conditions they need to practice in. This helps them to develop their specialized expertise.
In the Agent Builder Studio, click on each skill and the selector in turn. For each, click on Scenarios and then click the dropdown arrows to show the available scenarios. Check the box for each scenario to apply to the skill.
Start reaction: Startup
Transition: Transition
Produce product: Production
Selector: Full reaction
5. Run Your Training Session
Now, we are ready to train your agent and see the results. We suggest you run 50 training cycles. You will see the skills training one at a time. Each skill will train the selected number of cycles.
6. View Results
When the training has been completed, you can view your results in the training sessions tab in the UI. This will show you information on how well the agent is learning.
You will likely see a steep learning curve as the agent experiments with different control strategies and learns from the results. When the learning curve plateaus, that usually means that the skill is trained.
Analyzing the Strategy Pattern Agent's Performance
Conversion rate: 92% Thermal runaway risk: Low
We tested this fully trained agent and plotted the results.
This agent performance is not perfect, but it stays closer to the benchmark line than either of the two single-skill agents. It just needs some help avoiding thermal runaway. We can provide that by adding a perception layer.