Analyze Data in Detail with the Historian

In this tutorial, we will explore how to use the historian to validate the trained AI agent system in Composabl and training logs. The historian stores historical time-series data in an optimized format (parquet) - https://www.databricks.com/glossary/what-is-parquet, which helps in evaluating how the agent is performing during training.

Step 1: Accessing the Historian Data

The historian file stores time-series data essential for validating agent system training. There are several ways to access and store the historian data, but the recommended format is as a delta file (parquet).

Understanding the Format:
- The historian data is typically large, around 500 megabytes for standard operations. It is stored in a Delta Lake file format, optimized for time-series data and supporting efficient queries.
Downloading the Historian File:
- From the Composabl UI, download the historian file. This file will come in a compressed format (e.g., .gz).
- After extracting it, you should see the delta file containing time-series data.

Step 2: Setting Up for Validation

Unpacking the Historian File:
- If the historian file is compressed (e.g., .gz), unpack the file using a tool like gzip:
  gunzip -k historian_file.gz
- Once unzipped, you’ll see a 10 MB+ delta file with historical time-series data.
Understanding the Delta File:
- The delta file is optimized for fast reads and writes of time-series data.
- It supports an append-only structure, which ensures that each new piece of data can be added efficiently without modifying the existing data.

Step 3: Querying the Historian Data

Setting Up a Query Environment:
- To validate your agent system’s training, you’ll need to set up an environment that allows you to query the delta file. Delta Lake integrates well with systems like Apache Spark, but for simple querying, you can use tools like pandas in Python.

Querying for Agent Training Logs:

Extract and analyze relevant historical data from the delta file. Here's a simple Python example for querying the delta file using pandas:


import pandas as pd 



# Load the historian delta file 

df = pd.read_parquet('historian_delta_file.parquet') 

df = df.sort_values(by=['timestamp'])

df_data = df[df['category_sub'].isin(['step', 'skill-training','skill-training-cycle'])]
#filter df with composabl_obs on "data" col only
df_data = df_data[(df_data['data'].str.contains('composabl_obs')) | (df_data['category_sub'].str.contains('skill-training')) | (df_data['category_sub'].str.contains('skill-training-cycle'))]

#df_data['data'] = df_data['data'].apply(lambda x: x if 'composabl_obs' in x else None)
def convert_to_dict(x):
   try:
      return json.loads(x)
   except:
      try:
            return ast.literal_eval(x)
      except:
            return None

df_data['data'] = df_data['data'].apply(lambda x: convert_to_dict(x))

df_data['skill_name'] = df_data['data'].apply(lambda x: x['name'] if 'is_done' in x else None)
df_data['skill_name'] = df_data['skill_name'].fillna(method='bfill')

df_data['reward'] = df_data['data'].apply(lambda x: x['teacher_reward'] if 'composabl_obs' in x else None)

df_data['obs'] = df_data['data'].apply(lambda x: x['composabl_obs'] if 'composabl_obs' in x else None)

#df_data['done'] = df_data['data'].apply(lambda x: x["teacher_terminated"] if "teacher_terminated" in x else None)
df_data['cycle'] = df_data['data'].apply(lambda x: x['cycle'] if 'cycle' in x else None)
df_data['cycle'] = df_data['cycle'].fillna(method='bfill')

df_data = df_data[df_data['category_sub'] == 'step']

print(df_data)

# group by runs
df_group = df_data.groupby(['run_id','skill_name','cycle'])['reward'].mean()

# Process observation data
df_obs = pd.DataFrame(data=[[v[0] for v in list(x.values())] for x in df_data['obs'].values], columns=[list(df_data['obs'][0].keys())])

df_obs['cycle'] = df_data['cycle']
df_obs['run_id'] = df_data['run_id']
df_obs['skill_name'] = df_data['skill_name']
df_obs.columns = [x[0] for x in list(df_obs.columns)]

# Episode Reward by Run Id
for run_id in list(set([x[0] for x in df_group.index])):
   for skill in list(set([x[1] for x in df_group.index])):
      #df_group[run_id].plot(subplots=True, title=run_id)
      plt.plot(df_group[run_id][skill])
      plt.ylabel(f'Mean Episode Reward')
      plt.xlabel(f'Cycle')
      plt.title(f'{run_id} - {skill}')

      plt.show()

Key Benefits of Using the Historian for Validation:

Optimized Data Handling: The Delta Lake format is designed for fast querying, making it ideal for time-series data.
Efficient Storage: The append-only nature ensures that new data can be added without overwriting or modifying existing data, making it easy to track data over time.
Continuous Monitoring: By continuously adding data to the historian, you can validate your agent system's long-term impact on machine performance, uptime, and safety.

PreviousView Training Session Information NextSet KPI and ROI

Last updated 3 months ago