**Data Introspection**
*This page is a work in progress.*
A problem I see come up time and time again both in private projects and at work, is introspection. You have some kind of data, and you want a way to accomplish some of the following things:
* Serialize/deserialize the data.
* Display the data in a human-readable way for debugging.
* Change the format of the data, without having to make changes in a lot of places in your source code.
* Wrap/layer access to the data. Access may be tunnelled through a network protocol, optimizers, validators, etc.
* Store differences between versions of the data, perhaps to implement undo/redo or synching over the network.
We want some way of implementing the following types and functions, in a extensible way.
```c++
struct object_t;
struct property_t;
struct any_t;
struct visitor_t;
any_t get(object_t&, property_t);
void set(object_t&, property_t, any_t&);
void visit(object_t&, visitor_t&);
void call(visitor_t&, property_t, any_t&);
const char* name(property_t);
```
# Solutions
## Switches over Enumerations
This is the most common solution I've seen. The `property_t` is an enumeration, and each function has a big switch statement. This works well for small numbers of properties, but when you are dealing with dozens of them, it is easy for the switches to get out of sync.
```c++
struct object_t {
int x, y;
std::string name;
};
enum struct property_t {
x, y, name
};
typedef std::any any_t;
typedef std::function<void(property_t, any_t&)> visitor_t;
any_t get(object_t& o, property_t p) {
switch (p) {
case property_t::x: return o.x;
case property_t::y: return o.y;
case property_t::name: return o.name;
default: throw std::runtime_error("invalid property");
}
}
void set(object_t& o, property_t p, any_t v) {
switch (p) {
case property_t::x: o.x = std::any_cast(v); break;
case property_t::y: o.y = std::any_cast(v); break;
case property_t::name: o.name = std::any_cast(v); break;
default: throw std::runtime_error("invalid property");
}
}
void visit(object_t& o, visitor_t& v) {
v(property_t::x, o.x);
v(property_t::y, o.y);
v(property_t::name, o.name);
}
const char* name(property_t p) {
switch (p) {
case property_t::x: return "x";
case property_t::y: return "y";
case property_t::name: return "name";
default: throw std::runtime_error("invalid property");
}
}
```
## Macros
This approach I only came around recently when I was talking to a friend about this topic. [OpenXR](https://github.com/KhronosGroup/OpenXR-SDK/blob/main/include/openxr/openxr_reflection.h#L33) uses this approach. It avoids the code duplication by using macros. I like about this approach, that it is the only one, that completely avoids redundancy, but the usual downsides of macros apply, like being poorly supported by IDEs and requiring more context to understand the code. We can even avoid listing the name of the property as a string, but I kept it in to demonstrate how arbitrary additional data could be stored with each property.
```c++
#define MY_FOR_EACH_PROPERTY(_) \
_(x, int, "x") \
_(y, int, "y") \
_(name, const char*, "name")
struct object_t {
#define MY_DEFINITION(identifier, type, name) type identifier;
MY_FOR_EACH_PROPERTY(MY_DEFINITION)
#undef MY_DEFINITION
};
enum struct property_t {
#define MY_ENUMERATE(identifier, type, name) identifier,
MY_FOR_EACH_PROPERTY(MY_ENUMERATE);
#undef MY_ENUMERATE
};
typedef std::any any_t;
typedef std::function<void(property_t, any_t&)> visitor_t;
any_t get(object_t& o, property_t p) {
switch (p) {
#define MY_CASE(identifier, type, name) \
case property_t::identifier: return o.identifier;
MY_FOR_EACH_PROPERTY(MY_CASE);
#undef MY_CASE
default: throw std::runtime_error("invalid property");
}
}
void set(object_t& o, property_t p, any_t v) {
switch (p) {
#define MY_CASE(identifier, type, name) \
case property_t::identifier: \
o.identifier = std::any_cast(v); break;
MY_FOR_EACH_PROPERTY(MY_CASE);
#undef MY_CASE
default: throw std::runtime_error("invalid property");
}
}
void visit(object_t& o, visitor_t& v) {
#define MY_VISIT(identifier, type, name) \
v(property_t::identifier, o.identifier);
MY_FOR_EACH_PROPERTY(MY_VISIT);
#undef MY_VISIT
}
const char* name(property_t p) {
switch (p) {
#define MY_CASE(identifier, type, name) \
case property_t::identifier: return name;
MY_FOR_EACH_PROPERTY(MY_CASE);
#undef MY_CASE
default: throw std::runtime_error("invalid property");
}
}
```
## Tuples
## Templates
## "Reflection"
[Boost.PFR](https://www.boost.org/doc/libs/1_85_0/doc/html/boost_pfr.html) uses aggregate initialization and structured bindings to identify the number and types of members.
# Data Types
First we need to define what kind of data we want to be able to store. For the purpose of this article, a data type is just a set of possible instances of the data. (One might also include the set of operations valid on the data and the meaning of each possible instance, but I'm ignoring those for now) Therefore the task of defining a data type is to define this set. In terms of complexity we have two extremes:
* Arbitrary strings of bytes.
In the simplest case, the set is just all values that are representable with an arbitrary number of bytes. These are scalar numbers, strings, arrays, C structs, or SQL tables without constraints.
* Set of turing machines which halt.
On the other extreme is the set of all possible programs that halt. Generally, it is not feasable to iterate the values of this set, nor to verify whether a given value is part of it.
Languages like SQL or JSON Schema strike a balance between the two by allowing us to specify certain kinds of constraints. This (almost?) always entails describing the data as the crossproduct of various simpler types, and a set of constraints that can be evaluated. The problem becomes identifying those primitive values and manipulate them, as well as a way to check the validity of a given combination of primitive values.
# API
Considerations when designing the API to read and manipulate the data:
* Ease of use.
* Performance.
* ABI compatibility.
Multiple APIs can be implemented side by side.
## Enumeration Based
Libraries like OpenGL and ffmpeg offer an API like this. The represented data is a collection of properties of a set of primitive types or arrays of primitive types. They are accessed through typed getters and setters, which take enumerate types as argument. This allows adding new properties without breaking ABI with calling or wrapping code, as long as the property is of an existing primitive type. However, this can lead to situations, where a property type does not fit the property well. For example, OpenGL represents textures as `GLuint`, as there is no special primitive type, making it more error-prone to use. Constraints may be validated in setters, or in separate operations.
```c
GLfloat width;
glGetFloatv(GL_LINE_WIDTH, &width);
GLint binding;
glGetIntegeri_v(GL_SHADER_STORAGE_BUFFER_BINDING, 7, &binding);
```
Libraries like OpenSSL follow a similar approach. Instead of having typed getters and setters, they use typed property values, similar to `std::variant`. The getters and setters return and accept structs, that store the type of the property as a member variable. Helper macros are provided for common types. This allows adding new property types without breaking callers or wrappers. A wrapper can just pass along unknown types unchanged.
```c
uint32_t memory = 65536;
uint32_t lanes = 1;
OSSL_PARAM parameters[] = {
OSSL_PARAM_construct_uint32(OSSL_KDF_PARAM_ARGON2_MEMCOST, &memory),
OSSL_PARAM_construct_uint32(OSSL_KDF_PARAM_ARGON2_LANES, &lanes),
// more parameters
OSSL_PARAM_construct_end()
};
EVP_KDF_derive(context, hash, size, parameters);
```
An API like this allows users of the library to access members of an object, without needing to know the exact layout of it. Members can be added to the library, without existing code being affected.
To serialize/deserialize objects, the library can provide the available enumerate values as a list.
```c++
struct point;
enum struct property_t {
x, y, z
};
const property_t point_properties[] = { x, y, z };
int get_int(point &p, point_property property);
void serialize(point &p, std::ostream &os) {
for (auto property : point_properties) {
os << get_int(p, property) << ",";
}
}
```
*I want to write more examples here.*
## Getters and Setters
The library provides getter and setter functions for each primitive property. Adding properties requires adding getters and setters to the library and any wrapper of it. After having worked with an API like this, that also has about 6 layers, each with the same dozens of getters and setters, this is my least favourite.
```c++
class point {
public:
int get_x();
void set_x(int);
int get_y();
void set_y(int);
std::string get_name();
void set_name(std::string);
};
class network_point {
public:
int get_x();
void set_x(int);
int get_y();
void set_y(int);
std::string get_name();
void set_name(std::string);
};
```
A slight improvement in usability, though not performance, is to implement the getters and setters on wrappers around primitive types.
```c++
class int_value {
public:
int get();
void set(int);
};
class string_value {
public:
std::string get();
void set(std::string);
};
class network_int {
public:
int get();
void set(int);
};
class network_string {
public:
std::string get();
void set(std::string);
};
class point {
public:
int_value x;
int_value y;
string_value name;
};
class network_point {
public:
network_int x;
network_int y;
network_string name;
};
```
Although this example takes up more space, the total number of functions is actually smaller. This approach doesn't allow iterating over the properties.
## Member Variables
The library may also expose property values as member variables of a struct. Although this does not solve any of the initial problems it is most flexible for manipulating the data and may be provided additionally to one of the other APIs. The user can access properties directly, retreive pointers to them, and pass those pointers to other libraries. In turn it is the least flexible API for the library authors. This API almost necessitates a specific memory layout, and limits possibilities to change this layout afterwards, or requires costly copying and transforming of data. Validation of constraints may be performed during library operations on the data.
```c++
struct point {
int x, y;
std::string name;
};
point get();
void set(point p);
```
# Memory representation
## `struct`
Perhaps the most "natural" representation. Care must be taken not to break ABI when changing the size of `struct`s, e.g. by only adding members at the end and performing allocation in the library.
```c++
struct point {
int x, y;
std::string name;
};
enum struct property_t {
x, y, name
};
int get_int(point *p, property_t prop) {
switch (prop) {
case property_t::x: return p->x;
case property_t::y: return p->y;
default: throw std::runtime_error("invalid property");
}
}
std::string get_string(point *p, property_t prop) {
switch (prop) {
case property_t::name: return p->name;
default: throw std::runtime_error("invalid property");
}
}
```
Ties and tuples can be used to allow iterating over values. A ABI compatible interface can be implemented on top of that.
```c++
auto to_tie(point *p) {
return std::tie(p->x, p->y, p->name);
}
```
## Maps/Dictionaries
```c++
enum struct property_t {
x, y, name
};
struct point {
std::map<property_t, std::variant<int, std::string>> properties;
};
```
## Arrays/Tables
Arrays are a very efficient structure.
*I want to write more about this too.*